Tech Term Decoded: Semi-Structured Data

Definition

Semi-structured data is a unique form of data that is midway between structured and unstructured data. It finds a middle ground between the unstructured chaos of text documents and the strict schema of structured databases, but still contains some form of structure or organization. This structure often comes in the form of tags, keys, or other markers that separate elements and enforce hierarchies within the data [1]. In other words, it combines the best of both structured and unstructured data approaches, making it very common in real-world applications like forms, JSON files, XML documents, and web data.

Let’s take a look at the following example:

{

"student_id": "UNN/2024/001", // Structured (fixed format)

"name": "Chioma Okonkwo", // Structured (text field)

"age": 20, // Structured (number)

"courses": ["Math", "Physics"], // Structured (array)

"comment": "I prefer evening classes because I help my parents with their shop during the day" // Unstructured (free text)

}

It is called "semi"-structured because it has some structure (like databases), but also some flexibility (like documents). Not fully rigid like traditional databases and not completely unorganized like pure text files.

Semi-structured data: combination of both structured and unstructured data.

Origin

Semi-structured data (though not yet called that), originated from various issues related to the emergence of the web, such as the usefulness of expressing structured data in a semi-structured way for the purposes of browsing, among others. These issues first arose and received serious computer science study in the late 1970s and early 1980s.

But the first recorded mentions of “semi-structured data” occurred in two academic papers from Quass et al. “Querying Semi-structured Heterogeneous Information,” and Tresch et al. “Type Classification of Semi-structured Data,” in 1995. However, the term “semi-structured data” became popular via the seminal 1997 papers from Abiteboul, “Querying semi-structured data,” and Buneman, “Semi-structured data.” [2]

Context and Usage

The flexibility of semi-structured data makes it suitable for use across industries for various purposes such as discovering customer preferences, learning how they behave, and recognizing the different trends developing in the market.

In Health care, systems merge structured data, such as patient profiles and history, with unstructured notes or written comments from health care providers, producing a semi-structured data allows for streamlined patient records, improving patient diagnostics.

In E-commerce, companies gather online reviews made up of unstructured text and structured data, from customers as a way to a keep an eye on product performance. These elements create semi-structured data that offers brands insights into customer satisfaction [3].

Why it Matters

Semi-structured when compared with structured data and unstructured data, is an ideal choice for many modern applications due to its flexibility and ease of use. Structured data is too limiting and unstructured data is too difficult to analyze efficiently. The inbuilt design of semi-structured data such as the use of tags in XML or key-value pairs in JSON, allows for easier parsing and analysis compared to completely unstructured data. Furthermore, this inbuilt design enables semi-structured data to be more readily ingested by data analysis tools and systems, improving data processing and analytics [4].

In Practice

A good example of a real-life case study of a company that works with semi-structured data can be seen in the case of MongoDB. MongoDB is a popular choice for sectors like finance, healthcare, retail and other sectors needing insights from large volumes of data. As a NoSQL/non-relational database, it is able to handle both unstructured and semi-structured data. With data structured in a flexible, JSON-like format, MongoDB enables rapid development cycles and real-time responsiveness, so it can easily offer the kind of data AI needs [5].

Tech Term Decoded: Semi-Structured Data

Post a Comment

Contact Form