Strictly speaking, “unstructured data” is a contradiction in terms. Data must have structure to be comprehensible. By “unstructured data” people usually mean data with a non-tabular structure.
Tabular data is data that comes in tables. Each row corresponds to a subject, and each column corresponds to a kind of measurement. This is the easiest data to work with.
Non-tabular data could mean anything other than tabular data, but in practice it often means text, or richer media such as images and video. It could mean more constrained data types but with a graph structure rather than a tabular structure. For example, XML and JSON files have a tree structure rather than a table structure.
There’s also overlap between tabular and non-tabular data. For example DICOM images are medical images (essentially JPEG files) with tabular metadata embedded in the file. Coming from the opposite direction, relational databases often contain “blobs” (binary large objects) that are islands of unstructured data contained within structured data.
More productive discussions
My point here isn’t to quibble over language usage but to offer a constructive suggestion: say what structure data has, not what structure it doesn’t have.
Discussions about “unstructured data” are often unproductive because two people can use the term, with two different ideas of what it means, and think they’re in agreement when they’re not. Maybe an executive and a sales rep shake hands on an agreement that isn’t really an agreement.
Eventually there will have to be a discussion of what structure data actually has rather than what structure it lacks, and to what degree that structure is exploitable. Having that discussion sooner rather than later can save a lot of money.
Free text fields
One form of “unstructured” data is free text fields. These fields are not free of structure. They usually contain prose, written in a particular language, or at most in small number of languages. That’s a start. There should be more exploitable structure from context. Is the text a pathology report? A Facebook status? A legal opinion?
Clients will ask how to de-identify free text fields. You can’t. If the text is truly free, it could be anything, by definition. But if there’s some known structure, then maybe there’s some practical way to anonymize the data, especially if there’s some tolerance for error.
For example, a program may search for and mask probable names. Such a program would find “Elizabeth” but might fail to find “the queen.” Since there are only a couple queens [1], this would be a privacy breech. Such software would also have false positives, such as masking the name of the ocean liner Queen Elizabeth 2. [2]
Related posts
[1] The Wikipedia list of current sovereign monarchs lists only two women, Queen Elizabeth II of the UK and Queen Margrethe II of Denmark.
[2] The ship, also known as QE2, is Queen Elizabeth 2, while the monarch is Queen Elizabeth II.
There is a nebulous zone of “could be structured” data, where the provenance is known, but the relevance is not, owing to a lack of connection to other data. The challenge is to develop a metadata framework that formalizes those aspects of the “unstructured” data that can be formalized with minimal ambiguity.
In my experience it’s seldom one-way: Old data frequently has to be reconceptualized/reinterpreted to be relevant within an expanded metadata context. And, again in my experience, the net result all to often is: “Damn. This old data is crap.”
> Clients will ask how to de-identify *free text fields*. You can’t. If the text is truly free, it could be anything, by definition. But if there’s some known structure, then maybe there’s some practical way to anonymize the data, especially if there’s some tolerance for error.
I feel like this is overselling the claim. Consider the following anonymization scheme for free text: replace the field with a constant value, say the empty string. Of course, the client complains. “Oh, you wanted the anonymization to preserve meaningful structure? But you said the field was unstructured.”
Another way of saying this is that there’s a distinction between data with no structure and data with no explicitly stated structure: in the former case, your out of luck, but in the latter case, you can (and need to) figure out a structure for the data from the data.