Why Unstructured Data Proves a Relentless Challenge for Developers

This site is not optimized for Internet Explorer 9 and lower. Please choose another browser or upgrade your existing browser in order get the best experience of this website.

Business, Natural Language Processing

August 19, 2022

Ben Nussbaum

Any organization’s unstructured data is a goldmine of transformational knowledge. But only some will benefit—the ones with developers who can leave behind their traditional tools for analyzing structured data and building new frameworks.

Traditional analytics, like SQL queries on structured tables, won’t do.

The headwinds around analyzing unstructured data are already setting off high-severity alarm bells at Gartner, where analysts claim that 80 to 90 percent of enterprise data is already unstructured. All indicators point toward the proportion of unstructured data growing in the years to come, which means it’s not a problem that’s going to solve itself if organizations, and their developers, ignore it long enough.

New tools for storing, managing, and analyzing unstructured data will come and go, each promising a better way to make sense of this growing portion of their data lake. But without understanding where and why traditional analytics frameworks fail on unstructured data, you won’t be well-positioned to build the pipelines your knowledge worker peers need to make sense of that data.

The basics of unstructured data

Data is either:

structured, in that it’s formatted into a predefined structure, known as a schema, before being stored, queried through the database’s query language, and processed, or
unstructured, stored in its native format, and processed using query languages and processes outside of the database technology.

You usually pair structured data with relational databases, which use tables, columns, and rows to store data, much like an Excel spreadsheet. When you need to query and process your data into other formats, like reports or visualizations that generate organizational knowledge, you use SQL or a similar query language.

Unstructured data is best paired with non-relational databases or in data lakes, where you can store raw files without processing them first. You get the benefit of retaining the original format with the disadvantage of needing new tools and frameworks to process them.

Unstructured data comes in a variety of formats:

Raw text strings	Word documents	PDFs
Saved or crawled web pages	Social media posts	Public news articles
Video	Audio	Customer reviews
Meeting notes or presentations	IoT sensor data	… and much more.

Let’s not forget semi-structured data like JSON, CSV, and XML, which use metadata to provide a skeletal structure for raw data. This type of data makes certain queries easier—for example, show me all posts from UserX on April 17, 2022—but doesn’t help you process and understand the content of these posts at scale.

No matter the type of data or the storage infrastructure, most organizations only process 0.5% of their unstructured data, according to reports from IDC. That’s an enormous cost to your organization—in your monthly invoices from your storage provider of choice and in lost opportunity for deriving more invaluable knowledge.

Why is this the case?

Why unstructured data is hard to work with

Because unstructured data is stored in its native format, you can’t write SQL queries to gather and process its content. You need additional frameworks and tools to make sense of the meaning or intent of the data itself.

Let’s use an example that’s highest on the priority list for every major industrial operation: predictive maintenance reports.

When engineers and technicians perform maintenance on a machine, they report their work into a web app, which formats their entries into a codified schema designed for storage in a relational database.

That creates standard-issue structured data:

The machine/system in question,
Who performed the maintenance,
The date, time, and duration of the work
The specific action(s) taken, and more.

You can query this structured data using SQL to return all the maintenance done on a single machine, all the work a single engineer/technician did in a week, and so on.

But there is one more challenging field in these maintenance reports: a freeform text input field where engineers/technicians can add notes, mention unrelated issues that deserve a follow-up, or make recommendations for future work. This text is stored as a single blob in the relational database.

As a developer, you can’t create insight or knowledge from that blob of text with just SQL. All you can do is return it and display it somewhere else, like a PowerPoint presentation, and hope someone else does the manual work of interpreting it into valuable next steps.

Traditional methods of analyzing unstructured text

But let’s say you’re ambitious—you want to turn that blob of maintenance notes into something meaningful. You have two options: Using traditional methods of analyzing unstructured data or wading into the complex world of natural language processing.

There are a few traditional methods to try on textual data:

Boolean logic, where you linearly scan through the entire text for a single word or phrase—a process sometimes called “grepping” for the tool (grep) you might use—to return a 0 (does not contain the text) or 1 (does contain the text). If you string enough of these boolean expressions together using AND or OR operators, you can begin to flag certain blobs for the presence of content that deserves follow-up.
String manipulation for transforming your textual data into more manipulation-friendly formats. For example, you can extract substrings, replace/remove characters, normalize characters to lowercase, and more.
Regular expressions (regex), which let you search strings for specific search patterns and then further manipulate either the result or the full blob itself.

There are a handful of issues with these traditional development methods:

You can only search or extract keywords or concepts that you know you should be looking for, not the “unknown unknowns.”
They only work on text data, not images, audio, video, or anything that can’t be quickly converted into a plain string of characters.
They require trial-and-error development cycles, full of manual testing, which waste time and introduce errors that percolate into the organization via reports and knowledge work.
The outputs tend to be simplistic, which means you’re still relying on your knowledge worker peers to turn those results into meaningful insight.

Natural language processing (NLP) is another viable option for processing and analyzing your unstructured data. As a developer, you can create a set of training data from your existing database and then manually run one-off jobs to process and analyze the content of your unstructured text.

In theory, that’s a big win—you’ve finally turned a string or binary file into parseable results. In practice, these one-off efforts don’t use a comprehensive-enough corpus of training data. If you don’t regularly update the underlying model based on changes to your business or market, you will spread skewed results that create misinformed knowledge.

The biggest problem with one-off NLP tools is that, as a developer, you get stuck in a cycle of creating daisy chains of data storage, processing, and analysis tools. Instead of developing new opportunities for your organization’s knowledge workers, like repeatable queries or automatic ingest and NLP from trusted sources, you’re maintaining a sequence of complex handoffs that, to anyone else in the organization, look like a black box.

In attempting to build organizational knowledge, you create technical debt that no one understands.

Processing your unstructured data reliably and easily

The traditional ways of analyzing structured data aren’t going anywhere—as a developer, they’re still invaluable for generating reports and discovering broad trends around your customers, products, or industry. But your unstructured data deserves tooling that’s equally well-suited.

NLP is part of a better future—but the real value comes when you pair NLP processing with a knowledge graph, you can convert unstructured data into the most important schema there is to any organization: people, places, and things, and the many fascinating ways they’re all related to each other.

GraphGrid is a developer-focused platform of connected data tools and processes that help developers build and collaborate on knowledge graphs. It comes with built-in NLP workflows for converting unstructured data into nodes and relationships in context. The more data you process, the better your analytical models get, and you can enhance your team’s understanding with sentiment analysis, similarity scoring, named entity recognition (NER), and much more.

Download GraphGrid for free to start building a robust framework for driving indispensable knowledge from your vast ocean of unstructured data. If you’re still not seeing the big picture on how NLP and knowledge graphs will help you dig into that goldmine of unstructured data, schedule a GraphGrid demo.