Can you remember the last time you changed the way you spoke about or wrote something based on who was on the other end? It likely happens a lot more than you think—in the morning, you might talk to your children and spouse in one way, then change up your style once you get to the office (or the first Zoom call of the day). Just before lunch, you fire off an email to a customer using a professional tone and a handful of technical jargon, then pull out your phone and send a casual text message to a friend.
Whether you recognize it or not, you’re constantly adjusting your approach based on your context and audience. And when you’re on the receiving end, you’re doing the same, framing the conversation or reading based on whether it’s, for example, technical documentation versus a social media post. You identify names, phrases, and tones differently.
Natural language processing (NLP) is no different. You need the right combination of models, domains, and expertise from your data and analytics teams to get meaningful, high-quality results. And while that sounds like an extraordinary ask given NLP’s complexity and the “black box” quality of its underlying algorithms, the good news is that the whole process is much easier to understand—and implement—once you’ve broken down NLP’s version of having the right tool in the right hands.
That’s how you expand your analytical toolkit and dive into the huge swaths of unstructured data you likely have sitting unutilized in a data lake or warehouse to create new differentiators, insights, or entire products.
What’s an NLP domain?
NLP is a broad field applied to a broad range of applications, all of which have unique linguistic characteristics. You can’t apply the same processing logic against all natural language and expect to get meaningful results. This is the concern of NLP domains: What types of data are you trying to process, and what do high-quality, relevant results look like?
Domains become relevant when your data isn’t the same as everyday language because the meaning and relevance of our words depend entirely on the context.
We’re surrounded by useful examples of this, like how we talk face-to-face versus how we communicate over email or text messages. The phrase “an act of God” has a precise and legally-binding meaning in the development and enforcement of contracts and insurance policies in the English-speaking world, but carries a vastly different meaning in religious texts or day-to-day conversation.
Or, to use an example that’s very relevant to the future of NLP—the domain of healthcare, where unstructured text data covering conditions, medications, histories, and procedures will look and operate differently than other types of natural language. Your definition of relevant, high-quality results is dramatically different, too.
Domains dictate what models you create and apply to the data you have and the problems you’re trying to solve.
How does an NLP model differ?
Before we can talk about specific NLP models, we need to explain how NLP model types are concerned with understanding a particular portion of your textual data. If domains are concerned with the what of your data, then model types focus on how that data is processed. How can you break down a sentence given the linguistic characteristics of its language and domain?
Named Entity Recognition (NER) is one of the more well-known and easy-to-understand NLP model types, where NLP analyzes textual data to determine which names and phrases are proper nouns, belong to people, places, or formal “things.” NER breaks down the proper nouns it identifies into tokens, which are the smallest units of words or characters useful for processing.
We call this a model type because there are many ways to train a NER model, and every domain requires different implementations of the same model type. The actual NLP model is one implementation of a model type given a specific domain.
[This could be a good opportunity for some type of diagram/visual aid to demonstrate the differences and crossover between a domain, model type, and model. I’m not sure what type of diagram is most accurate and effective, but I’m thinking of a Venn diagram, with model type as one circle, domain as another circle, and model at the intersection of those two.]
Let’s walk through an example of a particular model to explain how all this works. This model is designed to identify people and things in popular culture, and we feed it this example sentence: “Tom Cruise is a Golden Globe-winning American actor and producer best known for film franchises like Top Gun and Mission: Impossible.”
The model identifies a handful of tokens and groups them based on the categories it’s been trained on:
- person: Tom Cruise
- organization: Golden Globe
- nationality: American
- film: Top Gun, Mission: Impossible
These results, driven by this particular NER model, are also just the beginning. To generate knowledge or insights at scale, you need to integrate this data with a knowledge graph of similar results or combine them with structured data to create a formal schema that can be queried and analyzed further.
And NER models are just the tip of the NLP iceberg—model types come in many shapes and have broad applications:
- Key Phrase Extraction: Pinpoint key concepts in a text to understand its context without extensive reading and research.
- Similarity Sorting: Contrasts two documents to understand if and how they’re related.
- Relationship Extraction: Detecting which tokens are related to one another and how.
- Coreference Resolution: Find all expressions that refer to a given entity in a text—for example, identifying which pronouns belong to which person.
- Paraphrase Detection: Translate two documents into sentence vectors, paraphrase them, and determine how closely related they are.
- Document Origination: Create and maintain an ordered list that identifies which documents emerge from another.
Selecting the right model type for your needs is just the beginning.
Why domains and models matter to the quality of your NLP results
As you head down the path of defining your NLP solution, your NLP domain should be easier to define than the ideal model type to apply. The domain is the what of your business—the industry you operate in, the types of unstructured data you store, and the large-order business goals you’d like to achieve in processing said data.
But it’s imperative you identify this domain early and ensure the models you’ll eventually implement are trained using data from the same domain. Applying a NER model trained on a domain of people, places, and things will fail spectacularly in a medical context. It might even output results that create incorrect or dangerous knowledge.
If your organization is just beginning its journey into NLP, the easiest route toward quickly generating relevant, valuable results is implementing a base model around the type of processing you’re interested in. To extend the medical example, you start with a base model, trained on a corpus of healthcare data, that identifies the linguistic characteristics you’re most interested in, like details of patient vitals and symptoms. This base model won’t be perfect, as it’s not tailored specifically for your organization, but it will kickstart a valuable NLP framework without leading anyone astray.
The process of refining and optimizing that base model into one that’s fine-tuned for your organization, its data, and its needs works on two tracks frequently.
First, there’s the business challenge or unmet opportunity that inspires the use of NLP to begin with. Refining your models beyond a baseline level of quality requires business teams to constantly communicate the results of how NLP models are applied across the organization or toward end users. They’re processing data, generating insights, and implementing new techniques, then translating new goals to the technical teams.
Second, you have the technical operations of refinement. As soon as you begin enhancing a base model, your development and data science teams need to version control, effectively deploy, and execute your custom models in an iterable process, just as they would with other end-user applications. They need to create and be able to work with a common data structure for their training data, which not only simplifies the training process and helps them re-use their training data for multiple models. The more models you can train with the same data format, the better.
The fruits of business-technical collaboration only pay dividends if there’s a good developer experience, too. Think about how easily your NLP service integrates with and ingests data from your data lake/warehouse, whether the output format is query-able, and how easily your developers can build visualizations (whether structured as graphs, charts, or more) that their knowledge worker peers expect.
Time to align your NLP domains, models, and goals
The deeper your organization dives into NLP, the more often you’ll go through the continuous cycle of re-training your models based on new, relevant training data for your domain. That’s how your data and analytics teams differentiate the quality of their insights, and the knowledge they generate, from others who might still be stuck on the base model.
GraphGrid is a suite of connected data services that help these teams streamline this cycle by providing base NLP models, integrations with their existing data sources, and a developer experience that encourages continuous improvement and more meaningful results. GraphGrid accelerates your journey into optimized NLP models and delivers results in graph-enhanced solutions for more immediate business impact and knowledge generation.