Investigating the Ukraine War with Change Data Capture and Open-Source Intelligence

This site is not optimized for Internet Explorer 9 and lower. Please choose another browser or upgrade your existing browser in order get the best experience of this website.

Change data capture, Open Source Intelligence

July 19, 2022

Ben Nussbaum

When your organization’s data is changing constantly, you can’t afford to work with it manually. That means:

No historical batch analysis on structured data that’s already out-of-date by the time it’s processed.
No running the same query or looking at the same dashboard repeatedly throughout the day, hoping for a different answer.
No data entry or manual efforts to create and store a new connection between person A and thing B.

A connected and collaborative knowledge graph helps you escape some of those time-consuming tasks, freeing knowledge workers from time-consuming tasks around ingesting, processing, and analyzing data. But in its default state, a knowledge graph is still a “pull”-based system—you have to ask it for information.

But a reactive data architecture, enabled partly by change data capture (CDC), can work alongside your collaborative knowledge graph to automatically push new, relevant nodes or changed relationships to knowledge workers or the applications they use regularly. It can flip the script and give real-world meaning to always-changing data with little effort.

To illustrate how a reactive data architecture and CDC work with a knowledge graph, let’s talk about open-source intelligence (OSINT).

Before: Open-source intelligence with a knowledge graph

Let’s say you work for a humanitarian organization named ASDF. You’re part of an intelligence and analytics team responsible for conducting OSINT on the ongoing war in Ukraine, with a particular interest in analyzing any new articles, quotes, images, or data from an official Russian media source.

ASDF already uses a collaborative knowledge graph, which frees them from challenges around using an SQL-based data lake and various data warehouses, which they use to manually piece together connections between people, places, and things. With the knowledge graph, the ASDF team has also safely escaped the dangers of the “knowledge graph of one” and unlocked new value-added benefits like data durability and a smoother developer experience.

What is open-source intelligence?

Open-source intelligence, often abbreviated as OSINT, is the practice of deriving intelligence or insights based on electronic data and information that’s publicly available on the internet or through human sources.

OSINT encapsulates various information types, sources, and schemas, including news media content, social media content, databases, open-source code, images, personal information (phone numbers, email addresses, etc.), and even meeting notes or presentations. As long as the data is publicly available—even behind a username/password—and you’re not using clandestine or illegal techniques to acquire the data, it’s under the OSINT umbrella.

The data used might be openly-accessed, but OSINT still requires technical prowess. You might need to write a custom data-scraping tool or run data through additional tools to process, clean, and normalize it before ingesting it into your knowledge graph.

The biggest challenge with effective OSINT isn’t in finding data to work with—it’s in knowing which information is most valuable to the solution you’re trying to develop or insight you’re trying to make. If you’re not careful, it’s easy to fall into the trap of data-hoarding, making generating knowledge even more difficult.

As part of the OSINT team, you’ve been building your collaborative knowledge graph, which now stores the people, places, and things related to the war in Ukraine and the greater Russian media landscape. Because a war zone and the media response to it are both changing in real-time, keeping up with the constant change is a challenge for your OSINT team.

You’ve set up some degree of automation to streamline the process of acquiring data: You have a list of RSS feeds from various Russian media sources, which removes the need to comb through known sources for updates constantly. However, you and your team need to manually read these new articles, images, databases, and more for details that might prove useful to the knowledge graph. You then have to input them into the knowledge graph and manually run natural language processing-based (NLP) text analysis to turn unstructured text into structured people, places, and things through Named Entity Recognition (NER), key phrase analysis, relation extraction, and similarity clustering.

After: Open-source intelligence with change data capture

Despite the innate advantages of the knowledge graph over SQL queries and data lakes, your OSINT team at ASDF isn’t pleased with having to manually read articles for relevance and input them into the knowledge graph for NLP-based analysis. Given how quickly the situation on the ground changes, your team wants to automate this part of the toolchain to remove time-consuming manual tasks and focus on exploring relationships between people, places, and things after they’ve been supplemented with the latest data.

First, you need to automate the process of ingesting an RSS feed and transforming them into nodes in your knowledge graph. With a connected data platform like GraphGrid, you can create an RSS feed policy, which combines an RSS feed and CSS selectors to extract full article content from the URL. Then, using Geequel, the Graph Query Language (GQL) for ONgDB, you filter and transform relevant properties to the article’s text, like the author, source, date, and more.

You can customize the refresh rate of this RSS ingest process, but it defaults to 120 seconds to minimize the lag between the release of new information and the generation of new nodes or relationships in your knowledge graph.

Once you enable automatic RSS ingestion in a platform like GraphGrid, you can also enable NLP data extraction based on what your knowledge graph has ingested and created via RSS feeds. In GraphGrid, this is called Continuous Processing (CP), which ensures that NLP processes every new text node added to your knowledge graph.

You need to create a document policy, which defines exactly how you want the NLP service to use textual data in your graph. From there, you can enable NLP features like similarity clustering, article summation, NER, sentiment analysis, and much more.

While the RSS ingest process takes care of adding text nodes to your knowledge graph, the NLP process extracts, processes, and annotates those text nodes with additional information and insights. It adds new people, places, and things—plus the way they’re related to each other—directly into your knowledge graph without any manual intervention.

This already seems like a massive improvement to ASDF’s OSINT workflows, but all this automation work is just what sets the stage for change data capture to enter the picture.

With CDC and a reactive data architecture, each change to your graph—whether created by the RSS ingest process or NLP processing—creates transaction data that triggers GraphGrid Fuze, the CDC integration within GraphGrip. Fuze collects the incoming transaction data, transforms it as configured, and forwards the messages to any number of message brokers and consumers. From there, there’s no limit to what ASDF does with these messages. The simplest version is simple notifications to the OSINT team about new and relevant articles adding to the knowledge graph or changed entities. That alone automates the entire cycle of discovering, reading, and inputting the endless stream of new content related to the war in Ukraine.

For example, you can update your knowledge graph’s textual search index to help your OSINT team discover the people, places, and things developing around relevant keywords, trigger an external AI platform that identifies the most notable changes and pushes notifications to your team, update dashboards or visualizations, or even connect graph data to other databases.

All completely automated, in real-time, based on a system-wide awareness of every change.

Reducing manual work with GraphGrid

While we’ve been using OSINT as an example, the value of a reactive data architecture and CDC isn’t restricted to this use case. Any organization that spends time reading sources in an environment that changes on a rolling, real-time basis can leverage a reactive data architecture, CDC, and NLP to automate those time-consuming steps.

Make the consumers of CDC messages do the hard work and give your knowledge workers pre-filtered data about only the meaningful changes to your organization’s data landscape.

And while a reactive data architecture seems like a complicated infrastructure that’s reserved only for enterprise IT teams, GraphGrid includes everything you need to enable CDC on graph data in a single package. With minimal configuration, GraphGrid Fuze Distributor processes transaction data and pushes that to your analytical workhorses—both human and computer—to ensure you’re creating knowledge from the most up-to-date data.

You can download GraphGrid entirely for free, or schedule a demo if you’d like to see how a knowledge graph and CDC could integrate with your existing data.