This site is not optimized for Internet Explorer 9 and lower. Please choose another browser or upgrade your existing browser in order get the best experience of this website.

Neo4j Data Pipeline

February 01, 2016

Neo4j Data PipelineEvery enterprise has a constant flow of new data that needs to be processed and stored, which can be done effectively using a data pipeline. Upon introducing Neo4j into an enterprise data architecture it becomes necessary to efficiently transform and load data into the Neo4j graph database. Doing this efficiently at scale with the enterprise integration patterns involved requires an intimate understanding of Neo4j write operations along with routing and queuing frameworks such as Apache Camel and ActiveMQ. Managing this requirement with its complexity proves to be a common challenge from enterprise to enterprise.

One of the common needs we’ve observed over the years is that an enterprise that wants to move forward efficiently with a Neo4j graph database needs to be able to rapidly create a reliable and robust data pipeline that can aggregate, manage and write their ever increasing volumes of data. The primary reason for this is to make it possible to write data in a consistent and reliable manner at a know flow rate. Solving this once and providing a robust solution for all is the driving force behind the creation of GraphGrid Data Pipeline.

GraphGrid Data Pipeline

The GraphGrid Data Platform, offers a robust data pipeline that manages high write throughput to Neo4j from varying input sources. The data pipeline is capable of batch operations management, keeps highly connected writes, manages data throttling, and carries out error handling processes.

Concurrent Write Operation Management

GraphGrid’s data pipeline handles concurrent write operations for any incoming data via strategies involving preservation of transactional integrity and transaction batch sizing and data throttling. A majority of writes to Neo4j work well for concurrent write operations, but in scenarios where dense nodes are involved sequential strategies can be utilized to avoid excessive write retry processes. The data pipeline also handles numerous concurrent processes writing data into the Neo4j graph database in parallel.

Continuous Write Flow

GraphGrid’s data pipeline can consistently manage uninterrupted write flow of connected data via robust logic to handle deadlock instances with a sequential process and an automated retry. Its auto-detection capabilities can lead to further transaction throughput since rollbacks are kept to a minimum, which would otherwise happen during highly concurrent write scenarios.

Data Throttling

GraphGrid’s data pipeline also handles the throttling of data to Neo4j that allows for reasonable entry to Neo4j resources through outside applications needing write access. This is useful when the data pipeline is flowing data into a production database used by customer facing applications where resources need to be preserved for real-time application response. The throttled data pipeline may be tuned depending on the load the Neo4j can allow for safely conducting write operations across all systems and applications using it. Keeping the pipeline open will lead to fast flow of updates as quickly as Neo4j can write them. Narrowing the data pipeline from its peak leads to a lower write load on the system.

Error Handling

An automated resolution capability respects transactional integrity when specified but has more flexibility in retry strategy by transaction decomposition methods, which quarantines separate statements in a transaction while successfully posting other statements. The GraphGrid data pipeline helps keep data flowing by handling errors during transaction updates via its quarantine and automated resolution processes.

We’ve experienced that these components of the GraphGrid data pipeline enable an enterprise to get their data connected and flowing efficiently into their Neo4j graph database very quickly so they can begin realizing the business value of their connected data.