The ability to load data into Neo4j is enabled through a variety of data loading APIs and tools. For processes where big data sets flow in or out of the Neo4j graph database, consideration needs to be taken to batch these read and write operations into batch sizes that are sympathetic to the master instances memory capacity as well the transactional overhead of data writes.
Neo4j provides a number of APIs to import big data sets including:
- the Cypher transactional endpoint, which uses the Cypher query language and is simple to utilize from any programing language because files containing CQL can be structured to bulk load data and write consistently.
- the Cypher data import capabilities exposed through LOAD CSV enable CSV files from a specified remote or local URL to be loaded and batching into desirable transaction sizes for importing massive data efficiently.
- the batch inserter which removes transactional overhead, but does require the database to be offline
Load Data Transactionally
To load or update data in Neo4j with an efficient write throughput, a reasonable transaction size needs to be consistently maintained based on the complexity of the writes being performed. Smaller-than-usual transactions (consisting of one or a few updated elements) suffer from transaction-committed overhead. Larger-than-expected transactions (involving hundreds of thousands or millions of elements) can lead to higher memory for the transient transaction state. Therefore, an adequate transaction as we’ve seen should consist of anywhere from 1k and 10k elements.
Importing Initial Data
When it comes to large initial imports consisting of million or billions of nodes, having a transactional process doesn’t lead to maximum write performance. To saturate a complete write speed, it’s important to bypass transaction semantics and create your initial data store in a “raw” manner via a batch-insertion mechanism. This of course isn’t an option once the database is online so utilizing an optimized write pipeline to maximize online write throughput into the Neo4j graph database becomes essential, which is why we created this as part of the GraphGrid Data Platform.
Making Use of Cypher
If you’re starting off with a Neo4j, you’ll need to import data to the graph database or create some initial data to establish your graph data model. For a demo or concept model, it is often be sufficient enough to craft a small graph via the Neo4j Web-UI or the Neo4j-Console. From there, you can either build up on your graph via data-browser or data Cypher “CREATE” and “MERGE” statements. Cypher also enables you to use LOAD CSV to import your initial test dataset, which we’ve experienced as a very effective data loading tool.
Cypher, Neo4j’s graph query language, works well in updating the graph (similar to SQL insert statements for a relational database). The easiest approach to import data in or out of the native graph database with Cypher for your initial testing purposes is to create proper statements for your input data. That can be accomplished through a Spreadsheet or any program language combining a series of connected texts. This is then made to paste or pipe in commands to the Neo4j-Shell, or have it read from a file.
Once you’ve done a simple data load to establish your data model you’ll want to begin considering the requirements of keeping that data updated and consistently flowing into your online Neo4j graph database.