I’ve been receiving many questions recently at trainings and meetups regarding how to effectively model time series data with use cases ranging from hour level precision to microsecond level precision. In assessing the various approaches possible, I landed on a tree structure as the model that best fit the problem. The two key questions I found myself asking as I went through the process of building the time tree to connect the time series events were, “How granular do I really need to make this to efficiently work with and expose the time-based data being analyzed?” and “Do I need to generate all time nodes down to the desired precision level?” The balance that needs to be considered is the initialization and maintainability of all the time series nodes versus the dynamic creation as time series events require their existence and the impact the missing nodes may or may not have when querying time series events by various date and time ranges.Neo4j: Cypher – Creating a time tree down to the day. The main modeling change in this step was to use a single CONTAINS relationship going from the higher tree toward the lower tree level to simplify movement up and down the entire tree through the use of depth-based pattern matching. Additionally I concluded that for sub-second level measurements it would be most effective to store the full precision (i.e millisecond, microsecond, etc) on the event node itself, but connect the event to the time tree at the second in which it occurred because any filtering or pattern analysis is unlikely to be meaningful within a second (at least for the use cases I’ve been hearing).data.gov and looking for other cities that have open data portals. I discovered the city of Seattle has a data portal at data.seattle.gov, which had a useful, large time series dataset containing all 911 fire calls received since 2010. There were about 400k entries in the CSV containing an Address, Latitude, Longitude, Incident Type, Incident Number and Date+Time.
After downloading the time series CSV, I needed to a do a couple things to get the data into a more friendly and complete state for loading into Neo4j.
1. I removed all spaces from the column names.
2. I wanted a millisecond timestamp associated with each row in addition to the long UTC text format so I ran a process to insert a second time column that converted the Datetime string value to milliseconds and placed that value in the new DatetimeMS column.
**Michael Hunger has some very useful tips on using LOAD CSV with Cyhper in his post LOAD CSV into Neo4j quickly and successfully, which is a worthwhile read before you begin importing your own CSV data.
Some helpful hints for interacting with the graph visualizations:
1. Hold shift and scroll up to zoom in.
2. Hold shift and scroll down to zoom out.
3. Hold shift and click and drag to move the graph around the display area.
4. Double click to zoom in around a specific point.
5. Shift and double click to zoom out around a specific point.
Please let me know if anything needs further clarification. Also, if you’ve got a data modeling use case or challenge you’re experiencing, tell me about it in the comments. I’m always looking for interesting data challenges.