I’ve been receiving many questions recently at trainings and meetups regarding how to effectively model time series data with use cases ranging from hour level precision to microsecond level precision. In assessing the various approaches possible, I landed on a tree structure as the model that best fit the problem. The two key questions I found myself asking as I went through the process of building the time tree to connect the time series events were, “How granular do I really need to make this to efficiently work with and expose the time-based data being analyzed?” and “Do I need to generate all time nodes down to the desired precision level?” The balance that needs to be considered is the initialization and maintainability of all the time series nodes versus the dynamic creation as time series events require their existence and the impact the missing nodes may or may not have when querying time series events by various date and time ranges.
I ultimately decided that it would be most effective to create the hour, minute and second level nodes only when needed to connect an event into a day. So I expanded on the work done by Mark Needham in his post Creating a time tree down to the day. The main modeling change in this step was to use a single CONTAINS relationship going from the higher tree toward the lower tree level to simplify movement up and down the entire tree through the use of depth-based pattern matching. Additionally I concluded that for sub-second level measurements it would be most effective to store the full precision (i.e millisecond, microsecond, etc) on the event node itself, but connect the event to the time tree at the second in which it occurred because any filtering or pattern analysis is unlikely to be meaningful within a second (at least for the use cases I’ve been hearing).
In the time series use cases I’ve been hearing there are millions of events flowing through the system over very short periods of time so I wanted to find an interesting data set of meaningful size to use in validating the effectiveness of the tree based approach for modeling time series data. There has been an open data movement in government the last couple years that has been gaining momentum so I began searching data.gov and looking for other cities that have open data portals. I discovered the city of Seattle has a data portal at data.seattle.gov, which had a useful, large time series dataset containing all 911 fire calls received since 2010. There were about 400k entries in the CSV containing an Address, Latitude, Longitude, Incident Type, Incident Number and Date+Time.
After downloading the time series CSV, I needed to a do a couple things to get the data into a more friendly and complete state for loading into ONgDB.
1. I removed all spaces from the column names.
2. I wanted a millisecond timestamp associated with each row in addition to the long UTC text format so I ran a process to insert a second time column that converted the Datetime string value to milliseconds and placed that value in the new DatetimeMS column.
**Michael Hunger has some very useful tips on using LOAD CSV with Geequel in his post LOAD CSV into ONgDB quickly and successfully, which is a worthwhile read before you begin importing your own CSV data.
I’ve included an interactive time series graphgist below with commented Geequel for the generation and a couple exploration queries. You’ll notice that I’m only loading the first 1k lines here in this post and that is just to keep the data set small enough to process and render promptly for this example. In my own testing and demonstration for the meetup group I’d loaded and connected all the event data. If you’re interested in testing out the larger dataset or one of your own, I’d be happy to help.
Some helpful hints for interacting with the graph visualizations:
1. Hold shift and scroll up to zoom in.
2. Hold shift and scroll down to zoom out.
3. Hold shift and click and drag to move the graph around the display area.
4. Double click to zoom in around a specific point.
5. Shift and double click to zoom out around a specific point.
Please let me know if anything needs further clarification. Also, if you’ve got a time series data modeling use case or challenge you’re experiencing, schedule a demo to talk through it with us or use the tooling in the GraphGrid Connected Data Platform fully-featured freemium download to try to it out yourself.