Take this unique dataset in a CSV format and transform it into a graph using ONgDB. Using data modeling for chemicals in cosmetics data with ONgDB, we can compact the vast number of relationships and properties within the Chemicals in Cosmetics dataset, creating more meaningful and easily applicable data.
Since 2005, all Californian cosmetic companies are required to provide the information of any cosmetic product that contains chemical(s) that cause or are suspected to cause cancer, developmental birth defects, or harm to the reproductive system. This list, of the cosmetics and chemicals in question, is openly provided on the California government website*. Even more intriguing, are the numerous properties about the products and chemicals, such as important dates and times, whether the product is still being sold, whether the chemical is still being used, and much more. The interconnectedness of this data illustrates the power of the property graph model and its ability to succinctly store information by prioritizing both the nodes and the relationships.
Below is a list of the more ambiguous/allusive headers found in the CSV. Going through and understanding the headers is a vital step for developing an accurate graph model.
Headers & Descriptions:
1. CDPHId (#): CA Dpt. of Public Health identification number for product. May occur more than once.
2. CSFId (#): CDPH identification number for CSF.
3. CSF: Color, scent, and/or flavor. Not all products have specific colors, scents, or flavors.
4. CompanyId (#): CDPH internal identification number for company.
5. PrimaryCategoryId (#): CDPH identification number for category.
6. PrimaryCategory: Type of product (13 primary categories).
7. SubCategoryId (#): CDPH internal identification number for subcategory.
8. SubCategory: Type of product within one of the 13 primary categories.
9. CASId (#): CDPH identification number for chemical.
10. CasNumber (#): Chemical identification number as selected by the manufacturer, packer, and/or distributor. Different forms of chemicals may have different CAS numbers.
11. ChemicalId (#): CDPH internal identification number for this chemical’s record specific to this product.
12. InitialDateReported: Date product was first reported to CDPH.
13. MostRecentDateReported: Date the product profile was last modified by manufacturer, packer, and/or distributor.
14. DiscontinuedDate: If applicable, date the product was discontinued.
15. ChemicalCreatedAt: Date that this chemical was first reported to CDPH for this product.
16. ChemicalUpdatedAt: Date this chemical report was last modified by manufacturer, packer, and/or distributor. This field is unique from the product profile.
17. ChemicalDateRemoved: If applicable, date the chemical was removed from product, i.e., reformulation.
18. ChemicalCount (#): Total number of current chemicals reported for this product.
*Find the CSV here: https://cdph.data.ca.gov/Environment/Chemicals-in-Cosmetics/emt8-tzcf
Taking these headers, we can now create a skeleton of how this data is going to look. This data modeling with ONgDB process involves drawing out a very basic property graph model like the one above.
- First, let’s create the indexes and unique constraints to ensure speedy look-ups while querying the nodes and relationships. Initially most of the numerical headers were uniquely constrained since a numerical id seemed like a more definitive/unique identifier. However, doing this, we had trouble creating nodes later on since many of the id’s were not unique to one chemical. Therefore many of the id’s are indexed and through trial and error we found the properties that could be uniquely constrained.
- Next, import the CSV. Instructions on how to do so can be found on the ONgDB website.
- Next, we create the validations which ensures that any rows with specified null values are not returned. As a general rule, any unique property or primary identifier for a node should not be null. It helps to look through the CSV and see which columns have multiple null values (such as CSF since not every product has a color/scent/flavor reported). We will not exclude rows with null CSF values since this would restrict too much of the data and the CSF data is not especially important. In some cases null values are extremely meaningful; you don’t want to exclude these. For example, if the dateChemicalRemoved is null this could tell us that the product still contains the potentially dangerous chemical. In conclusion, the validations are a bit of a hit and miss; sometimes its not until you test and load your CSV into ONgDB when you can finally correct your errors.
- Now we can create the nodes! Use MERGE instead of CREATE to ensure that unique nodes are not duplicated. Merge each node on a unique property so that it can find that one node on which to perform the merge. Use the ON CREATE SET clause for other properties set on the nodes.
- Finally, we set properties on the relationship connecting the chemical to its product. These properties, if you look at the headers’ descriptions, are chemical properties that are unique to its product. For example, by reading the descriptions we find that the ChemicalId is actually a distinct chemical identifier for each product whereas the CasId is a chemical identifier solely for the chemical (independent of its product). ONgDB allows these properties to be set on the relationship since it relates to both the chemical and product nodes.
- You are done! The property graph model for the data is compact and complete. We now have completed data modeling for chemicals in cosmetics.
Even a rudimentary understanding of Geequel allows you to seamlessly navigate and query the data. Below are a few to jog your imagination.
With the richness of this dataset, there are many more queries that can be done. Have fun and good luck. Give GraphGrid Connected Data Platform a try by data modeling chemicals in cosmetics and loading this dataset into it.