An example to illustrate migration of CSV to Grakn
Edit me

Introduction

This example looks at the migration of genealogy data in CSV format to build a knowledge base in GRAKN.AI. The data is used as the basis of a blog post that illustrates the fundamentals of the Grakn visualiser, reasoner and analytics components.

As the blog post explains, the original data was a document from Lenzen Research that described the family history of Catherine Niesz Titus for three generations of her maternal lineage.

In this example, we will walk through how to migrate the CSV data into Grakn, and confirm that we have succeeded using the Grakn visualiser.

For a detailed overview of CSV migration, we recommend that you take a look at the Grakn documentation on CSV Migration and Graql templating.

Genealogy Data

The data for this example can be found as a set of CSV files in the sample-projects repository on Github, which is also included in the Grakn distribution zip file. The data was put together by our team from narrative information gleaned from the original Lenzen Research document, with some minor additions to generate some interesting queries for Grakn’s reasoner.

Let’s take a look at the raw-data directory in the example project, which contains the CSV files. These files were put together by hand by our team, mostly by Michelangelo.

filename description
people.csv This comprehensive CSV contains information about all the individuals discussed in the Lenzen document. For each row, it lists the available information about individuals’ names, gender, birth and death dates and age at death. It also assigns each person a person ID (“pid”), which is a string containing the full name of each individual, and is used for identification of individuals in the other CSV files.
births.csv This CSV lists the person IDs of a child and each of its parents
weddings.csv This CSV comprises a row for each marriage, identifying each by a wedding ID (wid). The rows contain the person IDs of each spouse and the date of their wedding, where it is known.

Schema

The schema is a way to describe the entities and their relationships, so the underlying knowledge base can store them according to the Grakn model. You can find out more in our guide to the Grakn Knowledge Model. The schema allows Grakn to perform:

  • logical reasoning over the represented knowledge, such as the extraction of implicit information from explicit data (inference)
  • discovery of inconsistencies in the data (validation).

The schema is shown below. There is a single entity, person, which has a number of resources and can play various roles (parent, child, spouse1 and spouse2) in two possible relationships (parentship and marriage).

define

# Entities

person sub entity
  plays parent
  plays child
  plays spouse1
  plays spouse2

  has identifier
  has firstname
  has surname
  has middlename
  has picture
  has age
  has birth-date
  has death-date
  has gender;

# Roles and Relations

marriage sub relationship
  relates spouse1
  relates spouse2
  has picture;

spouse1 sub role;
spouse2 sub role;

parentship sub relationship
  relates parent
  relates child;

parent sub role;
child sub role;

# Resources

identifier sub attribute datatype string;
name sub attribute datatype string;
firstname sub name datatype string;
surname sub name datatype string;
middlename sub name datatype string;
picture sub attribute datatype string;
age sub attribute datatype long;
"date" sub attribute datatype string;
birth-date sub "date" datatype string;
death-date sub "date" datatype string;
gender sub attribute datatype string;

To load schema.gql into Grakn, make sure the engine is running and choose a clean keyspace in which to work (here we use the default keyspace, so we are cleaning it before we get started).

./grakn server clean
./grakn server start
./graql console -f ./schema.gql

Data Migration

Having loaded the schema, the next steps are to populate the knowledge base by migrating data into Grakn from CSV.

We will consider three CSV files that contain data to migrate into Grakn.

people.csv

The people.csv file contains details of the people that we will use to create seven person entities. Note that not all fields are available for each person, but at the very least, each row is expected to have the following:

  • pid (this is the person identifier, and is a string representing their full name)
  • first name
  • gender
name1,name2,surname,gender,born,dead,pid,age,picture
Timothy,,Titus, male,,,	Timothy Titus,,	
Mary,,Guthrie,female,,,Mary Guthrie,,	
John,,Niesz,male,1798-01-02,1872-03-06,John Niesz,74,
Mary,,Young,female,1798-04-09,1868-10-28,Mary Young,70,
William,Sanford,Titus,male,1818-03-23,01/01/1905,William Sanford Titus,76,
Elizabeth,,Niesz,female,1820-08-27,1891-12-08,Elizabeth Niesz,71,
Mary,Melissa,Titus,female,1847-08-12,10/05/1946,Mary Melissa Titus,98,
...

The migrator is a set of template Graql statements that instruct the Grakn migrator on how the CSV data can be mapped to the schema. The Grakn migrator applies the template to each row of data in a CSV file, replacing the indicated sections in the template with the value from a specific cell, identified by the column header (the key). This sounds complicated, but isn’t really, as we will show.

The Graql template code for the people migrator is as follows:

insert
  $p isa person has identifier <pid>
  has firstname <name1>,
		
  if (<surname> != "") do 
    {
    has surname <surname>,
    }

  if (<name2> != "") do 
    {
    has middlename <name2>,
    }

  if (<picture> != "") do 
    {
    has picture <picture>,
    }

  if (<age> != "") do 
    {
    has age @long(<age>),
    }

  if (<born> != "") do 
    {
    has birth-date <born>,
    }

  if (<dead> != "") do 
    {
    has death-date <dead>,
    }

  has gender <gender>;

For each row in the CSV file, the template inserts a person entity with resources that take the value of the cells in that row. Where data is optional, the template checks to see if it is present before adding the resources for middlename, surname, picture, age, birth and death dates.

Calling the Grakn migrator on the people.csv file using the above template (named people-migrator.gql) is performed as follows:

./graql migrate csv -i ./people.csv -t ./migrators/people-migrator.gql -k grakn

The data insertion generated by the migrator is as follows:

insert $p0 has death-date "1891-12-08" isa person has gender "female" has identifier "Elizabeth Niesz" has surname "Niesz" has age 71 has firstname "Elizabeth" has birth-date "1820-08-27";
insert $p0 has identifier "William Sanford Titus" has age 76 isa person has firstname "William" has surname "Titus" has gender "male" has birth-date "1818-03-23" has middlename "Sanford" has death-date "1905-01-01";
insert $p0 isa person has surname "Titus" has firstname "Timothy" has identifier "Timothy Titus" has gender "male";
insert $p0 isa person has firstname "Mary" has identifier "Mary Guthrie" has surname "Guthrie" has gender "female";
insert $p0 isa person has firstname "Mary" has death-date "1946-05-10" has surname "Titus" has age 98 has identifier "Mary Melissa Titus" has middlename "Melissa" has gender "female" has birth-date "1847-08-12";
insert $p0 has death-date "1872-03-06" has age 74 has identifier "John Niesz" isa person has birth-date "1798-01-02" has firstname "John" has gender "male" has surname "Niesz";
insert $p0 has identifier "Mary Young" has birth-date "1798-04-09" isa person has firstname "Mary" has death-date "1868-10-28" has surname "Young" has gender "female" has age 70;
# ...

births.csv

Each row of births.csv records a parent and child, with two rows for each of the three children listed:

parent,child
Timothy Titus,William Sanford Titus
Mary Guthrie,	William Sanford Titus
John Niesz,Elizabeth Niesz
Mary Young,Elizabeth Niesz
Elizabeth Niesz,Mary Melissa Titus
William Sanford Titus,Mary Melissa Titus
...

The Graql template code for the Grakn migrator is as follows:

match
	$c isa person has identifier <child>;
	$p isa person has identifier <parent>;
insert
	(child: $c, parent: $p) isa parentship;

For each row in the CSV file, the template matches the child and parent cells to their corresponding person entities, and then inserts a parentship relationship, placing the entities it has matched into the child and parent roles.

Calling the Grakn migrator on the births.csv file using the above template (named births-migrator.gql) is performed as follows:

./graql migrate csv -i ./births.csv -t ./migrators/births-migrator.gql -k grakn

The data insertion generated by the migrator is as follows:

match $c0 has identifier "William Sanford Titus" isa person; $p0 isa person has identifier "Timothy Titus";
insert (child: $c0, parent: $p0) isa parentship;
match $p0 isa person has identifier "Mary Guthrie"; $c0 has identifier "William Sanford Titus" isa person;
insert (child: $c0, parent: $p0) isa parentship;
match $p0 isa person has identifier "Elizabeth Niesz"; $c0 isa person has identifier "Mary Melissa Titus";
insert (child: $c0, parent: $p0) isa parentship;
match $c0 isa person has identifier "Mary Melissa Titus"; $p0 has identifier "William Sanford Titus" isa person;
insert (child: $c0, parent: $p0) isa parentship;
match $c0 isa person has identifier "Elizabeth Niesz"; $p0 has identifier "John Niesz" isa person;
insert (child: $c0, parent: $p0) isa parentship;
match $c0 isa person has identifier "Elizabeth Niesz"; $p0 has identifier "Mary Young" isa person;
insert (child: $c0, parent: $p0) isa parentship;
# ...

weddings.csv

The weddings.csv file contains two columns that correspond to both spouses in a marriage, and an optional column for a photograph of the happy couple:

spouse1,spouse2,picture
Timothy Titus,Mary Guthrie,
John Niesz,Mary Young,http://1.bp.blogspot.com/-Ty9Ox8v7LUw/VKoGzIlsMII/AAAAAAAAAZw/UtkUvrujvBQ/s1600/johnandmary.jpg
Elizabeth Niesz,William Sanford Titus,

The Graql template code for the migrator is as follows:

match
	$x has identifier <spouse1>;
	$y has identifier <spouse2>;
insert
	(spouse1: $x, spouse2: $y) isa marriage

	if (<picture> != "") do 
		{
		has picture <picture>
		};

For each row in the CSV file, the template matches the two spouse cells to their corresponding person entities, and then inserts a marriage relationship, placing the entities it has matched into the spouse1 and spouse2 roles. If there is data in the picture cell, a picture attribute is also created for the marriage relationship.

Calling the Grakn migrator on the weddings.csv file using the above template (named weddings-migrator.gql) is performed as follows:

./graql migrate csv -i ./weddings.csv -t ./migrators/weddings-migrator.gql -k grakn

The Graql insertion code is as follows:

match $x0 has identifier "Timothy Titus"; $y0 has identifier "Mary Guthrie";
insert (spouse1: $x0, spouse2: $y0) isa marriage;
match $x0 has identifier "John Niesz"; $y0 has identifier "Mary Young";
insert has picture "http:\/\/1.bp.blogspot.com\/-Ty9Ox8v7LUw\/VKoGzIlsMII\/AAAAAAAAAZw\/UtkUvrujvBQ\/s1600\/johnandmary.jpg" (spouse1: $x0, spouse2: $y0) isa marriage;
match $y0 has identifier "William Sanford Titus"; $x0 has identifier "Elizabeth Niesz";
insert (spouse1: $x0, spouse2: $y0) isa marriage;
# ...

Migration Script

For simplicity, the /raw-data/ directory of the example project contains a script called loader.sh that calls each migration script in turn, so you can simply call the script from the terminal, passing in the path to the Grakn /bin/ directory.

 ./loader.sh <relative-path-to-Grakn>/bin

The migration will take a minute or two, and the terminal will report which file it is migrating at each step. When it is complete, it will report that it is “Done migrating data”. To check that it has been successful, open the Grakn visualiser and select Types, then Entities, and choose one of those presented to you (the entities should be those described above. The visualiser will display the entities it has imported. The screenshot below illustrates the result from selecting to see all person entities.

Person query

We have completed the data import, and the knowledge base can now be queried. For example, from the Graql shell:

match $x isa person, has identifier $i; aggregate count;

There should be 60 people in the dataset.

Data Export

In this example, we have imported a dataset stored in three separate CSV files into Grakn to build a simple knowledge base. We have discussed the schema and migration templates, and shown how to apply the templates to the CSV data using the shell migrator, using a script file loader.sh to automate calling the migrator on each file It is possible to export the data from Grakn, in .gql format, so that it can easily be loaded to a knowledge base again without the need to migrate from CSV.

To export the data, we use graql migrate command again, as described in the migration documentation:

# Export the schema
./graql migrate export -schema > schema-export.gql

# Export the data
./graql migrate export -data > data-export.gql

Exporting data or the schema from Grakn, into Graql, will always redirect to standard out, so above we are sending the output to an appropriately named file.

Where Next?

This example has illustrated how to migrate CSV data into Grakn. Having read it, you may want to further study our documentation about CSV migration and Graql templating.