Giraph - Giraph Input/Output with Gora

Overview

The Apache Gora project is an open source framework which provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support.
The integration of these two awesome Apache projects has as main motivation the possibility of turning Gora-supported-NoSQL data stores into Giraph-processable graphs, and to provide Giraph the ability to store its results into different data stores, letting users focus on the processing itself.
The way Gora works is by defining the data model how our data is going to be stored using a JSON-like schema inspired in Apache Avro and doing the physical mapping to the data store using an XML file. The former one will help us generate data beans which will be read or written into different data stores, and the latter one, helps us defining which data bean should go where. In this way, Giraph will be able to read/write data using three files:

The generated data beans representing our data model.
The XML mapping file representing our physical mapping.
A file called gora.properties containing configurations related to which data store Gora will use.

The image below shows how this integration works in a plain simple image: Giraph Gora integration

Generating DataBeans

So the first thing we have to is to define our data model using a JSON-like schema. Here it is a schema resembling graphs stored inside Apache HBase through Gora. The following shows a schema for a vertex:

{"type": "record",
"name": "Vertex",
"namespace": "org.apache.giraph.gora.generated",
"fields" : [
           {"name": "vertexId", "type": "long"},
           {"name": "value", "type": "float"},
           {"name": "edges",
            "type": {
                     "type":"array", "items": {
                                     "name": "Edge",
                                     "type": "record",
                                     "namespace": "org.apache.giraph.gora.generated",
                                     "fields": [
                                             {"name": "vertexId", "type": "long"},
                                             {"name": "edgeValue", "type": "float"}
                                            ]
                                     }
                    }
          }
          ]
}

And this other schema shows what a schema for an edge should look like.

      {
      "type": "record",
      "name": "GEdge",
      "namespace": "org.apache.giraph.gora.generated",
      "fields" : [
                 {"name": "edgeId", "type": "string"},
                 {"name": "edgeWeight", "type": "float"},
                 {"name": "vertexInId", "type": "string"},
                 {"name": "vertexOutId", "type": "string"},
                 {"name": "label", "type": "string"}
                 ]
      }

Now we are ready to generate our data beans. To do this, we need to use gora-core.jar which comes with Giraph. The gora-compiler works using three parameters:

        <schema file> - REQUIRED -individual avsc file to be compiled or a directory path containing avsc files
        <output dir> - REQUIRED -output directory for generated Java files
        <-license id> - the preferred license header to add to the

So by executing the gora compiler through this command, the generated data beans will be created in the path set.

           java -jar gora-core-0.4-SNAPSHOT.jar org.apache.gora.compiler.GoraCompiler.class vertex.avsc  gora-app/src/main/java/
           java -jar gora-core-0.4-SNAPSHOT.jar org.apache.gora.compiler.GoraCompiler.class edge.avsc  gora-app/src/main/java/

This will result into a java class which will look something similar to this:

      /**
      * Class for defining a Giraph-Vertex.
      */
     @SuppressWarnings("all")
     public class GVertex extends PersistentBase {
       /**
        * Schema used for the class.
        */
       public static final Schema OBJ_SCHEMA = Schema.parse(
           "{\"type\":\"record\",\"name\":\"Vertex\"," +
           "\"namespace\":\"org.apache.giraph.gora.generated\"," +
           "\"fields\":[{\"name\":\"vertexId\",\"type\":\"string\"}," +
           "{\"name\":\"value\",\"type\":\"float\"},{\"name\":\"edges\"," +
           "\"type\":{\"type\":\"map\",\"values\":\"string\"}}]}");
       
       /**
        * Vertex Id
        */
       private Utf8 vertexId;
       
       /**
        * Gets vertexId
        * @return Utf8 vertexId
        */
       public Utf8 getVertexId() {
         return (Utf8) get(0);
       }
       
       /**
        * Sets vertexId
        * @param value vertexId
        */
       public void setVertexId(Utf8 value) {
         put(0, value);
       }
      . . .

Once this logical data modeling is done, the physical mapping between this generated classes and the actual data repositories have to be made. Gora does this by using a xml "mapping file".
The file below represents a gora-hbase-mapping.xml i.e. the necessary information to map our data model into HBase tables. Within the tags table the necessary column families will be defined. Moreover, within the tags class, the actual generated java bean will be mapped into the column families. Inside this, each field should be mapped into their respective column family, and the HBase qualifier to be used for storing this field.
This mapping file can contain as many mappings as generated data beans our application uses i.e. we can redefine more table tags with their own class and fields.

        <gora-orm>
          <table name="graphGiraph">
            <family name="vertices"/>
          </table>
          <class name="org.apache.giraph.io.gora.generated.GVertex" keyClass="java.lang.String" table="graphGiraph">
            <field name="vertexId" family="vertices" qualifier="vertexId"/>
            <field name="value" family="vertices" qualifier="value"/>
            <field name="edges" family="vertices" qualifier="edges"/>
          </class>
        </gora-orm>

A more complex file can be found inside giraph-gora/conf folder.

Preparation

Once the data beans have been generated, the gora.properties file has be created. This file specifies which data store is going to be used with Gora, but also contains extra information about such data store. An example of such file can be found inside giraph-gora/conf folder. Following our example, if it has been decided to use Apache HBase so gora.properties should contain such configuration, as shown below:
# FOR HBASE DATASTORE gora.datastore.default=org.apache.gora.hbase.store.HBaseStore Then to be able to use the Gora API the user needs to prepare the Gora environment. This is not more than having set up one of the data stores Gora support, having the data beans generated and the gora.properties file set up. A more detail yet simple tutorial can be found here.
The data definition files should be available in the classpath when the Giraph job is run. But also all configuration files needed for each specific data store should also be made available across the cluster. For example, if we were to use HBase along Giraph and Gora, then the hbase-site.xml file should be passed along as well. There are several ways to make these files available, and one common way to do this is with the -file option. This option would look like something similar to this:

      -files ../conf/gora.properties,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml

Gora also needs to be told which serialization types it will use. This serialization types could be made across the cluster, but if that is not desired, then they can be passed using the -D option of Hadoop. This option would look like something similar to this:

      -Dio.serializations=org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.JavaSerialization

Configuration Options

Now that the data beans have been generated, and Gora environment ready, the configuration options for this API have to be known in order to be specified by the user. These configurations are as follow:

label	type	description
giraph.gora.datastore.class	string	Gora DataStore class to access to data from - required.
giraph.gora.key.class	String	Gora Key class to query the datastore - required.
giraph.gora.persistent.class	String	Gora Persistent class to read objects from Gora - required.
giraph.gora.start.key	String	Gora start key to query the datastore.
giraph.gora.end.key	String	Gora end key to query the datastore.
giraph.gora.keys.factory.class	String	Keys factory to convert strings into desired keys - required.
giraph.gora.output.datastore.class	String	Gora DataStore class to write data to - required.
giraph.gora.output.key.class	String	Gora Key class to write to datastore - required.
giraph.gora.output.persistent.class	String	Gora Persistent class to write to Gora - required.

Input/Output Example

To make use of the Giraph input API available for Gora, it is required to extend the classes GoraVertexInputFormat or GoraEdgeInputFormat. In the first class, the only method that has to be implemented is transformVertex to transform a Gora Object into a Giraph's Vertex object. Likewise, for the second class the methods that have to be implemented are transformEdge, to convert a Gora Edge Object into a the Giraph'sEdge object, and getCurrentSourceId. There are two Examples of such implementations which are GoraGVertexVertexInputFormat and GoraGEdgeEdgeInputFormat. One other class that has to be implemented here is the KeyFactory because this class is used to transform the keys passed as strings throught the options into actual Gora key Objects used to query the data store. The default one assumes your key type is a String.
On the other hand, to make use of the Giraph output API available for Gora, it is required to extend the classes GoraVertexOutputFormat or GoraEdgeOutputFormat. In the first class, the only method that has to be implemented is getGoraVertex to transform a Giraph's Vertex object into a Gora object, and getGoraKey to determine the key which will represent such vertex. Likewise, for the Edge output class the methods that have to be implemented are getGoraEdge, to convert a Giraph's Edge object into a Gora Edge object, and getGoraKey to determine the key which will represent such edge. There are two Examples of such implementations which are GoraGVertexVertexOutputFormat and GoraGEdgeEdgeOutputFormat.
An example command showing how to put together all these classes and configurations is shown below. This command is to compute the shortest path algorithm onto the graph database shown previously is provided below.
export GIRAPH_CORE_JAR=$GIRAPH_CORE_TARGET_DIR/giraph-$GIRAPH_VERSION-for-$HADOOP_VERSION-jar-with-dependencies.jar export GIRAPH_EXAMPLES_JAR=$GIRAPH_EXAMPLES_TARGET_DIR/giraph-examples-$GIRAPH_VERSION-for-$HADOOP_VERSION-jar-with-dependencies.jar export GIRAPH_GORA_JAR=$GIRAPH_GORA_TARGET_DIR/giraph-gora-$GIRAPH_VERSION-SNAPSHOT-jar-with-dependencies.jar export GORA_HBASE_JAR=$GORA_HBASE_TARGET_DIR/gora-cassandra-$GORA_VERSION.jar export HBASE_JAR=$GORA_DIR/gora-hbase/lib/hbase-0.90.4.jar export HADOOP_CLASSPATH=$GIRAPH_CORE_JAR:$GIRAPH_EXAMPLES:$GIRAPH_GORA_JAR:$GORA_HBASE_JAR

           hadoop jar $GIRAPH_EXAMPLES_JAR org.apache.giraph.GiraphRunner 
           -files ../conf/gora.properties,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml
           -Dio.serializations=org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.JavaSerialization
           -Dgiraph.gora.datastore.class=org.apache.gora.hbase.store.HBaseStore
           -Dgiraph.gora.key.class=java.lang.String
           -Dgiraph.gora.persistent.class=org.apache.giraph.io.gora.generated.GEdge
           -Dgiraph.gora.start.key=0
           -Dgiraph.gora.end.key=10
           -Dgiraph.gora.keys.factory.class=org.apache.giraph.io.gora.utils.KeyFactory
           -Dgiraph.gora.output.datastore.class=org.apache.gora.hbase.store.HBaseStore 
           -Dgiraph.gora.output.key.class=java.lang.String  
           -Dgiraph.gora.output.persistent.class=org.apache.giraph.io.gora.generated.GEdgeResult 
           -libjars $GIRAPH_GORA_JAR,$GORA_HBASE_JAR,$HBASE_JAR
           org.apache.giraph.examples.SimpleShortestPathsComputation 
           -eif org.apache.giraph.io.gora.GoraGEdgeEdgeInputFormat 
           -eof org.apache.giraph.io.gora.GoraGEdgeEdgeOutputFormat 
           -w 1