Overview
The
Apache
Gora project is an open source framework which provides an in-memory
data model and persistence for big data. Gora supports persisting to column
stores, key value stores, document stores and RDBMSs, and
analyzing the data with extensive Apache Hadoop MapReduce support.
The integration of these two awesome Apache projects has as main motivation
the possibility of turning Gora-supported-NoSQL data stores into
Giraph-processable graphs, and to provide Giraph the ability to store its
results into different data stores, letting users focus on the processing itself.
The way Gora works is by defining the data model how our data is going to be
stored using a JSON-like schema inspired in
Apache Avro and
doing the physical mapping to the data store using an XML file.
The former one will help us generate data beans which will be read or written
into different data stores, and the latter one, helps us defining which data
bean should go where.
In this way, Giraph will be able to read/write data using three files:
- The generated data beans representing our data model.
- The XML mapping file representing our physical mapping.
- A file called gora.properties containing
configurations related to which data store Gora will use.
The image below shows how this integration works in a plain simple image:
Generating DataBeans
So the first thing we have to is to define our data model using a JSON-like schema. Here it is
a schema resembling graphs stored inside Apache HBase through Gora. The following shows a schema
for a vertex:
{"type": "record",
"name": "Vertex",
"namespace": "org.apache.giraph.gora.generated",
"fields" : [
{"name": "vertexId", "type": "long"},
{"name": "value", "type": "float"},
{"name": "edges",
"type": {
"type":"array", "items": {
"name": "Edge",
"type": "record",
"namespace": "org.apache.giraph.gora.generated",
"fields": [
{"name": "vertexId", "type": "long"},
{"name": "edgeValue", "type": "float"}
]
}
}
}
]
}
And this other schema shows what a schema for an edge should look like.
{
"type": "record",
"name": "GEdge",
"namespace": "org.apache.giraph.gora.generated",
"fields" : [
{"name": "edgeId", "type": "string"},
{"name": "edgeWeight", "type": "float"},
{"name": "vertexInId", "type": "string"},
{"name": "vertexOutId", "type": "string"},
{"name": "label", "type": "string"}
]
}
Now we are ready to generate our data beans. To do this, we need to use gora-core.jar which
comes with Giraph. The gora-compiler works using three parameters:
<schema file> - REQUIRED -individual avsc file to be compiled or a directory path containing avsc files
<output dir> - REQUIRED -output directory for generated Java files
<-license id> - the preferred license header to add to the
So by executing the gora compiler through this command, the generated data beans
will be created in the path set.
java -jar gora-core-0.4-SNAPSHOT.jar org.apache.gora.compiler.GoraCompiler.class vertex.avsc gora-app/src/main/java/
java -jar gora-core-0.4-SNAPSHOT.jar org.apache.gora.compiler.GoraCompiler.class edge.avsc gora-app/src/main/java/
This will result into a java class which will look something similar to this:
/**
* Class for defining a Giraph-Vertex.
*/
@SuppressWarnings("all")
public class GVertex extends PersistentBase {
/**
* Schema used for the class.
*/
public static final Schema OBJ_SCHEMA = Schema.parse(
"{\"type\":\"record\",\"name\":\"Vertex\"," +
"\"namespace\":\"org.apache.giraph.gora.generated\"," +
"\"fields\":[{\"name\":\"vertexId\",\"type\":\"string\"}," +
"{\"name\":\"value\",\"type\":\"float\"},{\"name\":\"edges\"," +
"\"type\":{\"type\":\"map\",\"values\":\"string\"}}]}");
/**
* Vertex Id
*/
private Utf8 vertexId;
/**
* Gets vertexId
* @return Utf8 vertexId
*/
public Utf8 getVertexId() {
return (Utf8) get(0);
}
/**
* Sets vertexId
* @param value vertexId
*/
public void setVertexId(Utf8 value) {
put(0, value);
}
. . .
Once this logical data modeling is done, the physical mapping between this generated
classes and the actual data repositories have to be made. Gora does this by using a
xml "mapping file".
The file below represents a
gora-hbase-mapping.xml i.e. the necessary
information to map our data model into HBase tables. Within the tags
table
the necessary column families will be defined. Moreover, within the tags
class, the actual generated java bean will be mapped into the column
families. Inside this, each field should be mapped into their respective column
family, and the HBase qualifier to be used for storing this field.
This mapping file can contain as many mappings as generated data beans our application
uses i.e. we can redefine more
table tags with their own
class
and
fields.
<gora-orm>
<table name="graphGiraph">
<family name="vertices"/>
</table>
<class name="org.apache.giraph.io.gora.generated.GVertex" keyClass="java.lang.String" table="graphGiraph">
<field name="vertexId" family="vertices" qualifier="vertexId"/>
<field name="value" family="vertices" qualifier="value"/>
<field name="edges" family="vertices" qualifier="edges"/>
</class>
</gora-orm>
A more complex file can be found inside
giraph-gora/conf folder.
Preparation
Once the data beans have been generated, the
gora.properties file
has be created. This file specifies which data store is going to be used with
Gora, but also contains extra information about such data store. An example of
such file can be found inside
giraph-gora/conf folder. Following
our example, if it has been decided to use Apache HBase so
gora.properties
should contain such configuration, as shown below:
# FOR HBASE DATASTORE
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
Then to be able to use the Gora API the user needs to prepare the Gora environment.
This is not more than having set up one of the data stores Gora support, having
the data beans generated and the
gora.properties file set up. A more
detail yet simple tutorial can be found
here.
The data definition files should be available in the classpath when the
Giraph job is run. But also all configuration files needed for each specific data
store should also be made available across the cluster. For example, if we were
to use HBase along Giraph and Gora, then the hbase-site.xml file should be passed
along as well. There are several ways to make these files available, and one common
way to do this is with the
-file option. This option would look like
something similar to this:
-files ../conf/gora.properties,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml
Gora also needs to be told which serialization types it will use. This serialization
types could be made across the cluster, but if that is not desired, then they can be
passed using the
-D option of Hadoop. This option would look like
something similar to this:
-Dio.serializations=org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.JavaSerialization
Configuration Options
Now that the data beans have been generated, and Gora environment ready,
the configuration options for this API have to be known in order to be specified
by the user. These configurations are as follow:
label |
type |
description |
giraph.gora.datastore.class |
string |
Gora DataStore class to access to data from - required. |
giraph.gora.key.class |
String |
Gora Key class to query the datastore - required. |
giraph.gora.persistent.class |
String |
Gora Persistent class to read objects from Gora - required. |
giraph.gora.start.key |
String |
Gora start key to query the datastore. |
giraph.gora.end.key |
String |
Gora end key to query the datastore. |
giraph.gora.keys.factory.class |
String |
Keys factory to convert strings into desired keys - required. |
giraph.gora.output.datastore.class |
String |
Gora DataStore class to write data to - required. |
giraph.gora.output.key.class |
String |
Gora Key class to write to datastore - required. |
giraph.gora.output.persistent.class |
String |
Gora Persistent class to write to Gora - required.
|
Input/Output Example
To make use of the Giraph input API available for Gora, it is required to extend the
classes
GoraVertexInputFormat or
GoraEdgeInputFormat.
In the first class, the only method that has to be implemented is
transformVertex to transform a
Gora Object into a
Giraph's
Vertex object. Likewise, for the second class the methods
that have to be implemented are
transformEdge, to convert a
Gora Edge Object into a the Giraph's
Edge object, and
getCurrentSourceId. There are two Examples of such implementations
which are
GoraGVertexVertexInputFormat and
GoraGEdgeEdgeInputFormat. One other class that has to be implemented
here is the
KeyFactory because this class is used to transform the keys
passed as strings throught the options into actual Gora key Objects used to query
the data store. The default one assumes your key type is a
String.
On the other hand, to make use of the Giraph output API available for Gora,
it is required to extend the classes
GoraVertexOutputFormat or
GoraEdgeOutputFormat.
In the first class, the only method that has to be implemented is
getGoraVertex to transform a Giraph's Vertex object into a
Gora object, and
getGoraKey to determine the key which will represent
such vertex. Likewise, for the Edge output class the methods
that have to be implemented are
getGoraEdge, to convert a Giraph's
Edge object into a Gora Edge object, and
getGoraKey to determine the
key which will represent such edge. There are two Examples of such implementations
which are
GoraGVertexVertexOutputFormat and
GoraGEdgeEdgeOutputFormat.
An example command showing how to put together all these classes and configurations
is shown below. This command is to compute the shortest path algorithm onto the
graph database shown previously is provided below.
export GIRAPH_CORE_JAR=$GIRAPH_CORE_TARGET_DIR/giraph-$GIRAPH_VERSION-for-$HADOOP_VERSION-jar-with-dependencies.jar
export GIRAPH_EXAMPLES_JAR=$GIRAPH_EXAMPLES_TARGET_DIR/giraph-examples-$GIRAPH_VERSION-for-$HADOOP_VERSION-jar-with-dependencies.jar
export GIRAPH_GORA_JAR=$GIRAPH_GORA_TARGET_DIR/giraph-gora-$GIRAPH_VERSION-SNAPSHOT-jar-with-dependencies.jar
export GORA_HBASE_JAR=$GORA_HBASE_TARGET_DIR/gora-cassandra-$GORA_VERSION.jar
export HBASE_JAR=$GORA_DIR/gora-hbase/lib/hbase-0.90.4.jar
export HADOOP_CLASSPATH=$GIRAPH_CORE_JAR:$GIRAPH_EXAMPLES:$GIRAPH_GORA_JAR:$GORA_HBASE_JAR
hadoop jar $GIRAPH_EXAMPLES_JAR org.apache.giraph.GiraphRunner
-files ../conf/gora.properties,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml
-Dio.serializations=org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.JavaSerialization
-Dgiraph.gora.datastore.class=org.apache.gora.hbase.store.HBaseStore
-Dgiraph.gora.key.class=java.lang.String
-Dgiraph.gora.persistent.class=org.apache.giraph.io.gora.generated.GEdge
-Dgiraph.gora.start.key=0
-Dgiraph.gora.end.key=10
-Dgiraph.gora.keys.factory.class=org.apache.giraph.io.gora.utils.KeyFactory
-Dgiraph.gora.output.datastore.class=org.apache.gora.hbase.store.HBaseStore
-Dgiraph.gora.output.key.class=java.lang.String
-Dgiraph.gora.output.persistent.class=org.apache.giraph.io.gora.generated.GEdgeResult
-libjars $GIRAPH_GORA_JAR,$GORA_HBASE_JAR,$HBASE_JAR
org.apache.giraph.examples.SimpleShortestPathsComputation
-eif org.apache.giraph.io.gora.GoraGEdgeEdgeInputFormat
-eof org.apache.giraph.io.gora.GoraGEdgeEdgeOutputFormat
-w 1