Tuesday, January 21, 2014

Exploring OpenStack Savanna, Parts II and III: Elastic Data Processing

In addition to provisioning Hadoop clusters, OpenStack Savanna can also be used to directly run Hadoop jobs using a feature called Elastic Data Processing (EDP).  One of the key advantages of this approach is that users do not need to deal with multiple environments -- everything can be run directly from the Savanna UI.  In this sense, Savanna is competing with Amazon's Elastic MapReduce service to offer "Analytics as a Service" (AaaS).

Part II: Exploring the Savanna Job Interface
There are three relevant tabs in the Savanna Dashboard plugin: "Job Binaries," "Jobs", and "Job Execution."  Job binaries allows you to upload the code you want to execute.  Options include Pig and Hive scripts as well as MapReduce Java jars.


After uploading the binaries, you can create a job using the Jobs tab:


At this stage, you need to select your job type: Pig, Hive, or a MapReduce Java jar.  If your code spans multiple files, you can use the "Libs" tab to select files to be included in the library path.

The final stage is executing the job.  The user is asked to choose the job, cluster, the number of mappers and reducers, and any arguments and parameters.  If the chosen cluster is not already running, Savanna will start the cluster before the job is run and shutdown the cluster when the job is finished.

Part III: Elastic Data Processing (EPD) Behind the Scenes
After the user provides the job descriptions and binaries, Savanna stores the data in a local database using the SQLAlchemy object-relational mapper.

When a job is executed, Savanna converts the data model into a XML Workflow for Oozie, a high-level workflow manager for running Pig, Hive, and MapReduce jobs on Hadoop.  By default, a new HDFS directory is created for every job:

/user/$username/$jobname/$uuid

where $username is the name of the user, $jobname is the name given to the job, and $uuid is a randomly-generated identifier.  The workflow itself is stored in a file named, appropriately enough, "workflow.xml". The main executable is placed in the job directory while the libraries are placed in

/user/$username/$jobname/$uuid/libs

Savanna currently provides two options for handling data sources.  If the user has provided data through Swift (the object store), the XML Workflow for Oozie points to the appropriate Swift locations.  Otherwise, it is assumed the user has provided paths to the data in the command-line arguments given when executing the job.  For example, if you intend to use HDFS, you would be responsible for manually uploading the data to the Hadoop cluster.

No comments:

Post a Comment