Clutch: Documentation

Overview

To the right is a picture of the entire Clutch software stack for a typical MapReduce application. The Java Virtual Machine (JVM) is, of course, at the very bottom of the stack. Clutch itself is built on top of ProActive/Remote Method Invocation, which provide the capability to make method calls on remote objects transparently. MapReduce- "a programming model and an associated implementation for processing and generating large data sets" - is then implemented using Clutch agents. Finally, the end user developer's application is implemented on top of the Clutch implementation of MapReduce. Because all of the communication details are handled by Clutch and all of the MapReduce bookkeeping has already been implemented, the end user developer need only worry about implementing a few simple classes in order to accomplish the task at hand.

Clutch doubletree structure The double-tree to the left is the intended overall organization of a system of Clutch agents. Using a few simple organizational rules, the agents attempt to organize themselves in this manner using only local information. In other words, there is no global scheduler or specialized organizing agent that tells new agents "where to go" within the system. Agents that end up in the leftmost tree take on the role of "mappers", while agents that find themselves in the right side of the tree are "reducers". (More information on these two types of computational roles can be found in Google's paper on MapReduce ). In addition, each agent maintains one or more links to the "other side" of the double-tree (not pictured) in order to propagate data from left-to-right; in other words, raw data is processed by mappers, propagated to one or more reducers for further processing, and eventually finds its way to the root of the reduce side of the tree where final results are written to disk to be collected by the user. More on the implementational details later.

A simple example

In order to implement any Clutch application, three basic components are needed:

a MapTaskFactorythat describes how to generate a MapTaskfrom raw input
a MapTaskimplementation that can map raw input (provided by the MapTaskFactory) to a set of key-value pairs
a ReduceTaskimplementation that can reduce sets of intermediate key-value pairs (i.e., output from a previous MapTaskor ReduceTask) to sets of result key-value pairs

One example of a simple Clutch application is one that computes the frequency of unique words in a file. The MapTaskFactoryfor such an application would take the following form:

public class WCMapTaskFactory extends AbstractMapTaskFactory {

/* required: empty no-args constructor */

public WCMapTaskFactory() {}

/* create a MapTask starting at our current offset

* using our "default" task length */

    public MapTask createMapTask()
        throws IOException
    {
        WCMapTask task = new WCMapTask();

        ...
        // some code that reads data and generates a WCMapTask
        ...

        task.addLengthOffsetPair(
            new Pair<Long, Long>(offset, length));
        task.setTotalLength(file.length());

        return task;
    }

/* create a MapTask starting at the specified offset

* that is of the specified length */

    public MapTask createMapTask(long offset, long length)
        throws IOException
    {
        WCMapTask task = new WCMapTask();

        ...
        // some code that reads data and generates a WCMapTask
        ...

        task.addLengthOffsetPair(
            new Pair<Long, Long>(offset, length));
        task.setTotalLength(file.length());

        return task;
    }
}

Note that the class musthave an empty no-argument constructor. This is a constraint inherited from ProActive, the middleware layer used to build Clutch. It is also important to note that the MapTaskFactoryis responsible for generating length-offset pairs for the dataset as well as knowing what the total "length" of the dataset is. These lengths and offsets can be arbitrary - e.g., tasks could be assigned sequence numbers, with the total length being the highest number in the sequence - or they can be derived in some logical way, such as the total length of the file being read (as in the example above). These are strictly for bookkeeping and are the mechanism that Clutch uses to handle task re-execution and job completion.

The MapTaskin our example would look like the following:

public class WCMapTask extends MapTask {

/* required: empty no-args constructor */

public WCMapTask() {}

/* this is the important function for a mapper */

    public ReduceTask map() {

        WCReduceTask task = new WCReduceTask();

        ...
        // some code that reduces the key-value pairs and populates
        // the WCReduceTask with the result
        ...

        return task;
    }
}

The map()function is the only function that WCMapTaskmust implement. For our word frequency example, the map()function could generate a list of key-value pairs each of which had a word as the key and 1 as the value. The pairs would then need to get propagated to the ReduceTask being generated.

Finally, WCReduceTaskwould be structured as follows:

public class WCReduceTask extends MapTask {

/* required: empty no-args constructor */

public WCReduceTask() {}

/* this is the important function for a reducer */

    public void reduce() {
        ...
        // some code that reduces intermediate key-value pairs
        // into their "final" form
        ...
    }

/* we need a way to retrieve resuls from the final Reducer */

    public List getResult() {
        return resultList;
    }
}

Coming soon: ReduceTask discussion, configuration, and code listings

More to do

Class diagrams, in-depth discussion of the design and implementation of Clutch, and a practical developer user's guide coming soon!