Cascalog-checkpoint: Fault-tolerant MapReduce Topologies

Cascalog is an abstraction library on top of Cascading [for writing MapReduce jobs][]. Since the Cascalog library is maturing, the Twitter guys (core committers) have been building features around it so that it's not just an abstraction for Cascading. One of which is [Cascalog-checkpoint][]. It is a small, easy-to-use, and very powerful little add-on for Cascalog. In particular, it enables fault-tolerant MapReduce topologies. Building Cascading/Cascalog queries can be visualised as assembling pipes to connect a flow of data. Imagine that you have Flow A and B. Flow B uses the result from A along with other bits. Thus, Flow B is dependent on A. Typically, if a MapReduce job fail for whatever reason, you simply fix what's wrong and start the job all over again. But what if Flow A takes hours to run (which is common for a MR job) and the error happened in Flow B? Why re-do all that processing for Flow A if we know that it finished successfully? By using Cascalog-checkpoint, you can *stage* intermediate results (e.g. result of Flow A) and failed jobs can automatically pickup from the last checked point. An obvious thing to do but not something I've seen done in Hadoop. At least not as easy as this: See [Sam Ritchie's post on cascalog-checkpoint][] for more examples. Of course, you need to coerce your flows such that output from Flow A can be read by Flow B. However, this is almost trivial via Cascalog/Cascading. As this notion of mix and match pipes and flows is a fundamental concept in Cascalog/Cascading. With so many choices of abstraction frameworks for coding MapReduce on Hadoop, I feel sorry for anyone using vanilla Java for writing MapReduce besides the most simplest or recurring jobs.