Cascalog-checkpoint: Fault-tolerant MapReduce Topologies

Cascalog is an abstraction library on top of Cascading [for writing MapReduce jobs][]. Since the Cascalog library is maturing, the Twitter guys (core committers) have been building features around it so that it's not just an abstraction for Cascading. One of which is [Cascalog-checkpoint][]. It is a small, easy-to-use, and very powerful little add-on for Cascalog. In particular, it enables fault-tolerant MapReduce topologies. Building Cascading/Cascalog queries can be visualised as assembling pipes to connect a flow of data. Imagine that you have Flow A and B. Flow B uses the result from A along with other bits. Thus, Flow B is dependent on A. Typically, if a MapReduce job fail for whatever reason, you simply fix what's wrong and start the job all over again. But what if Flow A takes hours to run (which is common for a MR job) and the error happened in Flow B? Why re-do all that processing for Flow A if we know that it finished successfully? By using Cascalog-checkpoint, you can *stage* intermediate results (e.g. result of Flow A) and failed jobs can automatically pickup from the last checked point. An obvious thing to do but not something I've seen done in Hadoop. At least not as easy as this: See [Sam Ritchie's post on cascalog-checkpoint][] for more examples. Of course, you need to coerce your flows such that output from Flow A can be read by Flow B. However, this is almost trivial via Cascalog/Cascading. As this notion of mix and match pipes and flows is a fundamental concept in Cascalog/Cascading. With so many choices of abstraction frameworks for coding MapReduce on Hadoop, I feel sorry for anyone using vanilla Java for writing MapReduce besides the most simplest or recurring jobs.

My 5 minute lightning talk on Cascalog

Cascalog makes it a lot simpler to build distributed strategy backtesters on terabytes of market data, for example. It is a data processing library for building MapReduce jobs. I've been spiking out a data processing project with it at work for the past couple of weeks. So I thought I might as well give a lightning talk about it at our monthly developers meetup. Here are my presentation slides introducing Cascalog and outlining its features.

The possibilities...

Local Hadoop test cluster up and running

Thanks to Cloudera's CDH3 image, I have a virtual machine with Hadoop on CentOS 5 working. I'm more of an Ubuntu guy, so CentOS is a new for me. But nothing Google couldn't solve. I also ran into a Hadoop exception about the java heap space. I couldn't find a solution online so I just bumped up the memory on the virtual machine and it solved the problem. In any case, I managed to run the pi calculation example on my local Hadoop cluster. [][]


Building a distributed back-tester with Hadoop on Amazon AWS

Testing is arguably the single most important aspect of trading system development. You can't tell how well an idea works unless you test it out. [Testing can also help you identify weaknesses or strengths in your model][]. The downside to testing is that it takes time to churn through those gigabytes of data. Backtesting is inherently a linear process. You feed in your tick data into your algorithm and expect some actions. You can't really make use of fork/join to let other threads steal from the process queue as the later process depends on results from the earlier calculations. However, often times than not, you're interested in testing many variations of a strategy. This is where MapReduce comes into play. MapReduce is a Google software framework. It is inspired by the map and reduce functions ubiquitous in functional programming. They are as common as for-loops in the Java world. The map function partitions an input into smaller problems and run them concurrently, e.g. each of the strategy's variant is executed on a node. The reduce function takes the results from all the nodes and aggregate them to get an output, e.g. back-test results from each strategy. Having used functional programming for some time now, using map/reduce is very natural for me. Where my knowledge falls short is in implementing a distributed infrastructure for running these map and reduce with massive scaling beyond my own multi-core computer. It just so happens that Amazon AWS has a hosted Hadoop PaaS. Where Hadoop is the Apache's framework for MapReduce. Hardware, check. Framework, check. This will be the ~~first~~ second system that I'll be working on in my goal to build a complete trading R&D platform. Expect some technical discussions in the coming months as I work my way through. Now, where should I start...

[Testing can also help you identify weaknesses or strengths in your model]: