What should I work on next for Cascalog?

Too many things to do, too little time. I figured we can do this in a data-driven way. So here's a poll. Please only submit an entry if you use Cascalog. And only one per person. Let's see if an honour system would work.

Poll result one week later

poll result

Votes Choice
15 Self-contained documentation site, e.g. http://cascalog.quantisan.com (demo address only)
10 Improve and consolidate guides into Github wiki
11 Bring Cascalog to Cascading 2.1/2.2
8 Fix open issues
13 Add features and performance increase, e.g. new logic solver, make use of new Cascading features since 2.0
2 Other

One person voted for integrating other machine learning library into Cascalog and another to isolate system library.

Posted 08 May 2013 in computing.

My talk at OR54 on knowledge discovery with web log data


Web log data contains a wealth of information about online visitors. We have a record of each and every customer interaction for the millions of visitors coming through each month at uSwitch.com. The challenge is to analyse this discrete time series, semi-structured dataset to understand the behaviour of our visitors on a personal level. This talk is a case study of how our data team of three leveraged heterogeneous architecture and agile methodologies to tackle this problem. And we had three months.


Cascalog-checkpoint: Fault-tolerant MapReduce Topologies

Cascalog is an abstraction library on top of Cascading [for writing MapReduce jobs][]. Since the Cascalog library is maturing, the Twitter guys (core committers) have been building features around it so that it's not just an abstraction for Cascading. One of which is [Cascalog-checkpoint][]. It is a small, easy-to-use, and very powerful little add-on for Cascalog. In particular, it enables fault-tolerant MapReduce topologies. Building Cascading/Cascalog queries can be visualised as assembling pipes to connect a flow of data. Imagine that you have Flow A and B. Flow B uses the result from A along with other bits. Thus, Flow B is dependent on A. Typically, if a MapReduce job fail for whatever reason, you simply fix what's wrong and start the job all over again. But what if Flow A takes hours to run (which is common for a MR job) and the error happened in Flow B? Why re-do all that processing for Flow A if we know that it finished successfully? By using Cascalog-checkpoint, you can *stage* intermediate results (e.g. result of Flow A) and failed jobs can automatically pickup from the last checked point. An obvious thing to do but not something I've seen done in Hadoop. At least not as easy as this: See [Sam Ritchie's post on cascalog-checkpoint][] for more examples. Of course, you need to coerce your flows such that output from Flow A can be read by Flow B. However, this is almost trivial via Cascalog/Cascading. As this notion of mix and match pipes and flows is a fundamental concept in Cascalog/Cascading. With so many choices of abstraction frameworks for coding MapReduce on Hadoop, I feel sorry for anyone using vanilla Java for writing MapReduce besides the most simplest or recurring jobs.

My 5 minute lightning talk on Cascalog

Cascalog makes it a lot simpler to build distributed strategy backtesters on terabytes of market data, for example. It is a data processing library for building MapReduce jobs. I've been spiking out a data processing project with it at work for the past couple of weeks. So I thought I might as well give a lightning talk about it at our monthly developers meetup. Here are my presentation slides introducing Cascalog and outlining its features.

The possibilities...