A hypothetical data analysis platform

My definition of a statistical platform is that it is a glue that ties orthogonal data analysis functions together. Take R for instance, it is a platform-as-application. You fire up R and everything is accessible to you. However, all the packages only work on top of R.

Python, on the other hand, take a platform-as-libraries approach. A basic data analaysis setup is to pip install Numpy, Scipy, Matplotlib. High-level libraries, such as scikit-learn and pandas, are built on top of these. It is somewhat more flexible for picking and choosing but the dependency is still a tree-like structure between some packages.

Then there's Incanter.

You don't like to use Parallel Colt for your matrices? Here, try this BLAS drop-in replacement and everything would just work with 10x speed.

Much of this flexibility is due to earlier design choices by Liebke et al. to leverage Clojure's idiom that "it is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures."

The thing is, I think we're only scratching the surface. Excuse me while I dream for a minute.

Say instead of jBLAS, you want to use CPU/GPU hybrid instead. Suppose you can just do a (use 'incanter-magma) and your Incanter code would just run with MAGMA (via Mathew Rocklin) under the hood without any other change.

Taking this idea of interfacing libraries into a hypothetical use case. Imagine that you cleaned and structured your data on Hadoop using Cascalog and is looking to analyse this dataset. You start your Incanter session to pull in your data (use 'incanter-cascalog). Write some Incanter script to interrogate this dataset but find the data is still too big for your laptop. So you (use 'incanter-storm) to make use of distributed processing instead. Incanter would then flow data directly from Cascalog to Storm inside your cluster.

For your results, you find JFreeChart limiting so you (use 'incanter-c2) to spiff up your visualisations with C2 all while not changing a single line of your Incanter script.

Instead of the star-like dependency of R and its packages, or the tree-like structure for Python and its packages, Incanter could be an interface to stand-alone libraries encapsulated by an application for the user.

Incanter, the library, could be modules that transform data into standard Incanter-compatible data structures to and from external libraries. Incanter, the application, could be a domain specific language, a client, and a in-REPL package manager.

Another benefit to this is that it helps to mitigate the developer shortage problem for Incanter too by making use of external, stand-alone libraries.

I call this platform-as-interface.

R language lacks consistency

That's the thing that gets under my skin about R. For example, why do I need to use do.call on a list but apply for data frame? Not only that, the two functions have entirely different names and signatures (edit: this is a bad example, see Joshua's explanation in the comment). I find that a lot of R is just a matter of memorising what to do under which condition. You can't simply guess or deduce what to do like a sane programming language would enable. I know that many solutions are just a Google search away, but I am not comfortable with the awkwardness of the language.

Having said that, I've been using R reluctantly for a few years. I try to avoid it as much as possible. But there's just no other statistical platform that's so easy to be productive. I tried Incanter for a while but it doesn't seem that well suited for exploratory analysis as I usually end up writing functions up from scratch. More recently I played with Julia briefly. Although it is too bleeding edge as there isn't even a release version yet.

As much as I don't like R the language, the R platform and its package repertoire are incomparable at the moment. We did a bit of ggplot today at work. With the help of my coworker, it only took a few minutes to hook R with our Hadoop cluster to pull some data and produce the graphs that we wanted. In comparison, Incanter's charts are pretty too but not very customisable. D3.js is very customisable but not quick to use at all. Then there's Julia, which can't do more than a bar or line chart for now.

I haven't mentioned the other big contender, Python + Numpy + Scipy + Panda + PyPy + Matplotlib. I tried some of that too some time ago, but didn't get far with it. Come to think of it, I wrote a similar babble like this a year ago ...

Quantisan.com is now compiled and statically served

I just migrated this blog from a self-hosted Wordpress site to a Pelican statically generated blog. It took me a whole weekend to do it because I have almost 600 posts. Much is still broken at the moment but at least the site seems presentable so I pushed it through. The fixes will just have to wait.

I considered using Jekyll as it appears to be the most popular static site generator. A look at its github page though shows no patch has been made for months. Development on Jekyll appears to have paused for some reason. Furthermore, I didn't have much luck getting exitwp, which is a wordpress xml to jekyll importer, to gracefully parse my many custom posting tags. I also gave Ruhoh a quick trial, but it also relies on exitwp to import wordpress posts.

Then I found that Pelican had its own importer tool using pandoc, which I tested works much better for the non-standard tags in my posts.

In any case, here are the steps I took to migrate to Pelican.

  1. Export Wordpress into a XML file
  2. transfer comments to Disqus
  3. install Pelican 3.0.0 dev build
  4. use pelican-import to convert the XML file into markdown posts
  5. moved some posts around into its directory/category
  6. use Textmate to do a bunch of batch fixes
  7. configure Pelican and made a Makefile
  8. install ghp-import rolled my own gh-pages script into Makefile
  9. customised a pelican theme
  10. setup Github for custom domain
  11. updated DNS

After all of that, this site is now hosted and served on github.

Cascalog-checkpoint: Fault-tolerant MapReduce Topologies

Cascalog is an abstraction library on top of Cascading [for writing MapReduce jobs][]. Since the Cascalog library is maturing, the Twitter guys (core committers) have been building features around it so that it's not just an abstraction for Cascading. One of which is [Cascalog-checkpoint][]. It is a small, easy-to-use, and very powerful little add-on for Cascalog. In particular, it enables fault-tolerant MapReduce topologies. Building Cascading/Cascalog queries can be visualised as assembling pipes to connect a flow of data. Imagine that you have Flow A and B. Flow B uses the result from A along with other bits. Thus, Flow B is dependent on A. Typically, if a MapReduce job fail for whatever reason, you simply fix what's wrong and start the job all over again. But what if Flow A takes hours to run (which is common for a MR job) and the error happened in Flow B? Why re-do all that processing for Flow A if we know that it finished successfully? By using Cascalog-checkpoint, you can *stage* intermediate results (e.g. result of Flow A) and failed jobs can automatically pickup from the last checked point. An obvious thing to do but not something I've seen done in Hadoop. At least not as easy as this: See [Sam Ritchie's post on cascalog-checkpoint][] for more examples. Of course, you need to coerce your flows such that output from Flow A can be read by Flow B. However, this is almost trivial via Cascalog/Cascading. As this notion of mix and match pipes and flows is a fundamental concept in Cascalog/Cascading. With so many choices of abstraction frameworks for coding MapReduce on Hadoop, I feel sorry for anyone using vanilla Java for writing MapReduce besides the most simplest or recurring jobs.

←   newer continue   →