A hypothetical data analysis platform

My definition of a statistical platform is that it is a glue that ties orthogonal data analysis functions together. Take R for instance, it is a platform-as-application. You fire up R and everything is accessible to you. However, all the packages only work on top of R.

Python, on the other hand, take a platform-as-libraries approach. A basic data analaysis setup is to pip install Numpy, Scipy, Matplotlib. High-level libraries, such as scikit-learn and pandas, are built on top of these. It is somewhat more flexible for picking and choosing but the dependency is still a tree-like structure between some packages.

Then there's Incanter.

You don't like to use Parallel Colt for your matrices? Here, try this BLAS drop-in replacement and everything would just work with 10x speed.

Much of this flexibility is due to earlier design choices by Liebke et al. to leverage Clojure's idiom that "it is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures."

The thing is, I think we're only scratching the surface. Excuse me while I dream for a minute.

Say instead of jBLAS, you want to use CPU/GPU hybrid instead. Suppose you can just do a (use 'incanter-magma) and your Incanter code would just run with MAGMA (via Mathew Rocklin) under the hood without any other change.

Taking this idea of interfacing libraries into a hypothetical use case. Imagine that you cleaned and structured your data on Hadoop using Cascalog and is looking to analyse this dataset. You start your Incanter session to pull in your data (use 'incanter-cascalog). Write some Incanter script to interrogate this dataset but find the data is still too big for your laptop. So you (use 'incanter-storm) to make use of distributed processing instead. Incanter would then flow data directly from Cascalog to Storm inside your cluster.

For your results, you find JFreeChart limiting so you (use 'incanter-c2) to spiff up your visualisations with C2 all while not changing a single line of your Incanter script.

Instead of the star-like dependency of R and its packages, or the tree-like structure for Python and its packages, Incanter could be an interface to stand-alone libraries encapsulated by an application for the user.

Incanter, the library, could be modules that transform data into standard Incanter-compatible data structures to and from external libraries. Incanter, the application, could be a domain specific language, a client, and a in-REPL package manager.

Another benefit to this is that it helps to mitigate the developer shortage problem for Incanter too by making use of external, stand-alone libraries.

I call this platform-as-interface.

R language lacks consistency

That's the thing that gets under my skin about R. For example, why do I need to use do.call on a list but apply for data frame? Not only that, the two functions have entirely different names and signatures (edit: this is a bad example, see Joshua's explanation in the comment). I find that a lot of R is just a matter of memorising what to do under which condition. You can't simply guess or deduce what to do like a sane programming language would enable. I know that many solutions are just a Google search away, but I am not comfortable with the awkwardness of the language.

Having said that, I've been using R reluctantly for a few years. I try to avoid it as much as possible. But there's just no other statistical platform that's so easy to be productive. I tried Incanter for a while but it doesn't seem that well suited for exploratory analysis as I usually end up writing functions up from scratch. More recently I played with Julia briefly. Although it is too bleeding edge as there isn't even a release version yet.

As much as I don't like R the language, the R platform and its package repertoire are incomparable at the moment. We did a bit of ggplot today at work. With the help of my coworker, it only took a few minutes to hook R with our Hadoop cluster to pull some data and produce the graphs that we wanted. In comparison, Incanter's charts are pretty too but not very customisable. D3.js is very customisable but not quick to use at all. Then there's Julia, which can't do more than a bar or line chart for now.

I haven't mentioned the other big contender, Python + Numpy + Scipy + Panda + PyPy + Matplotlib. I tried some of that too some time ago, but didn't get far with it. Come to think of it, I wrote a similar babble like this a year ago ...