Data science is a collaborative effort
The other day our marketing guy dropped a laundry list of reports that he wanted me to produce. I work with him sometimes so I don’t mind putting on a business intelligence hat occasionally to help him out. But I mustn’t forget that driving long-term values to the business is what I am good at, so I had to say no to him. What’s supposed to be a quick chat with him ended up in a hour long discussion to show case what I’m working on and demonstrating that I can do so much more than count things for him. In short, if you want data, ask the business analyst; if you want to solve problems, let’s talk.
It’s been said that Big Data analysis is a feedback loop. However, I don’t think having the data science team toil away at data is the message. Who’s to say that analytical steps in a big data reference model can’t involve feedback from people in addition to models?
For the past week, I’ve been showcasing a new internal data product to various stakeholders in our business. This one data product is now leading to a new process for our accountant, a product for our online marketing manager, and an innovative company-wide metric. All of this happened because we collaborated rather than delegated tasks. Analysing data is not a one-way process. Make it a feedback loop of math and people.
read moreCascalog-checkpoint: Fault-tolerant MapReduce Topologies
Cascalog is an abstraction library on top of Cascading for writing MapReduce jobs. Since the Cascalog library is maturing, the Twitter guys (core committers) have been building features around it so that it’s not just an abstraction for Cascading. One of which is Cascalog-checkpoint. It is a small, easy-to-use, and very powerful little add-on for Cascalog. In particular, it enables fault-tolerant MapReduce topologies.
Building Cascading/Cascalog queries can be visualised as assembling pipes to connect a flow of data. Imagine that you have Flow A and B. Flow B uses the result from A along with other bits. Thus, Flow B is dependent on A. Typically, if a MapReduce job fail for whatever reason, you simply fix what’s wrong and start the job all over again. But what if Flow A takes hours to run (which is common for a MR job) and the error happened in Flow B? Why re-do all that processing for Flow A if we know that it finished successfully?
By using Cascalog-checkpoint, you can stage intermediate results (e.g. result of Flow A) and failed jobs can automatically pickup from the last checked point. An obvious thing to do but not something I’ve seen done in Hadoop. At least not as easy as this:
See Sam Ritchie’s post on cascalog-checkpoint for more examples.
Of course, you need to coerce your flows such that output from Flow A can be read by Flow B. However, this is almost trivial via Cascalog/Cascading. As this notion of mix and match pipes and flows is a fundamental concept in Cascalog/Cascading.
With so many choices of abstraction frameworks for coding MapReduce on Hadoop, I feel sorry for anyone using vanilla Java for writing MapReduce besides the most simplest or recurring jobs.
read moreA weather station data scrapper in R
This R code scrape publicly available weather station data given a weather station ID and a date range. I patched an existing source code from UC Davis so it’s not my original code. Thought I’d share it here anyway. It doesn’t fail safely though because a null return value would break the fetching loop.
read moreMy 5 minute lightning talk on Cascalog
Cascalog makes it a lot simpler to build distributed strategy backtesters on terabytes of market data, for example. It is a data processing library for building MapReduce jobs. I’ve been spiking out a data processing project with it at work for the past couple of weeks. So I thought I might as well give a lightning talk about it at our monthly developers meetup. Here are my presentation slides introducing Cascalog and outlining its features.
The possibilities…
read moreAsk not what accuracy your algorithm achieves but what value it can add
Partnering with the marketing team at work earlier this month reminded me of an important lesson in my algorithmic trading. Our external TV agency presented a mid-term performance analysis and I was tasked with performing due diligence on their analysis methodology. It took me 2 minutes to realise that they were using a linear regression model. At first I thought it was too simplistic. Then I realised my own naivete because it was never about accuracy to begin with.
Measuring ROI on a marketing campaign is an illusive task. A quick metric is to perform a regression on the money you spent versus the revenue generated in the same time period. For example, let y(t) = x(t), where y represents revenue, x is marketing spending, and t is day. Plot your data on y versus x for a number of days. Then draw a best fit straight line to graphically show a correlation. If you’re feeling adventurous, throw in a few other factors that you think might have important influence on your daily revenue. Such that you have x2(t), x3(t), etc. This is what the agency did in their model.
One major flaw with using a regression model for this purpose is that it assumes each data point is mutually exclusive to another. So a day’s event does not influence another. This is simply not true in the real world. For examples, it might take a few days after someone sees your ad until they click your buy button or it might take more than a few showing of the ads until people take action. A better regression model for the first example is y(t) = x1(t – a), where ‘a’ is the delay. And for the second example, y(t) = ∑x1(t – A), where A is a vector of ‘a’.
The problem with this is that finding ‘a’ and ‘A’ is another regression problem in and of itself. Luckily, this is conceptually what an Artificial Neural Network does, a series of regression models taking into account the interrelation of the factors and non-linear effects of the response.
And so, within the span of a few minutes of our conference call with them, I’ve just convinced myself in my mind to try another algorithm when I have a chance. That weekend I wasted a few hours on R tinkering with the data. Then it finally dawned on me. What is the value of this?
To calculate an ROI figure on our marketing campaign, is it necessary to spike a machine learning project just so we can be 95% confident on that figure? The justification is further weakened by the fact that ROI is merely one of the many metrics available when evaluating a marketing campaign. So the impact of an accurate model versus a throw in the dart might not be worth spending an extra two weeks working on it whereas a throwing a dart takes only one command in R.
read more
