Data science is a collaborative effort

The other day our marketing guy dropped a laundry list of reports that he wanted me to produce. I work with him sometimes so I don't mind putting on a business intelligence hat occasionally to help him out. But I mustn't forget that driving long-term values to the business is what I am good at, so I had to say no to him. What's supposed to be a quick chat with him ended up in a hour long discussion to show case what I'm working on and demonstrating that I can do so much more than count things for him. In short, if you want data, ask the business analyst; if you want to solve problems, let's talk.

It's been said that Big Data analysis is a feedback loop. However, I don't think having the data science team toil away at data is the message. Who's to say that analytical steps in a big data reference model can't involve feedback from people in addition to models? For the past week, I've been showcasing a new internal data product to various stakeholders in our business. This one data product is now leading to a new process for our accountant, a product for our online marketing manager, and an innovative company-wide metric. All of this happened because we collaborated rather than delegated tasks. Analysing data is not a one-way process. Make it a feedback loop of math and people.

Posted 29 April 2012 in data-science.

My talk on bootstrapping data science in a company

Posted 25 March 2012 in data-science.

A weather station data scrapper in R

This R code scrape publicly available weather station data given a weather station ID and a date range. I patched an existing source code from UC Davis so it's not my original code. Thought I'd share it here anyway. It doesn't fail safely though because a null return value would break the fetching loop.

Posted 12 February 2012 in data-science.

Ask not what accuracy your algorithm achieves but what value it can add

Partnering with the marketing team at work earlier this month reminded me of an important lesson in my algorithmic trading. Our external TV agency presented a mid-term performance analysis and I was tasked with performing due diligence on their analysis methodology. It took me 2 minutes to realise that they were using a linear regression model. At first I thought it was too simplistic. Then I realised my own naivete because it was never about accuracy to begin with. Measuring ROI on a marketing campaign is an illusive task. A quick metric is to perform a regression on the money you spent versus the revenue generated in the same time period. For example, let y(t) = x(t), where y represents revenue, x is marketing spending, and t is day. Plot your data on y versus x for a number of days. Then draw a best fit straight line to graphically show a correlation. If you're feeling adventurous, throw in a few other factors that you think might have important influence on your daily revenue. Such that you have x2(t), x3(t), etc. This is what the agency did in their model. One major flaw with using a regression model for this purpose is that it assumes each data point is mutually exclusive to another. So a day's event does not influence another. This is simply not true in the real world. For examples, it might take a few days after someone sees your ad until they click your buy button or it might take more than a few showing of the ads until people take action. A better regression model for the first example is y(t) = x1(t - a), where 'a' is the delay. And for the second example, y(t) = ∑x1(t - A), where A is a vector of 'a'. The problem with this is that finding 'a' and 'A' is another regression problem in and of itself. Luckily, this is conceptually what an Artificial Neural Network does, a series of regression models taking into account the interrelation of the factors and non-linear effects of the response. And so, within the span of a few minutes of our conference call with them, I've just convinced myself in my mind to try another algorithm when I have a chance. That weekend I wasted a few hours on R tinkering with the data. Then it finally dawned on me. What is the value of this? To calculate an ROI figure on our marketing campaign, is it necessary to spike a machine learning project just so we can be 95% confident on that figure? The justification is further weakened by the fact that ROI is merely one of the many metrics available when evaluating a marketing campaign. So the impact of an accurate model versus a throw in the dart might not be worth spending an extra two weeks working on it whereas a throwing a dart takes only one command in R.

Posted 29 January 2012 in data-science.

← newer continue →