Trading and Systems Blog
A weather station data scrapper in R
This R code scrape publicly available weather station data given a weather station ID and a date range. I patched an existing source code from UC Davis so it’s not my original code. Thought I’d share it here anyway. It doesn’t fail safely though because a null return value would break the fetching loop.
My 5 minute lightning talk on Cascalog
Cascalog makes it a lot simpler to build distributed strategy backtesters on terabytes of market data, for example. It is a data processing library for building MapReduce jobs. I’ve been spiking out a data processing project with it at work for the past couple of weeks. So I thought I might as well give a lightning talk about it at our monthly developers meetup. Here are my presentation slides introducing Cascalog and outlining its features.
The possibilities…
Ask not what accuracy your algorithm achieves but what value it can add
Partnering with the marketing team at work earlier this month reminded me of an important lesson in my algorithmic trading. Our external TV agency presented a mid-term performance analysis and I was tasked with performing due diligence on their analysis methodology. It took me 2 minutes to realise that they were using a linear regression model. At first I thought it was too simplistic. Then I realised my own naivete because it was never about accuracy to begin with.
Measuring ROI on a marketing campaign is an illusive task. A quick metric is to perform a regression on the money you spent versus the revenue generated in the same time period. For example, let y(t) = x(t), where y represents revenue, x is marketing spending, and t is day. Plot your data on y versus x for a number of days. Then draw a best fit straight line to graphically show a correlation. If you’re feeling adventurous, throw in a few other factors that you think might have important influence on your daily revenue. Such that you have x2(t), x3(t), etc. This is what the agency did in their model.
One major flaw with using a regression model for this purpose is that it assumes each data point is mutually exclusive to another. So a day’s event does not influence another. This is simply not true in the real world. For examples, it might take a few days after someone sees your ad until they click your buy button or it might take more than a few showing of the ads until people take action. A better regression model for the first example is y(t) = x1(t – a), where ‘a’ is the delay. And for the second example, y(t) = ∑x1(t – A), where A is a vector of ‘a’.
The problem with this is that finding ‘a’ and ‘A’ is another regression problem in and of itself. Luckily, this is conceptually what an Artificial Neural Network does, a series of regression models taking into account the interrelation of the factors and non-linear effects of the response.
And so, within the span of a few minutes of our conference call with them, I’ve just convinced myself in my mind to try another algorithm when I have a chance. That weekend I wasted a few hours on R tinkering with the data. Then it finally dawned on me. What is the value of this?
To calculate an ROI figure on our marketing campaign, is it necessary to spike a machine learning project just so we can be 95% confident on that figure? The justification is further weakened by the fact that ROI is merely one of the many metrics available when evaluating a marketing campaign. So the impact of an accurate model versus a throw in the dart might not be worth spending an extra two weeks working on it whereas a throwing a dart takes only one command in R.
Business is on hold
Back in November, I started working as a Data Scientist at uSwitch, an utility price comparison site. I am very fortunate to be able to work with so many smart and passionate people there. There is so much that I am learning in fact, I haven’t had time to do much else. Although my ridiculous 4-hour commute is also a factor. I didn’t even notice that EUR/USD dropped 1000 pips! As such, I am officially putting my own business and research on hold until further notice. I will continue to post relevant technical discussions on this blog. All of my existing clients have been notified and arrangements made way back before I began my employment. Thank you for all your support!
Algorithmic ownage
I felt good when I simplified one of my algorithms and sped it up 10 times. I felt so good that I even wrote an entire blog post about it patting myself on the back. Then last week I got an email from Kevin of Keming Labs suggesting a few alternatives.
First of all, his solutions looked much cleaner than mine. Then over the weekend I was able to incorporate his 3 algorithms into my program. I ran a few benchmarks and here are the average of 2 tests using a dataset of 28,760 items.
My algorithm. Elapsed time: 68372.026532 msecs.
Kevin’s solution #1. Elapsed time: 156.940976 msecs.
Kevin’s solution #2. Elapsed time: 60.165483 msecs.
Kevin’s solution #3. Elapsed time: 296.162042 msecs.
Total ownage.
That’s what I like about sharing my work; once in a blue moon, a random person drops by and generously show me how I can improve a solution 1,000 times! Now the ball is in my hands to understand what he has done and improve myself.
Collaborating and learning, that’s why I open source.
Update: I’ve done some more digging and it seems that one of the reasons for the drastic improvement in performance is due to the use of transients in the built-in functions. Lesson of the day, leverage the language’s inherent optimization by staying with core data structures and functions as much as possible.
Eureka moment on design patterns for functional programming
Understanding design patterns for object-oriented programming made my life easier as a Java programmer. So I have been looking for a comparable book for functional programming ever since my sojourn into this age-old paradigm. It looks as though I’m not the only one looking too. But the thing is, I think I’ve just had a revelation of sort.
There is one and only one guideline to designing functional architectures — Keep it simple.
Simple as in keeping your functions having one purpose only. Simple as in work with the data directly and don’t conjure unnecessary intermediaries. Simple, as elaborated by Rich Hickey in his talk, Simple Made Easy.
Much of this is conveyed in Bloch’s Effective Java — item #5: Avoid creating unnecessary objects, item #13: minimize accessibility of classes and members, for examples. As Majewski said in a stackoverflow reply, “the patterns movement started because object-oriented programming was often turning into ‘spaghetti objects’”. So keeping it simple are what design patterns ultimately strive for.
“Functional programming is a restricted style of programming, so it didn’t need to grow a set of restricted conventions to limit the chaos.” As such, there is no design pattern book for functional programming. I didn’t get that earlier this year. But something clicked recently.
During the past few months, I’ve been doing some consulting and open source projects solving algorithmic problems with Clojure. One problem in a project that I was faced with this week is calculating the occurrence of each distinctive element within a list of elements.
Say we have a list, coll = (“orange”, “bottle”, “coke”, “bottle”). The output would be something like [("orange", "bottle", "coke") (1 2 1)]
This is my first solution.
The specs are not exactly as I described but the concept remains. What I did is to use tail calls (it’s supposed to be fast, isn’t it?) to aggregate each counter to produce a vector of counts. Then I map each pair of fragment with its corresponding count to generate a final output collection. Sounds overly complicated, doesn’t it? This is the first warning of a bad functional design.
For a collection of 30,000 items, this function took 11 minutes to compute on my notebook. This looks like a good place to exploit the parallel nature of this problem. Specifically, the counting of each fragment is independent of other fragments. Thus, there’s no need for the program to wait for one fragment to finish to process the next. I simplified the program to remove this inherent assumption of procedural processing.
Here is the gist of the refactored code where each function only does one job. Since the processing are modularised, I can parallelize the algorithm easily with the use of pmap instead of map on the last line.
I’ve split the first function into 3 functions (2 shown here). As Hickey said in his talk, simplifying can often produce more, not less, functions. Yet, the program is not only easier to read and runs in less than a minute. An order of magnitude faster!
There are still lots for me to learn. I want to find more challenging projects to push my own limits. But rather than solving arbitrary problems, I prefer to tackle real-world challenges. So if you know of anyone that can benefit from collaborating with a functional developer to build robust and scalable software, please pass along my contact. I offer a 15% referral reward for successful paid jobs too.

Recent Comments