Building a distributed back-tester with Hadoop on Amazon AWS

Testing is arguably the single most important aspect of trading system development. You can't tell how well an idea works unless you test it out. [Testing can also help you identify weaknesses or strengths in your model][]. The downside to testing is that it takes time to churn through those gigabytes of data. Backtesting is inherently a linear process. You feed in your tick data into your algorithm and expect some actions. You can't really make use of fork/join to let other threads steal from the process queue as the later process depends on results from the earlier calculations. However, often times than not, you're interested in testing many variations of a strategy. This is where MapReduce comes into play. MapReduce is a Google software framework. It is inspired by the map and reduce functions ubiquitous in functional programming. They are as common as for-loops in the Java world. The map function partitions an input into smaller problems and run them concurrently, e.g. each of the strategy's variant is executed on a node. The reduce function takes the results from all the nodes and aggregate them to get an output, e.g. back-test results from each strategy. Having used functional programming for some time now, using map/reduce is very natural for me. Where my knowledge falls short is in implementing a distributed infrastructure for running these map and reduce with massive scaling beyond my own multi-core computer. It just so happens that Amazon AWS has a hosted Hadoop PaaS. Where Hadoop is the Apache's framework for MapReduce. Hardware, check. Framework, check. This will be the ~~first~~ second system that I'll be working on in my goal to build a complete trading R&D platform. Expect some technical discussions in the coming months as I work my way through. Now, where should I start...

[Testing can also help you identify weaknesses or strengths in your model]:

Towards a broker-agnostic trading system

The first generation of my trading system uses a broker's proprietary script on that same broker's platform. The second generation of my trading system uses a generic programming language on another broker's platform. I am now working on the third generation of my trading system. It is language-agnostic, platform-agnostic, and broker-agnostic. Why? Because designing trading systems is as much about creativity as the science. An artist would not limit themselves to a defined toolset. I want to be free to use whatever tools are necessary to get the job done efficiently and being able to try new things as technologies evolve. Having a broker-agnostic system frees me from being locked into any particular broker. Conversely, I could use one system to gather data and trade from multiple brokers for the best execution prices. Yes, this is over-engineering for a small entity like myself. But heck, this is something that I enjoy building. So why not?

It's an open buffet in a small business

One of the few benefits of working for myself is that I don't need to worry about compatibility with legacy systems. I am free to use whatever open source tools to get the job done well. The downside to this is that there are so many technologies out there that it's hard to choose the right ones for the job. To give you a sense of what I meant, here are some of the topics that I have either tried or seriously considered in the past year.

  • Programming languages: Java, Python, R, C#, F#, Scala, Clojure, Haskell.
  • Data storage: HDF5, CSV, Binary, MySQL, PostgreSQL, MongoDB, MonetDB, Redis, HBase.
  • Cloud server: Amazon EC2, Microsoft Azure, [Google App Engine][], Rackspace Cloud, plain old VPSs.

Programming language choices are important because they sit at the bottom of a technology stack. Most of the work that I do are built on using them. For a long while, I settled on using a combination of Java, Python, and R. I prototype small ideas in Python. Implement production code in Java. And perform analysis in R. I discussed why use the right tool for the right task a year ago. By the end of my previous project, I am finding that the popular development triplet of Java, Python, and R, is not ideal for a solo-operation. Seeing that I have more time on my hands because I am using QTD to trade for me now, I am taking a break this summer to expand my knowledge and learn new technologies. Some of the technologies that I am experimenting with includes:

  • an in-memory data store for instantaneous and persistent tick data
  • parallel programming for concurrent processing with no locks
  • mathematically intuitive algorithm implementations using high order functions

Don't mind me as I help myself in this open buffet of technologies.

The secret to trading system development is to fail faster

If I am to offer only one advice with regard to developing trading systems then this is it -- fail faster. The sad truth to trading is that there is no magical system that can guarantee profitability indefinitely. A reason top firms like Renaissance Technologies employ hundreds of PhDs and is still actively hiring is because one can never reach the end of the rainbow in trading. A system can always be made better or made obsolete. In fact, you need to stay ahead of the curve as competitions are like vultures that will eat away your game if you don't keep moving. Not to mention the changing dynamics of the market itself is a moving target. As such, you simply need to keep innovating. My way of innovating boils down to executing five cyclical steps. Conceptualising, implementing, testing, measuring, and analysing, over and over again. In essence, my innovation methodology is a simulated annealing process that might or might not eventually lead to a breakthrough. I do this because:

  1. I am intelligent but I am not a genius. As such, I don't expect myself to make leaps and bounds with regular sparks of genius. Instead, I take small steps. Adding a 1% yield to your system is extremely difficult. Adding 0.01% is a lot easier. Adding 0.0001% is almost pedantic. So I aim for achieving smaller improvements and do it hundreds of times over a development cycle.
  2. My view of the world is not absolute. In fact, I am more often wrong than right. So it follows that most of my ideas don't have all the pieces right. Thus, it is inherent that I need to polish my ideas through trials and errors.

Failure is inevidentable when developing trading systems. It is part of the process. You come up with an idea, implement it, test it, find out why it isn't up to your expectation, and then make it better. So the quicker you can fail and learn from it, the quicker you can discover something useful.

continue   →