Building a distributed back-tester with Hadoop on Amazon AWS

Testing is arguably the single most important aspect of trading system development. You can't tell how well an idea works unless you test it out. [Testing can also help you identify weaknesses or strengths in your model][]. The downside to testing is that it takes time to churn through those gigabytes of data. Backtesting is inherently a linear process. You feed in your tick data into your algorithm and expect some actions. You can't really make use of fork/join to let other threads steal from the process queue as the later process depends on results from the earlier calculations. However, often times than not, you're interested in testing many variations of a strategy. This is where MapReduce comes into play. MapReduce is a Google software framework. It is inspired by the map and reduce functions ubiquitous in functional programming. They are as common as for-loops in the Java world. The map function partitions an input into smaller problems and run them concurrently, e.g. each of the strategy's variant is executed on a node. The reduce function takes the results from all the nodes and aggregate them to get an output, e.g. back-test results from each strategy. Having used functional programming for some time now, using map/reduce is very natural for me. Where my knowledge falls short is in implementing a distributed infrastructure for running these map and reduce with massive scaling beyond my own multi-core computer. It just so happens that Amazon AWS has a hosted Hadoop PaaS. Where Hadoop is the Apache's framework for MapReduce. Hardware, check. Framework, check. This will be the ~~first~~ second system that I'll be working on in my goal to build a complete trading R&D platform. Expect some technical discussions in the coming months as I work my way through. Now, where should I start...