Just because it's math does not mean it's not without its caveats

I am more than half way through my first speed run of the CFA level 1 material. This chapter discusses portfolio management. In particular, I just read the Markovitz Efficient Frontier. What strikes me most about this chapter is that these mathematical material are presented so matter-of-factly. Recalling in chapter 1, Ethics, the idea of prudence and due diligence was evident. This was exemplified later in the Financial Statement chapter. Methods of drafting balance sheet, income statement, and cash flow statement were presented in excrutiating details before a discussion about ways to analyze them. Just so we can understand the mechanics of it to spot bias, pitfalls, misrepresentation, and outright frauds. In fact, a good portion of the section were reserved for discussing the caveats in financial statement analysis. As boring as it was to sit through the accounting details, that was good prudence and good due diligence. Fast forward to this discussion on Modern Portfolio Theory (MPT), it is simply explained in the text that expected risk of a portfolio with individual asset correlations less than 1 will be less than its weighted average risk. Where are the billboard-sized warnings and neon-red danger signs? Perhaps criticisms of MPT will be discussed later or perhaps I simply missed it with my \~200 pages/day scanning pace (in which case, just ignore me). But if I don't see a caveat lector on MPT in bold, underlined, and boxed fonts later, then I am surely going to be very disappointed. This is a big deal because of the historic events of 2008. In which we witnessed the havoc caused by countless funds that blindedly rely on MPT's validity (i.e. diversification of risks on the naive basis of constant correlations and normally distributed returns), and other risk models, go under. There can be no worse mistake than underestimating your risk in trading. Let me repeat that, there can be no worse mistake than underestimating your risk in trading. Careless use of MPT (not MPT, per se) leads to exactly that. The problem is that people take these financial theories to heart too much without reading the fine prints. Just because it is presented in mathematical formulas does not mean a theory is law. The least the CFA text can do is to broaden financial prudence from not only dissecting financial statements but to encompass other important intellectual matter related to the job too. I am appalled by the fact that thousands of CFA level 1 hopefuls might actually believe that you can always reduce portfolio risk and identify an efficient asset mix from mere correlations and standard deviations (which in and of themselves are flawed in the real world), with no strings attached. "Science is the belief in the ignorance of the experts", Richard Feynman.

Social media: The fourth dimension in market data

It looks as though the first hedge fund utilizing social media data is about to launch in April. This "Twitter hedge fund" is based on the work of Bollen et al., "Twitter mood predicts the stock market," Journal of Computational Science, 2011. In more general terms, this is an example use of Natural Language Processing (NLP) to assess the semantic of tweets relating to the stock market. NLP, or more specifically semantic search, has been the holy grail of search engine companies since people realized that you can make a boatload of money by providing good search results. Semantic search improve search accuracy by understanding the seacher's intent and the contextual meaning rather than going on the vocabulary alone (see Nature: Quiz-playing computer system could revolutionize research). What Bollen et al. are proposing is a subset of the grander NLP problem, in that they are only concerned with the "mood states" rather than the whole meaning of the tweets. In other words, sentiment analysis. NLP is not new. R, for example, has a Natural Language Processing task view. Python has a Natural Language Toolkit. Wikipedia has a good list of NLP toolkits for other languages. Traders understand that the market is driven by people, and people are ultimately driven by emotions. So far, there hasn't been a method to directly evaluate the emotions of market participants. Would social media provide a glimpse to the emotions of the masses? According to Bollen et al., yes. At any rate, data is king in quantitative trading. I have no idea would their particular implementation of data mining social media work or not. What's true is that as more and more people publish their lives on the internet, all these non-quantitative information are just waiting to be exploited. Ethics and privacy issues aside, look at what Facebook is doing with their ads as an example. In addition to time, volume, and price, data mining social media may provide us with a fourth dimension to market data.

Network latency on Amazon EC2 t1.micro to Dukascopy

While the processing power of a EC2 t1.micro server sucks (benchmarks have shown it is slower than a Nokia N900 phone), network performance is well-known to be exceptional for EC2 servers throughout the spectrum of their offerings. Here's a ping test from my home computer with a ADSL connection in Ottawa, Canada to 194.8.15.1, one of the Dukascopy web servers.

PING 194.8.15.1 (194.8.15.1) 56(84) bytes of data.
64 bytes from 194.8.15.1: icmp_req=1 ttl=242 time=149 ms
64 bytes from 194.8.15.1: icmp_req=2 ttl=242 time=134 ms
64 bytes from 194.8.15.1: icmp_req=3 ttl=242 time=135 ms
64 bytes from 194.8.15.1: icmp_req=4 ttl=242 time=148 ms
64 bytes from 194.8.15.1: icmp_req=5 ttl=242 time=135 ms
--- 194.8.15.1 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss,
time 4003ms rtt min/avg/max/mdev = 134.364/140.579/149.726/7.053 ms

Then the same test from a EC2 t1.micro instance located in Dublin, Ireland (closest Amazon EC2 server location to Geneva).

PING 194.8.15.1 (194.8.15.1) 56(84) bytes of data.
64 bytes from 194.8.15.1: icmp_req=1 ttl=243 time=40.9 ms
64 bytes from 194.8.15.1: icmp_req=2 ttl=243 time=41.2 ms
64 bytes from 194.8.15.1: icmp_req=3 ttl=243 time=49.0 ms
64 bytes from 194.8.15.1: icmp_req=4 ttl=243 time=41.2 ms
64 bytes from 194.8.15.1: icmp_req=5 ttl=243 time=41.2 ms
--- 194.8.15.1 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss,
time 4005ms rtt min/avg/max/mdev = 40.961/42.728/49.013/3.154 ms

Exceptional, indeed. With the monthly pricing of a t1.micro going for around \$12/month, there are few reasons not to get a remote server to run your trading system rather than pray for your home internet connection to remain reliable 24/7.

Update with ping test from Berlin courtesy of commenter Holger:

Ping wird ausger 194.8.15.1 mit 32 Bytes Daten:
Antwort von 194.8.15.1: Bytes=32 Zeit=43ms TTL=244
Antwort von 194.8.15.1: Bytes=32 Zeit=43ms TTL=244
Antwort von 194.8.15.1: Bytes=32 Zeit=43ms TTL=244
Antwort von 194.8.15.1: Bytes=32 Zeit=43ms TTL=244
Ping-Statistik  194.8.15.1:
Pakete: Gesendet = 4, Empfangen = 4, Verloren = 0 (0% Verlust),
Ca. Zeitangaben in Millisek.: Minimum = 43ms, Maximum = 43ms, Mittelwert = 43ms

Update April 12, 2011: I've been using Rackspace Cloud (#2 cloud server provider, after Amazon). Here's the ping result from their London, UK server.

PING 194.8.15.1 (194.8.15.1) 56(84) bytes of data.
64 bytes from 194.8.15.1: icmp_seq=1 ttl=246 time=69.2 ms
64 bytes from 194.8.15.1: icmp_seq=2 ttl=246 time=69.6 ms
64 bytes from 194.8.15.1: icmp_seq=3 ttl=246 time=68.8 ms
64 bytes from 194.8.15.1: icmp_seq=4 ttl=246 time=66.6 ms
64 bytes from 194.8.15.1: icmp_seq=5 ttl=246 time=68.5 ms
64 bytes from 194.8.15.1: icmp_seq=6 ttl=246 time=64.6 ms
64 bytes from 194.8.15.1: icmp_seq=7 ttl=246 time=68.5 ms
64 bytes from 194.8.15.1: icmp_seq=8 ttl=246 time=69.1 ms
64 bytes from 194.8.15.1: icmp_seq=9 ttl=246 time=66.6 ms
--- 194.8.15.1 ping statistics ---
9 packets transmitted, 9 received, 0% packet loss,
time 8012ms rtt min/avg/max/mdev = 64.677/67.986/69.617/1.573 ms

Update April 26, 2011: I'm testing Quickweb Germany's VPS.

PING 194.8.15.1 (194.8.15.1) 56(84) bytes of data.
64 bytes from 194.8.15.1: icmp_seq=1 ttl=247 time=15.4 ms
64 bytes from 194.8.15.1: icmp_seq=2 ttl=247 time=15.3 ms
64 bytes from 194.8.15.1: icmp_seq=3 ttl=247 time=15.2 ms
64 bytes from 194.8.15.1: icmp_seq=4 ttl=247 time=15.3 ms
64 bytes from 194.8.15.1: icmp_seq=5 ttl=247 time=15.1 ms
64 bytes from 194.8.15.1: icmp_seq=6 ttl=247 time=15.2 ms
64 bytes from 194.8.15.1: icmp_seq=7 ttl=247 time=15.4 ms
--- 194.8.15.1 ping statistics ---
7 packets transmitted, 7 received, 0% packet loss,
time 6000ms rtt min/avg/max/mdev = 15.152/15.306/15.463/0.116 ms

Update September 2013: DigitalOcean offers $5/month 512MB server plan with 20GB SSD. I've used them exclusively for other work. Here's a ping test from one of their Amsterdam servers.

PING 194.8.15.1 (194.8.15.1) 56(84) bytes of data.
64 bytes from 194.8.15.1: icmp_req=2 ttl=245 time=21.5 ms
64 bytes from 194.8.15.1: icmp_req=3 ttl=245 time=22.4 ms
64 bytes from 194.8.15.1: icmp_req=4 ttl=245 time=22.6 ms
64 bytes from 194.8.15.1: icmp_req=5 ttl=245 time=21.5 ms
64 bytes from 194.8.15.1: icmp_req=6 ttl=245 time=22.4 ms

--- 194.8.15.1 ping statistics ---
6 packets transmitted, 5 received, 16% packet loss, time 5013ms
rtt min/avg/max/mdev = 21.540/22.142/22.686/0.484 ms

6 ways to filter data for your trading system

The imminent extortion of metered internet usage here in Canada got me thinking about the large amount of data (at least a few GB a month) that I use for my trading work. In particular, I'd like to talk about methods to filtering time series price data in this post. There is a tradeoff between reading every tick that comes in or sampling the data at less-frequent intervals. A common way discretionary traders present data is by using open-high-low-close-volume (OHLCV) bars of a certain time period. A 5-minute chart looks similar but shows different information than a 4-hour chart, for example. Technically, this is one way of filtering, or presenting, data by sampling at a fixed time interval. The thing with summarizing tick data into 5-min or 4-hour data is that you lose details in exchange for simplicity. However, the reduction in the amount of data is significant. A year's worth of tick data of a single currency pair, for example, runs in the gigabytes range. Whereas a year's worth of the same data in 5-minute OHLCV bars is merely a few ~~hundred~~ megabytes. Another factor to consider is that market data are inherently noisy. Would feeding your system with each and every tick be useful? Of course, if your system is able to handle the noise, or even better, make use of it, then all the more reason to opt for more details in the data. But that's another topic in itself. In any case, there are 6 ways to present/filter data that I can think of.

  1. Time interval.
  2. Price interval.
  3. Volume interval.
  4. Regression modelling.
  5. Frequency filtering.
  6. Wavelet transform.

As you can see, massaging data is not limited to choosing the time period for your OHLCV bars. In fact, it's a whole distinct field involving their own IEEE journal, Knowledge and Data Engineering. There is probably many other ways to filter data than these 6 that I introduced. This is yet another topic that I would like to know more to improve my trading system.

←   newer continue   →