Minimal variance asset allocation for Stocks ISA

With interest rate in the UK so pathetically low, I thought I might take some chance by making use of a Stocks ISA account in the UK. The problem though, is that I have no knowledge about the London stock market nor do I have the time to follow it. So I wrote a program to pick some Exchange Traded Funds (ETFs) with a primary goal to minimise risk and opportunity costs.

Here are what I wanted to achieve:

  • only a few trades at most a year
  • less risk than FTSE100
  • more yield than a laddered government bonds portfolio
  • require less than an hour per month of maintenance

Basically, this is a computer-assisted passive investment portfolio.

The first step is to scrape all the ETF symbols from London Stock Exchange on these pages. I use getNodeSet from XML package in R to select the relevant data from the HTML page with XPath.

page <- getURL(url, curl=curl)
tree <- htmlTreeParse(page, useInternalNodes=TRUE)
xpath <- "//table[@class = 'table_dati']/tbody"
node <- getNodeSet(tree, xpath)

This program only considers ETF because this portfolio is to diversify risk and not pick winning stocks. ETFs provide convenient exposure to various asset classes such as equities, bonds, and commodities at low costs.

Next is to scrape profile information for each symbol from Yahoo. We want data such as the fund's expense ratio and asset class category.

url <- paste("http://finance.yahoo.com/q/pr?s=", symbol, "+Profile", sep="")
tree <- htmlTreeParse(url, useInternalNodes=TRUE)
xpath <- "//table[contains(concat(' ', @class, ' '), ' yfnc_datamodoutline1 ')]/tr/td/table"
node <- getNodeSet(tree, xpath)

operation <- tryCatch(readHTMLTable(node[[2]]), error = function(e) NA)
overview <- tryCatch(readHTMLTable(node[[1]]), error = function(e) NA)

Once the funds' fundamental data are fetched, we can do a preliminary screening. I am filtering for:

  1. Actively traded funds,
  2. Sufficient age (3 years), and
  3. Only the best 3 expense ratio efficiency from each class

That last point is particularly important as illustrated in this plot.

London funds expense by category

The above plot couldn't fit in this frame but it shows that expense ratios are all over the place. What matters is that the raw data is available for use.

The plot below is clearer. It shows expense ratio by the fund's issuer. You can see that Vanguard funds generally have the best expense ratio as is commonly known.

London funds expense by issuer

The initial ETF list has 667 funds in 104 categories. The screened list narrows it down to 20 funds in 18 categories. Most that were screened are niche funds such as Islamic Global Equity and regional real estate funds.

Out of that 20 funds, I apply the popular Modern Portfolio Theory to minimise risk using historical quotes data with quantmod's Yahoo data fetcher. Given the expected returns of each asset, er and their covariance matrix, cov.mat, a long-only efficient portfolio weighting of those assets can be solved with quadratic programming like so.

Dmat <- 2*cov.mat
dvec <- rep.int(0, N)
Amat <- cbind(rep(1,N), er, diag(1,N))
bvec <- c(1, target.return, rep(0,N))
result <- solve.QP(Dmat=Dmat,dvec=dvec,Amat=Amat,bvec=bvec,meq=2)

To get these weightings,

IGLT.L   MIDD.L   INXG.L   EQQQ.L   SLXX.L   IBGS.L   IUKP.L   LUK2.L 
0.603023 0.122829 0.116879 0.084122 0.037906 0.017975 0.014051 0.003215

But here's the catch, this mean-variance optimisation approach which I'm using does not work in the real-world. The problem is that it optimises for historical data under simplistic assumptions. For potential improvements on this model, start with this Q&A on StackExchange but be warned that it's a rabbit hole to go down in.

Knowing that I shouldn't trust this model much, I do this a couple times under different scenarios on the efficient frontier and union the top weighted assets from each run as a compensation by sampling.

The result is a suggestion of six ETFs.

Symbol            Name                   category
BRIC.L    ISHARESII 50                BRIC Equity
EQQQ.L   POWERSHS EQQQ US Large-Cap Growth Equity
IGLS.L ISHARESIII 0-5£        GBP Government Bond
INXG.L GBP IDX-LNK GLT  GBP Inflation-Linked Bond
MIDD.L  ISHARESFTSE250          UK Mid-Cap Equity
SLXX.L ISHSIII IBX £CB         GBP Corporate Bond

Out of these I hand picked IGLS.L and MIDD.L for a conservative 80% bonds and 20% equity portfolio. This plot below shows the annualised return versus risk of ISF (FTSE100), an equal-weighted portfolio of the pre-screened 20 ETFs, and this final portfolio of two ETFs. Notice the historic risk of this final portfolio is a third of FTSE100.

Return vs risk

Not surprisingly, what my program derived from scratch is similar to the commonly suggested portfolio balance of bonds, local equities, and emerging market blend. What this program offers is picking out the specific ETFs from the hundreds of ETFs traded on London Stock Exchange for a balanced asset allocation.

The complete R source code for this project is available on Github.

Posted 31 January 2013 in stocks.

Unconfusing false-positive and false-negative statistical errors confusion

I was reading a blog post about real-time analytics over the lunch today. In it, the author made a claim that "funny business with timeframes can coerce most A/B tests into statistical significance." There's also this plot illustrating two time series of the cumulative number of heads in a two-fair-coin-comparison. Yet, time nor ordering has an effect on test results because each flip is independent. Not content with his claim, I wrote a coin flipping simulation in R to prove him wrong.

This plot shows p-values of proportion tests for two simulated fair coin flips that they are different. Each of these tests are repeated with increasing number of flips per test. Since both coins are fair, we should expect no p-value should dip below our 95% significance level (red horizontal line). Yet we're seeing some false positives (i.e. a claim of evidence when there really isn't) that say the two coins are statistically different.

false positive vs sample size, up to N=1000

A better illustration is to run a test with 1000 flips, get a test result, and repeat many times for many results. We see that sometimes false positive happens. Given that our significance level is 95%, we can expect false positives to happen 1 in 20 times.

repeated sampling at 1000 flips

Remembering that I should do a power calculation to get an optimal sample size, doing power.prop.test(p1=0.5, p2=0.501, power=0.90, alternative="two.sided") says N should be 5253704.

So this is a plot of doing many tests with 5253704 flips each.

N=5253704

But the false positives didn't improve at all! By now, I'm quite confused. So, I asked for help on StackExchange and received this insight.

What's being gained by running more trials is an increase
in the number of true positives or, equivalently, a decrease
in the number of false negatives. That the number of false
positives does not change is precisely the guarantee of the test.

And so, a 95% significance level remains 95% significant (1 in 20 chance of false positive) regardless of increasing sample sizes as shown. Again.

false positive up to 10k trials

What is, in fact, gained for increasing sample size is reduced false negative, which is defined as failing to make a claim when it is there. To illustrate that, we need a different plot because it is an entirely different circumstance. We have two new coins, and they are different.

Say we have one fair (p=50%) coin and another that's slightly biased (p=51%). This plot shows the result of running the same proportion test to see if these two are statistically different. As we increase sample size, the amount of false negative results, points above the red line (0.05 p-value, 95% significance level) denoting negative results, are clearly reduced as sample size increases. Thus this plot is illustrating that false negatives decreases as sample size increases.

false negative increasing samples

"Funny business" do not coerce A/B tests into statistical significance. The fact that a 95% significance gives 1 in 20 false positives is in fact what it guarantees. To decrease false positive, simply test at a higher significance level. For example, prop.test(c(heads.A, heads.B), n=c(N, N), alternative="two.sided", conf.level=0.99) to set it to 99% instead of the default 95%.

The R source code for this mental sojourn are available at this gist on Github.

A hypothetical data analysis platform

My definition of a statistical platform is that it is a glue that ties orthogonal data analysis functions together. Take R for instance, it is a platform-as-application. You fire up R and everything is accessible to you. However, all the packages only work on top of R.

Python, on the other hand, take a platform-as-libraries approach. A basic data analaysis setup is to pip install Numpy, Scipy, Matplotlib. High-level libraries, such as scikit-learn and pandas, are built on top of these. It is somewhat more flexible for picking and choosing but the dependency is still a tree-like structure between some packages.

Then there's Incanter.

You don't like to use Parallel Colt for your matrices? Here, try this BLAS drop-in replacement and everything would just work with 10x speed.

Much of this flexibility is due to earlier design choices by Liebke et al. to leverage Clojure's idiom that "it is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures."

The thing is, I think we're only scratching the surface. Excuse me while I dream for a minute.

Say instead of jBLAS, you want to use CPU/GPU hybrid instead. Suppose you can just do a (use 'incanter-magma) and your Incanter code would just run with MAGMA (via Mathew Rocklin) under the hood without any other change.

Taking this idea of interfacing libraries into a hypothetical use case. Imagine that you cleaned and structured your data on Hadoop using Cascalog and is looking to analyse this dataset. You start your Incanter session to pull in your data (use 'incanter-cascalog). Write some Incanter script to interrogate this dataset but find the data is still too big for your laptop. So you (use 'incanter-storm) to make use of distributed processing instead. Incanter would then flow data directly from Cascalog to Storm inside your cluster.

For your results, you find JFreeChart limiting so you (use 'incanter-c2) to spiff up your visualisations with C2 all while not changing a single line of your Incanter script.

Instead of the star-like dependency of R and its packages, or the tree-like structure for Python and its packages, Incanter could be an interface to stand-alone libraries encapsulated by an application for the user.

Incanter, the library, could be modules that transform data into standard Incanter-compatible data structures to and from external libraries. Incanter, the application, could be a domain specific language, a client, and a in-REPL package manager.

Another benefit to this is that it helps to mitigate the developer shortage problem for Incanter too by making use of external, stand-alone libraries.

I call this platform-as-interface.

R language lacks consistency

That's the thing that gets under my skin about R. For example, why do I need to use do.call on a list but apply for data frame? Not only that, the two functions have entirely different names and signatures (edit: this is a bad example, see Joshua's explanation in the comment). I find that a lot of R is just a matter of memorising what to do under which condition. You can't simply guess or deduce what to do like a sane programming language would enable. I know that many solutions are just a Google search away, but I am not comfortable with the awkwardness of the language.

Having said that, I've been using R reluctantly for a few years. I try to avoid it as much as possible. But there's just no other statistical platform that's so easy to be productive. I tried Incanter for a while but it doesn't seem that well suited for exploratory analysis as I usually end up writing functions up from scratch. More recently I played with Julia briefly. Although it is too bleeding edge as there isn't even a release version yet.

As much as I don't like R the language, the R platform and its package repertoire are incomparable at the moment. We did a bit of ggplot today at work. With the help of my coworker, it only took a few minutes to hook R with our Hadoop cluster to pull some data and produce the graphs that we wanted. In comparison, Incanter's charts are pretty too but not very customisable. D3.js is very customisable but not quick to use at all. Then there's Julia, which can't do more than a bar or line chart for now.

I haven't mentioned the other big contender, Python + Numpy + Scipy + Panda + PyPy + Matplotlib. I tried some of that too some time ago, but didn't get far with it. Come to think of it, I wrote a similar babble like this a year ago ...

continue   →