Unconfusing false-positive and false-negative statistical errors confusion

I was reading a blog post about real-time analytics over the lunch today. In it, the author made a claim that "funny business with timeframes can coerce most A/B tests into statistical significance." There's also this plot illustrating two time series of the cumulative number of heads in a two-fair-coin-comparison. Yet, time nor ordering has an effect on test results because each flip is independent. Not content with his claim, I wrote a coin flipping simulation in R to prove him wrong.

This plot shows p-values of proportion tests for two simulated fair coin flips that they are different. Each of these tests are repeated with increasing number of flips per test. Since both coins are fair, we should expect no p-value should dip below our 95% significance level (red horizontal line). Yet we're seeing some false positives (i.e. a claim of evidence when there really isn't) that say the two coins are statistically different.

false positive vs sample size, up to N=1000

A better illustration is to run a test with 1000 flips, get a test result, and repeat many times for many results. We see that sometimes false positive happens. Given that our significance level is 95%, we can expect false positives to happen 1 in 20 times.

repeated sampling at 1000 flips

Remembering that I should do a power calculation to get an optimal sample size, doing power.prop.test(p1=0.5, p2=0.501, power=0.90, alternative="two.sided") says N should be 5253704.

So this is a plot of doing many tests with 5253704 flips each.


But the false positives didn't improve at all! By now, I'm quite confused. So, I asked for help on StackExchange and received this insight.

What's being gained by running more trials is an increase
in the number of true positives or, equivalently, a decrease
in the number of false negatives. That the number of false
positives does not change is precisely the guarantee of the test.

And so, a 95% significance level remains 95% significant (1 in 20 chance of false positive) regardless of increasing sample sizes as shown. Again.

false positive up to 10k trials

What is, in fact, gained for increasing sample size is reduced false negative, which is defined as failing to make a claim when it is there. To illustrate that, we need a different plot because it is an entirely different circumstance. We have two new coins, and they are different.

Say we have one fair (p=50%) coin and another that's slightly biased (p=51%). This plot shows the result of running the same proportion test to see if these two are statistically different. As we increase sample size, the amount of false negative results, points above the red line (0.05 p-value, 95% significance level) denoting negative results, are clearly reduced as sample size increases. Thus this plot is illustrating that false negatives decreases as sample size increases.

false negative increasing samples

"Funny business" do not coerce A/B tests into statistical significance. The fact that a 95% significance gives 1 in 20 false positives is in fact what it guarantees. To decrease false positive, simply test at a higher significance level. For example, prop.test(c(heads.A, heads.B), n=c(N, N), alternative="two.sided", conf.level=0.99) to set it to 99% instead of the default 95%.

The R source code for this mental sojourn are available at this gist on Github.

The Power of Tangential Learning

According to Wikipedia, > tangential learning is the process by which some portion of people > will self-educate if a topic is exposed to them in something that they > already enjoy. I was just organizing my bookshelf the other day and is surprised by how fast my collection of statistics books have grown. I studied statistics back in school as a stepping stone to learning stochastic process for wireless communication theories (e.g. CDMA). Detecting radio-frequency and resolving wireless signals to meaningful messages is fundamentally about assessing the state of random processes. I hated statistics back then because I didn't *get it* and did poorly in those courses. It is ironic that years later I would realize how much I have grown to rely and love the little *p*'s and *q*'s in my little hobby of quantitative trading. I still don't like statistics (or theoretical math), per se. I just love *applying* them in my trading and quant programs. The more that I learn about statistics, the more that I realize how powerful they can be and how ignorant I am. For example, in my clinical study days I used either t-test or ANOVA for everything under the sun. Now that I've come to understanding about inferential statistics, I am aware of the assumptions and pitfalls such as the assumption of t-tests of homogeneity of variance between the two samples tested. If this assumption is violated, then the unequal variance t-test should be used. Finer points like these are routinely ignored in practice because many clinical studies are inherently designed in the experiment to meet these criteria. However, that's not the case when I am making creative use of statistics in my trading. I don't have a clinical study committee watching over my back. If I make a false or weaker-than-expected claim and don't know better, then it is my bank account that will suffer the consequences. Learning statistics was initially due to this got-to-know-better necessity. However, the more that I learn about statistics, the more that I appreciate it. If used correctly, statistics can provide a new dimension to the scientific assessment of your trading performance and market data.