Post gone viral, 16000 visitors in a day, how many actually read the article?

Edit: A few people have pointed out that my assumption about the Analytics engagement metric might be wrong because single page hit could be counted as zero on engagement time. I'll make an update to this post when smarter people than me on HN can agree on a metric. So I open this problem to Analytics expert, how can I discern readership ratio from Analytics data?

My reminiscing post about my time as an aerospace engineer versus software was on the front page of Hacker News for about 12 hours on Friday. That garnered 16,374 unique visitors to this site on that single day. However, Google Analytics data say that only 975 of those people spent more than 10 seconds here. Given that there's 652 words in that viral post, I doubt anyone can actually read it within that time. If we assume that only people spending more than 10 seconds have meaningfully read the article, it appears that only 6% of traffic are real readers from this Hacker News blitz.

Viral post visitors engagement

Given that my usual stats is above 10%, viral traffic audience is understandably less targeted but isn't abysmal by comparison. However, as a data scientist, I'm obliged to say that as this is an one-off event, we couldn't draw a statistically significant observation from it.

Interestingly, overall traffic the day after on Saturday is back down to 901 visits. And engagement for those spending more than 10 seconds is up at 8.3%. These residual traffic are coming in from domains like Twitter and link sharers.

Unconfusing false-positive and false-negative statistical errors confusion

I was reading a blog post about real-time analytics over the lunch today. In it, the author made a claim that "funny business with timeframes can coerce most A/B tests into statistical significance." There's also this plot illustrating two time series of the cumulative number of heads in a two-fair-coin-comparison. Yet, time nor ordering has an effect on test results because each flip is independent. Not content with his claim, I wrote a coin flipping simulation in R to prove him wrong.

This plot shows p-values of proportion tests for two simulated fair coin flips that they are different. Each of these tests are repeated with increasing number of flips per test. Since both coins are fair, we should expect no p-value should dip below our 95% significance level (red horizontal line). Yet we're seeing some false positives (i.e. a claim of evidence when there really isn't) that say the two coins are statistically different.

false positive vs sample size, up to N=1000

A better illustration is to run a test with 1000 flips, get a test result, and repeat many times for many results. We see that sometimes false positive happens. Given that our significance level is 95%, we can expect false positives to happen 1 in 20 times.

repeated sampling at 1000 flips

Remembering that I should do a power calculation to get an optimal sample size, doing power.prop.test(p1=0.5, p2=0.501, power=0.90, alternative="two.sided") says N should be 5253704.

So this is a plot of doing many tests with 5253704 flips each.

N=5253704

But the false positives didn't improve at all! By now, I'm quite confused. So, I asked for help on StackExchange and received this insight.

What's being gained by running more trials is an increase
in the number of true positives or, equivalently, a decrease
in the number of false negatives. That the number of false
positives does not change is precisely the guarantee of the test.

And so, a 95% significance level remains 95% significant (1 in 20 chance of false positive) regardless of increasing sample sizes as shown. Again.

false positive up to 10k trials

What is, in fact, gained for increasing sample size is reduced false negative, which is defined as failing to make a claim when it is there. To illustrate that, we need a different plot because it is an entirely different circumstance. We have two new coins, and they are different.

Say we have one fair (p=50%) coin and another that's slightly biased (p=51%). This plot shows the result of running the same proportion test to see if these two are statistically different. As we increase sample size, the amount of false negative results, points above the red line (0.05 p-value, 95% significance level) denoting negative results, are clearly reduced as sample size increases. Thus this plot is illustrating that false negatives decreases as sample size increases.

false negative increasing samples

"Funny business" do not coerce A/B tests into statistical significance. The fact that a 95% significance gives 1 in 20 false positives is in fact what it guarantees. To decrease false positive, simply test at a higher significance level. For example, prop.test(c(heads.A, heads.B), n=c(N, N), alternative="two.sided", conf.level=0.99) to set it to 99% instead of the default 95%.

The R source code for this mental sojourn are available at this gist on Github.

Recommendation discovery via graph traversal

I am quite excited about graph computing these days. It represents relational data such as customer behaviour naturally and otherwise complicated problems break down to simple pattern matching algorithm. Take recommendation system, for example. One way to do it is by machine learning as Wikipedia suggests. But if we represent the data in a property graph, a simplistic solution surfaces intuitively.

Picture this. If Bob likes item A; Cathy likes both item A and item B; then we can make the commutative link of item B for Bob.

Let's try it out in Neo4j using this pre-built web console example. You should see this graph with 4 person and 5 food items.

simple graph

Using this Cypher query, we get a list of all users and what food they like.

START   user = node:node_auto_index(type = "user") 
MATCH   person-[:IS_A]->user, person-[:LIKE]->x
RETURN  person.name, x.name

The second line is where we match the pattern that person is a user and that person like x. This query reads almost like the question which we want to ask.

We return all the person and those food they like, x:

+------------------------+
| person.name | x.name   |
+------------------------+
| "Andy"      | "apple"  |
| "Andy"      | "orange" |
| "Andy"      | "bread"  |
| "Bob"       | "apple"  |
| "Bob"       | "bread"  |
| "Cat"       | "apple"  |
| "Cat"       | "orange" |
| "Cat"       | "bread"  |
| "Cat"       | "fish"   |
| "Doug"      | "apple"  |
| "Doug"      | "orange" |
+------------------------+

Taking this a step further, we can find all the top common x and y that people like together in the above graph.

START   food = node:node_auto_index(type = "food"), user = node:node_auto_index(type = "user") 
MATCH   food<-[:IS_A]-x<-[:LIKE]-person-[:IS_A]->user,
        person-[:LIKE]->y-[:IS_A]->food
WHERE   NOT x = y
RETURN  x.name, y.name, count(*) as cnt 
ORDER BY cnt DESC 
LIMIT 10;

Resulting in:

+---------------------------+
| x.name   | y.name   | cnt |
+---------------------------+
| "apple"  | "orange" | 3   |
| "apple"  | "bread"  | 3   |
| "orange" | "apple"  | 3   |
| "bread"  | "apple"  | 3   |
| "bread"  | "orange" | 2   |
| "orange" | "bread"  | 2   |
| "fish"   | "apple"  | 1   |
| "fish"   | "bread"  | 1   |
| "apple"  | "fish"   | 1   |
| "bread"  | "fish"   | 1   |
+---------------------------+

So we find that a lot uf users that likes apple also likes orange or bread. We can then pick out all the people that likes apple but not orange yet to suggest (read: spam) orange to them.

START   apple = node:node_auto_index(name = "apple"), orange = node:node_auto_index(name = "orange")
MATCH   person-[:LIKE]->apple
WHERE   NOT(person-[:LIKE]->orange)
RETURN  person.name

+-------------+
| person.name |
+-------------+
| "Bob"       |
+-------------+

Easy, yes?

I construct models, not theories

I gave a talk at the 54th Annual Conference on Operational Research last week in Edinburgh. Operational Research is "using advanced analytical methods to help make better decisions" [wiki]. This field has been around long before data science and business intelligence. After listening to so many talks and talking to so many academics from marketing to finance to industrial engineering, I find that what we do are quite similar on a 30 thousand feet level -- using data to solve problems. Yet, there is a fundamental difference to our approaches. Whereas operational research is about constructing analytical theories; data science is about constructing models.

One of the talks that I recall is from a phd from Dubai about optimising maintenance scheduling in desalination plants. Desalination plants is big business in the Middle East as they provide a major source of fresh water there. However, components in these plants fail often because of the harsh condition that they work in and that servicing some of these components might need to bring the entire plant down for hours. The presenter proceeded to explain their method of using a Poisson process on the failure data to optimise maintenance work.

Now if it were me, I would add tons of sensors everywhere to enhance the frequency and granularity of data captured. Similar to what we do at work for web data. Then the problem practical solve itself as we'll be able to build predictive models for each and every crucial component. Using on-going data with these predictive models, we can flag high risk components and service them before they cause trouble.

The problem is that adding sensors is not trivial (from my electrical engineering days) in a physical system. The high cost of installing all of that and the questionable efficacy of measurements make getting data a challenge. For problems like this, I can see where a traditional scientific thinking of using sparse data to support theories is practical.

Yet, not everything can be reduced to formulas and solved analytically. As this blog piece on Scientific American points out, science is moving towards solving problems computationally. So too are the industries as we've seen examples from Amazon and LinkedIn driving massive sales by modelling and enabling a feedback loop with their data.

It's a shame that so many companies are poisoning the term Big Data these days by plastering it all over their marketing material to sell products with no substance. There are real strategic advantages to be reaped if companies can do it right.

My talk at OR54 on knowledge discovery with web log data

Abstract

Web log data contains a wealth of information about online visitors. We have a record of each and every customer interaction for the millions of visitors coming through each month at uSwitch.com. The challenge is to analyse this discrete time series, semi-structured dataset to understand the behaviour of our visitors on a personal level. This talk is a case study of how our data team of three leveraged heterogeneous architecture and agile methodologies to tackle this problem. And we had three months.

Slides

People is the biggest obstacle to becoming a data-driven organisation

Our data team of 3 is tasked with pushing our business to be more data-driven. From encouraging management to make strategic decisions with hard data to building real-time feedback loop within our products themselves. We have built the infrastructure in place to capture and analyse streams of high granularity data using open source tools like Kafka, Hadoop, and Cascalog. Now the hard part is to convince the rest of the company that counted numbers on an Excel spreadsheet is insufficient so that they can incorporate these new high-frequency and high-granularity information in their everyday decision-making process.

This came to light for me when I tried to revamp our Attribution Modelling, which is a fancy marketing term for how referral is attributed and doesn't have to do with mathematical modelling at all. Many companies traditionally use a last-referral-takes-the-cake approach. Say if a customer comes in via an email referral first, leaves the site, then comes back in from an Adword to ultimately make a purchase, the Adword would get attributed for that revenue produced. This last-referral method is a lazy way of doing things. Lead generating referrals get no credit and return on investment values are biased towards closing-stage marketing efforts. Granted, doing anything else would require a global view of customer journey, which is a non-trivial matter.

However, that is exactly what we're able to do easily now. So we took a couple of days to spike out a fair-share attribution model such that every touch point a customer used get a share of the pie. The intention is to provide a holistic view of referrers efficacy to feed into our higher-level models. But seeing that the old system is so unrealistic, I thought I might as well open it for everyone else.

Opening a can of worms, that was what it felt like. I found that things like financial reports and people's bonuses are tied to this old attribution modelling system. Even though it is obvious to everyone that the old attribution modelling system is unrealistic, changing it would require changing the work processes of multiple individuals across the company. You know what they say about people's habits? Habits are hard to change.

Seeing that this is a fundamental obstacle to our company's competitiveness, I've taken a break from pushing our data architecture to build and evangelise internal data products. I've been working closely individually with our business people to identify ways of using data, staging data views from Hadoop/Cascalog back into good old MySQL, and assisting them to generate actionable data with SQL to make their lives easier.

One of such views borrows a technique from my algorithmic trading. By applying a breakout strategy (which I wrote entirely in SQL, good fun) on the revenue and cost time-series data of our PPC and SEO campaigns, our PPC/SEO manager now have available a customisable screener for abnormality with any of our thousands of continuous marketing campaigns. Very much like a stock screener for breakouts. This used to be a subjective process based on a spreadsheet of numbers and expert opinions. Now it's data-driven.

There's a saying that a business is its people. To become a data-driven organisation, you need data-driven people. I never realised the significance of this until now.

continue   →