Recommendation discovery via graph traversal

I am quite excited about graph computing these days. It represents relational data such as customer behaviour naturally and otherwise complicated problems break down to simple pattern matching algorithm. Take recommendation system, for example. One way to do it is by machine learning as Wikipedia suggests. But if we represent the data in a property graph, a simplistic solution surfaces intuitively.

Picture this. If Bob likes item A; Cathy likes both item A and item B; then we can make the commutative link of item B for Bob.

Let's try it out in Neo4j using this pre-built web console example. You should see this graph with 4 person and 5 food items.

simple graph

Using this Cypher query, we get a list of all users and what food they like.

START   user = node:node_auto_index(type = "user") 
MATCH   person-[:IS_A]->user, person-[:LIKE]->x

The second line is where we match the pattern that person is a user and that person like x. This query reads almost like the question which we want to ask.

We return all the person and those food they like, x:

| |   |
| "Andy"      | "apple"  |
| "Andy"      | "orange" |
| "Andy"      | "bread"  |
| "Bob"       | "apple"  |
| "Bob"       | "bread"  |
| "Cat"       | "apple"  |
| "Cat"       | "orange" |
| "Cat"       | "bread"  |
| "Cat"       | "fish"   |
| "Doug"      | "apple"  |
| "Doug"      | "orange" |

Taking this a step further, we can find all the top common x and y that people like together in the above graph.

START   food = node:node_auto_index(type = "food"), user = node:node_auto_index(type = "user") 
MATCH   food<-[:IS_A]-x<-[:LIKE]-person-[:IS_A]->user,
WHERE   NOT x = y
RETURN,, count(*) as cnt 

Resulting in:

|   |   | cnt |
| "apple"  | "orange" | 3   |
| "apple"  | "bread"  | 3   |
| "orange" | "apple"  | 3   |
| "bread"  | "apple"  | 3   |
| "bread"  | "orange" | 2   |
| "orange" | "bread"  | 2   |
| "fish"   | "apple"  | 1   |
| "fish"   | "bread"  | 1   |
| "apple"  | "fish"   | 1   |
| "bread"  | "fish"   | 1   |

So we find that a lot uf users that likes apple also likes orange or bread. We can then pick out all the people that likes apple but not orange yet to suggest (read: spam) orange to them.

START   apple = node:node_auto_index(name = "apple"), orange = node:node_auto_index(name = "orange")
MATCH   person-[:LIKE]->apple
WHERE   NOT(person-[:LIKE]->orange)

| |
| "Bob"       |

Easy, yes?

I construct models, not theories

I gave a talk at the 54th Annual Conference on Operational Research last week in Edinburgh. Operational Research is "using advanced analytical methods to help make better decisions" [wiki]. This field has been around long before data science and business intelligence. After listening to so many talks and talking to so many academics from marketing to finance to industrial engineering, I find that what we do are quite similar on a 30 thousand feet level -- using data to solve problems. Yet, there is a fundamental difference to our approaches. Whereas operational research is about constructing analytical theories; data science is about constructing models.

One of the talks that I recall is from a phd from Dubai about optimising maintenance scheduling in desalination plants. Desalination plants is big business in the Middle East as they provide a major source of fresh water there. However, components in these plants fail often because of the harsh condition that they work in and that servicing some of these components might need to bring the entire plant down for hours. The presenter proceeded to explain their method of using a Poisson process on the failure data to optimise maintenance work.

Now if it were me, I would add tons of sensors everywhere to enhance the frequency and granularity of data captured. Similar to what we do at work for web data. Then the problem practical solve itself as we'll be able to build predictive models for each and every crucial component. Using on-going data with these predictive models, we can flag high risk components and service them before they cause trouble.

The problem is that adding sensors is not trivial (from my electrical engineering days) in a physical system. The high cost of installing all of that and the questionable efficacy of measurements make getting data a challenge. For problems like this, I can see where a traditional scientific thinking of using sparse data to support theories is practical.

Yet, not everything can be reduced to formulas and solved analytically. As this blog piece on Scientific American points out, science is moving towards solving problems computationally. So too are the industries as we've seen examples from Amazon and LinkedIn driving massive sales by modelling and enabling a feedback loop with their data.

It's a shame that so many companies are poisoning the term Big Data these days by plastering it all over their marketing material to sell products with no substance. There are real strategic advantages to be reaped if companies can do it right.

My talk at OR54 on knowledge discovery with web log data


Web log data contains a wealth of information about online visitors. We have a record of each and every customer interaction for the millions of visitors coming through each month at The challenge is to analyse this discrete time series, semi-structured dataset to understand the behaviour of our visitors on a personal level. This talk is a case study of how our data team of three leveraged heterogeneous architecture and agile methodologies to tackle this problem. And we had three months.


People is the biggest obstacle to becoming a data-driven organisation

Our data team of 3 is tasked with pushing our business to be more data-driven. From encouraging management to make strategic decisions with hard data to building real-time feedback loop within our products themselves. We have built the infrastructure in place to capture and analyse streams of high granularity data using open source tools like Kafka, Hadoop, and Cascalog. Now the hard part is to convince the rest of the company that counted numbers on an Excel spreadsheet is insufficient so that they can incorporate these new high-frequency and high-granularity information in their everyday decision-making process.

This came to light for me when I tried to revamp our Attribution Modelling, which is a fancy marketing term for how referral is attributed and doesn't have to do with mathematical modelling at all. Many companies traditionally use a last-referral-takes-the-cake approach. Say if a customer comes in via an email referral first, leaves the site, then comes back in from an Adword to ultimately make a purchase, the Adword would get attributed for that revenue produced. This last-referral method is a lazy way of doing things. Lead generating referrals get no credit and return on investment values are biased towards closing-stage marketing efforts. Granted, doing anything else would require a global view of customer journey, which is a non-trivial matter.

However, that is exactly what we're able to do easily now. So we took a couple of days to spike out a fair-share attribution model such that every touch point a customer used get a share of the pie. The intention is to provide a holistic view of referrers efficacy to feed into our higher-level models. But seeing that the old system is so unrealistic, I thought I might as well open it for everyone else.

Opening a can of worms, that was what it felt like. I found that things like financial reports and people's bonuses are tied to this old attribution modelling system. Even though it is obvious to everyone that the old attribution modelling system is unrealistic, changing it would require changing the work processes of multiple individuals across the company. You know what they say about people's habits? Habits are hard to change.

Seeing that this is a fundamental obstacle to our company's competitiveness, I've taken a break from pushing our data architecture to build and evangelise internal data products. I've been working closely individually with our business people to identify ways of using data, staging data views from Hadoop/Cascalog back into good old MySQL, and assisting them to generate actionable data with SQL to make their lives easier.

One of such views borrows a technique from my algorithmic trading. By applying a breakout strategy (which I wrote entirely in SQL, good fun) on the revenue and cost time-series data of our PPC and SEO campaigns, our PPC/SEO manager now have available a customisable screener for abnormality with any of our thousands of continuous marketing campaigns. Very much like a stock screener for breakouts. This used to be a subjective process based on a spreadsheet of numbers and expert opinions. Now it's data-driven.

There's a saying that a business is its people. To become a data-driven organisation, you need data-driven people. I never realised the significance of this until now.

←   newer continue   →