What I learned from 2 years of 'data sciencing'

Last week was my last one at uSwitch.com. From becoming aware of ‘data scientist’ as a valid job title on my job offer letter, to speaking at Strata London, to signing a book deal, to writing about it in our book on Web Data Mining (that's progressing at a glacial pace), I figured that I should jot down some takeaway lessons while this experience was still fresh.

sad day today as @uSwitchEng had to say goodbye to the amazing @Quantisan. you will be missed and best of luck in the next great adventure!
— Tim Goodwin (@timrgoodwin) December 18, 2013

It's not about the science but the data.

In my first year our team delivered a handful of data projects. For example, we developed a dashboard showing lifetime values for all of our millions of customers, demonstrated a 6 percent revenue gain with a product showcase sorting algorithm modeled by the multi-armed bandit problem, and simulated offline advertising impacts to online sales for optimising marketing spending, thereby saving £20,000 a month. For various reasons, however, none of these projects gained traction within the company and were abandoned.

Much of the effort spent on those projects was in getting the right data into the right shape. We needed to capture events across applications on different technology stacks, associate individual events to unique customers, and be able to process all that data in an ad hoc manner. Over the course of my first year, our team of two built and evolved a distributed data architecture and scalable data workflow based on open source tools and publications from companies like Google, LinkedIn, Twitter, etc. In fact, I scratched enough of my own itch on an open source big data processing project to become one of the maintainers for it.

Congratulations to @Quantisan for becoming a Cascalog committer today
— Nathan Marz (@nathanmarz) March 14, 2013

On a rather "by the way" note, we structured lifetime views of customer data from the disparate signals across company verticals. Word had been going around the company that we had this new business intelligence tool, and more and more people asked us to help them answer questions with data on their side of the business. The data we surfaced from our data workflow satisfied a widespread need in the company to understand customer behaviours. Little did I know that we'd be cleaning and shaping data for most of my second year at uSwitch. Following that, our commercial team released an external data product that I can't say much about, but which might bring in sizable benefits for the company soon.

It is glamorous to talk about the latest and greatest machine learning or data visualisation. In practice, I was just cleaning and shaping data. Enabling more people to make use of deep and structured data was the part that delivered value to the company.

Figuring out the right problems to solve is not easy.

Had we known that customer behaviour analytics were so valuable, we would have done that work earlier on (although many of the other projects were definitely a lot of fun to do). Figuring out the right work to do is one of the most difficult tasks for a data science team. The fact that the data science role is so vague doesn't help. The marketing crew think we are mining for customer insights. Developers think we're toying with Riemann, Storm, or something bleeding edge. Product managers think we are plotting graphs.

Everyone had ideas but there were only three of us in the team. Figuring out where to devote our time and effort is not as easy as it sounds. The issue is that a new project can be almost anything. So which one should we do? The plethora of choice can be confusing.

Seeing that this is data science, why not dive right into the data like they would often say in hackathons? I made the mistake in the first few weeks of my data science career of hacking away at the data and then trying to persuade people to make use of the result ... somehow.

Some interesting graphs came about. But as Marc often likes to ask, "so what?" Unless someone or something can act on the data, results can only satisfy intellectual curiosity. A business can't survive by funding people to carry out academic studies forever.

Nowadays, we talk to different stakeholders to try and dig as deep as possible into their needs before writing any code for a new project. This is me handwaving. Frankly, I'm still learning my way and rely a lot on luck through trial and error in discovering the right problems to solve.

It is a humbling experience.

Working with Paul Ingles and Abigail Lebrecht has been frustratingly awesome. Paul is opinionated about doing things as simply as possible. On more occasions than I can remember, we implemented our own little Clojure libraries because the open source ones available were "trying to do too much". Abigail was adamant about getting the data and analyses right. "What do you mean this data is only 99 percent correct?" Working day in and day out with Paul and Abigail showed me that I still had much to learn in efficient problem-solving and taught me to question all hidden assumptions.

In my previous role as a biomedical engineer, I also had the opportunity to work in a multidisciplinary team. But for my haptic-robotic therapy project, I never even considered going into a workshop to build my robot or provide clinical therapy for the stroke patients. What "multidisciplinary" meant back then were a small group of professionals coming together to work on a project with each person doing different tasks to get the thing to work.

The advantage of being a data scientist was that I was very hands-on in all aspects of the work. One week I might be pair programming with Paul, and fighting to keep him away from my keyboard, another week, integrating Riemann to monitor our data architecture. Other days I was debating with Abigail on the data mining side. That usually resulted from her finding flaws in the materialised tables that I produced from Cascalog, and then I would have to come up with a better estimation model for the missing data.

So do you want to be a data scientist?

This is it for me formally as a data scientist. I am moving back across the Atlantic to the States to co-found a new venture and continue my journey to make information accessible. Now back to the topic at hand. If cleaning vast amounts of data, being clueless as to what to do, and debating with brilliant colleagues all add up to a challenge that you want to take on, I know a company in London that's looking for a data scientist!

Paul Lam

Clojure Developer. Enterprise Startups. Being Useful.