[Update]: This article is getting a good bit of engagement. If it resonates with you, I’d love it if you could answer a short 2 minute survey on your data journey here. I will add the same survey link at the end of this post as well.
Depending on who you ask, you’re going to hear data science described as being sexy by some, and decidedly not so by others.
Sexy, I suspect, because in today’s geekdom-loving world, we imagine the lab coats have finally turned their laser-like, academic precision to the final economic frontier, data, and the dam holding back all those dollar-laden insights from that data is about to burst.
Decidedly unsexy, because for the most part, the work is laborious, boring, messy, and easy to trip up on.
As you might expect, the reality is somewhere in between sexy and not (but definitely right of centre). There is a lot of grunt work that goes into being able to answer the nuanced questions that actually add value to businesses. 80% of a data scientist’s time goes in preparing the data, and only 20% on actually looking at it.
Let’s break down a typical approach to a data science problem, so we’re in the same ballpark.
Steps to solving a data science problem
- Identifying objectives – The business lays out objectives or goals
- Identifying levers – The business levers that can be tweaked or deployed are identified [this step is optional and may also come later]. This helps bound the analysis (e.g. the business picks the lever of discount coupon assignment to select customers)
- Data gathering – The right data is gathered, whether it be the business’ own existing data stores, data from secondary sources (bought/traded from an outside party), or primary data (e.g. specifically commissioned market research)
- Data preparation – Most data is going to have some amount of incompleteness or dirtiness. The data scientist makes choices, consistent with the business context, of how to kludge the data
- Data Modeling and insight generation – The right algorithms, the necessary mathematical modeling, all that is “sciencey” about the field, is carried out at this step to identify patterns (or lack thereof) in the data. It is then that insights crystallize
- Story-telling – The insight is framed in the context of the business’ larger story, and how it ties back to the objective, with recommendations of specific interventions to act on the insight
- Feedback loop of Predict –> Intervene –> Measure – Ideally, the data scientist creates a prediction of the outcomes should said interventions be carried out, so that they can be measured (actual outcome vs prediction). The measured gap helps the data scientist improve her own understanding and process
- Build repeatability – The tools and processes used are documented so that they can be reused easily by others in the organization. This serves two purposes (i) Reproducibility of results increases trust in the process (ii) Not having to re-do good work brings down the cost of asking questions throughout the organization.
It’s a complex set of steps, with micro-loops scattered throughout, and while some pitfalls are obvious, it’s the lurkers, the ones that swim under the surface that often get us. I’ve consolidated below some of the more insidious ways I’ve found the analytic process getting derailed, along with some thoughts on how to prepare for, or recover from them.
Here be the pitfalls:
- Fluid (as opposed to concrete) business goals – When the business objective is a moving target, the data scientist should wait till it settles down, or press management to fix it at a point. However, it’s difficult to know when a target is about to move, so one way to tackle this is to work on long-term business goals (less likely to move) in parallel with the tactical ones. As the tactical ones become more volatile, data scientists (and indeed, heads of data) should allocate more of their own time towards solving for the business’ long-term goals until the volatility in the tactical goals reduces. Essentially, this is a case of managing upwards
- Imposing unrealistic time constraints: When business deadlines are not in line with the non-mechanical, non-predictable grind of data science. If data science success is measured by the frequency of insight generation, these whack-a-mole insights will soon degrade in quality and reliability as the team starts to cut corners or chunk up a story just so there’s something left over to deliver at the next meeting. Instead a business is best served by laying out objectives, and measuring the progress of the team week on week towards those objectives
- Underestimating the messiness of data: Data scientists have to use judgment to deal with dirty or incomplete data. The assumptions they make, or the methods they use to plug in missing data need to be consistent with the context of the business. Additionally, whatever assumptions are made and approaches taken need to be documented for the rest of the team so that they are the same ones used consistently over time (so that any parallax error remains constant)
- Inadequate statistical chops: There’s no question that the more nuanced questions require the data scientist to have a background or experience with statistics to be able to provide depth to their analysis, like preconditions and statistical guarantees. One can’t compensate by throwing more python or compute power at the problem. This needs to be solved at the hiring stage, or later by investing in the right training to ensure the data scientist has the requisite skills
- Misreading complexity (1): A misapplication of Occam’s razor. Data scientists can sometimes simplify a problem to the extent that it loses meaning. Instead, use simple methods to understand the nature of the problem, as a starting point. Then, make incremental progress.
- Misreading complexity (2): Over-complicating things when simplicity would have sufficed. This includes incorporating tangential data or methods that add far more noise than signal, as well as using overtly complex statistical models. Why does this happen? Ego, lack of experience, bragging rights, resume embellishment. Same intervention as above. Start simple, build incrementally. Applies to how you incorporate data, just as it does to the algorithms.
- Data Biases: The data can have implicit biases based on how it is collected, as well as where, and by whom. It is critical that the data scientist has an end-to-end view of what data is being collected, and how it is being collected. For example, at first pass, many businesses in the US are surprised to find that most of their customers live in Schenectady, NY. Turns out, the zip code there is 12345, which is a number that many customers give when their information is being collected, and the zip code field is marked as mandatory. The data scientist has to correct for such biases in various ways, whether by using their own filters or adding a relevance score (weightage) to such columns
- Missing Context: This ties into the point above. Models have to be combined with domain knowledge to ensure the correct reading of the data. This is sometimes missing all the way from hiring of data scientists through till the newly appointed data scientist finds themselves being shown the door with ‘best wishes on their next endeavour’. This is because they didn’t have the prior experience of working in the industry that their business operates in, and/or didn’t bother to gather it during their employment. One can’t just throw ‘science’ (math + computer science) at data and expect pots of gold. There is an art to it, and a large part of the art is combining context (including domain knowledge) with the science (tools and techniques).
- Lossy story-telling: When faced with an impatient cluster of senior management, chomping at the bit to take ‘action’, the nuance and caveats that statistical analysis introduces into insight can quickly get lost. This causes a gap between insight and the action taken on such insight. Doing something based on wrong understanding can be worse than doing nothing.
- Incentive/constraint misalignment: Decision-makers may have other constraints or incentives that the data scientist did not consider. This again can lead to a gap between insight and the final action taken. The worst outcome however is distrust between the data team and decision-makers. In this case, feedback loops are never created or they run askew with a lot of finger-pointing and no ownership
Avoiding these pitfalls requires a greater level of self-awareness than many of us have. Sometimes we miss the forest for the trees, and it is helpful to have a peer review system built into the process. A second level of inspection also needs to be built in, where the results of actions taken based on recommendations is closely measured against previously made predictions, to measure their efficacy.
As you deal with ever more complex data questions, I hope you set yourself up for success by first, being aware of the potential missteps you might take and how much they might cost, and secondly, by setting up systems and processes so that with each iteration, you bring down the number of potential mistakes, and so the cost.
In the end, you want your organization to freely be able to ask more questions of its data, at every level, knowing that each such question not only will provide reliable insight, but that the very act of asking will strengthen the organization’s analytic process that much more.
[Update]: Thanks for reading. I have a 2 minute survey for you, that will help me understand more about challenges that folks face in consuming data. I will be summarizing and sharing responses. Click here for the google form survey.
Indra is a co-founder of Scribble Data, a data analytics solutions company that helps businesses channel their data, tools and people, for maximum analytic mileage.
E: indrayudh@scribbledata.io