The butterfly effect of a coronavirus has caused seismic shifts in the appreciation of data science and its fallibility.
The COVID-19 pandemic has caused a worldwide data obsession — and not just among data scientists.
Every day, ordinary people all log onto our devices to look at numbers: how many people are sick; getting tested; vaccinated and recovering at any given point — and then we try to infer what will happen next.
This interest is turning the pandemic into a data product in and of itself. We have witnessed how a new wave of infections may correlate with a rise in cases elsewhere; and we now collectively see how important data analysis and predictions are for managing global recovery.
Ironically, however, as the data has become more interesting, important and relevant, it has also endangered data science models based on history. And this has changed what it means to be a data scientist.
Dealing with data drifts
Historically, data science (roughly speaking) meant inputting parameters like variables or factors and then applying algorithms to analyze them and predict outcomes. This has meant that we could use history to predict the future.
Even if the exact same things were not happening, they were still following the same paths. And then the pandemic threw the predictability element off kilter. Last year, many businesses that had relied on data in any capacity were forced to confront the theories of concept drift and data drift.
- Concept drift is when the parameters are the same, but the outcome is different. So, for instance, if at the start of the pandemic all businesses’ sales collectively dropped by 50% all at once, it would not matter if one business was better than the other; they would all sell half as much as they did before. The outcome is different, but the parameters are correct.
- Data drift is when parameters are no longer affect things the same way as before. In this case, people changing their shopping habits and preferences across the region could be used as an example. Previously successful businesses (such as the Singapore F1 Night Race) suddenly came to a standstill, whereas others that were able to adapt quickly (such as Sport Singapore, which launched its Get Active TV clips to promote home workouts) may have seen sales skyrocket.
In practice, this meant that even if businesses were able to continue running their operations globally (or regionally) as before, it would have become increasingly harder to predict outcomes.
At a local country level, few predicted how fast Vietnam would recover compared to other South-east Asian countries last year, and this is undoubtedly an example of shifting parameters.
Rolling with the punches
Some companies reacted to the pandemic data shifts by moving to manual intervention. Existing data models were turned off, and humans came in to manually review the data and act on it.
However, this slowed down processes significantly, and unfortunately the work was not scalable and therefore not sustainable.
Another response was to look at emerging data and trying to quickly adapt. Essentially, this entailed adding some intuition to the models by keeping the same parameters but trying to rule out certain factors. For some companies, this meant discounting the initial months of the pandemic because they were too volatile: March, April and May 2020 had to be ruled out in order to continue using older models from previous time periods.
The problem with this was, injecting intuition and guessing into these algorithms meant losing a significant part of the power of machine learning.
Another way has been to go in-between and just reduce the risk and lower overall exposure. It is a more conservative approach in general and most favored by the traditional financial services sector—particularly amongst traditional lending firms.
The unfortunate result is that small businesses have suddenly found it next to impossible to get loans, as traditional lenders have lowered their risk appetites (which already barely covered these businesses). This has left an even bigger gap in the market for alternative providers of working capital to fill.
Simply put, the pandemic has fundamentally altered the history of data as we know it, and the ripple effects are causing seismic shifts at both a societal and business level.
Where do we go from here?
One of the major lessons learned over the past year is that businesses need to step up data collection and processing and make it a primary focus going forward. The faster these businesses can recognize unusual patterns and reevaluate their data models, the better.
Previously, data scientists could frame their models and run test cases and so on. Now, the focus must be on retraining by starting to look to where the data model is back to ‘normal’. This means deciding between using and accepting both older and newer data, and/or normalizing models by fully ignoring outlier periods of extreme change. In Asia, we would consider this to be between February and May 2020.
In the future, data scientists may well agree to start to consider the pandemic as a parameter for future modeling. That part of future-proofing businesses will mean using today’s data as their historical data.
If the pandemic has taught the data science community anything, it is the importance of comparing and measuring the quality of data results, and implementing the right tools that notify users relatively early when typical patterns are significantly off. This also means reacting quickly to the indications and adjusting to situations in an agile way.
At the end of the day, it’s not about the data tools we use but about the people who use them. There has been a permanent change in the way society and businesses now view data scientists: they have a seat at the decision-making table and are the ones leading the charge in the global recovery.
However exciting this may seem, it is also in a way very humbling given that: the situation that put data science in the limelight has also demonstrated that, when events are big enough to alter history, nothing can be certain.