Data Science in Black Swan Times
Part 1: Understand your Causal Chain
In his book “The Black Swan”, Author Nassim Taleb describes black swan events as those that are unpredictable, carry drastic consequences, and people retrospectively describe as completely predictable. I think it’s reasonable to apply all three conditions to the pandemic caused by COVID-19. During these times, practically every part of ordinary life is altered. This raises the question of the validity of most predictive models. It’s during times like these that assessing the underlying assumption of your models is particularly important, and having a background is statistics is helpful.
Revisiting the Chain Rule
Before I was a data scientist, I was a particle physicist. Particle physics is unique in that you study phenomena that are invisible to the human eye. Signal is exceedingly rare. Signal events are identified by combining signatures left in various electronic detectors, with the resulting electronic signature being well described as a random event. While machine learning is being leveraged to more effectively identify the signal of these types of rare events, these techniques were not quickly adopted. The reason for this fact, aside from the stubbornness of physicists, is that in this field the underlying chain of events are well modeled. Understanding variations in these underlying processes are interesting in their own right, as it can indicate the presence of a new physical phenomena.
Classifying random, often rare, events is one of the tasks that machine learning is best suited for, and its most ubiquitous application. Unlike the above scenario, however, machine learning is often applied in scenarios where the the intermediary steps are not well studied. In most cases, an effective model for the end outcome is useful, even if every step in the causal chain isn’t perfectly understood. And while an electron traveling through gas at the speed of light sounds difficult to model, it doesn’t even scratch the surface of the complications presented by human behavior.
As an example, consider a model applied to a Medicare population that is trying to predict if a member will have an inpatient admission due to influenza. As a feature set, you will leverage a patients medical billing history, along with demographic information. While it may seem like this is a straight forward event, there are actually multiple events that happen along the way. First, a person needs to be exposed to influenza. Second, this exposure leads to an infection severe enough to merit an inpatient admission. Of course, even this picture is an oversimplification, but it’s useful for illustrative purposes.
When we train a model, it’s simply learning P(I|M), not modeling the intermediary steps along the way. Most of the time, this is OK, because it is reasonable to assume that most individuals will be exposed to influenza every flu season, and the real thing we are learning is P(I|E, M). During these times of social distancing, that original assumption is not well founded. In normal times, P(E|M) will be modeled as a high constant. During the pandemic, however, this assumption will not hold true. The second step in the causal chain, P(I|E,M), will likely be largely unchanged during these times.
All of this is a bit theoretical, but these ideas have real world implication on the effectiveness of the model. Social distancing measures have made the initial assumption about P(E|M) invalid. The knock on effect is that the entire healthcare system is observing a drastic decrease in utilization. The real question is, do you need to retrain your model? The answer is that it depends on context, specifically context surrounding the deployment of the model.
One use case for a flu model might be used by a hospital department to try to predict overall load on inpatient resources. The output of our model will give us probabilities of various members of our population having an inpatient admission, which when combined with some simple arithmetic (or some fancy Monte Carlo simulations) can produce predictions on expected hospital load. This is an example of a model deployment that would be completely invalidated by the violation of the initial link in the causal chain. The fix for this will be pretty complicated. As we state above, P(E|M) can be modeled as a high constant in normal times. The effects social distancing measures have on this distribution are very uneven. While it will be a net decrease, some groups are less able to self isolate. You’ll probably need to retrain the model incorporating current data.
A completely separate use case would be to take predictions based on a population, and identify individuals who are most likely to have an acute complication if they were to contract influenza. These days, most people are rightfully cautious about going to the hospital for non-acute care. This behavior is prudent for most people. That said, for those people who are most at risk for flu complications, it is still a good idea to get a flu shot. For most people, knowing if they are better off getting the flu shot is a tough call to make, and most will default to staying at home. In this case, the causal model stays in tact. Even though the probability of an individual being exposed to influenza is not well modeled, in this application it’s not a primary point of consideration. A care management team leveraging this model is actually trying to communicate the second step in the causal chain P(I | E, M). Even taking the math out, you can imagine what a care manager communicating to a member might sound like: “While we want you to keep safe, we think that if you were to get the flu this year, then you are at risk for a bad reaction.” A care manager would reasonably convey the second part of the causal model without affording any thought to doing so. In this case, the model is doing its job, and its application in this context will be unaffected.
If context doesn’t put your model in the dust bin, it doesn’t mean you should treat it with scrutiny. In my next article, I’ll discuss data drift, and look at how we can evaluate if the underlying conditions of a model have changed to a degree that you need to retrain the model.
Originally published at https://closedloop.ai on October 8, 2020.
Data Science in Black Swan Times, Part 1: Understand your Causal Chain was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.