What happens if there is a human in the loop of a data generation process? Let's look at what can happen with the help of an example.

Imagine you have some kind of data generation process involving independent variables x1 and x2, and a dependent variable y = f(x1, x2) = x1 + x2. We sample from this process in the range [0, 1] for both x1 and x2:

/blog/2023-08-08_interacion_effect/heatmap_with_scatter_all.png
Uniformly sampled data

If we look at the correlation of just x1 with y, ignoring x2, we will see a strong correlation:

/blog/2023-08-08_interacion_effect/regression_x1_y_all.png
Regression of y with x1, all data

However, imagine there is a human who sets values for x1 and x2. The human has been instructed to keep y at close to 1. The human can vary x1 and x2 to achieve some other secondary task that we're not concerned with here. How will the data look like? They will follow the area for which y = f(x1, x2) = x1 + x2 ~= 1:

/blog/2023-08-08_interacion_effect/heatmap_with_scatter_masked.png
data sampled by human in the loop

Assume all we have is a dataset that was constructed by this second process (where there's a human in the loop setting x1 and x2), and we again look at the correlation of x1 with y, we will find no correlation at all:

/blog/2023-08-08_interacion_effect/regression_x1_y_masked.png
Regression of y with x1, data sampled by human in the loop

I think this is a reasonably realistic scenario in all kinds of industry use-cases where there's a machine which is operated by a human, and data are collected along the way. A data scientist who obtains the dataset but who doesn't talk much with the operator will have a difficult time understanding whether or not x1 and x2 are related to y, and they may even incorrectly estimate that x1 and x2 don't have a relationship with y.