Quick thought: Human in the loop during data generation masks correlation

What happens if there is a human in the loop of a data generation process? This blog post is an illustration of a thought we had during a collaboration with BASF. It is obvious in hindsight but puzzled us for a while when we analyzed process data.

Imagine you have some kind of data generation process involving independent variables x1 and x2, and a dependent variable y = f(x1, x2) = x1 + x2. We sample from this process in the range [0, 1] for both x1 and x2:

/blog/2023-08-08_interaction_effect/heatmap_with_scatter_all.png — Uniformly sampled data

If we look at the correlation of just x1 with y, ignoring x2, we will see a strong correlation:

/blog/2023-08-08_interaction_effect/regression_x1_y_all.png — Regression of y with x1, all data

However, imagine there is a human who sets values for x1 and x2. The human has been instructed to keep y at close to 1. The human can vary x1 and x2 to achieve some other secondary task that we're not concerned with here. How will the data look like? They will follow the area for which y = f(x1, x2) = x1 + x2 ~= 1:

/blog/2023-08-08_interaction_effect/heatmap_with_scatter_masked.png — data sampled by human in the loop

Assume all we have is a dataset that was constructed by this second process (where there's a human in the loop setting x1 and x2), and we again look at the correlation of x1 with y, we will find no correlation at all:

/blog/2023-08-08_interaction_effect/regression_x1_y_masked.png — Regression of y with x1, data sampled by human in the loop

I think this is a reasonably realistic scenario in all kinds of industry use-cases where there's a machine which is operated by a human, and data are collected along the way. A data scientist who obtains the dataset but who doesn't talk much with the operator will have a difficult time understanding whether or not x1 and x2 are related to y, and they may even incorrectly estimate that x1 and x2 don't have a relationship with y.