We have trained a time series model based on historical data and want to make sure it still performs well on current data before deploying it. Our biggest concern is data quality. It is not uncommon for our data sources to break in strange and mysterious ways. When data breakages occur, they can result in suboptimal performance at best, and embarrassingly meaningless predictions at worst. So, finding them is very important!
In this example, we have taken the publicly available Boston housing dataset and introduced some obvious (and some subtle) breakages that are reflective of what we tend to see.
You are provided with [login to view URL], a pickle file containing:
1. X_test, X_train: the test/train data features
2. Preds_test, preds_train: the test/train predictions
Your task is to find and describe all the changes between X_test and X_train. To do so, you should use a python jupyter notebook (with any open source libraries you like). The ultimate submission should be a python notebook that can be run to produce graphs detailing the different changes. You will be evaluated both on the number of changes found, as well as on the quality of your visualizations.