Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Use self-joins with panel data to improve model accuracy

Access this AI accelerator on GitHub

In this accelerator, explore how to implement self-joins in panel data analysis. Regardless of your industry, if you work with panel data, this guide is tailored to help you accelerate feature engineering and extract valuable insights.

Panel data, with multiple observations for consistent subjects over time, is ubiquitous in various domains. While panel data is often spread across multiple tables, it can also exist in a single dataset with multiple features suitable as panel dimensions. The self-join technique enables automated, time-aware feature engineering with just one dataset, generating hundreds of candidate features of lagged aggregations and statistics. Combining these features within panel dimensions can substantially improve predictive model performance.

The accelerator focuses on predicting airline take-off delays of 30 minutes or more to illustrate the self-join technique. However, this framework applies broadly across verticals and can easily be adapted to your use case. Using a single dataset, join it four times across different features, engineer time-based features from each join, using the AI Catalog for data management.

The accelerator covers data preparation with multiple joins and time horizons, how to mitigate target leakage with multiple feature lists as well as time gaps in time-aware joins.

Panel data analysis unlocks valuable insights into subjects evolving over time, and is often overlooked when there is a singular dataset.


Updated October 17, 2023