Skip to content

アプリケーション内で をクリックすると、お使いのDataRobotバージョンに関する全プラットフォームドキュメントにアクセスできます。

Time series data wrangling

プレビュー

Time series data wrangling is a preview feature, on by default.

機能フラグ:

  • 時系列データのラングリングを有効にする
  • 時系列機能を有効にする

Feature engineering is a major component of successfully addressing use case requirements. By wrangling—creating recipes of operations and applying them first to a sample and then, when verified, to a full dataset—time-aware data, you can perform time series feature engineering during the data preparation phase. Executing operations like lags and rolling statistics on input data provides control over which time-based features are generated before modeling. By reviewing the preview that results from adding both time-aware and non-time-aware operations, you can adjust before publishing, preventing the need to rerun modeling if what would otherwise be done automatically doesn't fit your use case.

This page describes the workflow specific to time series wrangling. See also the full data wrangling documentation for more information.

基本的なワークフロー

The following describes the basic workflow:

備考

There is no validation of settings as configuration of operations progresses. This is because partitioning, where validation occurs in conjunction with other experiment settings, happens later in the process.

  1. Connect to Snowflake, using a configured data connection.
  2. From Preview settings, configure a live sample using date/time as the sampling method.
  3. When sampling settings are complete, click Resample to apply the settings.
  4. Add operations from the Recipes panel.

Set date/time sampling

By default, DataRobot retrieves 10000 random rows for the live sample. To use time series wrangling, and more specifically, to configure DataRobot to create a derivation plan for your sample, you must use the date/time sampling method. That is, while you can apply lags and rolling statistics to non-date/time samples, you cannot take advantage of the automation provided by the suggested derivation plan.

備考

When setting the sampling method, the data sample you configure is used only inside wrangling; it is not applied on the full data until the recipe is published.

To configure date/time sampling, open Preview settings and set the Sampling method to date/time.

Then, complete the date/time-specific fields described in the table below:

フィールド 説明
サンプリング方法 Set the method to date/time to create a sample that contains the specified latest/earliest rows, ordered by date/time feature. If you choose Random or First-N Rows, as described here, you will not have access to the automated time series feature derivation plan suggestions.
行数 ソースデータから取得する行数を指定します。 Enter a value under 10000. Note that the more rows you retrieve, the longer it will take to render the live sample.
順序付け列 Select the name of the column that contains the primary date/time feature that DataRobot will use to create a sample. Use the same feature you intend to use as the ordering feature when configuring date/time-aware feature transformations.
戦略 Set which rows are pulled from the sample, either the earliest or the latest. Earliest is the better selection if DataRobot will be suggesting a derivation plan. This is because when the plan is generated, target values are used in the feature reduction process. Given that, using the earliest rows to create modeling data minimizes the likelihood that those rows will be part of validation or holdout later.
Series identifier column Enter the name of the column that multiseries modeling will use to identify multiple, individual time series datasets. The identifier indicates which series each row belongs to.
選択された系列 (Optional) Select one or more series, within the multiseries data, that will be present in the sample.

すべてのフィールドに入力してリサンプリングをクリックします。 The live sample updates to display the data selected based on the configuration. You can now add operations.

操作の追加

Operations are the transformations that will be applied to the source data to prepare it for modeling. The sections below describe the time-aware operations; see the operation reference for information on other operations and working with operations within a recipe. Click + Add operations to begin configuring transformations.

The Lag features and Derive rolling statistics operations can result in more than one feature added to the dataset. The Derive time series features operation, on the other hand, substitutes the input dataset with an output dataset that is expanded by the specified forecast distances for each operation.

When all fields are complete for a given operation, click +Add to recipe. The preview updates based on your changes. A summary of each operation is provided in the Recipe panel.

時系列特徴量の派生

When working with time series wrangling, the data is ordered by the specified date column (the "ordering feature") and partitioned by series ID column, if applicable. The feature derivation windows (FDW) are defined in terms of rows. In those windows, the window start is excluded and the end is included. So, the dataset is "reordered" according to date and then, for example, if FDW=5, five rows are used to derive features.

The Derive time series features operation generates new time series features. To start, you supply fundamental information and DataRobot then expands the data according to forecast distances, adds known in advance columns (if specified) and naive baseline features, and then replaces the original sample. To create new features, you can have DataRobot suggest a derivation plan and automatically create new features, manually add tasks for up to 200 features, or apply a hybrid approach.

備考

If you manually add features and then request and execute a suggested derivation plan, the manual additions will be overwritten by the auto-generated plan.

パラメーターの設定

To configure parameters that set up the ability to use the automated derivation plan, after setting the sampling method to date/time, select the Derive time series features operation. Alternatively, if the sampling method is random or N-rows, you can configure time series wrangling parameters but cannot take advantage of the automation.

The following table describes the parameters to configure:

フィールド 説明
ターゲット特徴量 Set the target feature, which will be used for generating naive baseline (the last known value, possibly seasonally) features during feature derivation. The target feature must be numeric.
順序付け特徴量 Select the name of the column that contains the primary date/time feature that DataRobot will use to order rows during feature transformation.
Series identifier column Enter the name of the column that multiseries modeling will use to identify multiple, individual time series datasets. The identifier indicates which series each row belongs to.
順序付け特徴量 Select the name of the column that contains the primary date/time feature that DataRobot will use to order rows during feature transformation.
予測距離 各位置から何行先を予測するかを決定する相対位置を設定します。 Enter one or more integers.
ナイーブベースラインの周期性(行) Set one or more integers, which represent the periodicities, expressed as a number of rows. These values are used to calculate naive baseline features. Periodicities assume that the value at a given point in the future will be the same as the last observed value from the same period in the cycle.
事前に既知の特徴量 (Optional) Specify any features for which you know the value in advance. This maintains the feature's actual value, making feature value for the prediction date a predictor.
ローリング中央値のユーザー定義関数(数値) (Optional, advanced) Specify the path to a helper function that can improve performance.
ローリング最頻値のユーザー定義関数(カテゴリー) (Optional, advanced) Specify the path to a helper function that can improve performance.

Note that all setting are summarized in the right panel summary. When all fields are complete and correct, click Next.

派生のための特徴量

After clicking Next, the Features for derivation panel opens.

The following options are available for creating derived time series features:

方法 説明 サンプリング
1 派生プランの提案 Have DataRobot suggest a derivation plan and automatically create new features. Requires date/time
2 手動で特徴量を追加 Add up to 200 features and assign tasks. 任意
1+2 Hybrid approach Use the suggested derivation plan and then manually add more features to the results. Requires date/time (for the automated plan)

派生プランの提案

Use the Suggest derivation plan option when you have set the sampling method to date/time and want DataRobot to create lags and rolling statistics based on the configuration settings you supply. Click the option and complete the fields:

フィールド 説明
特徴量派生ウィンドウ* Configures the number of rows (periods of data) that DataRobot uses to derive features. Enter one or more values.
Features with minimal derivation (Optional) Specify features for which only a single lag (the first) is created. You cannot exclude the the primary date/time feature or the series ID.
Maximum number of lags per feature (Optional) Specify the maximum number of lags created by the derivation plan.
特徴量削減のしきい値 Set a feature reduction threshold that selects the most impactful features, which serves a the threshold for feature reduction. For example, 0.9, the default, means that features which cumulatively reach 90% of importance are returned. Importance is derived based on SHAP impact calculations.
情報量の少ない特徴量を除外 When enabled, the default, features must pass a []"reasonableness" check](histogram#data-page-informational-tags){ target=_blank } that determines whether they contain information useful for building a generalizable model.

* The forecast point is part of a feature derivation window; the feature derivation window end is always zero. For this reason, there is no blind history gap in time series data wrangling.

Once all fields are completed, select Create plan. DataRobot applies operations (tasks) for features.

Click on a feature name in the left panel to see the tasks created reflected in the middle panel. From there, click Add derivation task to add more tasks to the feature. Click Delete task to remove any tasks you do not want eventually applied to the modeling dataset. In the right panel, tasks-per-feature are added to the summary.

Review the plan:

  • If you are satisfied with the derivation plan, click Add to recipe. ライブサンプルは、DataRobotがデータソースから新しいサンプルを取得し、操作を適用すると更新され、変換をリアルタイムで確認できます。
  • If you are not satisfied, click Suggest derivation plan and reset the configuration to different values. The original output will be overwritten by the output from the new plan.

手動で特徴量を追加

To add features manually, regardless of the sampling method, configure parameters and the toggle on the option to Add features manually. Click in the entry box to select a feature to derive and then click Add.

The feature is added to the list in the left panel and derivation configuration becomes available in the center panel. Or, click on any previously added feature in the left panel to open the derivation configuration, for example, if you want to revisit the configuration, and add or remove tasks.

To add a task, from the Task dropdown, make a selection. Each task has its own configuration fields:

タスク フィールド
Lag
作成するラグ Enter one or more integers, representing the lag order to create for the named feature.
数値のローリング統計
特徴量派生ウィンドウ(行) Set the number rows that each rolling window contains. Statistics are calculated in a window that includes the current row.
統計手法 Set the method: Average, Median, Standard deviation, Minimum, or Maximum.
カテゴリーのローリング統計
特徴量派生ウィンドウ(行) Set the number rows that each rolling window contains. Statistics are calculated in a window that includes the current row.
統計手法 This field is preset to Most frequent and cannot be changed.

Click Add derivation task to add more tasks or delete task to remove an individual task from the derivation configuration. As you work with features, DataRobot reports the number of tasks (including incompletely configured tasks) in the right and left panels.

When all derivation tasks are configured for a feature, click Add to recipe. ライブサンプルは、DataRobotがデータソースから新しいサンプルを取得し、操作を適用すると更新され、変換をリアルタイムで確認できます。

Hybrid approach

With sampling set to date/time, you can use a hybrid approach that leverages the automation of Suggest derivation plan while allowing you to add features that may have been missed due to your configuration. You can also add tasks to features that were transformed.

To add new features after the automated plan creates transformations, either:

  • Toggle on Add features manually to search for features and then configure tasks.
  • Click a feature in the left panel to show the task configuration and click Add derivation task to set new tasks for that feature.

When all feature transformation instructions are complete, click Add to recipe. ライブサンプルは、DataRobotがデータソースから新しいサンプルを取得し、操作を適用すると更新され、変換をリアルタイムで確認できます。

ラグ特徴量

A lag represents a specific time period between an occurrence and its impact and are important for capturing delayed relationships between features in time series data. The lag measurement (e.g., 3, 7) is implemented based on the time step that is detected in the data. A lag is calculated relative to the current row—the first lag is the previous row, the second lag is two rows back, and so on. Click Lag features to set the lag configuration.

次の表にフィールドを示します。

フィールド 説明
特徴量名 Set the name of the feature to lag.
順序付け特徴量 Select the name of the column that contains the primary date/time feature that DataRobot will use to order rows during feature transformation.
Series identifier column Enter the name of the column that multiseries modeling will use to identify multiple, individual time series datasets. The identifier indicates which series each row belongs to.
作成するラグ Enter one or more integers, representing the lag order to create for the named feature.

Derive rolling statistics

Rolling statistics allows you to calculate statistics for a rolling window that is comprised of a specified number of rows. They can be created for both numeric and categorical features.

Output of the operation is:

  • Numeric: One or several columns, depending on the statistical method specified.
  • Categorical: One column, calculated in a window including the current row

The following table describes the fields for both numeric and categorical rolling statistics:

フィールド 説明
特徴量名 Set the name of the feature them
順序付け特徴量 Select the name of the column that contains the primary date/time feature that DataRobot will use to order rows during feature transformation.
Series identifier column Enter the name of the column that multiseries modeling will use to identify multiple, individual time series datasets. The identifier indicates which series each row belongs to.
特徴量派生ウィンドウ(行) Set the number rows that each rolling window contains. Statistics are calculated in a window that includes the current row.
統計手法(数値) Set the method: Average, Median, Standard deviation, Minimum, or Maximum.
Statistical methods (categorical) This field is preset to Most frequent and cannot be changed.
ローリング中央値のユーザー定義関数(数値) (Optional) Specify the path to a helper function.
ローリング最頻値のユーザー定義関数(カテゴリー) (Optional) Specify the path to a helper function.

Specify a UDF

Both options provide an optional input field where you can provide a path to user-defined functions (UDFs), which are available as SQL scripts on GitHub. They include user-defined functions/aggregations that compute rolling median and most frequent statistics for different databases. DataRobot recommends using these functions when wrangling with time series operations. They generate SQL that is smaller and faster, without needing additional joins to create windows.

To use the UDFs, download the scripts locally and then provide them in the Rolling median.. or Rolling most frequent user-defined function field, in the format catalog.schema.udf_name.

注意事項

  • For time series wrangling, only Snowflake connections are supported in the UI. Postgres connections are available via the API. Other data sources are under development.

  • The maximum number of output features is 200.

  • Only regression experiments (numeric targets) are supported.

  • Only one derivation plan is allowed. You can add features manually to the plan output if more are needed.

  • Windows defined in terms of time units—for example, days and minutes—are not supported.

  • There is no validation of the data quality on the input dataset.

次のステップ

ここから、次のことができます。


更新しました November 1, 2024