Automatic Project Flows¶
The APF feature must be enabled. If you do not see the Create Project Flow button on the top of the Project page, contact your Data Prep System Administrator.
The Data Prep Automatic Project Flows feature (APF) allows you to intelligently operationalize curated data flows. APF computes the entire sequence of data prep steps across projects, datasets, and AnswerSets to produce an automated end-to-end output Flow for your data. Set the Flow to run on a recurring time-based schedule or run it just once to produce an end result AnswerSet. Then manage all runs using APF's monitoring capabilities. Business Analysts and Data Engineers use APF to simplify complex data flows by breaking them into smaller groups of Data Prep projects.
APF lets you operationalize the data flow—each project performs a related or cohesive set of steps for improved readability and limited complexity. After creating projects, you can select the final project in the sequence as your target project. APF takes care of the rest—sequencing, preparing, and automating the entire end-to-end flow without requiring manual stitching.
APF helps teams share data and gather input from business and IT leaders. Team members can build Data Prep projects that depend on output AnswerSets created by others. Members complete their data prep work in their own Data Prep project, and then the entire sequence is operationalized from a single target project. APF takes care of the rest with no manual stitching required, regardless of who creates or owns the projects and AnswerSets. Members of the team can monitor the Flow and view a graph to see how their projects and AnswerSets contribute to the Flow's final output.
In this example, APF produces the end-state "Sales Variance Report" from a series of Data Prep projects and AnswerSets produced by multiple people.
Bob connects to the data lake for his "Product Hierarchy" data, preps and produces an AnswerSet that is shared with Susan who pulls in "Sales Transaction history" data from a Cloud application.
Susan preps this data and produces an AnswerSet, which she shares with Varya for the Sales Variance project that she maintains. In addition to the AnswerSet from Susan, Varya also combines data from an Excel report that she pulls in from a cloud storage system.
When Varya is finished with her data prep, she produces a "Sales Variance Report" AnswerSet. She needs to produce this report each week. She clicks Create Project Flow in her Sales Variance project and configures a time-based trigger for running the Flow. APF traverses back through the Flow of related projects, AnswerSets, and datasets to create the dependency chain required to produce the end-state AnswerSet. Varya then uses the APF Monitoring Interface to manage all subsequent runs of the Flow.
Contributors must have permissions to all of the datasets and all of the projects in the Flow before creating a Flow, otherwise it will not run successfully.
If a contributor has permissions to an AnswerSet, but not to the project from which that AnswerSet was produced, they can still create a Flow up to the point at which they ceasess to have the read permission. This flexibility in Flow creation enables contributors to manage the operationalization of Flows for the portions they have permissions to access independently.
Contributors must also have permissions to all datasets and projects in the Flow to manage them from the Monitoring Interface. Data Prep System Administrators provide these permissions.
The target project does not include anything produced downstream in the defined Flow. In the previous example, if the "Sales Variance Report" AnswerSet consumes a project, the project is not included in the Flow—the target project is always the end point for a Flow.
Set up a Project Flow¶
To create a Project Flow:
Open your target project, the project that will produce your end-state AnswerSet.
Click Create Project Flow on the top right of the Projects page.
Provide a name and optional description for the Flow, then click Create.
The intelligent automation engine calculates the Flow dependencies and APF displays the Project Flows page where you configure APF. You can also access the Project Flows page when you edit an existing Project Flow.
See Manage Flows for common actions you can take for all Flows.
You configure APF by setting triggers and notifications on the Project Flows page. You can also adjust settings for the Flow's input and output datasets.
The Project Flows page has three tabs where you configure Flow settings:
Use the General tab to update Project Flow details and to add triggers.
On the General tab, you can:
- Update the Name and Description of a Flow that you've created.
- Specify the triggers to run your Flow. The triggers are time- and frequency-based. You can also use the custom option to provide a cron expression for the trigger.
- Provide email addresses for run status. Separate each address with a comma.
As soon as a Flow is created, a Project ID Flow displays on the General tab. This ID is used to identify the Flow for REST API calls and for troubleshooting the Flow.
The Inputs tab provides a list of the datasets used in the Flow, the versions of those datasets used to create the Flow, and the projects in which each dataset is used.
On the Inputs tab, you can:
Specify that a dataset is automatically reimported every time the Flow is run.
By default, all projects are configured to use the latest version of a dataset saved in the library. However, newer versions of a dataset may be available from the original data source before a new version of it is manually imported to the Data Prep library. In this case, you can configure a dataset to be automatically reimported from its original data source every time the Flow is run. Then this latest version will always be saved in the library. To enable this automatic update, click Reimport dataset on run. When the option is enabled, a Configure Reimport Options button also displays. The button opens the library import pane where you can change the data source path, query, or enter export parsing options. These options are saved with the dataset in the library and you only need to configure them if you want to change the current settings.
Configure a dataset's version to use for the project.
By default, all projects are configured to use the latest versions of datasets saved in the library. However, you may want to change this default behavior, which can be done when you click Edit (in the Options for Datasets as used in Projects column):
Pin to version: The dataset remains the exact version currently used by the project.
Fail if columns changed: The dataset will fail to import into the project if the latest version coming in from the library has a different layout (schema), for example, if new columns are added, columns that are not used in the project's steps are removed, a column type is changed, the order of columns is changed, etc.
If more than one project uses the same dataset as input for the Flow, this is noted in the projects column.
Click See All Projects to view all projects that use the dataset and to optionally configure different versions of the dataset to use per project. For example, you can specify that one project uses the latest version of the dataset from the library, while another project uses the exact version of the dataset currently saved in the associated version of the project.
View metadata statistics for the dataset inputs by hovering over a dataset name in the Datasets column. The dataset's version, creation date, the user who added it to the library, and the number of columns and rows are displayed in a pop-up window.
The Outputs tab provides a list of all the output AnswerSets that are published from the Flow.
All outputs are configured at the lens level because a publishing lens is always required to create a publishing point from a Data Prep project.
There are times when your Flow may include a project that has multiple lenses. Not all of those lenses are required to produce output AnswerSets. By default, only required lenses automatically publish AnswerSets that are saved in the library. If you'd like to enable the AnswerSets to be published even if they are not required for the Flow, you can enable them on the Outputs tab.
Lenses that produce output AnswerSets and are required for the Flow can never be disabled.
In addition to adjusting the publishing options for non-essential AnswerSets, you can publish any lens output AnswerSet to an external data source, for example, a database or a cloud storage system. To specify a publish location in addition to the Data Prep library, click Configure Lens and open the Exports pane.
You can take the following actions on the Outputs tab:
Disable a non-bridging lens to prevent it from publishing AnswerSets to the library.
Click the slider adjacent to the lens to disable it.
Export the published AnswerSet to a data source (in addition to the default library setting).
Click Configure Lens for the lens. The Export pane opens at the bottom of the page. By default, Data Prep publishes AnswerSets to the Data Prep library. To publish to an external data source, click the dropdown menu for the Export Lens field and select Library and Export. You can then specify the output location details and export parsing options for that AnswerSet.
APF lets you monitor the status of Flows. The key components for generating a Flow's output are Snapshots, Runs, and Chores. The following diagram illustrates how these components monitor Flows. See the following sections for details.
Project Flows page¶
The Project Flows page lists the Flows that you have permissions to view and edit, as well as the current status of the most recent run for each. On this page, you can:
Edit the configuration details for the Flow. Click Edit to open the APF Configuration Interface where you can make adjustments to the configuration. See Configure APF.
Click Run to run the Flow manually. Starting a Flow manually is particularly useful if you need to test out a new Flow or a configuration change to the Flow and you don't want to wait for the time-based trigger to start it.
Show the Snapshots for the Flow. Click Show all Snapshots to open the Snapshots pane.
Click More Actions > permissions to update the permissions settings so that you can share this Flow with another person. Note that permissions are only visible to the user who created the Flow or to users with whom the creator has shared all of the permissions.
Click More Actions > view latest results to go to the latest Flow. This will not display until there is at least one run of the Flow.
The Snapshots page lists the Snapshots for a Flow. Every time a Flow is executed (called a "run" of the Flow), a Snapshot is created to capture the configuration settings used to create the output for the run. The runs continue with this Snapshot until any configuration changes are made to the Flow—for example, changes to the schedule, notifications, inputs, and output settings. Then a new Snapshot is created for the Flow. The new Snapshot captures the executed runs with the modified configuration settings. Snapshots allow you to audit the exact state of a Project Flow for each run.
APF does not create a new Snapshot if datasets are configured to use the latest version from the library. See the Inputs tab for dataset configuration options.
On this page, you can:
- Click View to open a read-only view of the APF configuration settings for the Snapshot.
- Click Show All Runs to open the Run List page, which details every run for the Snapshot.
The Run List page captures all details for each individual run under a Snapshot. The number of discreet chores that must be completed to finish the run—for example, publishing a dependency AnswerSet—are listed on the page. Every time a Flow is run, a new run entry displays on this page.
To open a read-only view of the APF configuration settings associated with a run, click View.
If there is no change to the data used to create the Flow—for example, all datasets used in the Flow remain exactly the same version as were used in the previous run—the APF engine conserves resources and does not rerun the Flow again until new data inputs are available.
The APF quotas meter displays at the top of the Flows page to indicate your usage. Hover over one of the counts—Daily, Weekly, or Monthly. A tooltip provides details of your current usage and limits.
Quotas are based on Chore count, and Chores are defined as:
- The running of an individual project that is required to produce a Flow.
- An import (but not a publish) of any dataset or AnswerSet that is required to produce a Flow.
The sum of all Chores ultimately produces the output for your Flow. While a Flow is running, refresh your browser to update the quotas meter on the Flow's page. If you need your Chore count quotas increased, contact your DataRobot Data Prep Administrator or DataRobot Customer Success.
Access tools for managing Flows on the top right of the Project Flows page.
You can manage your saved Flow by:
- Generating a visual graph for a Flow
- Running a Flow manually
- Deleting a Flow
- Updating Flows to use latest project versions
Generate a visual graph for a Flow¶
The Graph button generates an APF graph in a new browser window that displays the datasets and how they flow into the individual projects used to generate a Flow's final output AnswerSet.
Hover over a dataset or project in the Flow to display the corresponding downstream lineage (in pink) and upstream dependencies (in blue).
For example, hovering over the dataset for March 2016 Transactions displays the following:
Hovering over an intermediate project in the Flow—in this example, Customer Loyalty-Women Members—the upstream dependencies display in blue and the downstream lineage displays in pink.
Notice in both examples that if datasets and projects do not participate in the portion of the Flow that you've selected, they are grayed out in the graph.
You may see a dotted line in a graph for some Flows. The dotted line indicates that an AnswerSet was published from a project in the Flow, and then later consumed again by the same or another project in the Flow. This is referred to as a looping input and is represented by the dotted line.
Run a Flow manually¶
There may be times when you want to manually kick off a run of a Flow without having to wait for its scheduled start time. This can be done from the Actions dropdown menu. Click Run now.
Delete a Flow¶
If you no longer want to keep a saved Flow, you can delete it. Click Actions > Delete. You are prompted to confirm your selection. Note that any AnswerSets that were published to the library as a result of running this Flow will not be deleted as a result of deleting the Flow.
Update a Flow with the latest project versions¶
Every time an action is taken in your project—for example, adding a step, removing a step, or rearranging steps—a new version of your project is created. Each version provides an audit trail of the changes you have made to your data during the course of your data prep work. When creating a Project Flow, the Flow is always pinned to the specific project versions at the time of the Flow's creation. However, you can update a Flow to use the latest version of all projects. This can be done from the Actions dropdown menu. Select Update Project Versions and you are prompted to confirm your selection:
You can choose to overwrite the existing APF or create a new one. If you choose to create a new APF, all triggers are copied to the new APF but they are disabled by default.
The ability to update an existing APF must be enabled. If you do not see the Update All Project Versions window, contact your Data Prep System Administrator to enable this. If this feature is not enabled, a warning message displays and you can only update the versions if there were no significant changes to the project (e.g. no changes to project dataset of the lenses).
To update a specific project's version—instead of all projects in the Flow—on the Outputs tab, hover over the project for which you want to update the version, then click Update Project Version in the right-hand column.
The following are terms specific to APF.
|Chore||A dataset import or a project execution. The dataset import chore performs a reimport of your dataset through a data source. The project execution chore addresses all other tasks required for the Flow, such as publishing an AnswerSet to the library or exporting an AnswerSet.|
|Flow||A collection of projects that can be run as a unit. One or more frequency-based schedules can be associated with a Flow, which allows a Flow to run on a recurring basis.|
|Inputs||Datasets from the library that are required to run a Flow.|
|Outputs||The AnswerSets written to the library generated by the run of a Flow.|
|Run||The execution of each of the projects that are required by the target project. The run executes all of the steps from the upstream dependency projects, then writes the resulting AnswerSet(s) to the library.|
|Snapshot||The configuration settings captured for each run of a Flow. Your Data Prep Administrator must enable this feature in your application.|
|Target Project||The Data Prep project from which a Flow is created. Once a Flow is created, all upstream dependencies are automatically calculated by the APF engine.|