Features¶
Features represent the columns in your dataset that DataRobot uses for modeling. Each feature has properties such as type, statistics, and importance that help you understand your data and make informed modeling decisions. This page describes how to work with features in your projects.
Retrieve features¶
You can retrieve all features from a project or get a specific feature by name.
Get all features¶
To retrieve all features from a project, use Project.get_features():
>>> import datarobot as dr
>>> project = dr.Project.get('5e3c94aff86f2d10692497b5')
>>> features = project.get_features()
>>> len(features)
21
>>> features[0].name
'Partition'
>>> features[0].feature_type
'Numeric'
You can also iterate through features using Project.iterate_features():
>>> from itertools import islice
>>> feature_iterator = project.iterate_features(offset=0, limit=10)
>>> for feature in islice(feature_iterator, 5):
... print(feature.name, feature.feature_type)
Partition Numeric
CustomerID Categorical
Age Numeric
Income Numeric
Education Categorical
Get a specific feature¶
To retrieve a single feature by name, use Feature.get():
>>> feature = dr.Feature.get(project_id=project.id, feature_name='Age')
>>> feature.name
'Age'
>>> feature.feature_type
'Numeric'
>>> feature.project_id
'5e3c94aff86f2d10692497b5'
Explore feature properties¶
Each feature object contains detailed information about the feature's characteristics and distribution.
Basic feature information¶
>>> feature = project.get_features()[0]
>>> feature.name
'Age'
>>> feature.feature_type
'Numeric'
>>> feature.id
12345
>>> feature.project_id
'5e3c94aff86f2d10692497b5'
Feature statistics¶
For numeric features, you can access summary statistics from the EDA sample:
>>> numeric_feature = dr.Feature.get(project_id=project.id, feature_name='Income')
>>> numeric_feature.min
25000.0
>>> numeric_feature.max
150000.0
>>> numeric_feature.mean
67500.0
>>> numeric_feature.median
65000.0
>>> numeric_feature.std_dev
18500.5
For date features, summary statistics are expressed as ISO-8601 formatted date strings:
>>> date_feature = dr.Feature.get(project_id=project.id, feature_name='TransactionDate')
>>> date_feature.min
'2020-01-01T00:00:00Z'
>>> date_feature.max
'2023-12-31T23:59:59Z'
Feature data quality¶
To check data quality metrics for features:
>>> feature = dr.Feature.get(project_id=project.id, feature_name='Email')
>>> feature.unique_count
1250
>>> feature.na_count
5
>>> feature.low_information
False
>>> feature.importance
0.85
The importance attribute provides a numeric measure of the strength of relationship between the feature and target, independent of any model. This value may be None for non-modeling features such as the partition columns.
Target leakage detection¶
To check if a feature has target leakage:
>>> feature = dr.Feature.get(project_id=project.id, feature_name='CustomerID')
>>> feature.target_leakage
'FALSE'
Target leakage can return the following values:
FALSE: No target leakage detected.MODERATE: Moderate risk of target leakage.HIGH_RISK: High risk of target leakage.SKIPPED_DETECTION: Target leakage detection was not run on this feature.
Time series eligibility¶
For time series projects, check if a feature can be used as the datetime partition column.
>>> date_feature = dr.Feature.get(project_id=project.id, feature_name='Date')
>>> date_feature.time_series_eligible
True
>>> date_feature.time_series_eligibility_reason
'Suitable for use as datetime partition column'
>>> date_feature.time_step
1
>>> date_feature.time_unit
'DAY'
Get feature histograms¶
Histograms provide a visual representation of feature distributions. To retrieve histogram data for any feature:
>>> feature = dr.Feature.get(project_id=project.id, feature_name='Age')
>>> histogram = feature.get_histogram()
>>> histogram.plot
[{'count': 150, 'target': None, 'label': '18-25'},
{'count': 320, 'target': None, 'label': '26-35'},
{'count': 450, 'target': None, 'label': '36-45'},
{'count': 280, 'target': None, 'label': '46-55'},
{'count': 100, 'target': None, 'label': '56+'}]
You can specify the maximum number of bins:
>>> histogram = feature.get_histogram(bin_limit=20)
>>> len(histogram.plot)
20
Work with feature lists¶
Feature lists are collections of features used for modeling. You can retrieve feature lists from a project and examine which features they contain.
Get project feature lists¶
>>> project = dr.Project.get('5e3c94aff86f2d10692497b5')
>>> featurelists = project.get_featurelists()
>>> featurelists
[Featurelist('Raw Features'),
Featurelist('Informative Features'),
Featurelist('universe')]
Examine features in a feature list¶
>>> raw_features = project.get_featurelists()[0]
>>> raw_features.features
['Partition', 'CustomerID', 'Age', 'Income', 'Education', 'Email']
>>> len(raw_features.features)
21
Create a custom feature list¶
You can create custom feature lists from a subset of available features:
>>> all_features = project.get_features()
>>> selected_feature_names = [f.name for f in all_features if f.feature_type == 'Numeric']
>>> custom_featurelist = project.create_featurelist(
... name='Numeric Features Only',
... features=selected_feature_names
... )
>>> custom_featurelist
Featurelist('Numeric Features Only')
Analyze categorical features¶
For categorical features, you can access additional insights about the distribution of categories.
Get key summary for categorical features¶
For summarized categorical features, you can retrieve statistics for the top keys:
>>> categorical_feature = dr.Feature.get(project_id=project.id, feature_name='ProductCategory')
>>> key_summary = categorical_feature.key_summary
>>> key_summary[0]
{'key': 'Electronics',
'summary': {'min': 0, 'max': 29815.0, 'stdDev': 6498.029, 'mean': 1490.75,
'median': 0.0, 'pctRows': 5.0}}
The key summary provides statistics for the top 50 keys, including:
- min: Minimum value of the key.
- max: Maximum value of the key.
- mean: Mean value of the key.
- median: Median value of the key.
- stdDev: Standard deviation of the key.
- pctRows: Percentage occurrence of key in the EDA sample.
Analyze multicategorical features¶
For multicategorical features, you can retrieve specialized insights about label relationships.
Get a multicategorical histogram¶
>>> multicat_feature = dr.Feature.get(project_id=project.id, feature_name='Tags')
>>> histogram = multicat_feature.get_multicategorical_histogram()
>>> histogram
MulticategoricalHistogram(...)
Get pairwise correlations¶
Analyze correlations between labels in a multicategorical feature:
>>> correlations = multicat_feature.get_pairwise_correlations()
>>> correlations
PairwiseCorrelations(...)
Get pairwise joint probabilities¶
>>> joint_probs = multicat_feature.get_pairwise_joint_probabilities()
>>> joint_probs
PairwiseJointProbabilities(...)
Get pairwise conditional probabilities¶
>>> cond_probs = multicat_feature.get_pairwise_conditional_probabilities()
>>> cond_probs
PairwiseConditionalProbabilities(...)
Time series feature properties¶
For time series projects, you can retrieve additional properties for features when used with multiseries or cross-series configurations.
Get multiseries properties¶
Retrieve time series properties for a potential multiseries datetime partition column:
>>> date_feature = dr.Feature.get(project_id=project.id, feature_name='Date')
>>> properties = date_feature.get_multiseries_properties(
... multiseries_id_columns=['StoreID']
... )
>>> properties
{'time_series_eligible': True,
'time_unit': 'DAY',
'time_step': 1}
Get cross-series properties¶
To retrieve cross-series properties for multiseries ID columns:
>>> multiseries_feature = dr.Feature.get(project_id=project.id, feature_name='StoreID')
>>> properties = multiseries_feature.get_cross_series_properties(
... datetime_partition_column='Date',
... cross_series_group_by_columns=['Region']
... )
>>> properties
{'name': 'StoreID',
'eligibility': 'Eligible as cross-series group-by column',
'isEligible': True}
Filter and search features¶
You can filter features by various criteria to find specific features of interest.
Filter by feature type¶
>>> all_features = project.get_features()
>>> numeric_features = [f for f in all_features if f.feature_type == 'Numeric']
>>> categorical_features = [f for f in all_features if f.feature_type == 'Categorical']
>>> text_features = [f for f in all_features if f.feature_type == 'Text']
Find features with missing data¶
>>> features_with_missing = [f for f in project.get_features()
... if f.na_count is not None and f.na_count > 0]
>>> for feature in features_with_missing:
... print(f"{feature.name}: {feature.na_count} missing values")
Email: 5 missing values
Phone: 12 missing values
Find low-information features¶
>>> low_info_features = [f for f in project.get_features() if f.low_information]
>>> [f.name for f in low_info_features]
['ConstantColumn', 'SingleValueColumn']
Find features by importance threshold¶
>>> important_features = [f for f in project.get_features()
... if f.importance is not None and f.importance > 0.5]
>>> sorted_features = sorted(important_features, key=lambda x: x.importance, reverse=True)
>>> for feature in sorted_features[:5]:
... print(f"{feature.name}: {feature.importance:.3f}")
Income: 0.892
Age: 0.756
Education: 0.643
Common workflows¶
Analyze all features in a project¶
>>> project = dr.Project.get('5e3c94aff86f2d10692497b5')
>>> features = project.get_features()
>>>
>>> print(f"Total features: {len(features)}")
>>> print(f"Feature types: {set(f.feature_type for f in features)}")
>>>
>>> for feature in features:
... print(f"\n{feature.name} ({feature.feature_type}):")
... if feature.na_count is not None:
... print(f" Missing values: {feature.na_count}")
... if feature.importance is not None:
... print(f" Importance: {feature.importance:.3f}")
... if feature.target_leakage != 'SKIPPED_DETECTION':
... print(f" Target leakage: {feature.target_leakage}")
Export feature information to a DataFrame¶
>>> import pandas as pd
>>>
>>> features = project.get_features()
>>> feature_data = []
>>> for feature in features:
... feature_data.append({
... 'name': feature.name,
... 'type': feature.feature_type,
... 'importance': feature.importance,
... 'missing_count': feature.na_count,
... 'unique_count': feature.unique_count,
... 'low_information': feature.low_information,
... 'target_leakage': feature.target_leakage
... })
>>>
>>> df = pd.DataFrame(feature_data)
>>> df.to_csv('feature_summary.csv', index=False)
Compare features across projects¶
>>> project1 = dr.Project.get('5e3c94aff86f2d10692497b5')
>>> project2 = dr.Project.get('5e3c94aff86f2d10692497b6')
>>>
>>> features1 = {f.name: f for f in project1.get_features()}
>>> features2 = {f.name: f for f in project2.get_features()}
>>>
>>> common_features = set(features1.keys()) & set(features2.keys())
>>> print(f"Common features: {len(common_features)}")
>>>
>>> for feature_name in common_features:
... f1 = features1[feature_name]
... f2 = features2[feature_name]
... if f1.feature_type != f2.feature_type:
... print(f"{feature_name}: type mismatch ({f1.feature_type} vs {f2.feature_type})")
Considerations¶
- Feature statistics (
min,max,mean,median,std_dev) are calculated from the EDA sample data. - For non-numeric features or features created prior to summary statistics becoming available, their values will be
None. - The
importanceattribute is independent of any model and measures the relationship strength between the feature and target. - In time series projects,
Featureobjects represent input features, whileModelingFeatureobjects represent features used for modeling after partitioning. - Feature histograms are based on the EDA sample and may not reflect the full dataset distribution.