データセットの作成と管理¶
プロジェクトを作成してモデリングを開始するには、まずDataRobotにデータをアップロードしてデータセットを準備する必要があります。
Create a dataset¶
There are several ways to create a dataset.
Dataset.uploadは、ローカルファイルへのパス、外部URLを介したストリーミング可能なファイルオブジェクト、またはpandas DataFrameのいずれかを受け取ります。
>>> import datarobot as dr
>>> # Upload a local file
>>> dataset_one = dr.Dataset.upload("./data/examples.csv")
>>> # Create a dataset with a URL
>>> dataset_two = dr.Dataset.upload("https://raw.githubusercontent.com/curran/data/gh-pages/dbpedia/cities/data.csv")
>>> # Create a dataset using a pandas DataFrame
>>> dataset_three = dr.Dataset.upload(my_df)
>>> # Create a dataset using a local file
>>> with open("./data/examples.csv", "rb") as file_pointer:
... dataset_four = dr.Dataset.create_from_file(filelike=file_pointer)
Dataset.create_from_file can take either a path to a local file or any streamable file object.
>>> import datarobot as dr
>>> dataset = dr.Dataset.create_from_file(file_path='data_dir/my_data.csv')
>>> with open('data_dir/my_data.csv', 'rb') as f:
... other_dataset = dr.Dataset.create_from_file(filelike=f)
Dataset.create_from_in_memory_dataは、pandas.Dataframeまたはデータ行を表すディクショナリのリストからデータセットを作成します。
データ行を表すディクショナリには、同じキーが含まれている必要があります。
>>> import pandas as pd
>>> data_frame = pd.read_csv('data_dir/my_data.csv')
>>> pandas_dataset = dr.Dataset.create_from_in_memory_data(data_frame=data_frame)
>>> in_memory_data = [{'key1': 'value', 'key2': 'other_value', ...},
... {'key1': 'new_value', 'key2': 'other_new_value', ...}, ...]
>>> in_memory_dataset = dr.Dataset.create_from_in_memory_data(records=other_data)
Dataset.create_from_url takes CSV data from a URL. If you have set DISABLE_CREATE_SNAPSHOT_DATASOURCE, you must set do_snapshot=False.
>>> url_dataset = dr.Dataset.create_from_url('https://s3.amazonaws.com/my_data/my_dataset.csv',
... do_snapshot=False)
Dataset.create_from_data_source takes data from a data source.
If you have set DISABLE_CREATE_SNAPSHOT_DATASOURCE, you must set do_snapshot=False.
>>> data_source_dataset = dr.Dataset.create_from_data_source(data_source.id, do_snapshot=False)
または
>>> data_source_dataset = data_source.create_dataset(do_snapshot=False)
Use datasets¶
After creating a dataset, you can create Projects from it and begin training models.
You can also combine project creation and a dataset upload in one method using Project.create.
However, using this method means the data is only accessible to the project which created it.
>>> project = dataset.create_project(project_name='New Project')
>>> project.analyze_and_model('some target')
Project(New Project)
Get information from a dataset¶
The dataset object contains some basic information that you can query, as shown in the snippet below.
>>> dataset.id
u'5e31cdac39782d0f65842518'
>>> dataset.name
u'my_data.csv'
>>> dataset.categories
["TRAINING", "PREDICTION"]
>>> dataset.created_at
datetime.datetime(2020, 2, 7, 16, 51, 10, 311000, tzinfo=tzutc())
The snippet below outlines several methods available to retrieve details from a dataset.
# Details
>>> details = dataset.get_details()
>>> details.last_modification_date
datetime.datetime(2020, 2, 7, 16, 51, 10, 311000, tzinfo=tzutc())
>>> details.feature_count_by_type
[FeatureTypeCount(count=1, feature_type=u'Text'),
FeatureTypeCount(count=1, feature_type=u'Boolean'),
FeatureTypeCount(count=16, feature_type=u'Numeric'),
FeatureTypeCount(count=3, feature_type=u'Categorical')]
>>> details.to_dataset().id == details.dataset_id
True
# Projects
>>> dr.Project.create_from_dataset(dataset.id, project_name='Project One')
Project(Project One)
>>> dr.Project.create_from_dataset(dataset.id, project_name='Project Two')
Project(Project Two)
>>> dataset.get_projects()
[ProjectLocation(url=u'https://app.datarobot.com/api/v2/projects/5e3c94aff86f2d10692497b5/', id=u'5e3c94aff86f2d10692497b5'),
ProjectLocation(url=u'https://app.datarobot.com/api/v2/projects/5e3c94eb9525d010a9918ec1/', id=u'5e3c94eb9525d010a9918ec1')]
>>> first_id = dataset.get_projects()[0].id
>>> dr.Project.get(first_id).project_name
'Project One'
# Features
>>> all_features = dataset.get_all_features()
>>> feature = next(dataset.iterate_all_features(offset=2, limit=1))
>>> feature.name == all_features[2].name
True
>>> print(feature.name, feature.feature_type, feature.dataset_id)
(u'Partition', u'Numeric', u'5e31cdac39782d0f65842518')
>>> feature.get_histogram().plot
[{'count': 3522, 'target': None, 'label': u'0.0'},
{'count': 3521, 'target': None, 'label': u'1.0'}, ... ]
# The raw data
>>> with open('myfile.csv', 'wb') as f:
... dataset.get_file(filelike=f)
Retrieve datasets¶
You can retrieve specific datasets, a list of all datasets, or an iterator that retrieves all or some datasets.
>>> dataset_id = '5e387c501a438646ed7bf0f2'
>>> dataset = dr.Dataset.get(dataset_id)
>>> dataset.id == dataset_id
True
# A blocking call that returns all datasets
>>> dr.Dataset.list()
[Dataset(name=u'Untitled Dataset', id=u'5e3c51e0f86f2d1087249728'),
Dataset(name=u'my_data.csv', id=u'5e3c2028162e6a5fe9a0d678'), ...]
# Avoid listing datasets that fail to properly upload
>>> dr.Dataset.list(filter_failed=True)
[Dataset(name=u'my_data.csv', id=u'5e3c2028162e6a5fe9a0d678'),
Dataset(name=u'my_other_data.csv', id=u'3efc2428g62eaa5f39a6dg7a'), ...]
# An iterator that lazily retrieves from the server page-by-page
>>> from itertools import islice
>>> iterator = dr.Dataset.iterate(offset=2)
>>> for element in islice(iterator, 3):
... print(element)
Dataset(name='some_data.csv', id='5e8df2f21a438656e7a23d12')
Dataset(name='other_data.csv', id='5e8df2e31a438656e7a23d0b')
Dataset(name='Untitled Dataset', id='5e6127681a438666cc73c2b0')
Manage datasets¶
You can modify, delete, and restore datasets. Note that you need the dataset’s ID in order to restore it from deletion. If you do not keep track of the ID, you will be unable to restore a dataset. If your deleted dataset was used to create a project, that project can still access it, but you will not be able to create new projects using that dataset.
>>> dataset.modify(name='A Better Name')
>>> dataset.name
'A Better Name'
>>> new_project = dr.Project.create_from_dataset(dataset.id)
>>> stored_id = dataset.id
>>> dr.Dataset.delete(dataset.id)
# new_project is still ok
>>> dr.Project.create_from_dataset(stored_id)
Traceback (most recent call last):
...
datarobot.errors.ClientError: 410 client error: {u'message': u'Requested Dataset 5e31cdac39782d0f65842518 was previously deleted.'}
>>> dr.Dataset.un_delete(stored_id)
>>> dr.Project.create_from_dataset(stored_id, project_name='Successful')
Project(Successful)
You can share a dataset as demonstrated in the following code snippet.
>>> from datarobot.enums import SHARING_ROLE
>>> from datarobot.models.dataset import Dataset
>>> from datarobot.models.sharing import SharingAccess
>>>
>>> new_access = SharingAccess(
>>> "new_user@datarobot.com",
>>> SHARING_ROLE.OWNER,
>>> can_share=True,
>>> )
>>> access_list = [
>>> SharingAccess("old_user@datarobot.com", SHARING_ROLE.OWNER, can_share=True),
>>> new_access,
>>> ]
>>>
>>> Dataset.get('my-dataset-id').share(access_list)
Manage dataset feature lists¶
You can create, modify, and delete custom feature lists on a given dataset. Some feature lists are automatically created by DataRobot and cannot be modified or deleted. Note that you cannot restore a deleted feature list.
>>> dataset.get_featurelists()
[DatasetFeaturelist(Raw Features),
DatasetFeaturelist(universe),
DatasetFeaturelist(Informative Features)]
>>> dataset_features = [feature.name for feature in dataset.get_all_features()]
>>> custom_featurelist = dataset.create_featurelist('Custom Features', dataset_features[:5])
>>> custom_featurelist
DatasetFeaturelist(Custom Features)
>>> dataset.get_featurelists()
[DatasetFeaturelist(Raw Features),
DatasetFeaturelist(universe),
DatasetFeaturelist(Informative Features),
DatasetFeaturelist(Custom Features)]
>>> custom_featurelist.update('New Name')
>>> custom_featurelist.name
'New Name'
>>> custom_featurelist.delete()
>>> dataset.get_featurelists()
[DatasetFeaturelist(Raw Features),
DatasetFeaturelist(universe),
DatasetFeaturelist(Informative Features)]
Use credential data¶
ユーザー名とパスワードまたは資格情報IDの代わりに資格情報データを受け入れるメソッドについては、資格情報データセクションを参照してください。