dlhub_sdk.models package

Submodules

dlhub_sdk.models.datasets module

class dlhub_sdk.models.datasets.Dataset

Bases: dlhub_sdk.models.BaseMetadataModel

Base class for describing a dataset

The Dataset class and any of its subclasses contain operations for describing what a dataset is and how to use it.

class dlhub_sdk.models.datasets.TabularDataset

Bases: dlhub_sdk.models.datasets.Dataset

Read a dataset stored as a single file in a tabular format.

Will read in the names of the columns, and allow users to associate column names with descriptions of the data provided.

This class is compatible with any data format readable by the Pandas library. See the list of read functions in Pandas

annotate_column(column_name, description=None, data_type=None, units=None)

Provide documentation about a certain column within a dataset.

Overwrites any type values inferred from reading the dataset

Parameters
  • column_name (string) – Name of a column

  • description (string) – Longer description of a column

  • data_type (string) – Short description of the data type

  • units (string) – Units for the columns data (if applicable)

classmethod create_model(path, format='csv', read_kwargs=None)

Initialize the description of a tabular dataset

Parameters
  • path (string) – Path to dataset

  • format (string) – Format of the dataset. We support all of the read operations of Pandas (e.g., read_csv). Provide the format of your dataset as the suffix for the Pandas read command (e.g., “csv” for “read_csv”).

  • read_kwargs (dict) – Any keyword arguments for the pandas read command

get_unannotated_columns()

Get the names of columns that have not been described

load_dataset(path, format, **kwargs)

Load in a dataset to get some high-level descriptions of it

Parameters
  • path (string) – Path to dataset

  • format (string) – Format of the dataset. We support all of the read operations of Pandas (e.g., read_csv). Provide the format of your dataset as the suffix for the Pandas read command (e.g., “csv” for “read_csv”).

  • **kwargs (dict) – arguments for the Pandas read function

mark_inputs(column_names)

Mark which columns are inputs to a model

Parameters

column_names ([string]) – Names of columns

mark_labels(column_names)

Mark a column as label

Parameters

column_names ([string]) – Names of columns

dlhub_sdk.models.pipeline module

Model for a pipeline of several servables

class dlhub_sdk.models.pipeline.PipelineModel

Bases: dlhub_sdk.models.BaseMetadataModel

Model for a pipeline of several servables

A Pipeline is created after individual servables have been published in DLHub, or at least assigned a DLHub identifier. A Pipeline is formed from a list of these servables, and any options to employ when running them. A step in a pipeline could also be another pipeline.

A simple example for a DLHub pipeline is an image classification tool. The first step in the pipeline is an image reader take takes any image type and produces an array. The second step standardizes the shape of the image, where the “options” to the servable are the desired resolution and whether the image is grayscale or not. The final step is the classification pipeline. Put together, the image pipeline can support any type of input data.

add_step(author, name, description, parameters=None)

Add a step to the pipeline

Parameters
  • author (string) – DLHub username of the owner of the servable

  • name (string) – Name of the DLHub servable

  • description (string) – A short description of this step

  • parameters (dict) – Any options for the servable. See the list of parameters for a servable

Module contents

This module contains tools for describing objects being published to DLHub.

class dlhub_sdk.models.BaseMetadataModel

Bases: object

Base class for models describing objects published via DLHub

Covers information that goes in the datacite block of the metadata file and some of the DLHub block.

There are many kinds of MetadataModel classes that each describe a different kind of object. Each of these different types are created using the create_model operation (e.g., KerasModel.create_model('model.h5')), but have different arguments depending on the type of object. For example, TensorFlow models only require the directory created when saving the model for serving but scikit-learn models require the pickle file, how the pickle was created (e.g., with joblib), and how many input features it requires.

Once created, you will need to fill in additional details about the object to make it reusable. The MetadataModel classes attempt to learn as much about an object as possible automatically, but there is some information that must be provided by a human. To start, you must define a title and name for the object and are encouraged to provide an abstract describing the model and list any associated papers/websites that describe the model. You will fill plenty of examples for how to describe the models in the DLHub_containers repostiory. Some types of objects require data specific to their type (e.g., Python servables need a list of required packages). We encourage you to find examples for your specific type of object in the containers repository for inspiration and to see the Python documentation for each Metadata Model.

The MetadataModel object can be saved using the to_dict operation and read back into memory using the from_dict method. We recommend you save your dictionary to disk in the JSON or yaml format, which will allow for manual edits to be made before submitting or resubmitting a object description.

add_alternate_identifier(identifier, identifier_type)

Add an identifier of this artifact in another service

Parameters
  • identifier (string) – Identifier

  • identifier_type (string) – Identifier type

add_directory(directory, include=(), exclude=(), recursive=False)

Add all the files in a directory

Parameters
  • include (string or [string]) – Only add files that match any of these patterns

  • exclude (string or [string]) – Exclude all files that match any of these patterns

  • directory (string) – Path to a directory

  • recursive (bool) – Whether to add all files in a directory

add_file(file, name=None)

Add a file to the list of files to be distributed with the artifact

Parameters
  • file (string) – Path to the file

  • name (string) – Optional. Name of the file, if it is a file that serves a specific purpose in software based on this artifact (e.g., if this is a pickle file of a scikit-learn model)

add_files(files)

Add files that should be distributed with this artifact.

Parameters

files ([string]) – Paths of files that should be published

add_funding_reference(name, identifier=None, identifier_type=None, award_number=None, award_title=None, award_uri=None)

Add a funding source to the list of resources

Parameters
  • name (string) – Name of funding provider

  • identifier (string) – Identifier (e.g., ISNI) of the funder

  • identifier_type (string) – Type of the identifier (ISNI, GRID, Crossref Funder ID, Other)

  • award_number (string) – Code assigned by the funder

  • award_title (string) – Title of the award

  • award_uri (string) – URI of the award

Add an identifier of an artifact that is related to this resource (e.g., a paper that describes a dataset).

You must define both the identifier and how it relates to this resource. The possible types of relations are listed in the documentation for datacite <https://schema.datacite.org/meta/kernel-4.1/doc/DataCite-MetadataKernel_v4.1.pdf>_ on Page 25. The most common ones used in DLHub will likely be:

  • “IsDescribedBy”: For a paper that describes a dataset or model

  • “IsDocumentedBy”: For the software documentation for a model

  • “IsDerviedFrom”: For the database a training set was pulled from

  • “Requires”: For any software libraries that are required for this module

Parameters
  • identifier (string) – Identifier

  • identifier_type (string) – Identifier type

  • relation_type (string) – Relation between

add_requirement(library, version=None)

Add a required Python library.

The name of the library should be either the name on PyPI, or a URL for the git repository holding the code (e.g., git+https://github.com/DLHub-Argonne/dlhub_sdk.git)

Parameters
  • library (string) – Name of library

  • version (string) – Required version. ‘latest’ to use the most recent version on PyPi (if available). ‘detect’ will attempt to find the version of the library installed on the computer running this software. Default is None

add_requirements(requirements)

Add a dictionary of requirements

Utility wrapper for add_requirement

Parameters

requirements (dict) – Keys are names of library (str), values are the version

add_rights(uri=None, rights=None)

Any rights information for this resource. Provide a rights management statement for the resource or reference a service providing such information. Include embargo information if applicable. Use the complete title of a license and include version information if applicable.

Parameters
  • uri (string) – URI of the rights

  • rights (string) – Description of the rights

classmethod create_model(**kwargs)

Instantiate the metadata model.

Takes in arguments that allow metadata describing a dataset to be autogenerated. For example, these could include options describing how to read a dataset from a CSV file or which class method to invoke on a Python pickle object.

classmethod from_dict(data)

Reconstitute class from dictionary

Parameters

data (dict) – Metadata for this class

get_zip_file(path)

Write all the listed files to a ZIP object

Takes all of the files returned by list_files. First determines the largest common path of all files, and preserves directory structure by using this common path as the root directory. For example, if the files are “/home/a.pkl” and “/home/a/b.dat”, the common directory is “/home” and the files will be stored in the Zip as “a.pkl” and “a/b.dat”

Parameters

path (string) – Path for the ZIP File

Returns

Base path for the ZIP file (useful for adjusting the paths of the files

included in the metadata model)

Return type

(string)

list_files()

Provide a list of files associated with this artifact.

Returns

([string]) list of file paths

property name

Get the name of the servable

Returns

(string) Name of the servable

parse_repo2docker_configuration(directory=None)

Gathers information about required environment from repo2docker configuration files.

See https://repo2docker.readthedocs.io/en/latest/config_files.html for more details

Parameters

directory (str) – Path to directory containing configuration files (default: current working directory)

read_codemeta_file(directory=None)

Read in metadata from a codemeta.json file

Parameters

directory (string) – Path to directory contain the codemeta.json file (default: current working directory)

set_abstract(abstract)

Define an abstract for this artifact. Use for a high-level summary

Parameters

abstract (string) – Description of this artifact

set_authors(authors, affiliations=[])

Add authors to a dataset

Parameters
  • authors ([string]) – List of authors for the dataset. In format: “<Family Name>, <Given Name>”

  • affiliations ([[string]]) – List of affiliations for each author.

set_doi(doi)

Set the DOI of this object, if available

This function is only for advanced usage. Most users of the toolbox will not know the DOI before sending the doi in to DLHub.

Parameters

doi (string) – DOI of the object

set_domains(domains)

Set the field of science that is associated with this artifcat

Parameters

domains ([string]) – Name of a fields of science (e.g., “materials science”)

set_methods(methods)

Define a methods section for this artifact. Use to describe any specific details about how the dataset, model, etc was generated.

Parameters

methods (str) – Detailed method descriptions

set_name(name)

Set the name of artifact.

Should be something short, descriptive, and memorable

Parameters

name (string) – Name of artifact

set_publication_year(year)

Define the publication year

This function is only for advanced usage. Normally, this will be assigned automatically

Parameters

year (string) – Publication year

set_title(title)

Add a title to the dataset

set_version(version)

Set the version of this resource

Parameters

version (string) – Version number

set_visibility(visible_to)

Define the list of people and groups who have permissions to see and use this model.

By default, it will be visible to anyone ([“public”]).

Parameters

visible_to ([string]) – List of allowed users and groups, listed by GlobusAuth UUID

to_dict(simplify_paths=False, save_class_data=False)

Render the dataset to a JSON description

Parameters
  • simplify_paths (bool) – Whether to simplify the paths of each file

  • save_class_data (bool) – Whether to save data about the class

Returns

(dict) A description of the dataset in a form suitable for download