Utility functions¶

How to use this reference

This page contains information about each class and function in this module. This is meant as a detailed reference for this module. If you're looking an introduction, we recommend reviewing the How to section.

This module contains utility functions used by tool execution. In general, you will not need to use many of these functions directly.

add_provider_if_databaseid_found ¶

add_provider_if_databaseid_found(data)

Recursively traverse a data structure of nested dictionaries/lists. If a dict contains the key 'databaseId', add a peer key '$provider' = 'datahub-cell'. Return the modified data structure.

cancel_run ¶

cancel_run(execution_id: str) -> None

cancel a run

Parameters:

Name	Type	Description	Default
`execution_id`	`str`	execution ID	required

cancel_runs ¶

cancel_runs(job_ids: list[str]) -> None

Cancel multiple jobs in parallel.

Parameters:

Name	Type	Description	Default
`job_ids`	`list[str]`	List of job IDs to cancel.	required

get_job_dataframe ¶

get_job_dataframe(update: bool = False) -> Any

returns a dataframe of all jobs and statuses, reading from local cache

Parameters:

Name	Type	Description	Default
`update`	`bool`	Whether to check for updates on non-terminal jobs. Defaults to False.	`False`

Note that this function is deliberately not annotated with an output type because pandas is imported internally to this funciton.

Returns:

Type	Description
`Any`	pd.DataFrame: A dataframe containing job information

get_status_and_progress ¶

get_status_and_progress(execution_id: str) -> dict

Determine the status of a run, identified by job ID

Parameters:

Name	Type	Description	Default
`execution_id`	`str`	execution_id	required

get_statuses_and_progress ¶

get_statuses_and_progress(job_ids: list[str]) -> list

get statuses and progress reports for multiple jobs in parallel

Parameters:

Name	Type	Description	Default
`job_ids`	`list[str]`	list of job IDs	required

make_payload ¶

make_payload(
    *,
    inputs: dict,
    outputs: dict,
    cluster_id: Optional[str] = None,
    cols: Optional[list] = None,
    metadata: Optional[dict] = None
) -> dict

helper function to create payload for tool execution. This helper function is used by all wrapper functions in the run module to create the payload.

Parameters:

Name	Type	Description	Default
`inputs`	`dict`	inputs	required
`outputs`	`dict`	outputs	required
`cluster_id`	`Optional[str]`	cluster ID. Defaults to None. If not provided, the default cluster (us-west-2) is used.	`None`
`cols`	`Optional[list]`	(Optional[list], optional): list of columns. Defaults to None. If provided, column names (in inputs or outputs) are converted to column IDs.	`None`
`metadata`	`Optional[dict]`	(Optional[dict], optional): metadata to be added to the payload. Defaults to None.	`None`

Returns:

Name	Type	Description
`dict`	`dict`	correctly formatted payload, ready to be passed to execute_tool

query_run_status ¶

query_run_status(execution_id: str) -> str

Determine the status of a run, identified by execution_id ID

Parameters:

Name	Type	Description	Default
`execution_id`	`str`	execution_id ID	required

Returns:

Type	Description
`str`	One of "Created", "Queued", "Running", "Succeeded", or "Failed"

query_run_statuses ¶

query_run_statuses(job_ids: list[str]) -> dict

get statuses for multiple jobs in parallel

Parameters:

Name	Type	Description	Default
`job_ids`	`list[str]`	list of job IDs	required

read_jobs ¶

read_jobs() -> list

read jobs from files in the jobs cache directory

run_tool ¶

run_tool(*, data: dict, tool_key: str)

run any tool using provided data transfer object (DTO)

Parameters:

Name	Type	Description	Default
`data`	`dict`	data transfer object. This is typically generated by the `make_payload` function.	required

wait_for_job ¶

wait_for_job(
    execution_id: str, *, poll_interval: int = 4
) -> None

Repeatedly poll Deep Origin for the job status, till the status is "Succeeded" or "Failed (a terminal state)

This function is useful for blocking execution of your code till a specific task is complete.

Parameters:

Name	Type	Description	Default
`execution_id`	`str`	execution_id ID. This is typically printed to screen and returned when a job is initialized.	required
`poll_interval`	`int`	number of seconds to wait between polling. Defaults to 4.	`4`

wait_for_jobs ¶

wait_for_jobs(
    refresh_time: int = 3, hide_succeeded: bool = True
) -> Any

Wait for all jobs started via this client to complete

Parameters:

Name	Type	Description	Default
`refresh_time`	`int`	number of seconds to wait between polling. Defaults to 3.	`3`
`hide_succeeded`	`bool`	whether to hide jobs that have already completed. Defaults to True.	`True`

Note that this function signature is explicitly not annotated with a return type to avoid importing pandas outside this function

Returns:

Type	Description
`Any`	pd.DataFrame: dataframe of all jobs.