User Guide¶

Thank you for using aind-data-transfer-service! This guide is intended for scientists and engineers in AIND that wish to upload data from our shared network drives (e.g., VAST) to the cloud.

Concepts¶

There are two important concepts that should be highlighted: Tasks and job_types.

The Task model can be imported from aind_data_transfer_service.models.core.Task; however, it is simple to build it as a python dictionary too. It has the following fields:
- skip_task: Whether or not to skip a Task such as checking if the s3_folder already exists.
- image: A docker image to run for a particular task
- image_version: The version of the docker image to run
- image_resources: The HPC resources that are being requested.
- job_settings: A dictionary that can be passed into the command script.
- command_script: The command to run in the docker image.
The following Tasks are being run during the transform and upload pipeline. The more important ones are in bold.
- send_job_start_email: sends an optional email to signal that a workflow has started.
- check_s3_folder_exists: Raises an error if an s3_folder already exists. Set skip_task to True to skip this check. We recommend skipping this task sparingly. This will force a sync to AWS even if the folder already exists in the cloud. This will overwrite the data already uploaded, but won’t delete any data.
- check_source_folders_exist: Checks that the source folders exist. This raises an error earlier in the workflow if the source folders were configured incorrectly.
- create_staging_folder: Creates a temporary folder on VAST to store some temporary files for staging.
- gather_preliminary_metadata: Automatically gathers metadata for subject, procedures, and data_description by running the aind-metadata-mapper GatherMetadataJob. You can optionally specify a directory with pre-compiled metadata files.
- check_metadata_files: Checks that the metadata files exist and are json.
- copy_derivatives_folder: Can specify a derivatives folder to upload. Set the job_settings like {“input_source”: “path_to_folder”}.
- modality_transformation_settings: This creates the settings for transforming the modality folders. This is a dictionary that is keyed by the modality abbreviation and contains the settings for the corresponding modality.
- compress_data/submit_job: This is a mapped task. Each Task in the modality_transformation_settings will run in parallel in a separate container here.
- compress_data/monitor_job: This is a mapped task that monitors the compress_data/submit_job tasks. It is expected that the status of this task alternates between “running” and “up_for_reschedule” until the compression job is finished.
- gather_final_metadata: Automatically generates a processing.json by running the aind-metadata-mapper GatherProcessingJob.
- upload_data_to_s3: Uploads the data to S3.
- register_data_asset: Registers the record to DocDB and Code Ocean.
- get_codeocean_asset_id: Retrieve the Code Ocean data asset ID from DocDB.
- codeocean_pipeline_settings: As with the modality_transformation_settings, the parameters to send to the Code Ocean pipeline monitor capsule. You can specify up to 1 task per modality.
- run_codeocean_pipeline: This is a mapped task that will run each task in the codeocean_pipeline_settings dictionary in an individual container.
- remove_staging_folder: Removes the staging folder created above.
- remove_source_folders: Optionally remove the source folders from VAST. By default, this is turned off. Please be careful running this task.
- send_job_end_email: sends an optional email to signal that a workflow has ended.
job_type: Since the majority of workflows may use the same parameters repeatedly, the Tasks can be stored in AWS Parameter Store. A user will only need to define a job_type, and the presets will be used. This is the recommended way of using aind-data-transfer-service. Please reach out to a member of Scientific Computing for help with defining a job_type.

Prerequisites¶

It’s assumed that raw data is already stored and organized on a shared network drive such as VAST.
The raw data should be organized by modality.

Please see aind-file-standards for more information. - Example 1:

- /allen/aind/scratch/working_group/session_123456_2024-06-19
  - /ecephys
  - /behavior
  - /behavior_videos
  - session.json
  - rig.json

Using the web portal¶

Access to the web portal is available only through the VPN. The web portal can be accessed at http://aind-data-transfer-service/

Download the excel template file by clicking the Job Submit Template link.
If there are compatibility issues with the excel template, you can try saving it as a csv file and modifying the csv file
Create one row per data acquisition session
Required fields
- user_email: Your email address to receive notifications about the job status (e.g., start, completion, failure)
- job_type: We store pre-compiled default configurations in AWS Parameter Store (e.g. modality transformation settings, Code Ocean pipeline settings). This field determines which preset to use when running the upload job. A list of job types can be seen by clicking the Job Parameters link.
- project_name: A list of project names can be seen by clicking the Project Names link
- subject_id: The LabTracks ID of the mouse
- acq_datetime: The date and time the data was acquired. Should be in ISO format, for example, 2024-05-27T16:07:59
- modalities: Two columns must be added per modality. A modality (chosen from drop down menu) and a Posix style path to the data source. For example,
  - modality0 (e.g., ecephys)
  - modality0.input_source (e.g., /allen/aind/scratch/working_group/session_123456_2024-06-19/ecephys_data)
  - modality1 (e.g, behavior)
  - modality1.input_source (e.g., /allen/aind/scratch/working_group/session_123456_2024-06-19/behavior_data)
  - modality2 (e.g, behavior_videos)
  - modality2.input_source (e.g., /allen/aind/scratch/working_group/session_123456_2024-06-19/behavior_videos)
Optional fields
- platform: Standardized way of collecting and processing data (chosen from drop down menu) Note: This field is deprecated and will be removed when aind-data-schema 2.0 is rolled out.
- metadata_dir: If metadata files are pre-compiled and saved to a directory, you can add the Posix style path to the directory under this column
- derivatives_dir: If a derivatives folder is available for upload, can be specified as a Posix style path to the directory under this column
- s3_bucket: By default, data will be uploaded to a default bucket in S3 managed by AIND. Please reach out to the Scientific Computing department if you wish to upload to a different bucket.
- modality{n}.pipeline_id (or modality{n}.capsule_id): It is possible to add a Code Ocean pipeline_id or capsule_id to a modality. For more complex parameters, please define a job_type or use the REST API. - modality0.capsule_id (e.g., 123-456) - modality1.pipeline_id (e.g., 123-456)
- force_cloud_sync: We recommend using this flag sparingly. This will force a sync to AWS even if the folder already exists in the cloud. This will overwrite the data already uploaded, but won’t delete any data. Please reach out to a member of Scientific Computing for help clearing data from AWS.

Using the REST API¶

For more granular configuration, jobs can be submitted via a REST API at the endpoint:

http://aind-data-transfer-service/api/v2/submit_jobs

You may pip install aind-data-transfer-service for access to the Task model; however, this isn’t strictly necessary. You can form the post request as a dictionary. The service will perform validation.

Note: The user_email field is required in the SubmitJobRequestV2 model to receive notifications about job status.

We strongly recommend using customized job_types to simplify the requests. For more detailed examples please check the scripts in examples.

Viewing the status of submitted jobs¶

The status of submitted jobs can be viewed at: http://aind-data-transfer-service/jobs

This page shows the jobs submitted in the last 14 days. You can filter/sort by status, asset name, job type, etc. You can also click into a job’s tasks to view the status and logs from individual tasks.

Please note that certain tasks, such as compress_data/monitor_job, will alternate between “running” and “up_for_reschedule” status until finished.

Viewing job parameters based on job type¶

We store pre-compiled job configurations in AWS Parameter Store based on job_type. Available job types and their configurations can be viewed at: http://aind-data-transfer-service/job_params

To request a new job type, please reach out to Scientific Computing. Admins can manage job types directly in the Job Parameters page.

Reporting bugs or making feature requests¶

Please report any bugs or feature requests here: issues