User Guide ========== Thank you for using ``aind-data-transfer-service``! This guide is intended for scientists and engineers in AIND that wish to upload data from our shared network drives (e.g., VAST) to the cloud. Concepts -------- There are two important concepts that should be highlighted: Tasks and job_types. - The Task model can be imported from aind_data_transfer_service.models.core.Task; however, it is simple to build it as a python dictionary too. It has the following fields: - skip_task: Whether or not to skip a Task such as checking if the s3_folder already exists. - image: A docker image to run for a particular task - image_version: The version of the docker image to run - image_resources: The HPC resources that are being requested. - job_settings: A dictionary that can be passed into the command script. - command_script: The command to run in the docker image. - The following Tasks are being run during the transform and upload pipeline. The more important ones are in bold. - send_job_start_email: sends an optional email to signal that a workflow has started. - check_s3_folder_exists: Raises an error if an s3_folder already exists. Set skip_task to True to skip this check. We recommend skipping this task sparingly. This will force a sync to AWS even if the folder already exists in the cloud. This will overwrite the data already uploaded, but won't delete any data. - check_source_folders_exist: Checks that the source folders exist. This raises an error earlier in the workflow if the source folders were configured incorrectly. - create_staging_folder: Creates a temporary folder on VAST to store some temporary files for staging. - **gather_preliminary_metadata**: Automatically gathers metadata for subject, procedures, and data_description by running the aind-metadata-mapper GatherMetadataJob. You can optionally specify a directory with pre-compiled metadata files. - check_metadata_files: Checks that the metadata files exist and are json. - copy_derivatives_folder: Can specify a derivatives folder to upload. Set the job_settings like {"input_source": "path_to_folder"}. - **modality_transformation_settings**: This creates the settings for transforming the modality folders. This is a dictionary that is keyed by the modality abbreviation and contains the settings for the corresponding modality. - compress_data/submit_job: This is a mapped task. Each Task in the modality_transformation_settings will run in parallel in a separate container here. - compress_data/monitor_job: This is a mapped task that monitors the compress_data/submit_job tasks. It is expected that the status of this task alternates between "running" and "up_for_reschedule" until the compression job is finished. - gather_final_metadata: Automatically generates a processing.json by running the aind-metadata-mapper GatherProcessingJob. - upload_data_to_s3: Uploads the data to S3. - register_data_asset: Registers the record to DocDB and Code Ocean. - get_codeocean_asset_id: Retrieve the Code Ocean data asset ID from DocDB. - **codeocean_pipeline_settings**: As with the modality_transformation_settings, the parameters to send to the Code Ocean pipeline monitor capsule. You can specify up to 1 task per modality. - run_codeocean_pipeline: This is a mapped task that will run each task in the codeocean_pipeline_settings dictionary in an individual container. - remove_staging_folder: Removes the staging folder created above. - remove_source_folders: Optionally remove the source folders from VAST. By default, this is turned off. Please be careful running this task. - send_job_end_email: sends an optional email to signal that a workflow has ended. - job_type: Since the majority of workflows may use the same parameters repeatedly, the Tasks can be stored in AWS Parameter Store. A user will only need to define a job_type, and the presets will be used. This is the recommended way of using aind-data-transfer-service. Please reach out to a member of Scientific Computing for help with defining a job_type. Prerequisites ------------- - It's assumed that raw data is already stored and organized on a shared network drive such as VAST. - The raw data should be organized by modality. - Please see `aind-file-standards `__ for more information. - Example 1: .. code:: bash - /allen/aind/scratch/working_group/session_123456_2024-06-19 - /ecephys - /behavior - /behavior_videos - session.json - rig.json Using the web portal -------------------- Access to the web portal is available only through the VPN. The web portal can be accessed at `http://aind-data-transfer-service/ `__ - Download the excel template file by clicking the ``Job Submit Template`` link. - If there are compatibility issues with the excel template, you can try saving it as a csv file and modifying the csv file - Create one row per data acquisition session - Required fields - user_email: Your email address to receive notifications about the job status (e.g., start, completion, failure) - job_type: We store pre-compiled default configurations in AWS Parameter Store (e.g. modality transformation settings, Code Ocean pipeline settings). This field determines which preset to use when running the upload job. A list of job types can be seen by clicking the ``Job Parameters`` link. - project_name: A list of project names can be seen by clicking the ``Project Names`` link - subject_id: The LabTracks ID of the mouse - acq_datetime: The date and time the data was acquired. Should be in ISO format, for example, 2024-05-27T16:07:59 - **modalities**: Two columns must be added per modality. A **modality** (chosen from drop down menu) and a Posix style path to the data source. For example, - modality0 (e.g., ecephys) - modality0.input_source (e.g., /allen/aind/scratch/working_group/session_123456_2024-06-19/ecephys_data) - modality1 (e.g, behavior) - modality1.input_source (e.g., /allen/aind/scratch/working_group/session_123456_2024-06-19/behavior_data) - modality2 (e.g, behavior_videos) - modality2.input_source (e.g., /allen/aind/scratch/working_group/session_123456_2024-06-19/behavior_videos) - Optional fields - platform: Standardized way of collecting and processing data (chosen from drop down menu) Note: This field is deprecated and will be removed when aind-data-schema 2.0 is rolled out. - metadata_dir: If metadata files are pre-compiled and saved to a directory, you can add the Posix style path to the directory under this column - derivatives_dir: If a derivatives folder is available for upload, can be specified as a Posix style path to the directory under this column - s3_bucket: By default, data will be uploaded to a default bucket in S3 managed by AIND. Please reach out to the Scientific Computing department if you wish to upload to a different bucket. - modality{n}.pipeline_id (or modality{n}.capsule_id): It is possible to add a Code Ocean pipeline_id or capsule_id to a modality. For more complex parameters, please define a job_type or use the REST API. - modality0.capsule_id (e.g., 123-456) - modality1.pipeline_id (e.g., 123-456) - force_cloud_sync: We recommend using this flag sparingly. This will force a sync to AWS even if the folder already exists in the cloud. This will overwrite the data already uploaded, but won't delete any data. Please reach out to a member of Scientific Computing for help clearing data from AWS. Using the REST API ------------------ For more granular configuration, jobs can be submitted via a REST API at the endpoint: ``http://aind-data-transfer-service/api/v2/submit_jobs`` You may pip install aind-data-transfer-service for access to the Task model; however, this isn't strictly necessary. You can form the post request as a dictionary. The service will perform validation. **Note:** The ``user_email`` field is required in the SubmitJobRequestV2 model to receive notifications about job status. We strongly recommend using customized job_types to simplify the requests. For more detailed examples please check the scripts in `examples `__. Viewing the status of submitted jobs ------------------------------------ The status of submitted jobs can be viewed at: http://aind-data-transfer-service/jobs This page shows the jobs submitted in the last 14 days. You can filter/sort by status, asset name, job type, etc. You can also click into a job's tasks to view the status and logs from individual tasks. Please note that certain tasks, such as compress_data/monitor_job, will alternate between "running" and "up_for_reschedule" status until finished. Viewing job parameters based on job type -------------------------------------------- We store pre-compiled job configurations in AWS Parameter Store based on `job_type`. Available job types and their configurations can be viewed at: http://aind-data-transfer-service/job_params To request a new job type, please reach out to Scientific Computing. Admins can manage job types directly in the Job Parameters page. Reporting bugs or making feature requests ----------------------------------------- Please report any bugs or feature requests here: `issues `__