User Guide
==========

Thank you for using ``aind-data-transfer-service``! This guide is
intended for scientists and engineers in AIND that wish to upload data
from our shared network drives (e.g., VAST) to the cloud.

Concepts
--------

There are two important concepts that should be highlighted: Tasks and
job_types.

-  The Task model can be imported from
   aind_data_transfer_service.models.core.Task; however, it is simple to build
   it as a python dictionary too. It has the following fields:

   -  skip_task: Whether or not to skip a Task such as checking if the
      s3_folder already exists.
   -  image: A docker image to run for a particular task
   -  image_version: The version of the docker image to run
   -  image_resources: The HPC resources that are being requested.
   -  job_settings: A dictionary that can be passed into the command script.
   -  command_script: The command to run in the docker image.

-  The following Tasks are being run during the transform and upload pipeline.
   The more important ones are in bold.

   -  send_job_start_email: sends an optional email to signal that a workflow
      has started.
   -  check_s3_folder_exists: Raises an error if an s3_folder already exists.
      Set skip_task to True to skip this check. We recommend skipping this task
      sparingly. This will force a sync to AWS even if the folder already exists
      in the cloud. This will overwrite the data already uploaded, but won't
      delete any data.
   -  check_source_folders_exist: Checks that the source folders exist. This
      raises an error earlier in the workflow if the source folders were
      configured incorrectly.
   -  create_staging_folder: Creates a temporary folder on VAST to store some
      temporary files for staging.
   -  **gather_preliminary_metadata**: Automatically gathers metadata for
      subject, procedures, and data_description by running the
      aind-metadata-mapper GatherMetadataJob. You can optionally specify a
      directory with pre-compiled metadata files.
   -  check_metadata_files: Checks that the metadata files exist and are json.
   -  copy_derivatives_folder: Can specify a derivatives folder to upload.
      Set the job_settings like {"input_source": "path_to_folder"}.
   -  **modality_transformation_settings**: This creates the settings for
      transforming the modality folders. This is a dictionary that is keyed
      by the modality abbreviation and contains the settings for the
      corresponding modality.
   -  compress_data/submit_job: This is a mapped task. Each Task in
      the modality_transformation_settings will run in parallel in a separate
      container here.
   -  compress_data/monitor_job: This is a mapped task that monitors the
      compress_data/submit_job tasks. It is expected that the status of this
      task alternates between "running" and "up_for_reschedule" until the
      compression job is finished.
   -  gather_final_metadata: Automatically generates a processing.json
      by running the aind-metadata-mapper GatherProcessingJob.
   -  upload_data_to_s3: Uploads the data to S3.
   -  register_data_asset: Registers the record to DocDB and Code Ocean.
   -  get_codeocean_asset_id: Retrieve the Code Ocean data asset ID from DocDB.
   -  **codeocean_pipeline_settings**: As with the
      modality_transformation_settings, the parameters to send to the
      Code Ocean pipeline monitor capsule. You can specify up to 1
      task per modality.
   -  run_codeocean_pipeline: This is a mapped task that will run each task in
      the codeocean_pipeline_settings dictionary in an individual container.
   -  remove_staging_folder: Removes the staging folder created above.
   -  remove_source_folders: Optionally remove the source folders from VAST.
      By default, this is turned off. Please be careful running this task.
   -  send_job_end_email: sends an optional email to signal that a workflow
      has ended.

-  job_type: Since the majority of workflows may use the same parameters
   repeatedly, the Tasks can be stored in AWS Parameter Store. A user will
   only need to define a job_type, and the presets will be used. This is the
   recommended way of using aind-data-transfer-service. Please reach out to a
   member of Scientific Computing for help with defining a job_type.

Prerequisites
-------------

-  It's assumed that raw data is already stored and organized on a
   shared network drive such as VAST.
-  The raw data should be organized by modality.
-  Please see `aind-file-standards <https://allenneuraldynamics.github.io/aind-file-standards>`__ for more information.
   -  Example 1:

      .. code:: bash

         - /allen/aind/scratch/working_group/session_123456_2024-06-19
           - /ecephys
           - /behavior
           - /behavior_videos
           - session.json
           - rig.json

Using the web portal
--------------------

Access to the web portal is available only through the VPN. The web
portal can be accessed at
`http://aind-data-transfer-service/ <http://aind-data-transfer-service>`__

-  Download the excel template file by clicking the
   ``Job Submit Template`` link.

-  If there are compatibility issues with the excel template, you can
   try saving it as a csv file and modifying the csv file

-  Create one row per data acquisition session

-  Required fields

   -  user_email: Your email address to receive notifications about the job
      status (e.g., start, completion, failure)
   -  job_type: We store pre-compiled default configurations in AWS Parameter
      Store (e.g. modality transformation settings, Code Ocean pipeline
      settings). This field determines which preset to use when
      running the upload job. A list of job types can be seen by clicking the
      ``Job Parameters`` link.
   -  project_name: A list of project names can be seen by clicking the
      ``Project Names`` link
   -  subject_id: The LabTracks ID of the mouse
   -  acq_datetime: The date and time the data was acquired. Should be
      in ISO format, for example, 2024-05-27T16:07:59
   -  **modalities**: Two columns must be added per modality. A
      **modality** (chosen from drop down menu) and a Posix style path
      to the data source. For example,

      -  modality0 (e.g., ecephys)
      -  modality0.input_source (e.g.,
         /allen/aind/scratch/working_group/session_123456_2024-06-19/ecephys_data)
      -  modality1 (e.g, behavior)
      -  modality1.input_source (e.g.,
         /allen/aind/scratch/working_group/session_123456_2024-06-19/behavior_data)
      -  modality2 (e.g, behavior_videos)
      -  modality2.input_source (e.g.,
         /allen/aind/scratch/working_group/session_123456_2024-06-19/behavior_videos)

-  Optional fields

   -  platform: Standardized way of collecting and processing data
      (chosen from drop down menu) Note: This field is deprecated and will be
      removed when aind-data-schema 2.0 is rolled out.
   -  metadata_dir: If metadata files are pre-compiled and saved to a
      directory, you can add the Posix style path to the directory under
      this column
   -  derivatives_dir: If a derivatives folder is available for upload, can be
      specified as a Posix style path to the directory under this column
   -  s3_bucket: By default, data will be uploaded to a default bucket
      in S3 managed by AIND. Please reach out to the Scientific
      Computing department if you wish to upload to a different bucket.
   -  modality{n}.pipeline_id (or modality{n}.capsule_id): It is possible to add
      a Code Ocean pipeline_id or capsule_id to a modality. For more complex
      parameters, please define a job_type or use the REST API.
      -  modality0.capsule_id (e.g., 123-456)
      -  modality1.pipeline_id (e.g., 123-456)
   - force_cloud_sync: We recommend using this flag sparingly. This will force
     a sync to AWS even if the folder already exists in the cloud.
     This will overwrite the data already uploaded, but won't delete any data.
     Please reach out to a member of Scientific Computing for help clearing data
     from AWS.

Using the REST API
------------------

For more granular configuration, jobs can be submitted via a REST API at the
endpoint:

``http://aind-data-transfer-service/api/v2/submit_jobs``

You may pip install aind-data-transfer-service for access to the Task model;
however, this isn't strictly necessary. You can form the post request as a
dictionary. The service will perform validation. 

**Note:** The ``user_email`` field is required in the SubmitJobRequestV2 model
to receive notifications about job status.

We strongly recommend using
customized job_types to simplify the requests. For more detailed examples please
check the scripts in `examples <https://github.com/AllenNeuralDynamics/aind-data-transfer-service/tree/main/docs/examples>`__.


Viewing the status of submitted jobs
------------------------------------

The status of submitted jobs can be viewed at:
http://aind-data-transfer-service/jobs

This page shows the jobs submitted in the last 14 days. You can filter/sort by
status, asset name, job type, etc. You can also click into a job's tasks to view
the status and logs from individual tasks.

Please note that certain tasks, such as compress_data/monitor_job, will
alternate between "running" and "up_for_reschedule" status until finished.

Viewing job parameters based on job type
--------------------------------------------

We store pre-compiled job configurations in AWS Parameter Store based on `job_type`.
Available job types and their configurations can be viewed at:
http://aind-data-transfer-service/job_params

To request a new job type, please reach out to Scientific Computing.
Admins can manage job types directly in the Job Parameters page.

Reporting bugs or making feature requests
-----------------------------------------

Please report any bugs or feature requests here:
`issues <https://github.com/AllenNeuralDynamics/aind-data-transfer-service/issues/new/choose>`__