===================
Synicix ML Pipeline
===================

Prerequisites
=============

 - Datajoint
 - Docker
 - K8
 - Pytorch

Introduction:
=============

Synicix ML Pipeline is a pytorch based machine learning pipeline built to facilitate training machine learning models on various dataset with different training configuration as well provide various other useful tools such as fp16 mixed precision training.

In order to keep improving the pipeline, please file bugs and feature request to the Github Repo under the Other Resrouces section


Overview on Architecture
========================
.. image:: images/synicix_ml_pipeline_tutorial/synicix_ml_pipeline_erd.png
    :width: 100%
    :align: center


DatasetConfig, ModelConfig, TrainingConfig
------------------------------------------
At the core of the pipeline is **Dataset Config**, **Models Config**, and **Training Config** shown in their respective tables above.
These core tables handles the defining what dataset and model to use and how to train as the name suggesst

TrainingTask
------------
Following the three core tables is the **Training Task** table which serves as a table indiciating what combination of **Dataset Config**, **Model Config**, and **Training Config** should be fed thorugh the pipeline.
Typically this is use to restrict to a subset of all possiable combination between the 3 core tables.

TrainingResult
--------------
After that we have the **Training Result** table which handles storing the training of every combination in the **Training Task** table.

**DatasetConfig**, **ModelConfig**, **TrainingConfig**, and **TrainingTask** are **dj.Manual Tables**, while TrainingResult is **dj.Computed**

Handling of Abstraction
=======================
In order to deal the possability various needs and use cases the pipeline was built with heavy abstraction in mind via the import module libaray provided by python.

Due to this, you will often see table definition requiring **model_class_name** and **class_name** which basically tells the pipeline what class to import from where.
Futher details will be covered the following sections.

Using the Pipeline
==================
Jupyter Notebook example:
-------------------------
https://github.com/cajal/SynicixMLPipeline/blob/master/Pipeline%20Configuration.ipynb 

**Note: however this only shows the inserting and not how to create your own dataset and models to be used in this pipeline, as such the following section will go into more detail behind that.**


Directory Setup
===============

Before running any code the pipeline typically requires certain global variable paths to be defined.

- **dj.config['store']['external_training_results']:** Where to store the external blobs for the TrainingResult table
- **dataset_dir:** Where the dataset files are stored
- **dataset_cache_dir:** Where to cache the dataset files to the local computer when training
- **model_save_dir:** Where to save the model checkpoint files to

Here is example:

.. code-block:: python
    :linenos:

    import os
    import datajoint as dj


    if os.name == 'nt':
        dj.config['stores'] = dict(external_training_result = dict(protocol='file', location='\\mnt\\scratch07\\external'))
        dataset_dir = '\\\\at-storage3.ad.bcm.edu\\scratch07\\synicix_dev\\datasets\\'
        dataset_cache_dir = 'C:\\\\dataset_cache\\'
        model_save_dir = '\\\\at-storage3.ad.bcm.edu\\scratch07\\synicix_dev\\model_storage\\'
    elif os.name == 'posix':
        dj.config['stores'] = dict(external_training_result = dict(protocol='file', location='/mnt/scratch07/external/training_result'))
        dataset_dir = '/mnt/scratch07/synicix_dev/datasets/'
        dataset_cache_dir = 'dataset_cache/'
        model_save_dir = '/mnt/scratch07/synicix_dev/model_storage/'


DatasetConfig
=============

A dj.Manual table class that handle the storage dataset configs with details on what dataset class and params to load that dataset and dataloader with.

Definition
----------
.. code-block:: python
    :linenos:

    definition = """
        dataset_config_md5_hash : char(128)
        ---
        dataset_file_name                   : varchar(256)
        dataset_type                        : varchar(256)
        dataset_class_module_name           : varchar(256)
        dataset_class_name                  : varchar(256)
        dataset_class_params                : longblob
        train_sampler_module_name           : varchar(256)
        train_sampler_class_name            : varchar(256)
        train_sampler_class_params          : longblob
        validation_sampler_module_name      : varchar(256)
        validation_sampler_class_name       : varchar(256)
        validation_sampler_class_params     : longblob
        test_sampler_module_name            : longblob
        test_sampler_class_name             : varchar(256)
        test_sampler_class_params           : longblob
        input_shape                         : longblob
        output_shape                        : longblob
        additional_model_params             : longblob
        """

Additional Details on the Attributes
------------------------------------

- **dataset_class_module_name, dataset_class_name, dataset_class_params:** Details what pytorch based dataset_class it should import (user defined) [REQUIRED]
- **(train, validation, test) _sampler_module_name, _sampler_class_name, _sampler_class_params:** What pytorch sampler class should be passed into the respective dataloader (user defined) [If not defined it will default to pytroch default dataloader Sampler]
- **input_shape, output_shape:** These are computed via the validation dataset and is passed to the model during the model creation [Requires validation examples]
- **additional_model_params:** Obatin from calling get_additional_model_params from the dataset class [No additional params should be defined as dict()]

Additional Notes on Dataloader Default Beheavior
------------------------------------------------

- By default, if sampler for train is not defined then shuffle will be set to True, else False
- By default, if sampler for (validation/test) is not defined then shuffle will be set to False

Implementation of a Dataset Class
---------------------------------
Dataset Config expects a pytorch base dataset class with a few additional requirements:

**The following functions need to be defined**:

- **__len___(self):** Pytorch Dataset requirement
- **__getitem__(self, index):**: Pytorch Dataset requirement
- **get_additional_model_params(self):** DatasetConfig requirement, use to define additional_model_params

**Example:**
:ref:`NeuroDataDataset`

Inserting into DatasetConfig
----------------------------
DatasetConfig has a insert_tuples functions to handle the computation of md5_hash as well other infomation such as input_shape and etc. As such one should always use this function to insert into the table

Below is an example of how to insert multiple DatasetConfigs:

(**NOTE: Please note that dataset_dir and dataset_cache_dir must be defined ahead of time**):

.. code-block:: python
    :linenos:

    # Get all dataset file name under the dataset_dir and insert them
    dataset_file_names = os.listdir(dataset_dir)

    tuple_dicts_to_insert = []

    for dataset_file_name in dataset_file_names:
        tuple_dict = dict(
            dataset_file_name = dataset_file_name,
            dataset_type = 'NeuroDataDataset',
            dataset_class_module_name = 'synicix_ml_pipeline.dataset_classes.NeuroDataDataset',
            dataset_class_name = 'NeuroDataDataset',
            dataset_class_params = dict(mode='full-encoding'),
            train_sampler_module_name = '',
            train_sampler_class_name = '',
            train_sampler_class_params = dict(),
            validation_sampler_module_name = '',
            validation_sampler_class_name = '',
            validation_sampler_class_params = dict(),
            test_sampler_module_name = '',
            test_sampler_class_name = '',
            test_sampler_class_params = dict(),
        )
        
        tuple_dicts_to_insert.append(tuple_dict)

    dataset_config.insert_tuples(tuple_dicts_to_insert)

ModelConfig
===========
A dj.Manual table class that handle the storage of pytorch models definition along with some helper function to help load the models

Definition
----------
.. code-block:: python
    :linenos:

    definition = """
        model_config_md5_hash       : char(128)   # MD5 Hash of network_class_name + network_module_code
        ---
        model_class_module_name     : varchar(256)
        model_class_name            : varchar(256)   # Class name of the network
        model_class_params          : longblob
    """

Implementation of a Model Class
-------------------------------

The pipeline expects the standard pytorch model with a few additional requirements

**The following functions need to be defined**:

- **__init__(self, input_shape, output_shape):** Pytorch Dataset where input_shape and output_shape are required by the pipeline
- **forward(self, x):**: Pytorch Dataset requirement, **must return two variables: output, regularlization loss**

**Example:**
:ref:`SimpleMLP`

Inserting into ModelConfig
--------------------------
ModelConfig has a insert_tuples functions to handle the computation of md5_hash, as such one should always use this function to insert into the table.

Below is an example of how to insert into ModelConfig:

.. code-block:: python
    :linenos:

    tuple_dicts = []

    tuple_dict = dict(
        model_class_module_name='synicix_ml_pipeline.models.SimpleMLP',
        model_class_name='SimpleMLP',
        model_class_params=dict(num_hidden_layers=1, hidden_size=1000, l1_loss_lamda=0.0, l2_loss_lamda=0.0)
    )

    tuple_dicts.append(tuple_dict)

    model_config.insert_tuples(tuple_dicts)

TrainingConfig
==============

Definition
----------
.. code-block:: python
    :linenos:

    definition = """
        training_config_md5_hash   : char(128) # MD5 Hash of attribute below
        ---
        trainer_class_module_name                       : varchar(256)
        trainer_class_name                              : varchar(256)
        trainer_class_params                            : longblob
        batch_size                                      : smallint unsigned
        epoch_limit                                     : int unsigned
        optimizer_class_module_name                     : varchar(256)
        optimizer_class_name                            : varchar(256)
        optimizer_class_params                          : longblob
        criterion_class_module_name                     : varchar(256)
        criterion_class_name                            : varchar(256)
        criterion_class_params                          : longblob
    """

Additional Details on the Attributes
------------------------------------
- **trainer_class_module_name, trainer_class_name, trainer_class_params:** Details what trainer it should import and use, default will be NNTrainer located under synicix_ml_pipeline/trainers/NNTrainer.py (user defined) [REQUIRED]
- **optimizer_class_module_name, optimizer_class_name, optimizer_class_params:** Details what pytorch based optimizer it should import and use (user defined) [REQUIRED]
- **criterion_class_module_name, criterion_class_name, criterion_class_params:** Details what pytorch based criterion it should import and use (user defined) [REQUIRED]

Implementing a Trainer Class
----------------------------

**Required init parameters:**

- **train_dataloader** (pytorch dataloader)
- **validation_dataloader** (pytorch dataloader)
- **test_dataloader** (pytorch dataloader)
- **device** (pytorch device)
- **model_class** (user defined)
- **model_class_params** (user defined dict)
- **optimizer_class** (user defined)
- **optimizer_class_params** (user defined dict)
- **criterion_class** (user defined)
- **criterion_class_params** (user defined dict)
- **model_save_path** (str)
- **max_epoch** (int)

**Required functions:**

- **train(self):** Function to start training process
- **validate(self):** Function to run the validate dataset
- **evaluate(self, return_outputs_targets_and_loss=False):** Function to run the test dataset and return loss or a dict of (outputs, targets, and loss)
- **load_best_performing_model(self):** Load best the load best performing model once training is done

**Example:** 
:ref:`NNTrainer`

TrainingTask
============

This table serves as a subset of all possiable Dataset, Model and Training Config combination. Whatever is inserted here will be trained with its result recorded in TrainnigResult


Definition
----------

.. code-block:: python
    :linenos:

    definition = """
        training_task_md5_hash   : char(128) # MD5 Hash of attribute below
        ---
        -> DatasetConfig
        -> ModelConfig
        -> TrainingConfig
    """

Inserting into TrainingTask
---------------------------

.. code-block:: python
    :linenos:

    dataset_config_keys = (DatasetConfig).fetch('KEY')
    model_config_keys =(ModelConfig).fetch('KEY')
    training_config_keys = (TrainingConfig).fetch('KEY')

    tuple_dicts = []

    # Insert the possiable combination base on the restrictions above
    for dataset_config_key in dataset_config_keys:
        for model_config_key in model_config_keys:
            for training_config_key in training_config_keys:
                tuple_dict = dict()
                
                tuple_dict.update(dataset_config_key)
                tuple_dict.update(model_config_key)
                tuple_dict.update(training_config_key)
                    
                tuple_dicts.append(tuple_dict)
                
    training_task.insert_tuple(tuple_dicts)

TrainingResult
==============

Definition
----------

.. code-block:: python
    :linenos:

    definition = """
        -> TrainingTask
        ---
        test_score				        : float
        training_epoch_loss_history                     : blob@external_training_result
        validation_epoch_loss_history                   : blob@external_training_result
        regularization_loss_history                     : blob@external_training_result
        model_class_params_history                      : blob@external_training_result
        model_save_path			                : varchar(256)
        utc_insert_time = CURRENT_TIMESTAMP             : timestamp
        """

Populating TrainingResult:
--------------------------

TrainingResult will require dataset_dir, dataset_cache_dir, model_save_dir, and num_workers to be defined, where num_workers is number of dataloaders threads.

Typical usesage of populating TrainingResult is done via a .py script and K8


**training_script.py**

.. code-block:: python
    :linenos:

    import os
    import sys
    import datajoint as dj
    from synicix_ml_pipeline.datajoint_tables.TrainingResult import TrainingResult

    if __name__ == '__main__':
        if os.name == 'nt':
            dj.config['stores'] = dict(external_training_resul = dict(protocol='file', location='\\mnt\\scratch07\\external'))
            dataset_dir = '\\\\at-storage3.ad.bcm.edu\\scratch07\\synicix_dev\\datasets\\'
            dataset_cache_dir = 'C:\\\\dataset_cache\\'
            model_save_dir = '\\\\at-storage3.ad.bcm.edu\\scratch07\\synicix_dev\\model_storage\\'
        elif os.name == 'posix':
            dj.config['stores'] = dict(external_training_result = dict(protocol='file', location='/mnt/scratch07/external/training_result'))
            dataset_dir = '/mnt/scratch07/synicix_dev/datasets/'
            dataset_cache_dir = 'dataset_cache/'
            model_save_dir = '/mnt/scratch07/synicix_dev/model_storage/'

        # Get num_workers from args
        num_workers = int(sys.argv[1])

        # Create the TrainingResult instance
        training_result = TrainingResult(dataset_dir=dataset_dir, dataset_cache_dir=dataset_cache_dir, model_save_dir=model_save_dir, num_workers=num_workers)

        # Being populating
        training_result.populate(reserve_jobs=True, order='random')

**K8 population yaml file**

.. code-block:: yaml
    :linenos:

    apiVersion: batch/v1 # Jobs Default K8 API
    kind: Job # This tells kubernetes what kind of class it is working with
    metadata:
    name: synicix-ml-pipeline # Name of the Job
    spec:
    parallelism: 110 #
    template: # Pod Templete
        spec:
        restartPolicy: Never # Options are OnFailure, and Never.
        hostNetwork: true # This option will allow the pod to use the host network for internet access
        tolerations: # This toleration allows the pod to be schedule onto gpu-only pod machines, remove this if you are not using gpu
        - key: "gpu"
            operator: "Equal"
            value: "true"
            effect: "NoSchedule"
        volumes:
        - name: mnt
            hostPath:
            path: /mnt # Directory on the host machine to be mounted
        affinity: # Affinity to select certain nodes with 11GB, 12GB, or 24GB memory
            nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution: # Require nodes to have this label
                nodeSelectorTerms:
                - matchExpressions:
                - key: gpu_mem_size # Target label is gpu_mem_size
                    operator: In # Key must have one of the following values
                    values:
                    - 8GB
                    - 11GB
                    - 12GB
                    - 24GB
                    - 32GB
            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
                preference:
                matchExpressions:
                - key: tensor_cores
                    operator: In
                    values:
                    - "true"
        containers: # Container Level
        - name: synicix-ml-pipeline # Container name (Can be set to whatever)
            image: synicix/pytorch-fp16-base:latest # Docker Image hosted on Docker Hub
            resources:
            limits:
                nvidia.com/gpu: 1 # requesting 1 GPUs
            volumeMounts: # Container reference to volumes define above
            - name: mnt # Name of the volume define above
            mountPath: /mnt # Location of where to mount it in the container
            env: # This section refers to secrets created under the user namespace and set them as enviorment variables
            - name: DJ_HOST
            valueFrom:
                secretKeyRef:
                name: datajoint-credentials
                key: DJ_HOST
            - name: DJ_USER
            valueFrom:
                secretKeyRef:
                name: datajoint-credentials
                key: DJ_USER
            - name: DJ_PASS
            valueFrom:
                secretKeyRef:
                name: datajoint-credentials
                key: DJ_PASS
            - name: GITHUB_USERNAME
            valueFrom:
                secretKeyRef:
                name: github-credentials
                key: GITHUB_USERNAME
            - name: GITHUB_PASSWORD
            valueFrom:
                secretKeyRef:
                name: github-credentials
                key: GITHUB_PASSWORD
            command: ["/bin/bash"] # Entry point for the container
            args: ["-c", "git clone https://$(GITHUB_USERNAME):$(GITHUB_PASSWORD)@github.com/Synicix/SynicixMLPipeline.git \
            && pip3 install /SynicixMLPipeline \
            && python3 -u /SynicixMLPipeline/K8/TrainingDeployment/training_script.py 0"] # sh commands to clone and run python script

Other Resources:
================

| **GitHub**: https://github.com/cajal/SynicixMLPipeline?organization=cajal&organization=cajal