Application Framework Options

The Latent AI Machine Learning Application Framework (AF) is a modular framework that enables users to solve machine learning problems by bringing their own data and using those datasets to quickly train and evaluate different models to select the best performing model that meets their design requirements. Models exported from AF can be optimized, compiled, and evaluated on target edge hardware to verify criteria are met.

When used with LEIP Recipes, AF default configurations are provided that are designed to provide good performance to a broad set of applications. Depending on your dataset, you may need to change the defaults, such as changing the input shapes. You may also wish to alter parameters to explore different learning rates, or to trade off accuracy for fast training, or vice-versa.

AF builds on top of many component technologies, including Hydra and PyTorch Lightning, giving users configurable access to many underlying components.

We recommend you start with LEIP Recipes to gain experience with AF and a number of available models that not only work with AF out-of-the-box, but are also guaranteed to compile, optimize, and run on many different hardware platforms. The following sections will give you an introduction to AF’s underlying capabilities if you would like to experiment with different parameters to find more optimal settings for your dataset. If you have more specific needs or different models you would like to use in this modular fashion, please contact us at Latent AI.

Basic Commands

AF supports a set of commands/modes, and each is configurable with various aspects of the ML process:

train: Train a model
evaluate: Evaluate a trained model
predict: Visualize and summarize the predictions of a trained model
vizdata: Visualize the input data to verify that it is correctly ingested
export: Export a trained model for further processing in the LEIP SDK (i.e., compile, optimize, etc.).
export_data: Exports the dataset defined in the configuration to another format. Example, export a detection data in the coco format. *
data_report: Generates a PDF report on the configured dataset with useful statistical and sample information on the dataset. *
* = Experimental mode.

Each mode can be set as shown here (evaluate mode as example):

CODE

af [...] command=evaluate

Historically, the default mode (that is, if no command is specified) is train. This behavior is deprecated and will be removed in future versions of AF.

Available Models

Refer to the list of supported detector models or supported classifiers for the recipe_name and architecture names.

The configuration and architecture names are used as follows:
af --config-name=<recipe_name> model.architecture=<architecture_name> command=<command_name>

Here is an example exporting a YOLOv5 Small model:
af --config-name=yolov5 model.architecture=yolov5s command=export

Here is an example exporting a timm:inception_v4 classifier:
af --config-name=classifier-recipe model.architecture=timm:inception_v4 command=export

How Do I?

Train a Model With My Data: BYOD

The BYOD instructions differ depending on the type of recipe. Instructions are available for both Classifier and Detector models.

More Training Options

Change the Learning Rate

Use the following command to change the learning rate:

CODE

af [...] model.module.optimizer.lr=0.1

Change the Processing Resolution

Use the following command to change the processing resolution:

CODE

af [...] task.width=384 task.height=384

Change the Batch Size

Use the following command to change the batch size:

CODE

af [...] task.bs_train=16 task.bs_val=64

Change the Optimizer

Use the following command to change the optimizer:

CODE

af [...] model.module.optimizer=timm.adamw

For all available optimizer options, visit the Optimizers section.

Add ML Metrics Logging -- Tensorboard

Use the following command to add Tensorboard to the ML metrics logging:

CODE

af [...] loggers=tensorboard

The logs will be stored in the configured experiment output folder, ./outputs by default.

Add ML Metrics Logging -- http://Neptune.AI

Use the following command to add http://Neptune.AI the ML metric logging:

CODE

af [...] loggers=neptune loggers.neptune.project="<your_neptune_project_id>"

Note: You have to provide your Neptune credentials in NEPTUNE_API_TOKEN, refer to https://docs.neptune.ai/getting-started/installation#authentication-neptune-api-token. The logs will be stored in your Neptune project.

Change the Learning Rate Scheduler

There is a change on command line (this is a group of values) so the syntax differs:

CODE

af [...] model/module/scheduler=OneCycle

Increase the Number of Training Epochs

Use the following command to increase the number of trains epochs:

CODE

af [...] trainer.max_epochs=42

Limit the Training Time

Use the following command to limit to 2 hours and 42 minutes of total training time:

CODE

af [...] trainer.max_time="00:02:42:00"

Change the Display and Log Metrics

For the classifiers:

CODE

# add one or more metrics
+model/module/metrics@model.module.metrics=[AUROC,AveragePrecision]

# override to one or more metrics
model/module/metrics=[Accuracy,AUROC,AveragePrecision]

Train with Multiple GPUs on One Machine

Enter the following commands to train multiple GPUs on one machine.

CODE

# use all available gpus
af [...] trainer.devices=-1

# use first and third available gpus
af [...] trainer.devices=[0,2]

# use two gpus
af [...] trainer.devices=2

Get More Debug Output in the Console

Enter the following command to receiving more debugging output:

CODE

af [...] hydra.verbose=[af]

When Does Training Stop?

Training generally stops when either of the following conditions are met:

trainer.max_epochs is reached.
trainer.max_time is reached.
An early termination callback is enabled and its conditions are met, for example EarlyStopping based on val_loss_epoch.
The user enters Ctrl-C.

Specify a Trained Checkpoint for Further Processing (Evaluate, Visualize Predictions, and Export)

The training process will generate checkpoints of the best models as they are being trained. They end up in the artifacts/ folder (for example, if your current folder is /latentai):

CODE

/latentai/artifacts/train/2022-06-15_13-53-38_task_leip_classifier/epoch=2-step=303.ckpt

Use the following syntax in order to specify such an existing checkpoint for continuing training, exporting, evaluating, or visualizing:

CODE

af [...] +checkpoint=<path of .ckpt file>

The pathname could be absolute or relative to the current working directory.

If you would like to change the export file, for example to simplify scripting for automatic test and integration, you can use the following option:
callbacks.checkpoint.filename=<checkpoint filename>

Export a Model

To export a pre-trained model (for example, the YOLOv5 pre-trained on MS COCO), call it with the same configuration used for training and add command=export

CODE

af [...] command=export

Notice that if you do not provide a +checkpoint=</absolute/path/to.ckpt>, the AF knows to pull in the pretrained weights of the model by default.

Export a Trained Model from checkpoint

To export a newly trained model, locate the checkpoint that you would like to export and call it with the same configuration used for training:

CODE

af [...] command=export +checkpoint=<path of .ckpt file>

The default location for the exported .pt file will be:

CODE

./artifacts/export/[task.moniker]_[backbone]_[batch_size]x[height]x[width].pt

Example: leip_classifier_ptcv-mobilenetv2_w1_1x224x224.pt

Export using a specified batch size

The default batch size for the exported model is 1. If you would like to trace using a different batch size:

CODE

af [...] command=export [...] export.jit.batch_size=8

Evaluate with a Trained Model Checkpoint

Locate the checkpoint that you would like to evaluate and call it with the same configuration used for training. Call with command=evaluate

CODE

af [...] command=evaluate +checkpoint=<path of .ckpt file>

The AF will predict and run evaluation metrics over the entire validation set. At the end, you will see a metrics report, which will also be exported to /latentai/artifacts/evaluate/<data_name>/metrics_report.json

By default, the recipe you select will define what evaluation protocol to use.

The following protocols are available: for detection models:

Evaluate using MS COCO protocol: eval=tm-map
Evaluate the mAP for 0.5:0.95 with step 0.05: eval=coco
Evaluate using Pascal protocol, computes the AP for each class: eval=pascal
Generate Precision versus Recall curves for each class: eval=prc

Visualize Predictions with a Trained Model Checkpoint

Locate the checkpoint that you would like to predict with and visualize and call it with the same configuration used for training. Call with command=predict

CODE

af [...] command=predict +checkpoint=<path of .ckpt file>

After the command runs, the images with predictions will also be exported to /latentai/artifacts/predict/<data_name>/

Confidence Threshold

When visualizing detector model predictions, it can be useful to set a minimum confidence for bounding boxes that we want to display:

CODE

af [...] command=predict +checkpoint=<path> predict.annotation_renderers.bbox.confidence_threshold=0.1

Left: Ground truth. Right: Predictions with confidence > 0.1

Left: Ground truth. Right: Predictions with confidence > 0.5

Show Ground Truth

The default behavior is to display the ground truth and the predictions side to side. Users can optionally choose not to include the ground truth next to the predictions:

CODE

af [...] command=predict +checkpoint=<path> predict.annotation_renderers.bbox.show_gt=False

Show Errors

When visualizing the ground truth next to the predictions, it can be useful to highlight false positives and false negatives:

CODE

af [...] command=predict +checkpoint=<path> predict.annotation_renderers.bbox.show_errors=True

Note that the visualizer will consider an error any prediction with an IoU < 0.5 with any ground truth box.

Red on the right image (predictions) means false positives.
Red on the left (ground truth) means false negative / missed prediction.

Export datasets in other formats

The export_data command exports the dataset defined in the configuration to another format. For example, if you want to export a detection data that is originally in the pascal format to the coco format.

General usage

CODE

af [...] command=export_data export_data=<output_format>

options for output_format are:

coco for detection data
pascal for detection data
kitti for detection data
classifier_imagefolder for classifier data

Advanced usage

If you want to export only a small subset of the dataset, you can use the RandomSubset dataset wrapper to subset the data that will be exported.

CODE

af data=composable/randomsubset \
   data@data.full_dataset=../data_old/torchvision/coco-detection-90 \
   data.module.dataset_generator.fraction=0.2 \
   data.module.dataset_generator.random_seed=123 \
   command=export_data \
   export_data=coco

where <data config> can be any of the supported datasets or the name of the data config a user created via the BYOD tutorials.

The above command will take the 90-class version of the coco detection data, fraction only 20% of it, selecting samples using a 123 random seed. That new smaller dataset will be exported as coco using the export_data command.

All Optimizer and Scheduler Options

Optimizer

How to Configure

The optimizer can be configured using a change on command line:

CODE

af [...] model.module.optimizer.moniker=timm.adamw

Change in YAML:

CODE

model:
  module:
    optimizer:
      moniker: timm.adamw

Supported Values

CODE

torch:Adadelta
torch:Adagrad
torch:Adam
torch:AdamW
torch:SparseAdam
torch:Adamax
torch:ASGD
torch:LBFGS
torch:NAdam
torch:RAdam
torch:RMSprop
torch:Rprop
torch:SGD
timm:sgd
timm:nesterov
timm:momentum
timm:sgdp
timm:adam
timm:adamw
timm:adamp
timm:nadam
timm:radam
timm:adamax
timm:adabelief
timm:radabelief
timm:adadelta
timm:adagrad
timm:adafactor
timm:lamb
timm:lambc
timm:larc
timm:lars
timm:nlarc
timm:nlars
timm:madgrad
timm:madgradw
timm:novograd
timm:nvnovograd
timm:rmsprop
timm:rmsproptf

Schedulers

How to Configure

The schedulers can be configured either via the command line OR by modifying the recipe YAML file directly.

Note: internally schedulers are groups of values, so the syntax for command line and YAML file changes is different than changes to single values.

Change the scheduler on command line:

CODE

af [...] model/module/scheduler=OneCycle

Supported Values

CODE

ExponentialDecay
ReduceOnPlateau
OneCycle
OneCycleAnnealed

`model/module/scheduler=ExponentialDecayScaled`

Reference: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ExponentialLR.html#exponentiallr

Parameter	Parameter Explanation
`model.module.optimizer.lr`	Starting LR (eg. 0.01)
`model.module.scheduler.actual.gamma`	Decay rate (eg 0.95)
`model.module.scheduler.actual.[XXX]`	Any other parameter of torch scheduler

`model/module/scheduler=ExponentialDecayScaled`

Reference: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ExponentialLR.html#exponentiallr

Parameter	Parameter Explanation
`trainer.max_epochs`	Max epoch for LR scaling (e.g. 42)
`model.module.optimizer.lr`	Starting LR (e.g. 0.1)
`model.module.scheduler._recipe_.lr_end`	Ending LR at last epoch (e.g. 0.0001)
`model.module.scheduler.actual.[XXX]`	Any other parameter of torch scheduler

`model/module/scheduler=OneCycle`

Reference: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.OneCycleLR.html

Parameter	Parameter Explanation
`trainer.max_epochs`	Max epoch for LR scheduling (e.g. 42)
`model.module.optimizer.lr`	Max LR (e.g. 0.1)
`model.module.scheduler.actual.[XXX]`	Any parameter of torch scheduler

`model/module/scheduler=OneCycleAnnealed`

Two phase LR: (1) OneCycle → (2) constant LR annealing

References:

Parameter	Parameter Explanation
`model.module.scheduler._recipe_.epochs_onecycle`	# of epochs for initial OneCycle
`model.module.scheduler._recipe_.constant_lr_factor`	factor of initial LR to anneal
`model.module.optimizer.lr`	Max LR (e.g., 0.1)
`model.module.scheduler.actual.[XXX]`	Any parameter of torch sequential scheduler
`model.module.scheduler.actual._schedulers_[0].[XXX]`	Any parameter of first phase OneCycle scheduler
`model.module.scheduler.actual._schedulers_[1].[XXX]`	Any parameter of second phase constant scheduler
`model.module.scheduler=ReducedOnPlateau`	Reduces the learning rate when a metric has stopped improving.
`model.module.scheduler=CosineAnneling`	Sets the learning rate of each parameter group using a cosine annealing schedule.