qai-cli/examples/training/README.md

# SageMaker Training Example

This example downloads a small image-classification dataset, uploads it through `qc-cli`, and submits a live SageMaker training job.

## Prerequisites

- AWS credentials configured for the profile in `config.yaml`
- Infrastructure already deployed with `qc-cli infra setup`
- `config.yaml` updated with:

```yaml
s3:
  bucket: your-bucket-name

sagemaker:
  training:
    image_uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.6-cpu-py312-ubuntu22.04-sagemaker-v1
    instance_type: ml.m4.xlarge
    instance_count: 1
    source_dir: examples/training/source
    entry_point: train.py
    hyperparameters:
      epochs: 1
      batch-size: 32
      learning-rate: 0.001
      image-size: 160
      validation-split: 0.2
```

## Training Hyperparameters

Values under `sagemaker.training.hyperparameters` are passed to the training entry point as command-line arguments. For this example, they map to arguments defined in [source/train.py](source/train.py).

Supported by this example:

| Name | Type | Default | Description |
|---|---:|---:|---|
| `epochs` | int | `1` | Number of training epochs. |
| `batch-size` | int | `32` | Images per training batch. |
| `learning-rate` | float | `0.001` | Adam optimizer learning rate. |
| `image-size` | int | `160` | Resize images to square `image-size x image-size`. |
| `validation-split` | float | `0.2` | Fraction of data used for validation. |
| `max-samples` | int | `0` | Optional cap for smoke tests; `0` means use all images. |
| `seed` | int | `13` | Random seed for reproducible splitting. |
| `num-workers` | int | `2` | DataLoader worker count. |

Do not set `train-dir` or `model-dir` in normal SageMaker runs. SageMaker sets those automatically through `SM_CHANNEL_TRAIN` and `SM_MODEL_DIR`.

## 1. Download The Dataset

```bash
bash examples/training/download_flower_photos.sh
```

This creates:

```text
examples/training/data/flower_photos_sagemaker/
  daisy/
  dandelion/
  roses/
  sunflowers/
  tulips/
```

## 2. Run Training

Run the training script and wait until it finishes:

```bash
bash examples/training/run_training.sh --config config.yaml --wait
```

Use a dataset that is already uploaded to `s3.data_prefix`:

```bash
bash examples/training/run_training.sh \
  --config config.yaml \
  --skip-upload \
  --wait
```

## Notes

- The default dataset path is `examples/training/data/flower_photos_sagemaker`.
- Uploaded data uses the `s3.bucket` and `s3.data_prefix` values from `config.yaml`.
- Training artifacts are written under `s3://<bucket>/<model_prefix>/`.
- The SageMaker `model.tar.gz` contains `model.onnx`, `model.pt`, `class_to_idx.json`, and `metrics.json`.
- SageMaker packages `examples/training/source`, installs `requirements.txt`, and runs `train.py`.