# SageMaker Training Example This example downloads a small image-classification dataset, uploads it through `qc-cli`, and submits a live SageMaker training job. ## Prerequisites - AWS credentials configured for the profile in `config.yaml` - Infrastructure already deployed with `qc-cli infra setup` - `config.yaml` updated with: ```yaml s3: bucket: your-bucket-name sagemaker: training: image_uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.6-cpu-py312-ubuntu22.04-sagemaker-v1 instance_type: ml.m4.xlarge instance_count: 1 source_dir: examples/training/source entry_point: train.py hyperparameters: epochs: 1 batch-size: 32 learning-rate: 0.001 image-size: 160 validation-split: 0.2 ``` ## Training Hyperparameters Values under `sagemaker.training.hyperparameters` are passed to the training entry point as command-line arguments. For this example, they map to arguments defined in [source/train.py](source/train.py). Supported by this example: | Name | Type | Default | Description | |---|---:|---:|---| | `epochs` | int | `1` | Number of training epochs. | | `batch-size` | int | `32` | Images per training batch. | | `learning-rate` | float | `0.001` | Adam optimizer learning rate. | | `image-size` | int | `160` | Resize images to square `image-size x image-size`. | | `validation-split` | float | `0.2` | Fraction of data used for validation. | | `max-samples` | int | `0` | Optional cap for smoke tests; `0` means use all images. | | `seed` | int | `13` | Random seed for reproducible splitting. | | `num-workers` | int | `2` | DataLoader worker count. | Do not set `train-dir` or `model-dir` in normal SageMaker runs. SageMaker sets those automatically through `SM_CHANNEL_TRAIN` and `SM_MODEL_DIR`. ## 1. Download The Dataset ```bash bash examples/training/download_flower_photos.sh ``` This creates: ```text examples/training/data/flower_photos_sagemaker/ daisy/ dandelion/ roses/ sunflowers/ tulips/ ``` ## 2. Run Training Run the training script and wait until it finishes: ```bash bash examples/training/run_training.sh --config config.yaml --wait ``` Use a dataset that is already uploaded to `s3.data_prefix`: ```bash bash examples/training/run_training.sh \ --config config.yaml \ --skip-upload \ --wait ``` ## Notes - The default dataset path is `examples/training/data/flower_photos_sagemaker`. - Uploaded data uses the `s3.bucket` and `s3.data_prefix` values from `config.yaml`. - Training artifacts are written under `s3:////`. - The SageMaker `model.tar.gz` contains `model.onnx`, `model.pt`, `class_to_idx.json`, and `metrics.json`. - SageMaker packages `examples/training/source`, installs `requirements.txt`, and runs `train.py`.