SageMaker Training Example
This example downloads a small image-classification dataset, uploads it through qc-cli, and submits a live SageMaker training job.
Prerequisites
- AWS credentials configured for the profile in
config.yaml - Infrastructure already deployed with
qc-cli infra setup config.yamlupdated with:
s3:
bucket: your-bucket-name
sagemaker:
training:
image_uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.6-cpu-py312-ubuntu22.04-sagemaker-v1
instance_type: ml.m4.xlarge
instance_count: 1
source_dir: examples/training/source
entry_point: train.py
hyperparameters:
epochs: 1
batch-size: 32
learning-rate: 0.001
image-size: 160
validation-split: 0.2
Training Hyperparameters
Values under sagemaker.training.hyperparameters are passed to the training entry point as command-line arguments. For this example, they map to arguments defined in source/train.py.
Supported by this example:
| Name | Type | Default | Description |
|---|---|---|---|
epochs |
int | 1 |
Number of training epochs. |
batch-size |
int | 32 |
Images per training batch. |
learning-rate |
float | 0.001 |
Adam optimizer learning rate. |
image-size |
int | 160 |
Resize images to square image-size x image-size. |
validation-split |
float | 0.2 |
Fraction of data used for validation. |
max-samples |
int | 0 |
Optional cap for smoke tests; 0 means use all images. |
seed |
int | 13 |
Random seed for reproducible splitting. |
num-workers |
int | 2 |
DataLoader worker count. |
Do not set train-dir or model-dir in normal SageMaker runs. SageMaker sets those automatically through SM_CHANNEL_TRAIN and SM_MODEL_DIR.
1. Download The Dataset
bash examples/training/download_flower_photos.sh
This creates:
examples/training/data/flower_photos_sagemaker/
daisy/
dandelion/
roses/
sunflowers/
tulips/
2. Run Training
Run the training script and wait until it finishes:
bash examples/training/run_training.sh --config config.yaml --wait
Use a dataset that is already uploaded to s3.data_prefix:
bash examples/training/run_training.sh \
--config config.yaml \
--skip-upload \
--wait
Notes
- The default dataset path is
examples/training/data/flower_photos_sagemaker. - Uploaded data uses the
s3.bucketands3.data_prefixvalues fromconfig.yaml. - Training artifacts are written under
s3://<bucket>/<model_prefix>/. - The SageMaker
model.tar.gzcontainsmodel.onnx,model.pt,class_to_idx.json, andmetrics.json. - SageMaker packages
examples/training/source, installsrequirements.txt, and runstrain.py.