Files
slalom a3f3060e13 ai-hub (#3)
Reviewed-on: #3
2026-06-03 21:06:06 +00:00
..
2026-06-03 21:06:06 +00:00
2026-06-02 19:04:23 +00:00

SageMaker Training Example

This example downloads a small image-classification dataset, uploads it through qc-cli, and submits a live SageMaker training job.

Prerequisites

  • AWS credentials configured for the profile in config.yaml
  • Infrastructure already deployed with qc-cli infra setup
  • config.yaml updated with:
s3:
  bucket: your-bucket-name

sagemaker:
  training:
    image_uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.6-cpu-py312-ubuntu22.04-sagemaker-v1
    instance_type: ml.m4.xlarge
    instance_count: 1
    source_dir: examples/training/source
    entry_point: train.py
    hyperparameters:
      epochs: 1
      batch-size: 32
      learning-rate: 0.001
      image-size: 160
      validation-split: 0.2

Training Hyperparameters

Values under sagemaker.training.hyperparameters are passed to the training entry point as command-line arguments. For this example, they map to arguments defined in source/train.py.

Supported by this example:

Name Type Default Description
epochs int 1 Number of training epochs.
batch-size int 32 Images per training batch.
learning-rate float 0.001 Adam optimizer learning rate.
image-size int 160 Resize images to square image-size x image-size.
validation-split float 0.2 Fraction of data used for validation.
max-samples int 0 Optional cap for smoke tests; 0 means use all images.
seed int 13 Random seed for reproducible splitting.
num-workers int 2 DataLoader worker count.

Do not set train-dir or model-dir in normal SageMaker runs. SageMaker sets those automatically through SM_CHANNEL_TRAIN and SM_MODEL_DIR.

1. Download The Dataset

bash examples/training/download_flower_photos.sh

This creates:

examples/training/data/flower_photos_sagemaker/
  daisy/
  dandelion/
  roses/
  sunflowers/
  tulips/

2. Run Training

Run the training script and wait until it finishes:

bash examples/training/run_training.sh --config config.yaml --wait

Use a dataset that is already uploaded to s3.data_prefix:

bash examples/training/run_training.sh \
  --config config.yaml \
  --skip-upload \
  --wait

Notes

  • The default dataset path is examples/training/data/flower_photos_sagemaker.
  • Uploaded data uses the s3.bucket and s3.data_prefix values from config.yaml.
  • Training artifacts are written under s3://<bucket>/<model_prefix>/.
  • The SageMaker model.tar.gz contains model.onnx, model.pt, class_to_idx.json, and metrics.json.
  • SageMaker packages examples/training/source, installs requirements.txt, and runs train.py.