command to start sagemaker training

include sample training
2026-05-25 16:48:31 -04:00
parent 62ffe163e8
commit 0e728cc193
13 changed files with 796 additions and 5 deletions
--- a/examples/training/README.md
+++ b/examples/training/README.md
@@ -0,0 +1,90 @@
+# SageMaker Training Example
+
+This example downloads a small image-classification dataset, uploads it through `qc-cli`, and submits a live SageMaker training job.
+
+## Prerequisites
+
+- AWS credentials configured for the profile in `config.yaml`
+- Infrastructure already deployed with `qc-cli infra setup`
+- `config.yaml` updated with:
+
+```yaml
+s3:
+  bucket: your-bucket-name
+
+sagemaker:
+  role_name: <role-name>
+  training:
+    image_uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.6-cpu-py312-ubuntu22.04-sagemaker-v1
+    instance_type: ml.m4.xlarge
+    instance_count: 1
+    source_dir: examples/training/source
+    entry_point: train.py
+    hyperparameters:
+      epochs: 1
+      batch-size: 32
+      learning-rate: 0.001
+      image-size: 160
+      validation-split: 0.2
+```
+
+## Training Hyperparameters
+
+Values under `sagemaker.training.hyperparameters` are passed to the training entry point as command-line arguments. For this example, they map to arguments defined in [source/train.py](source/train.py).
+
+Supported by this example:
+
+| Name | Type | Default | Description |
+|---|---:|---:|---|
+| `epochs` | int | `1` | Number of training epochs. |
+| `batch-size` | int | `32` | Images per training batch. |
+| `learning-rate` | float | `0.001` | Adam optimizer learning rate. |
+| `image-size` | int | `160` | Resize images to square `image-size x image-size`. |
+| `validation-split` | float | `0.2` | Fraction of data used for validation. |
+| `max-samples` | int | `0` | Optional cap for smoke tests; `0` means use all images. |
+| `seed` | int | `13` | Random seed for reproducible splitting. |
+| `num-workers` | int | `2` | DataLoader worker count. |
+
+Do not set `train-dir` or `model-dir` in normal SageMaker runs. SageMaker sets those automatically through `SM_CHANNEL_TRAIN` and `SM_MODEL_DIR`.
+
+## 1. Download The Dataset
+
+```bash
+bash examples/training/download_flower_photos.sh
+```
+
+This creates:
+
+```text
+examples/training/data/flower_photos_sagemaker/
+  daisy/
+  dandelion/
+  roses/
+  sunflowers/
+  tulips/
+```
+
+## 2. Run Training
+
+Run the training script and wait until it finishes:
+
+```bash
+bash examples/training/run_training.sh --config config.yaml --wait
+```
+
+Use a dataset that is already uploaded to `s3.data_prefix`:
+
+```bash
+bash examples/training/run_training.sh \
+  --config config.yaml \
+  --skip-upload \
+  --wait
+```
+
+## Notes
+
+- The default dataset path is `examples/training/data/flower_photos_sagemaker`.
+- Uploaded data uses the `s3.bucket` and `s3.data_prefix` values from `config.yaml`.
+- Training artifacts are written under `s3://<bucket>/<model_prefix>/`.
+- The SageMaker `model.tar.gz` contains `model.onnx`, `model.pt`, `class_to_idx.json`, and `metrics.json`.
+- SageMaker packages `examples/training/source`, installs `requirements.txt`, and runs `train.py`.