command to start sagemaker training

include sample training
This commit is contained in:
2026-05-25 16:48:31 -04:00
parent 62ffe163e8
commit 0e728cc193
13 changed files with 796 additions and 5 deletions

View File

@@ -30,11 +30,16 @@ qc-cli --help
# 1. Create config.yaml in the current directory
qc-cli init
# 2. Edit config.yaml — at minimum set s3.bucket and sagemaker.role_name
# 2. Edit config.yaml — at minimum set s3.bucket and sagemaker.training.image_uri
# 3. Provision AWS infrastructure (S3 bucket + SageMaker IAM role).
# This is the step that requires the AWS CDK CLI.
qc-cli infra setup
# 4. Upload training data, then submit a SageMaker training job.
qc-cli upload ./my-dataset
qc-cli train start
qc-cli train status
```
## Configuration
@@ -51,8 +56,17 @@ s3:
sagemaker:
role_name: qc-cli-sagemaker-role
training:
image_uri: "" # ECR URI for your training container
instance_type: ml.m5.xlarge
instance_count: 1
entry_point: null # Optional: script inside source_dir
source_dir: null # Optional: local dir packaged and uploaded automatically
hyperparameters: {}
```
`hyperparameters` is a flat map of values passed to the training container. Valid keys depend on the selected training image and entry point.
To provision an MLflow tracking server, set:
```yaml
@@ -101,6 +115,19 @@ qc-cli upload <file> --s3-key <key> Upload a file to a custom S3 key
Uploads use `s3.bucket` and `s3.data_prefix` from `config.yaml`. File uploads default to `s3://<bucket>/<data_prefix>/<filename>`. Directory uploads are recursive, preserve paths relative to the uploaded directory, and place files under `s3://<bucket>/<data_prefix>/`.
### `train`
```
qc-cli train start Submit a SageMaker training job
qc-cli train status [job-name] Show job status; defaults to the last submitted job
qc-cli train list List recent training jobs
qc-cli train list --limit 3 Show a custom number of recent jobs
```
`train start` uses `s3://<bucket>/<data_prefix>/` as the training channel and writes outputs under `s3://<bucket>/<model_prefix>/`. If `sagemaker.training.source_dir` is set, the CLI packages that directory, uploads it beside the job output prefix, and passes `sagemaker_program`/`sagemaker_submit_directory` to the SageMaker container.
The expected output artifact is SageMakers `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.
## AWS permissions required
The IAM user or role running the CLI needs:
@@ -111,6 +138,7 @@ The IAM user or role running the CLI needs:
| CreateRole, GetRole, DeleteRole, AttachRolePolicy, DetachRolePolicy | IAM |
| CreateStack, UpdateStack, DeleteStack, DescribeStacks, DescribeStackEvents | CloudFormation |
| GetCallerIdentity | STS |
| CreateTrainingJob, DescribeTrainingJob, ListTrainingJobs | SageMaker AI |
| CreateMlflowTrackingServer, DescribeMlflowTrackingServer, DeleteMlflowTrackingServer | SageMaker AI, when `mlflow.mode` is `create` or `existing` |
`AdministratorAccess` covers all of the above.