command to start sagemaker training
include sample training
This commit is contained in:
30
README.md
30
README.md
@@ -30,11 +30,16 @@ qc-cli --help
|
||||
# 1. Create config.yaml in the current directory
|
||||
qc-cli init
|
||||
|
||||
# 2. Edit config.yaml — at minimum set s3.bucket and sagemaker.role_name
|
||||
# 2. Edit config.yaml — at minimum set s3.bucket and sagemaker.training.image_uri
|
||||
|
||||
# 3. Provision AWS infrastructure (S3 bucket + SageMaker IAM role).
|
||||
# This is the step that requires the AWS CDK CLI.
|
||||
qc-cli infra setup
|
||||
|
||||
# 4. Upload training data, then submit a SageMaker training job.
|
||||
qc-cli upload ./my-dataset
|
||||
qc-cli train start
|
||||
qc-cli train status
|
||||
```
|
||||
|
||||
## Configuration
|
||||
@@ -51,8 +56,17 @@ s3:
|
||||
|
||||
sagemaker:
|
||||
role_name: qc-cli-sagemaker-role
|
||||
training:
|
||||
image_uri: "" # ECR URI for your training container
|
||||
instance_type: ml.m5.xlarge
|
||||
instance_count: 1
|
||||
entry_point: null # Optional: script inside source_dir
|
||||
source_dir: null # Optional: local dir packaged and uploaded automatically
|
||||
hyperparameters: {}
|
||||
```
|
||||
|
||||
`hyperparameters` is a flat map of values passed to the training container. Valid keys depend on the selected training image and entry point.
|
||||
|
||||
To provision an MLflow tracking server, set:
|
||||
|
||||
```yaml
|
||||
@@ -101,6 +115,19 @@ qc-cli upload <file> --s3-key <key> Upload a file to a custom S3 key
|
||||
|
||||
Uploads use `s3.bucket` and `s3.data_prefix` from `config.yaml`. File uploads default to `s3://<bucket>/<data_prefix>/<filename>`. Directory uploads are recursive, preserve paths relative to the uploaded directory, and place files under `s3://<bucket>/<data_prefix>/`.
|
||||
|
||||
### `train`
|
||||
|
||||
```
|
||||
qc-cli train start Submit a SageMaker training job
|
||||
qc-cli train status [job-name] Show job status; defaults to the last submitted job
|
||||
qc-cli train list List recent training jobs
|
||||
qc-cli train list --limit 3 Show a custom number of recent jobs
|
||||
```
|
||||
|
||||
`train start` uses `s3://<bucket>/<data_prefix>/` as the training channel and writes outputs under `s3://<bucket>/<model_prefix>/`. If `sagemaker.training.source_dir` is set, the CLI packages that directory, uploads it beside the job output prefix, and passes `sagemaker_program`/`sagemaker_submit_directory` to the SageMaker container.
|
||||
|
||||
The expected output artifact is SageMaker’s `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.
|
||||
|
||||
## AWS permissions required
|
||||
|
||||
The IAM user or role running the CLI needs:
|
||||
@@ -111,6 +138,7 @@ The IAM user or role running the CLI needs:
|
||||
| CreateRole, GetRole, DeleteRole, AttachRolePolicy, DetachRolePolicy | IAM |
|
||||
| CreateStack, UpdateStack, DeleteStack, DescribeStacks, DescribeStackEvents | CloudFormation |
|
||||
| GetCallerIdentity | STS |
|
||||
| CreateTrainingJob, DescribeTrainingJob, ListTrainingJobs | SageMaker AI |
|
||||
| CreateMlflowTrackingServer, DescribeMlflowTrackingServer, DeleteMlflowTrackingServer | SageMaker AI, when `mlflow.mode` is `create` or `existing` |
|
||||
|
||||
`AdministratorAccess` covers all of the above.
|
||||
|
||||
Reference in New Issue
Block a user