279 lines
13 KiB
Markdown
279 lines
13 KiB
Markdown
# qc-cli
|
||
|
||
A CLI for Qualcomm's MLOps pipeline — browse and download models from Qualcomm AI Hub, fine-tune them on custom datasets using SageMaker, validate inference, and prepare artifacts for Qualcomm hardware deployment.
|
||
|
||
## Requirements
|
||
|
||
- Python 3.13+
|
||
- [uv](https://docs.astral.sh/uv/getting-started/installation/)
|
||
- AWS account with credentials configured (`aws configure`) when using `qc-cli infra`
|
||
- AWS CDK CLI (`npm install -g aws-cdk`) when using `qc-cli infra setup` or `qc-cli infra destroy`
|
||
|
||
## Installation
|
||
|
||
```bash
|
||
git clone <repo>
|
||
cd qc-cli
|
||
uv sync
|
||
```
|
||
|
||
Run commands with `uv run qc-cli <command>` or activate the venv first:
|
||
|
||
```bash
|
||
source .venv/bin/activate
|
||
qc-cli --help
|
||
```
|
||
|
||
## Quick start
|
||
|
||
```bash
|
||
# 1. Create config.yaml in the current directory
|
||
qc-cli init
|
||
|
||
# 2. Edit config.yaml — at minimum set sagemaker.training.image_uri
|
||
|
||
# 3. Provision AWS infrastructure (S3 bucket + SageMaker IAM role).
|
||
# This is the step that requires the AWS CDK CLI.
|
||
qc-cli infra setup
|
||
|
||
# 4. Upload training data, then submit a SageMaker training job.
|
||
qc-cli upload ./my-dataset
|
||
qc-cli train start
|
||
qc-cli train status
|
||
```
|
||
|
||
## Configuration
|
||
|
||
`qc-cli init` writes a `config.yaml` in the current directory. The fields you must fill in before using the tool:
|
||
|
||
```yaml
|
||
infra:
|
||
stack_name: qc-cli-mlops-1a2b3c4d5e6f
|
||
|
||
aws:
|
||
region: us-east-1
|
||
profile: default # AWS CLI profile name
|
||
|
||
s3:
|
||
bucket: qc-cli-mlops-1a2b3c4d5e6f-data
|
||
|
||
sagemaker:
|
||
training:
|
||
image_uri: "" # ECR URI for your training container
|
||
instance_type: ml.m5.xlarge
|
||
instance_count: 1
|
||
entry_point: null # Optional: script inside source_dir
|
||
source_dir: null # Optional: local dir packaged and uploaded automatically
|
||
hyperparameters: {}
|
||
|
||
aihub:
|
||
device:
|
||
name: Samsung Galaxy S25 (Family)
|
||
target_runtime: tflite
|
||
input_specs: {} # Required before running qc-cli ai-hub commands
|
||
job_name: null # Optional prefix for AI Hub Workbench jobs
|
||
model_name: null # Optional name for uploaded local ONNX models
|
||
compile_options: null
|
||
profile_options: null
|
||
quantize_options: null
|
||
output_dir: build/qai-hub
|
||
```
|
||
|
||
`qc-cli init` generates the `infra.stack_name` and `s3.bucket` namespace once and writes it to `config.yaml`. Keep these values stable for a deployment; changing them points the CLI at different infrastructure.
|
||
|
||
The CLI isolates both application resources and CDK bootstrap resources. The application CloudFormation stack uses `infra.stack_name`, the S3 bucket uses the same generated namespace because bucket names are globally unique, and the SageMaker IAM role uses a CloudFormation-generated physical name. CDK bootstrap resources are derived internally from `infra.stack_name`, including a bootstrap stack named `<stack_name>-bootstrap` and a matching non-default CDK asset bucket qualifier. `qc-cli infra destroy` removes the application stack but leaves the CDK bootstrap stack in place; the command prints the retained bootstrap stack name.
|
||
|
||
`hyperparameters` is a flat map of values passed to the training container. Valid keys depend on the selected training image and entry point.
|
||
|
||
To provision an MLflow tracking server, set:
|
||
|
||
```yaml
|
||
mlflow:
|
||
mode: create
|
||
experiment_name: qc-cli-training
|
||
registered_model_name: qc-cli-model
|
||
register_trained_models: true
|
||
```
|
||
|
||
In `create` mode, the CLI manages the tracking server name from `infra.stack_name`; you do not need to set `tracking_server_name`.
|
||
|
||
To use an existing MLflow tracking server, set:
|
||
|
||
```yaml
|
||
mlflow:
|
||
mode: existing
|
||
tracking_server_name: your-tracking-server-name
|
||
```
|
||
|
||
When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. Training metrics can be upload with `train start --upload-metrics` or `mlflow upload-metrics`.
|
||
|
||
To open the managed SageMaker MLflow UI, request a fresh presigned URL:
|
||
|
||
```bash
|
||
qc-cli mlflow open --config config.yaml
|
||
```
|
||
|
||
This opens a browser to a fresh presigned URL. It works for `mode: create` and for `mode: existing` when the existing server is managed by Amazon SageMaker. In `create` mode, the command uses the CLI-managed tracking server name. In `existing` mode, it uses `mlflow.tracking_server_name`. If the existing MLflow server is external to SageMaker, open it with that server's own URL instead.
|
||
|
||
## Commands
|
||
|
||
### `init`
|
||
|
||
```
|
||
qc-cli init Write config.yaml
|
||
qc-cli init --output <path> Write config to a custom path
|
||
qc-cli init --force Overwrite an existing config file
|
||
```
|
||
|
||
### `infra`
|
||
|
||
```
|
||
qc-cli infra setup Deploy the CDK stack
|
||
qc-cli infra setup --no-bootstrap Deploy without running CDK bootstrap
|
||
qc-cli infra setup --cloudformation-execution-policy <arn> Set CDK bootstrap execution policy ARN
|
||
qc-cli infra status Show CDK stack/resource status
|
||
qc-cli infra destroy Destroy stack, retaining S3 data
|
||
qc-cli infra destroy --yes Destroy stack without confirmation
|
||
qc-cli infra destroy --delete-bucket-data Destroy stack and delete S3 data
|
||
```
|
||
|
||
`--cloudformation-execution-policy` is a one-time CDK bootstrap option, not a `config.yaml` setting. Pass it on `infra setup` when you need the CDK bootstrap CloudFormation execution role to use a policy other than the default `AdministratorAccess`:
|
||
|
||
```bash
|
||
qc-cli infra setup --cloudformation-execution-policy arn:aws:iam::aws:policy/PowerUserAccess
|
||
```
|
||
|
||
### `mlflow`
|
||
|
||
```
|
||
qc-cli mlflow open Open a presigned MLflow UI URL
|
||
qc-cli mlflow upload-metrics [job-name] Upload completed training metrics
|
||
```
|
||
|
||
`mlflow upload-metrics` defaults to the last submitted training job. It creates or recovers the job's MLflow run, imports `training_metrics.json` from the SageMaker model artifact, and records successful upload in `.qc-cli.json`. Use `--force` to upload the metrics again.
|
||
|
||
### `upload`
|
||
|
||
```
|
||
qc-cli upload <file> Upload a single file to S3
|
||
qc-cli upload <dir> Upload all files in a directory tree to S3
|
||
qc-cli upload <file> --s3-key <key> Upload a file to a custom S3 key
|
||
```
|
||
|
||
Uploads use `s3.bucket` and `s3.data_prefix` from `config.yaml`. File uploads default to `s3://<bucket>/<data_prefix>/<filename>`. Directory uploads are recursive, preserve paths relative to the uploaded directory, and place files under `s3://<bucket>/<data_prefix>/`.
|
||
|
||
### `train`
|
||
|
||
```
|
||
qc-cli train start Submit a SageMaker training job
|
||
qc-cli train start --upload-metrics Submit, wait, and upload metrics
|
||
qc-cli train status [job-name] Show job status; defaults to the last submitted job
|
||
qc-cli train list List recent training jobs
|
||
qc-cli train list --limit 3 Show a custom number of recent jobs
|
||
```
|
||
|
||
`train start` uses `s3://<bucket>/<data_prefix>/` as the training channel and writes outputs under `s3://<bucket>/<model_prefix>/`. If `sagemaker.training.source_dir` is set, the CLI packages that directory, uploads it beside the job output prefix, and passes `sagemaker_program`/`sagemaker_submit_directory` to the SageMaker container.
|
||
|
||
`train start --upload-metrics` checks SageMaker every 30 seconds by default, then uploads metrics after completion. Use `--poll-interval <seconds>` to choose another positive interval. Stopping the local command does not stop the SageMaker job.
|
||
|
||
The expected output artifact is SageMaker’s `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.
|
||
|
||
### `ai-hub`
|
||
|
||
```
|
||
qc-cli ai-hub upload <calibration.npz|calibration-dir> <inputs.npz|inputs.npy>
|
||
qc-cli ai-hub upload <calibration> <inputs> --from-step validate
|
||
qc-cli ai-hub optimize [--onnx-path PATH] [--model-s3-uri URI] [--from-job NAME]
|
||
qc-cli ai-hub quantize <calibration.npz|calibration-dir> [--model-id ID] [--onnx-path PATH] [--model-s3-uri URI] [--from-job NAME]
|
||
qc-cli ai-hub compile [--model-id ID] [--onnx-path PATH] [--model-s3-uri URI] [--from-job NAME]
|
||
qc-cli ai-hub validate <inputs.npz|inputs.npy> [--model-id ID] [--input-name NAME]
|
||
qc-cli ai-hub profile [--model-id ID]
|
||
qc-cli ai-hub download [--model-id ID] [--output PATH]
|
||
```
|
||
|
||
`ai-hub upload` optimizes to ONNX, quantizes, validates, and profiles. When `aihub.target_runtime` is not `onnx`, it also compiles the quantized model to that deployment runtime. The initial ONNX optimization gives external models Workbench provenance and applies compiler optimization passes before quantization.
|
||
|
||
Resume behavior:
|
||
|
||
```text
|
||
--from-step optimize Run optimize, quantize, optional final compile, validate, and profile.
|
||
--from-step quantize Quantize the last optimized ONNX, then optionally compile, validate, and profile.
|
||
--from-step compile Skip optimize and quantize; finalize the last quantized model for the target runtime.
|
||
--from-step validate Skip optimize, quantize, and compile; validate the last compiled model.
|
||
--from-step profile Skip optimize, quantize, compile, and validate; profile the last compiled model.
|
||
```
|
||
|
||
When a step runs in the current command, `upload` passes its returned model ID directly to the next step. When a step is skipped, the next step resolves the needed model ID from `.qc-cli.json`. This avoids re-running earlier AI Hub jobs when you only need to continue from a later step.
|
||
|
||
`ai-hub optimize` compiles an external model with `--target_runtime onnx`. `ai-hub quantize` uses an explicit `--model-id`, the last optimized ONNX model, or an explicit/local model source in that order. `ai-hub compile` resolves model sources in this order: `--model-id`, explicit source options, last quantized model, then the last training job. For `target_runtime: onnx`, upload treats the quantized ONNX as the final model and skips a redundant second compile. `ai-hub download` remains separate because downloading is outside the Workbench processing loop.
|
||
|
||
AI Hub authentication currently uses the local `qai-hub` SDK configuration. A planned follow-up is to support AWS Systems Manager Parameter Store `SecureString` for team-managed tokens, where `config.yaml` stores only a parameter name such as `/qc-cli/aihub/token`, AWS KMS encrypts the token at rest, and the CLI retrieves it at runtime with `ssm:GetParameter` plus `kms:Decrypt` permissions.
|
||
|
||
## Model lifecycle
|
||
|
||
The CLI uses neutral experiment naming for trained artifacts and reserves release terminology for an explicit promotion step.
|
||
|
||
Current behavior:
|
||
|
||
1. `qc-cli train start` submits a SageMaker training job.
|
||
2. `qc-cli train status` reads and displays SageMaker status only; it does not contact MLflow.
|
||
3. `qc-cli train start --upload-metrics` polls every 30 seconds by default, then uploads per-epoch metrics after completion.
|
||
4. `qc-cli mlflow upload-metrics [job-name]` uploads or retries metrics for an existing completed job.
|
||
5. The metrics upload workflow finalizes the MLflow run and, when `mlflow.register_trained_models` is enabled, registers the SageMaker `model.tar.gz` as a new MLflow model version with:
|
||
- `qc_cli.stage=experiment`
|
||
- `qc_cli.artifact_kind=trained_source`
|
||
- `qc_cli.source=sagemaker`
|
||
6. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
|
||
7. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.
|
||
|
||
Training scripts can include a `training_metrics.json` file in the SageMaker model directory. When present, the explicit metrics upload command logs its ordered metrics to the associated MLflow run using each epoch as the MLflow step and stores the JSON as a run artifact:
|
||
|
||
```json
|
||
{
|
||
"schema_version": 1,
|
||
"steps": [
|
||
{"step": 0, "metrics": {"val.precision": 0.72, "val.recall": 0.68}}
|
||
],
|
||
"summary": {"summary.best_epoch": 0}
|
||
}
|
||
```
|
||
|
||
Metric names must be non-empty strings, values must be finite numbers, and steps must be non-negative, unique, and strictly increasing. If the file is missing, the command uploads the final metrics reported by SageMaker and continues model registration without per-epoch history. A malformed metrics artifact still fails the upload command without affecting the trained model or model registration.
|
||
|
||
Future release aliases such as `v1` or `production` can point at a selected deployable artifact.
|
||
|
||
Example future metadata:
|
||
|
||
```text
|
||
qc-cli-model version 12
|
||
qc_cli.stage=experiment
|
||
qc_cli.artifact_kind=trained_source
|
||
qc_cli.source=sagemaker
|
||
|
||
qc-cli-model-aihub version 3
|
||
qc_cli.stage=ai_hub_compiled
|
||
qc_cli.artifact_kind=deployable
|
||
qc_cli.parent_registered_model_name=qc-cli-model
|
||
qc_cli.parent_model_version=12
|
||
qc_cli.runtime=tflite
|
||
qc_cli.quantization=int8
|
||
qc_cli.target_device=Samsung Galaxy S25
|
||
```
|
||
|
||
In that flow, `experiment-latest` remains a training convenience alias. Release selection is a separate promotion decision based on the derived artifact, not on the experiment name.
|
||
|
||
## AWS permissions required
|
||
|
||
The IAM user or role running the CLI needs:
|
||
|
||
| Action | Service |
|
||
|---|---|
|
||
| CreateBucket, DeleteBucket, PutObject, GetObject, ListBucket, DeleteObject | S3 |
|
||
| CreateRole, GetRole, DeleteRole, AttachRolePolicy, DetachRolePolicy | IAM |
|
||
| CreateStack, UpdateStack, DeleteStack, DescribeStacks, DescribeStackEvents | CloudFormation |
|
||
| GetCallerIdentity | STS |
|
||
| CreateTrainingJob, DescribeTrainingJob, ListTrainingJobs | SageMaker AI |
|
||
| CreateMlflowTrackingServer, DescribeMlflowTrackingServer, DeleteMlflowTrackingServer | SageMaker AI, when `mlflow.mode` is `create` or `existing` |
|
||
|
||
`AdministratorAccess` covers all of the above.
|