257 lines
12 KiB
Markdown
257 lines
12 KiB
Markdown
# qc-cli
|
||
|
||
A CLI for Qualcomm's MLOps pipeline — browse and download models from Qualcomm AI Hub, fine-tune them on custom datasets using SageMaker, validate inference, and prepare artifacts for Qualcomm hardware deployment.
|
||
|
||
## Requirements
|
||
|
||
- Python 3.13+
|
||
- [uv](https://docs.astral.sh/uv/getting-started/installation/)
|
||
- AWS account with credentials configured (`aws configure`) when using `qc-cli infra`
|
||
- AWS CDK CLI (`npm install -g aws-cdk`) when using `qc-cli infra setup` or `qc-cli infra destroy`
|
||
|
||
## Installation
|
||
|
||
```bash
|
||
git clone <repo>
|
||
cd qc-cli
|
||
uv sync
|
||
```
|
||
|
||
Run commands with `uv run qc-cli <command>` or activate the venv first:
|
||
|
||
```bash
|
||
source .venv/bin/activate
|
||
qc-cli --help
|
||
```
|
||
|
||
## Quick start
|
||
|
||
```bash
|
||
# 1. Create config.yaml in the current directory
|
||
qc-cli init
|
||
|
||
# 2. Edit config.yaml — at minimum set sagemaker.training.image_uri
|
||
|
||
# 3. Provision AWS infrastructure (S3 bucket + SageMaker IAM role).
|
||
# This is the step that requires the AWS CDK CLI.
|
||
qc-cli infra setup
|
||
|
||
# 4. Upload training data, then submit a SageMaker training job.
|
||
qc-cli upload ./my-dataset
|
||
qc-cli train start
|
||
qc-cli train status
|
||
```
|
||
|
||
## Configuration
|
||
|
||
`qc-cli init` writes a `config.yaml` in the current directory. The fields you must fill in before using the tool:
|
||
|
||
```yaml
|
||
infra:
|
||
stack_name: qc-cli-mlops-1a2b3c4d5e6f
|
||
|
||
aws:
|
||
region: us-east-1
|
||
profile: default # AWS CLI profile name
|
||
|
||
s3:
|
||
bucket: qc-cli-mlops-1a2b3c4d5e6f-data
|
||
|
||
sagemaker:
|
||
training:
|
||
image_uri: "" # ECR URI for your training container
|
||
instance_type: ml.m5.xlarge
|
||
instance_count: 1
|
||
entry_point: null # Optional: script inside source_dir
|
||
source_dir: null # Optional: local dir packaged and uploaded automatically
|
||
hyperparameters: {}
|
||
|
||
aihub:
|
||
device:
|
||
name: Samsung Galaxy S25 (Family)
|
||
target_runtime: tflite
|
||
input_specs: {} # Required before running qc-cli ai-hub commands
|
||
job_name: null # Optional prefix for AI Hub Workbench jobs
|
||
model_name: null # Optional name for uploaded local ONNX models
|
||
compile_options: null
|
||
profile_options: null
|
||
quantize_options: null
|
||
output_dir: build/qai-hub
|
||
```
|
||
|
||
`qc-cli init` generates the `infra.stack_name` and `s3.bucket` namespace once and writes it to `config.yaml`. Keep these values stable for a deployment; changing them points the CLI at different infrastructure.
|
||
|
||
The CLI isolates both application resources and CDK bootstrap resources. The application CloudFormation stack uses `infra.stack_name`, the S3 bucket uses the same generated namespace because bucket names are globally unique, and the SageMaker IAM role uses a CloudFormation-generated physical name. CDK bootstrap resources are derived internally from `infra.stack_name`, including a bootstrap stack named `<stack_name>-bootstrap` and a matching non-default CDK asset bucket qualifier. `qc-cli infra destroy` removes the application stack but leaves the CDK bootstrap stack in place; the command prints the retained bootstrap stack name.
|
||
|
||
`hyperparameters` is a flat map of values passed to the training container. Valid keys depend on the selected training image and entry point.
|
||
|
||
To provision an MLflow tracking server, set:
|
||
|
||
```yaml
|
||
mlflow:
|
||
mode: create
|
||
experiment_name: qc-cli-training
|
||
registered_model_name: qc-cli-model
|
||
register_trained_models: true
|
||
```
|
||
|
||
In `create` mode, the CLI manages the tracking server name from `infra.stack_name`; you do not need to set `tracking_server_name`.
|
||
|
||
To use an existing MLflow tracking server, set:
|
||
|
||
```yaml
|
||
mlflow:
|
||
mode: existing
|
||
tracking_server_name: your-tracking-server-name
|
||
```
|
||
|
||
When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. `train status` finalizes that run once the job reaches a terminal state and registers completed model artifacts as experiment model versions using the `experiment-latest` MLflow alias. An experiment version is an immutable trained-source artifact; it records that training produced a model, not that the model is better than earlier versions or ready for release.
|
||
|
||
To open the managed SageMaker MLflow UI, request a fresh presigned URL:
|
||
|
||
```bash
|
||
qc-cli mlflow open --config config.yaml
|
||
```
|
||
|
||
This opens a browser to a fresh presigned URL. It works for `mode: create` and for `mode: existing` when the existing server is managed by Amazon SageMaker. In `create` mode, the command uses the CLI-managed tracking server name. In `existing` mode, it uses `mlflow.tracking_server_name`. If the existing MLflow server is external to SageMaker, open it with that server's own URL instead.
|
||
|
||
## Commands
|
||
|
||
### `init`
|
||
|
||
```
|
||
qc-cli init Write config.yaml
|
||
qc-cli init --output <path> Write config to a custom path
|
||
qc-cli init --force Overwrite an existing config file
|
||
```
|
||
|
||
### `mlflow`
|
||
|
||
```
|
||
qc-cli mlflow open Open a presigned MLflow UI URL in a browser
|
||
```
|
||
|
||
### `infra`
|
||
|
||
```
|
||
qc-cli infra setup Deploy the CDK stack
|
||
qc-cli infra setup --no-bootstrap Deploy without running CDK bootstrap
|
||
qc-cli infra setup --cloudformation-execution-policy <arn> Set CDK bootstrap execution policy ARN
|
||
qc-cli infra status Show CDK stack/resource status
|
||
qc-cli infra destroy Destroy stack, retaining S3 data
|
||
qc-cli infra destroy --yes Destroy stack without confirmation
|
||
qc-cli infra destroy --delete-bucket-data Destroy stack and delete S3 data
|
||
```
|
||
|
||
`--cloudformation-execution-policy` is a one-time CDK bootstrap option, not a `config.yaml` setting. Pass it on `infra setup` when you need the CDK bootstrap CloudFormation execution role to use a policy other than the default `AdministratorAccess`:
|
||
|
||
```bash
|
||
qc-cli infra setup --cloudformation-execution-policy arn:aws:iam::aws:policy/PowerUserAccess
|
||
```
|
||
|
||
### `upload`
|
||
|
||
```
|
||
qc-cli upload <file> Upload a single file to S3
|
||
qc-cli upload <dir> Upload all files in a directory tree to S3
|
||
qc-cli upload <file> --s3-key <key> Upload a file to a custom S3 key
|
||
```
|
||
|
||
Uploads use `s3.bucket` and `s3.data_prefix` from `config.yaml`. File uploads default to `s3://<bucket>/<data_prefix>/<filename>`. Directory uploads are recursive, preserve paths relative to the uploaded directory, and place files under `s3://<bucket>/<data_prefix>/`.
|
||
|
||
### `train`
|
||
|
||
```
|
||
qc-cli train start Submit a SageMaker training job
|
||
qc-cli train status [job-name] Show job status; defaults to the last submitted job
|
||
qc-cli train list List recent training jobs
|
||
qc-cli train list --limit 3 Show a custom number of recent jobs
|
||
```
|
||
|
||
`train start` uses `s3://<bucket>/<data_prefix>/` as the training channel and writes outputs under `s3://<bucket>/<model_prefix>/`. If `sagemaker.training.source_dir` is set, the CLI packages that directory, uploads it beside the job output prefix, and passes `sagemaker_program`/`sagemaker_submit_directory` to the SageMaker container.
|
||
|
||
The expected output artifact is SageMaker’s `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.
|
||
|
||
### `ai-hub`
|
||
|
||
```
|
||
qc-cli ai-hub upload <calibration.npz|calibration-dir> <inputs.npz|inputs.npy>
|
||
qc-cli ai-hub upload <calibration> <inputs> --from-step validate
|
||
qc-cli ai-hub quantize <calibration.npz|calibration-dir> [--onnx-path PATH] [--model-s3-uri URI] [--from-job NAME]
|
||
qc-cli ai-hub compile [--model-id ID] [--onnx-path PATH] [--model-s3-uri URI] [--from-job NAME]
|
||
qc-cli ai-hub validate <inputs.npz|inputs.npy> [--model-id ID] [--input-name NAME]
|
||
qc-cli ai-hub profile [--model-id ID]
|
||
qc-cli ai-hub download [--model-id ID] [--output PATH]
|
||
```
|
||
|
||
`ai-hub upload` runs the four Workbench upload steps in order: quantize, compile, validate, and profile. Use `--from-step compile`, `--from-step validate`, or `--from-step profile` to resume from saved local state after a completed earlier step.
|
||
|
||
Resume behavior:
|
||
|
||
```text
|
||
--from-step quantize Run quantize, compile, validate, and profile.
|
||
--from-step compile Skip quantize; compile the last quantized model unless an explicit source is passed.
|
||
--from-step validate Skip quantize and compile; validate the last compiled model.
|
||
--from-step profile Skip quantize, compile, and validate; profile the last compiled model.
|
||
```
|
||
|
||
When a step runs in the current command, `upload` passes its returned model ID directly to the next step. When a step is skipped, the next step resolves the needed model ID from `.qc-cli.json`. This avoids re-running earlier AI Hub jobs when you only need to continue from a later step.
|
||
|
||
`ai-hub compile` resolves model sources in this order: `--model-id`, explicit source options (`--onnx-path`, `--model-s3-uri`, `--from-job`), last quantized model from state, then the last training job from local state. `ai-hub download` is separate because downloading the optimized artifact is outside the four-step Workbench upload loop.
|
||
|
||
When MLflow is enabled, AI Hub job-producing commands (`quantize`, `compile`, `validate`, `profile`, and `upload`) log AI Hub metadata to MLflow. Each command execution receives a `qc_cli.aihub_submission_id`; all steps inside one `ai-hub upload` share that submission ID. Runs are nested under the MLflow run for the resolved source model when the CLI can prove that source from local state, such as `--from-job` or a model produced by a prior tracked AI Hub step. Otherwise, AI Hub runs are standalone. `validate` also logs output summaries, and `profile` logs profile metrics plus the raw profile JSON. `ai-hub download` does not create an MLflow run because it does not submit or measure an AI Hub job.
|
||
|
||
AI Hub authentication currently uses the local `qai-hub` SDK configuration. A planned follow-up is to support AWS Systems Manager Parameter Store `SecureString` for team-managed tokens, where `config.yaml` stores only a parameter name such as `/qc-cli/aihub/token`, AWS KMS encrypts the token at rest, and the CLI retrieves it at runtime with `ssm:GetParameter` plus `kms:Decrypt` permissions.
|
||
|
||
## Model lifecycle
|
||
|
||
The CLI uses neutral experiment naming for trained artifacts and reserves release terminology for an explicit promotion step.
|
||
|
||
Current behavior:
|
||
|
||
1. `qc-cli train start` submits a SageMaker training job.
|
||
2. `qc-cli train status` finalizes the MLflow run after the job reaches a terminal state.
|
||
3. If the job completed and `mlflow.register_trained_models` is enabled, the SageMaker `model.tar.gz` is registered as a new MLflow model version with:
|
||
- `qc_cli.stage=experiment`
|
||
- `qc_cli.artifact_kind=trained_source`
|
||
- `qc_cli.source=sagemaker`
|
||
4. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
|
||
5. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.
|
||
|
||
Future release aliases such as `v1` or `production` can point at a selected deployable artifact.
|
||
|
||
Example future metadata:
|
||
|
||
```text
|
||
qc-cli-model version 12
|
||
qc_cli.stage=experiment
|
||
qc_cli.artifact_kind=trained_source
|
||
qc_cli.source=sagemaker
|
||
|
||
qc-cli-model-aihub version 3
|
||
qc_cli.stage=ai_hub_compiled
|
||
qc_cli.artifact_kind=deployable
|
||
qc_cli.parent_registered_model_name=qc-cli-model
|
||
qc_cli.parent_model_version=12
|
||
qc_cli.runtime=tflite
|
||
qc_cli.quantization=int8
|
||
qc_cli.target_device=Samsung Galaxy S25
|
||
```
|
||
|
||
In that flow, `experiment-latest` remains a training convenience alias. Release selection is a separate promotion decision based on the derived artifact, not on the experiment name.
|
||
|
||
## AWS permissions required
|
||
|
||
The IAM user or role running the CLI needs:
|
||
|
||
| Action | Service |
|
||
|---|---|
|
||
| CreateBucket, DeleteBucket, PutObject, GetObject, ListBucket, DeleteObject | S3 |
|
||
| CreateRole, GetRole, DeleteRole, AttachRolePolicy, DetachRolePolicy | IAM |
|
||
| CreateStack, UpdateStack, DeleteStack, DescribeStacks, DescribeStackEvents | CloudFormation |
|
||
| GetCallerIdentity | STS |
|
||
| CreateTrainingJob, DescribeTrainingJob, ListTrainingJobs | SageMaker AI |
|
||
| CreateMlflowTrackingServer, DescribeMlflowTrackingServer, DeleteMlflowTrackingServer | SageMaker AI, when `mlflow.mode` is `create` or `existing` |
|
||
|
||
`AdministratorAccess` covers all of the above.
|