qai-cli/README.md

# qc-cli

A CLI for Qualcomm's MLOps pipeline — browse and download models from Qualcomm AI Hub, fine-tune them on custom datasets using SageMaker, validate inference, and prepare artifacts for Qualcomm hardware deployment.

## Requirements

- Python 3.13+
- [uv](https://docs.astral.sh/uv/getting-started/installation/)
- AWS account with credentials configured (`aws configure`) when using `qc-cli infra`
- AWS CDK CLI (`npm install -g aws-cdk`) when using `qc-cli infra setup` or `qc-cli infra destroy`

## Installation

```bash
git clone <repo>
cd qc-cli
uv sync
```

Run commands with `uv run qc-cli <command>` or activate the venv first:

```bash
source .venv/bin/activate
qc-cli --help
```

## Quick start

```bash
# 1. Create config.yaml in the current directory
qc-cli init

# 2. Edit config.yaml — at minimum set sagemaker.training.image_uri

# 3. Provision AWS infrastructure (S3 bucket + SageMaker IAM role).
#    This is the step that requires the AWS CDK CLI.
qc-cli infra setup

# 4. Upload training data, then submit a SageMaker training job.
qc-cli upload ./my-dataset
qc-cli train start
qc-cli train status
```

## Configuration

`qc-cli init` writes a `config.yaml` in the current directory. The fields you must fill in before using the tool:

```yaml
infra:
  stack_name: qc-cli-mlops-1a2b3c4d5e6f

aws:
  region: us-east-1
  profile: default          # AWS CLI profile name

s3:
  bucket: qc-cli-mlops-1a2b3c4d5e6f-data

sagemaker:
  training:
    image_uri: ""           # ECR URI for your training container
    instance_type: ml.m5.xlarge
    instance_count: 1
    entry_point: null       # Optional: script inside source_dir
    source_dir: null        # Optional: local dir packaged and uploaded automatically
    hyperparameters: {}

aihub:
  device:
    name: Samsung Galaxy S25 (Family)
  target_runtime: tflite
  input_specs: {}           # Required before running qc-cli ai-hub commands
  job_name: null            # Optional prefix for AI Hub Workbench jobs
  model_name: null          # Optional name for uploaded local ONNX models
  compile_options: null
  profile_options: null
  quantize_options: null
  output_dir: build/qai-hub
```

`qc-cli init` generates the `infra.stack_name` and `s3.bucket` namespace once and writes it to `config.yaml`. Keep these values stable for a deployment; changing them points the CLI at different infrastructure.

The CLI isolates both application resources and CDK bootstrap resources. The application CloudFormation stack uses `infra.stack_name`, the S3 bucket uses the same generated namespace because bucket names are globally unique, and the SageMaker IAM role uses a CloudFormation-generated physical name. CDK bootstrap resources are derived internally from `infra.stack_name`, including a bootstrap stack named `<stack_name>-bootstrap` and a matching non-default CDK asset bucket qualifier. `qc-cli infra destroy` removes the application stack but leaves the CDK bootstrap stack in place; the command prints the retained bootstrap stack name.

`hyperparameters` is a flat map of values passed to the training container. Valid keys depend on the selected training image and entry point.

To provision an MLflow tracking server, set:

```yaml
mlflow:
  mode: create
  experiment_name: qc-cli-training
  registered_model_name: qc-cli-model
  register_trained_models: true
```

In `create` mode, the CLI manages the tracking server name from `infra.stack_name`; you do not need to set `tracking_server_name`.

To use an existing MLflow tracking server, set:

```yaml
mlflow:
  mode: existing
  tracking_server_name: your-tracking-server-name
```

When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. Training metrics can be upload with `train start --upload-metrics` or `mlflow upload-metrics`.

To open the managed SageMaker MLflow UI, request a fresh presigned URL:

```bash
qc-cli mlflow open --config config.yaml
```

This opens a browser to a fresh presigned URL. It works for `mode: create` and for `mode: existing` when the existing server is managed by Amazon SageMaker. In `create` mode, the command uses the CLI-managed tracking server name. In `existing` mode, it uses `mlflow.tracking_server_name`. If the existing MLflow server is external to SageMaker, open it with that server's own URL instead.

## Commands

### `init`

```
qc-cli init                  Write config.yaml
qc-cli init --output <path>  Write config to a custom path
qc-cli init --force          Overwrite an existing config file
```

### `infra`

```
qc-cli infra setup                         Deploy the CDK stack
qc-cli infra setup --no-bootstrap          Deploy without running CDK bootstrap
qc-cli infra setup --cloudformation-execution-policy <arn> Set CDK bootstrap execution policy ARN
qc-cli infra status                        Show CDK stack/resource status
qc-cli infra destroy                       Destroy stack, retaining S3 data
qc-cli infra destroy --yes                 Destroy stack without confirmation
qc-cli infra destroy --delete-bucket-data  Destroy stack and delete S3 data
```

`--cloudformation-execution-policy` is a one-time CDK bootstrap option, not a `config.yaml` setting. Pass it on `infra setup` when you need the CDK bootstrap CloudFormation execution role to use a policy other than the default `AdministratorAccess`:

```bash
qc-cli infra setup --cloudformation-execution-policy arn:aws:iam::aws:policy/PowerUserAccess
```

### `mlflow`

```
qc-cli mlflow open                       Open a presigned MLflow UI URL
qc-cli mlflow upload-metrics [job-name] Upload completed training metrics
```

`mlflow upload-metrics` defaults to the last submitted training job. It creates or recovers the job's MLflow run, imports `training_metrics.json` from the SageMaker model artifact, and records successful upload in `.qc-cli.json`. Use `--force` to upload the metrics again.

### `upload`

```
qc-cli upload <file>                 Upload a single file to S3
qc-cli upload <dir>                  Upload all files in a directory tree to S3
qc-cli upload <file> --s3-key <key>  Upload a file to a custom S3 key
```

Uploads use `s3.bucket` and `s3.data_prefix` from `config.yaml`. File uploads default to `s3://<bucket>/<data_prefix>/<filename>`. Directory uploads are recursive, preserve paths relative to the uploaded directory, and place files under `s3://<bucket>/<data_prefix>/`.

### `train`

```
qc-cli train start              Submit a SageMaker training job
qc-cli train start --upload-metrics  Submit, wait, and upload metrics
qc-cli train status [job-name]  Show job status; defaults to the last submitted job
qc-cli train list               List recent training jobs
qc-cli train list --limit 3     Show a custom number of recent jobs
```

`train start` uses `s3://<bucket>/<data_prefix>/` as the training channel and writes outputs under `s3://<bucket>/<model_prefix>/`. If `sagemaker.training.source_dir` is set, the CLI packages that directory, uploads it beside the job output prefix, and passes `sagemaker_program`/`sagemaker_submit_directory` to the SageMaker container.

`train start --upload-metrics` checks SageMaker every 30 seconds by default, then uploads metrics after completion. Use `--poll-interval <seconds>` to choose another positive interval. Stopping the local command does not stop the SageMaker job.

The expected output artifact is SageMaker’s `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.

### `ai-hub`

```
qc-cli ai-hub upload <calibration.npz|calibration-dir> <inputs.npz|inputs.npy>
qc-cli ai-hub upload <calibration> <inputs> --from-step validate
qc-cli ai-hub optimize [--onnx-path PATH] [--model-s3-uri URI] [--from-job NAME]
qc-cli ai-hub quantize <calibration.npz|calibration-dir> [--model-id ID] [--onnx-path PATH] [--model-s3-uri URI] [--from-job NAME]
qc-cli ai-hub compile [--model-id ID] [--onnx-path PATH] [--model-s3-uri URI] [--from-job NAME]
qc-cli ai-hub validate <inputs.npz|inputs.npy> [--model-id ID] [--input-name NAME]
qc-cli ai-hub profile [--model-id ID]
qc-cli ai-hub download [--model-id ID] [--output PATH]
```

`ai-hub upload` optimizes to ONNX, quantizes, validates, and profiles. When `aihub.target_runtime` is not `onnx`, it also compiles the quantized model to that deployment runtime. The initial ONNX optimization gives external models Workbench provenance and applies compiler optimization passes before quantization.

Resume behavior:

```text
--from-step optimize  Run optimize, quantize, optional final compile, validate, and profile.
--from-step quantize  Quantize the last optimized ONNX, then optionally compile, validate, and profile.
--from-step compile   Skip optimize and quantize; finalize the last quantized model for the target runtime.
--from-step validate  Skip optimize, quantize, and compile; validate the last compiled model.
--from-step profile   Skip optimize, quantize, compile, and validate; profile the last compiled model.
```

When a step runs in the current command, `upload` passes its returned model ID directly to the next step. When a step is skipped, the next step resolves the needed model ID from `.qc-cli.json`. This avoids re-running earlier AI Hub jobs when you only need to continue from a later step.

`ai-hub optimize` compiles an external model with `--target_runtime onnx`. `ai-hub quantize` uses an explicit `--model-id`, the last optimized ONNX model, or an explicit/local model source in that order. `ai-hub compile` resolves model sources in this order: `--model-id`, explicit source options, last quantized model, then the last training job. For `target_runtime: onnx`, upload treats the quantized ONNX as the final model and skips a redundant second compile. `ai-hub download` remains separate because downloading is outside the Workbench processing loop.

AI Hub authentication currently uses the local `qai-hub` SDK configuration. A planned follow-up is to support AWS Systems Manager Parameter Store `SecureString` for team-managed tokens, where `config.yaml` stores only a parameter name such as `/qc-cli/aihub/token`, AWS KMS encrypts the token at rest, and the CLI retrieves it at runtime with `ssm:GetParameter` plus `kms:Decrypt` permissions.

## Model lifecycle

The CLI uses neutral experiment naming for trained artifacts and reserves release terminology for an explicit promotion step.

Current behavior:

1. `qc-cli train start` submits a SageMaker training job.
2. `qc-cli train status` reads and displays SageMaker status only; it does not contact MLflow.
3. `qc-cli train start --upload-metrics` polls every 30 seconds by default, then uploads per-epoch metrics after completion.
4. `qc-cli mlflow upload-metrics [job-name]` uploads or retries metrics for an existing completed job.
5. The metrics upload workflow finalizes the MLflow run and, when `mlflow.register_trained_models` is enabled, registers the SageMaker `model.tar.gz` as a new MLflow model version with:
   - `qc_cli.stage=experiment`
   - `qc_cli.artifact_kind=trained_source`
   - `qc_cli.source=sagemaker`
6. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
7. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.

Training scripts can include a `training_metrics.json` file in the SageMaker model directory. When present, the explicit metrics upload command logs its ordered metrics to the associated MLflow run using each epoch as the MLflow step and stores the JSON as a run artifact:

```json
{
  "schema_version": 1,
  "steps": [
    {"step": 0, "metrics": {"val.precision": 0.72, "val.recall": 0.68}}
  ],
  "summary": {"summary.best_epoch": 0}
}
```

Metric names must be non-empty strings, values must be finite numbers, and steps must be non-negative, unique, and strictly increasing. If the file is missing, the command uploads the final metrics reported by SageMaker and continues model registration without per-epoch history. A malformed metrics artifact still fails the upload command without affecting the trained model or model registration.

Future release aliases such as `v1` or `production` can point at a selected deployable artifact.

Example future metadata:

```text
qc-cli-model version 12
qc_cli.stage=experiment
qc_cli.artifact_kind=trained_source
qc_cli.source=sagemaker

qc-cli-model-aihub version 3
qc_cli.stage=ai_hub_compiled
qc_cli.artifact_kind=deployable
qc_cli.parent_registered_model_name=qc-cli-model
qc_cli.parent_model_version=12
qc_cli.runtime=tflite
qc_cli.quantization=int8
qc_cli.target_device=Samsung Galaxy S25
```

In that flow, `experiment-latest` remains a training convenience alias. Release selection is a separate promotion decision based on the derived artifact, not on the experiment name.

## AWS permissions required

The IAM user or role running the CLI needs:

| Action | Service |
|---|---|
| CreateBucket, DeleteBucket, PutObject, GetObject, ListBucket, DeleteObject | S3 |
| CreateRole, GetRole, DeleteRole, AttachRolePolicy, DetachRolePolicy | IAM |
| CreateStack, UpdateStack, DeleteStack, DescribeStacks, DescribeStackEvents | CloudFormation |
| GetCallerIdentity | STS |
| CreateTrainingJob, DescribeTrainingJob, ListTrainingJobs | SageMaker AI |
| CreateMlflowTrackingServer, DescribeMlflowTrackingServer, DeleteMlflowTrackingServer | SageMaker AI, when `mlflow.mode` is `create` or `existing` |

`AdministratorAccess` covers all of the above.