Files

slalom a8c736e28e WIP: add ai-hub metrics to MLFlow

2026-06-05 14:46:04 -04:00

12 KiB

Raw Permalink Blame History

qc-cli

A CLI for Qualcomm's MLOps pipeline — browse and download models from Qualcomm AI Hub, fine-tune them on custom datasets using SageMaker, validate inference, and prepare artifacts for Qualcomm hardware deployment.

Requirements

Python 3.13+
uv
AWS account with credentials configured (aws configure) when using qc-cli infra
AWS CDK CLI (npm install -g aws-cdk) when using qc-cli infra setup or qc-cli infra destroy

Installation

git clone <repo>
cd qc-cli
uv sync

Run commands with uv run qc-cli <command> or activate the venv first:

source .venv/bin/activate
qc-cli --help

Quick start

# 1. Create config.yaml in the current directory
qc-cli init

# 2. Edit config.yaml — at minimum set sagemaker.training.image_uri

# 3. Provision AWS infrastructure (S3 bucket + SageMaker IAM role).
#    This is the step that requires the AWS CDK CLI.
qc-cli infra setup

# 4. Upload training data, then submit a SageMaker training job.
qc-cli upload ./my-dataset
qc-cli train start
qc-cli train status

Configuration

qc-cli init writes a config.yaml in the current directory. The fields you must fill in before using the tool:

infra:
  stack_name: qc-cli-mlops-1a2b3c4d5e6f

aws:
  region: us-east-1
  profile: default          # AWS CLI profile name

s3:
  bucket: qc-cli-mlops-1a2b3c4d5e6f-data

sagemaker:
  training:
    image_uri: ""           # ECR URI for your training container
    instance_type: ml.m5.xlarge
    instance_count: 1
    entry_point: null       # Optional: script inside source_dir
    source_dir: null        # Optional: local dir packaged and uploaded automatically
    hyperparameters: {}

aihub:
  device:
    name: Samsung Galaxy S25 (Family)
  target_runtime: tflite
  input_specs: {}           # Required before running qc-cli ai-hub commands
  job_name: null            # Optional prefix for AI Hub Workbench jobs
  model_name: null          # Optional name for uploaded local ONNX models
  compile_options: null
  profile_options: null
  quantize_options: null
  output_dir: build/qai-hub

qc-cli init generates the infra.stack_name and s3.bucket namespace once and writes it to config.yaml. Keep these values stable for a deployment; changing them points the CLI at different infrastructure.

The CLI isolates both application resources and CDK bootstrap resources. The application CloudFormation stack uses infra.stack_name, the S3 bucket uses the same generated namespace because bucket names are globally unique, and the SageMaker IAM role uses a CloudFormation-generated physical name. CDK bootstrap resources are derived internally from infra.stack_name, including a bootstrap stack named <stack_name>-bootstrap and a matching non-default CDK asset bucket qualifier. qc-cli infra destroy removes the application stack but leaves the CDK bootstrap stack in place; the command prints the retained bootstrap stack name.

hyperparameters is a flat map of values passed to the training container. Valid keys depend on the selected training image and entry point.

To provision an MLflow tracking server, set:

mlflow:
  mode: create
  experiment_name: qc-cli-training
  registered_model_name: qc-cli-model
  register_trained_models: true

In create mode, the CLI manages the tracking server name from infra.stack_name; you do not need to set tracking_server_name.

To use an existing MLflow tracking server, set:

mlflow:
  mode: existing
  tracking_server_name: your-tracking-server-name

When MLflow is enabled, train start creates an MLflow run for the SageMaker job. train status finalizes that run once the job reaches a terminal state and registers completed model artifacts as experiment model versions using the experiment-latest MLflow alias. An experiment version is an immutable trained-source artifact; it records that training produced a model, not that the model is better than earlier versions or ready for release.

To open the managed SageMaker MLflow UI, request a fresh presigned URL:

qc-cli mlflow open --config config.yaml

This opens a browser to a fresh presigned URL. It works for mode: create and for mode: existing when the existing server is managed by Amazon SageMaker. In create mode, the command uses the CLI-managed tracking server name. In existing mode, it uses mlflow.tracking_server_name. If the existing MLflow server is external to SageMaker, open it with that server's own URL instead.

Commands

`init`

qc-cli init                  Write config.yaml
qc-cli init --output <path>  Write config to a custom path
qc-cli init --force          Overwrite an existing config file

`mlflow`

qc-cli mlflow open  Open a presigned MLflow UI URL in a browser

`infra`

qc-cli infra setup                         Deploy the CDK stack
qc-cli infra setup --no-bootstrap          Deploy without running CDK bootstrap
qc-cli infra setup --cloudformation-execution-policy <arn> Set CDK bootstrap execution policy ARN
qc-cli infra status                        Show CDK stack/resource status
qc-cli infra destroy                       Destroy stack, retaining S3 data
qc-cli infra destroy --yes                 Destroy stack without confirmation
qc-cli infra destroy --delete-bucket-data  Destroy stack and delete S3 data

--cloudformation-execution-policy is a one-time CDK bootstrap option, not a config.yaml setting. Pass it on infra setup when you need the CDK bootstrap CloudFormation execution role to use a policy other than the default AdministratorAccess:

qc-cli infra setup --cloudformation-execution-policy arn:aws:iam::aws:policy/PowerUserAccess

`upload`

qc-cli upload <file>                 Upload a single file to S3
qc-cli upload <dir>                  Upload all files in a directory tree to S3
qc-cli upload <file> --s3-key <key>  Upload a file to a custom S3 key

Uploads use s3.bucket and s3.data_prefix from config.yaml. File uploads default to s3://<bucket>/<data_prefix>/<filename>. Directory uploads are recursive, preserve paths relative to the uploaded directory, and place files under s3://<bucket>/<data_prefix>/.

`train`

qc-cli train start              Submit a SageMaker training job
qc-cli train status [job-name]  Show job status; defaults to the last submitted job
qc-cli train list               List recent training jobs
qc-cli train list --limit 3     Show a custom number of recent jobs

train start uses s3://<bucket>/<data_prefix>/ as the training channel and writes outputs under s3://<bucket>/<model_prefix>/. If sagemaker.training.source_dir is set, the CLI packages that directory, uploads it beside the job output prefix, and passes sagemaker_program/sagemaker_submit_directory to the SageMaker container.

The expected output artifact is SageMaker’s model.tar.gz, normally containing the trained model file your container writes to /opt/ml/model.

`ai-hub`

qc-cli ai-hub upload <calibration.npz|calibration-dir> <inputs.npz|inputs.npy>
qc-cli ai-hub upload <calibration> <inputs> --from-step validate
qc-cli ai-hub quantize <calibration.npz|calibration-dir> [--onnx-path PATH] [--model-s3-uri URI] [--from-job NAME]
qc-cli ai-hub compile [--model-id ID] [--onnx-path PATH] [--model-s3-uri URI] [--from-job NAME]
qc-cli ai-hub validate <inputs.npz|inputs.npy> [--model-id ID] [--input-name NAME]
qc-cli ai-hub profile [--model-id ID]
qc-cli ai-hub download [--model-id ID] [--output PATH]

ai-hub upload runs the four Workbench upload steps in order: quantize, compile, validate, and profile. Use --from-step compile, --from-step validate, or --from-step profile to resume from saved local state after a completed earlier step.

Resume behavior:

--from-step quantize  Run quantize, compile, validate, and profile.
--from-step compile   Skip quantize; compile the last quantized model unless an explicit source is passed.
--from-step validate  Skip quantize and compile; validate the last compiled model.
--from-step profile   Skip quantize, compile, and validate; profile the last compiled model.

When a step runs in the current command, upload passes its returned model ID directly to the next step. When a step is skipped, the next step resolves the needed model ID from .qc-cli.json. This avoids re-running earlier AI Hub jobs when you only need to continue from a later step.

ai-hub compile resolves model sources in this order: --model-id, explicit source options (--onnx-path, --model-s3-uri, --from-job), last quantized model from state, then the last training job from local state. ai-hub download is separate because downloading the optimized artifact is outside the four-step Workbench upload loop.

When MLflow is enabled, AI Hub job-producing commands (quantize, compile, validate, profile, and upload) log AI Hub metadata to MLflow. Each command execution receives a qc_cli.aihub_submission_id; all steps inside one ai-hub upload share that submission ID. Runs are nested under the MLflow run for the resolved source model when the CLI can prove that source from local state, such as --from-job or a model produced by a prior tracked AI Hub step. Otherwise, AI Hub runs are standalone. validate also logs output summaries, and profile logs profile metrics plus the raw profile JSON. ai-hub download does not create an MLflow run because it does not submit or measure an AI Hub job.

AI Hub authentication currently uses the local qai-hub SDK configuration. A planned follow-up is to support AWS Systems Manager Parameter Store SecureString for team-managed tokens, where config.yaml stores only a parameter name such as /qc-cli/aihub/token, AWS KMS encrypts the token at rest, and the CLI retrieves it at runtime with ssm:GetParameter plus kms:Decrypt permissions.

Model lifecycle

The CLI uses neutral experiment naming for trained artifacts and reserves release terminology for an explicit promotion step.

Current behavior:

qc-cli train start submits a SageMaker training job.
qc-cli train status finalizes the MLflow run after the job reaches a terminal state.
If the job completed and mlflow.register_trained_models is enabled, the SageMaker model.tar.gz is registered as a new MLflow model version with:
- qc_cli.stage=experiment
- qc_cli.artifact_kind=trained_source
- qc_cli.source=sagemaker
The MLflow alias experiment-latest points at the most recently registered experiment version.
AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.

Future release aliases such as v1 or production can point at a selected deployable artifact.

Example future metadata:

qc-cli-model version 12
qc_cli.stage=experiment
qc_cli.artifact_kind=trained_source
qc_cli.source=sagemaker

qc-cli-model-aihub version 3
qc_cli.stage=ai_hub_compiled
qc_cli.artifact_kind=deployable
qc_cli.parent_registered_model_name=qc-cli-model
qc_cli.parent_model_version=12
qc_cli.runtime=tflite
qc_cli.quantization=int8
qc_cli.target_device=Samsung Galaxy S25

In that flow, experiment-latest remains a training convenience alias. Release selection is a separate promotion decision based on the derived artifact, not on the experiment name.

AWS permissions required

The IAM user or role running the CLI needs:

Action	Service
CreateBucket, DeleteBucket, PutObject, GetObject, ListBucket, DeleteObject	S3
CreateRole, GetRole, DeleteRole, AttachRolePolicy, DetachRolePolicy	IAM
CreateStack, UpdateStack, DeleteStack, DescribeStacks, DescribeStackEvents	CloudFormation
GetCallerIdentity	STS
CreateTrainingJob, DescribeTrainingJob, ListTrainingJobs	SageMaker AI
CreateMlflowTrackingServer, DescribeMlflowTrackingServer, DeleteMlflowTrackingServer	SageMaker AI, when `mlflow.mode` is `create` or `existing`

AdministratorAccess covers all of the above.

12 KiB Raw Permalink Blame History Unescape Escape