Files
qai-cli/README.md

172 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# qc-cli
A CLI for Qualcomm's MLOps pipeline — browse and download models from Qualcomm AI Hub, fine-tune them on custom datasets using SageMaker, validate inference, and prepare artifacts for Qualcomm hardware deployment.
## Requirements
- Python 3.13+
- [uv](https://docs.astral.sh/uv/getting-started/installation/)
- AWS account with credentials configured (`aws configure`) when using `qc-cli infra`
- AWS CDK CLI (`npm install -g aws-cdk`) when using `qc-cli infra setup` or `qc-cli infra destroy`
## Installation
```bash
git clone <repo>
cd qc-cli
uv sync
```
Run commands with `uv run qc-cli <command>` or activate the venv first:
```bash
source .venv/bin/activate
qc-cli --help
```
## Quick start
```bash
# 1. Create config.yaml in the current directory
qc-cli init
# 2. Edit config.yaml — at minimum set sagemaker.training.image_uri
# 3. Provision AWS infrastructure (S3 bucket + SageMaker IAM role).
# This is the step that requires the AWS CDK CLI.
qc-cli infra setup
# 4. Upload training data, then submit a SageMaker training job.
qc-cli upload ./my-dataset
qc-cli train start
qc-cli train status
```
## Configuration
`qc-cli init` writes a `config.yaml` in the current directory. The fields you must fill in before using the tool:
```yaml
infra:
stack_name: qc-cli-mlops-1a2b3c4d5e6f
aws:
region: us-east-1
profile: default # AWS CLI profile name
s3:
bucket: qc-cli-mlops-1a2b3c4d5e6f-data
sagemaker:
training:
image_uri: "" # ECR URI for your training container
instance_type: ml.m5.xlarge
instance_count: 1
entry_point: null # Optional: script inside source_dir
source_dir: null # Optional: local dir packaged and uploaded automatically
hyperparameters: {}
```
`qc-cli init` generates the `infra.stack_name` and `s3.bucket` namespace once and writes it to `config.yaml`. Keep these values stable for a deployment; changing them points the CLI at different infrastructure.
The CLI isolates both application resources and CDK bootstrap resources. The application CloudFormation stack uses `infra.stack_name`, the S3 bucket uses the same generated namespace because bucket names are globally unique, and the SageMaker IAM role uses a CloudFormation-generated physical name. CDK bootstrap resources are derived internally from `infra.stack_name`, including a bootstrap stack named `<stack_name>-bootstrap` and a matching non-default CDK asset bucket qualifier. `qc-cli infra destroy` removes the application stack but leaves the CDK bootstrap stack in place; the command prints the retained bootstrap stack name.
`hyperparameters` is a flat map of values passed to the training container. Valid keys depend on the selected training image and entry point.
To provision an MLflow tracking server, set:
```yaml
mlflow:
mode: create
experiment_name: qc-cli-training
registered_model_name: qc-cli-model
register_trained_models: true
```
In `create` mode, the CLI manages the tracking server name from `infra.stack_name`; you do not need to set `tracking_server_name`.
To use an existing MLflow tracking server, set:
```yaml
mlflow:
mode: existing
tracking_server_name: your-tracking-server-name
```
When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. `train status` finalizes that run once the job reaches a terminal state and registers completed model artifacts as pre-release model versions using the `prerelease-latest` MLflow alias.
To open the managed SageMaker MLflow UI, request a fresh presigned URL:
```bash
qc-cli infra mlflow-url --config config.yaml
```
This works for `mode: create` and for `mode: existing` when the existing server is managed by Amazon SageMaker. In `create` mode, the command uses the CLI-managed tracking server name. In `existing` mode, it uses `mlflow.tracking_server_name`. If the existing MLflow server is external to SageMaker, open it with that server's own URL instead.
## Commands
### `init`
```
qc-cli init Write config.yaml
qc-cli init --output <path> Write config to a custom path
qc-cli init --force Overwrite an existing config file
```
### `infra`
```
qc-cli infra setup Deploy the CDK stack
qc-cli infra setup --no-bootstrap Deploy without running CDK bootstrap
qc-cli infra setup --cloudformation-execution-policy <arn> Set CDK bootstrap execution policy ARN
qc-cli infra status Show CDK stack/resource status
qc-cli infra mlflow-url Print a presigned MLflow UI URL
qc-cli infra destroy Destroy stack, retaining S3 data
qc-cli infra destroy --yes Destroy stack without confirmation
qc-cli infra destroy --delete-bucket-data Destroy stack and delete S3 data
```
`--cloudformation-execution-policy` is a one-time CDK bootstrap option, not a `config.yaml` setting. Pass it on `infra setup` when you need the CDK bootstrap CloudFormation execution role to use a policy other than the default `AdministratorAccess`:
```bash
qc-cli infra setup --cloudformation-execution-policy arn:aws:iam::aws:policy/PowerUserAccess
```
### `upload`
```
qc-cli upload <file> Upload a single file to S3
qc-cli upload <dir> Upload all files in a directory tree to S3
qc-cli upload <file> --s3-key <key> Upload a file to a custom S3 key
```
Uploads use `s3.bucket` and `s3.data_prefix` from `config.yaml`. File uploads default to `s3://<bucket>/<data_prefix>/<filename>`. Directory uploads are recursive, preserve paths relative to the uploaded directory, and place files under `s3://<bucket>/<data_prefix>/`.
### `train`
```
qc-cli train start Submit a SageMaker training job
qc-cli train status [job-name] Show job status; defaults to the last submitted job
qc-cli train list List recent training jobs
qc-cli train list --limit 3 Show a custom number of recent jobs
```
`train start` uses `s3://<bucket>/<data_prefix>/` as the training channel and writes outputs under `s3://<bucket>/<model_prefix>/`. If `sagemaker.training.source_dir` is set, the CLI packages that directory, uploads it beside the job output prefix, and passes `sagemaker_program`/`sagemaker_submit_directory` to the SageMaker container.
The expected output artifact is SageMakers `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.
## AWS permissions required
The IAM user or role running the CLI needs:
| Action | Service |
|---|---|
| CreateBucket, DeleteBucket, PutObject, GetObject, ListBucket, DeleteObject | S3 |
| CreateRole, GetRole, DeleteRole, AttachRolePolicy, DetachRolePolicy | IAM |
| CreateStack, UpdateStack, DeleteStack, DescribeStacks, DescribeStackEvents | CloudFormation |
| GetCallerIdentity | STS |
| CreateTrainingJob, DescribeTrainingJob, ListTrainingJobs | SageMaker AI |
| CreateMlflowTrackingServer, DescribeMlflowTrackingServer, DeleteMlflowTrackingServer | SageMaker AI, when `mlflow.mode` is `create` or `existing` |
`AdministratorAccess` covers all of the above.