qai-cli/README.md

# qc-cli

A CLI for Qualcomm's MLOps pipeline — browse and download models from Qualcomm AI Hub, fine-tune them on custom datasets using SageMaker, validate inference, and prepare artifacts for Qualcomm hardware deployment.

## Requirements

- Python 3.13+
- [uv](https://docs.astral.sh/uv/getting-started/installation/)
- AWS account with credentials configured (`aws configure`) when using `qc-cli infra`
- AWS CDK CLI (`npm install -g aws-cdk`) when using `qc-cli infra setup` or `qc-cli infra destroy`

## Installation

```bash
git clone <repo>
cd qc-cli
uv sync
```

Run commands with `uv run qc-cli <command>` or activate the venv first:

```bash
source .venv/bin/activate
qc-cli --help
```

## Quick start

```bash
# 1. Create config.yaml in the current directory
qc-cli init

# 2. Edit config.yaml — at minimum set sagemaker.training.image_uri

# 3. Provision AWS infrastructure (S3 bucket + SageMaker IAM role).
#    This is the step that requires the AWS CDK CLI.
qc-cli infra setup

# 4. Upload training data, then submit a SageMaker training job.
qc-cli upload ./my-dataset
qc-cli train start
qc-cli train status
```

## Configuration

`qc-cli init` writes a `config.yaml` in the current directory. The fields you must fill in before using the tool:

```yaml
infra:
  stack_name: qc-cli-mlops-1a2b3c4d5e6f

aws:
  region: us-east-1
  profile: default          # AWS CLI profile name

s3:
  bucket: qc-cli-mlops-1a2b3c4d5e6f-data

sagemaker:
  training:
    image_uri: ""           # ECR URI for your training container
    instance_type: ml.m5.xlarge
    instance_count: 1
    entry_point: null       # Optional: script inside source_dir
    source_dir: null        # Optional: local dir packaged and uploaded automatically
    hyperparameters: {}
```

`qc-cli init` generates the `infra.stack_name` and `s3.bucket` namespace once and writes it to `config.yaml`. Keep these values stable for a deployment; changing them points the CLI at different infrastructure.

The CLI isolates both application resources and CDK bootstrap resources. The application CloudFormation stack uses `infra.stack_name`, the S3 bucket uses the same generated namespace because bucket names are globally unique, and the SageMaker IAM role uses a CloudFormation-generated physical name. CDK bootstrap resources are derived internally from `infra.stack_name`, including a bootstrap stack named `<stack_name>-bootstrap` and a matching non-default CDK asset bucket qualifier. `qc-cli infra destroy` removes the application stack but leaves the CDK bootstrap stack in place; the command prints the retained bootstrap stack name.

`hyperparameters` is a flat map of values passed to the training container. Valid keys depend on the selected training image and entry point.

To provision an MLflow tracking server, set:

```yaml
mlflow:
  mode: create
  experiment_name: qc-cli-training
  registered_model_name: qc-cli-model
  register_trained_models: true
```

In `create` mode, the CLI manages the tracking server name from `infra.stack_name`; you do not need to set `tracking_server_name`.

To use an existing MLflow tracking server, set:

```yaml
mlflow:
  mode: existing
  tracking_server_name: your-tracking-server-name
```

When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. `train status` finalizes that run once the job reaches a terminal state and registers completed model artifacts as pre-release model versions using the `prerelease-latest` MLflow alias.

To open the managed SageMaker MLflow UI, request a fresh presigned URL:

```bash
qc-cli infra mlflow-url --config config.yaml
```

This works for `mode: create` and for `mode: existing` when the existing server is managed by Amazon SageMaker. In `create` mode, the command uses the CLI-managed tracking server name. In `existing` mode, it uses `mlflow.tracking_server_name`. If the existing MLflow server is external to SageMaker, open it with that server's own URL instead.

## Commands

### `init`

```
qc-cli init                  Write config.yaml
qc-cli init --output <path>  Write config to a custom path
qc-cli init --force          Overwrite an existing config file
```

### `infra`

```
qc-cli infra setup                         Deploy the CDK stack
qc-cli infra setup --no-bootstrap          Deploy without running CDK bootstrap
qc-cli infra setup --cloudformation-execution-policy <arn> Set CDK bootstrap execution policy ARN
qc-cli infra status                        Show CDK stack/resource status
qc-cli infra mlflow-url                    Print a presigned MLflow UI URL
qc-cli infra destroy                       Destroy stack, retaining S3 data
qc-cli infra destroy --yes                 Destroy stack without confirmation
qc-cli infra destroy --delete-bucket-data  Destroy stack and delete S3 data
```

`--cloudformation-execution-policy` is a one-time CDK bootstrap option, not a `config.yaml` setting. Pass it on `infra setup` when you need the CDK bootstrap CloudFormation execution role to use a policy other than the default `AdministratorAccess`:

```bash
qc-cli infra setup --cloudformation-execution-policy arn:aws:iam::aws:policy/PowerUserAccess
```

### `upload`

```
qc-cli upload <file>                 Upload a single file to S3
qc-cli upload <dir>                  Upload all files in a directory tree to S3
qc-cli upload <file> --s3-key <key>  Upload a file to a custom S3 key
```

Uploads use `s3.bucket` and `s3.data_prefix` from `config.yaml`. File uploads default to `s3://<bucket>/<data_prefix>/<filename>`. Directory uploads are recursive, preserve paths relative to the uploaded directory, and place files under `s3://<bucket>/<data_prefix>/`.

### `train`

```
qc-cli train start              Submit a SageMaker training job
qc-cli train status [job-name]  Show job status; defaults to the last submitted job
qc-cli train list               List recent training jobs
qc-cli train list --limit 3     Show a custom number of recent jobs
```

`train start` uses `s3://<bucket>/<data_prefix>/` as the training channel and writes outputs under `s3://<bucket>/<model_prefix>/`. If `sagemaker.training.source_dir` is set, the CLI packages that directory, uploads it beside the job output prefix, and passes `sagemaker_program`/`sagemaker_submit_directory` to the SageMaker container.

The expected output artifact is SageMaker’s `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.

## AWS permissions required

The IAM user or role running the CLI needs:

| Action | Service |
|---|---|
| CreateBucket, DeleteBucket, PutObject, GetObject, ListBucket, DeleteObject | S3 |
| CreateRole, GetRole, DeleteRole, AttachRolePolicy, DetachRolePolicy | IAM |
| CreateStack, UpdateStack, DeleteStack, DescribeStacks, DescribeStackEvents | CloudFormation |
| GetCallerIdentity | STS |
| CreateTrainingJob, DescribeTrainingJob, ListTrainingJobs | SageMaker AI |
| CreateMlflowTrackingServer, DescribeMlflowTrackingServer, DeleteMlflowTrackingServer | SageMaker AI, when `mlflow.mode` is `create` or `existing` |

`AdministratorAccess` covers all of the above.