update

match
move mlflow to its own command
2026-06-08 14:59:44 -04:00 · 2026-06-08 14:54:13 -04:00 · 2026-06-05 11:47:38 -04:00 · 2026-06-05 11:25:04 -04:00 · 2026-06-04 17:28:17 -04:00 · 2026-06-03 17:13:00 -04:00
31 changed files with 3543 additions and 169 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -218,7 +218,7 @@ __marimo__/
 .streamlit/secrets.toml
 .venv/
-config.yaml
+config*.yaml
 cdk.out/
 .qc-cli*.json
 examples/*/data/
--- a/README.md
+++ b/README.md
@@ -30,7 +30,7 @@ qc-cli --help
 # 1. Create config.yaml in the current directory
 qc-cli init
-# 2. Edit config.yaml — at minimum set s3.bucket and sagemaker.training.image_uri
+# 2. Edit config.yaml — at minimum set sagemaker.training.image_uri
 # 3. Provision AWS infrastructure (S3 bucket + SageMaker IAM role).
 #    This is the step that requires the AWS CDK CLI.
@@ -47,15 +47,17 @@ qc-cli train status
 `qc-cli init` writes a `config.yaml` in the current directory. The fields you must fill in before using the tool:
 ```yaml
 infra:
  stack_name: qc-cli-mlops-1a2b3c4d5e6f
 aws:
  region: us-east-1
  profile: default          # AWS CLI profile name
 s3:
-  bucket: your-unique-bucket-name
+  bucket: qc-cli-mlops-1a2b3c4d5e6f-data
 sagemaker:
  role_name: qc-cli-sagemaker-role
  training:
    image_uri: ""           # ECR URI for your training container
    instance_type: ml.m5.xlarge
@@ -63,8 +65,24 @@ sagemaker:
    entry_point: null       # Optional: script inside source_dir
    source_dir: null        # Optional: local dir packaged and uploaded automatically
    hyperparameters: {}
 aihub:
  device:
    name: Samsung Galaxy S25 (Family)
  target_runtime: tflite
  input_specs: {}           # Required before running qc-cli ai-hub commands
  job_name: null            # Optional prefix for AI Hub Workbench jobs
  model_name: null          # Optional name for uploaded local ONNX models
  compile_options: null
  profile_options: null
  quantize_options: null
  output_dir: build/qai-hub
 ```
 `qc-cli init` generates the `infra.stack_name` and `s3.bucket` namespace once and writes it to `config.yaml`. Keep these values stable for a deployment; changing them points the CLI at different infrastructure.
 The CLI isolates both application resources and CDK bootstrap resources. The application CloudFormation stack uses `infra.stack_name`, the S3 bucket uses the same generated namespace because bucket names are globally unique, and the SageMaker IAM role uses a CloudFormation-generated physical name. CDK bootstrap resources are derived internally from `infra.stack_name`, including a bootstrap stack named `<stack_name>-bootstrap` and a matching non-default CDK asset bucket qualifier. `qc-cli infra destroy` removes the application stack but leaves the CDK bootstrap stack in place; the command prints the retained bootstrap stack name.
 `hyperparameters` is a flat map of values passed to the training container. Valid keys depend on the selected training image and entry point.
 To provision an MLflow tracking server, set:
@@ -72,9 +90,13 @@ To provision an MLflow tracking server, set:
 ```yaml
 mlflow:
  mode: create
-  tracking_server_name: your-tracking-server-name
+  experiment_name: qc-cli-training
  registered_model_name: qc-cli-model
  register_trained_models: true
 ```
 In `create` mode, the CLI manages the tracking server name from `infra.stack_name`; you do not need to set `tracking_server_name`.
 To use an existing MLflow tracking server, set:
 ```yaml
@@ -83,6 +105,16 @@ mlflow:
  tracking_server_name: your-tracking-server-name
 ```
 When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. `train status` finalizes that run once the job reaches a terminal state and registers completed model artifacts as experiment model versions using the `experiment-latest` MLflow alias. An experiment version is an immutable trained-source artifact; it records that training produced a model, not that the model is better than earlier versions or ready for release.
 To open the managed SageMaker MLflow UI, request a fresh presigned URL:
 ```bash
 qc-cli mlflow open --config config.yaml
 ```
 This opens a browser to a fresh presigned URL. It works for `mode: create` and for `mode: existing` when the existing server is managed by Amazon SageMaker. In `create` mode, the command uses the CLI-managed tracking server name. In `existing` mode, it uses `mlflow.tracking_server_name`. If the existing MLflow server is external to SageMaker, open it with that server's own URL instead.
 ## Commands
 ### `init`
@@ -93,6 +125,12 @@ qc-cli init --output <path>  Write config to a custom path
 qc-cli init --force          Overwrite an existing config file
 ```
 ### `mlflow`
 ```
 qc-cli mlflow open  Open a presigned MLflow UI URL in a browser
 ```
 ### `infra`
 ```
@@ -105,6 +143,12 @@ qc-cli infra destroy --yes                 Destroy stack without confirmation
 qc-cli infra destroy --delete-bucket-data  Destroy stack and delete S3 data
 ```
 `--cloudformation-execution-policy` is a one-time CDK bootstrap option, not a `config.yaml` setting. Pass it on `infra setup` when you need the CDK bootstrap CloudFormation execution role to use a policy other than the default `AdministratorAccess`:
 ```bash
 qc-cli infra setup --cloudformation-execution-policy arn:aws:iam::aws:policy/PowerUserAccess
 ```
 ### `upload`
 ```
@@ -128,6 +172,72 @@ qc-cli train list --limit 3     Show a custom number of recent jobs
 The expected output artifact is SageMaker’s `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.
 ### `ai-hub`
 ```
 qc-cli ai-hub upload <calibration.npz|calibration-dir> <inputs.npz|inputs.npy>
 qc-cli ai-hub upload <calibration> <inputs> --from-step validate
 qc-cli ai-hub quantize <calibration.npz|calibration-dir> [--onnx-path PATH] [--model-s3-uri URI] [--from-job NAME]
 qc-cli ai-hub compile [--model-id ID] [--onnx-path PATH] [--model-s3-uri URI] [--from-job NAME]
 qc-cli ai-hub validate <inputs.npz|inputs.npy> [--model-id ID] [--input-name NAME]
 qc-cli ai-hub profile [--model-id ID]
 qc-cli ai-hub download [--model-id ID] [--output PATH]
 ```
 `ai-hub upload` runs the four Workbench upload steps in order: quantize, compile, validate, and profile. Use `--from-step compile`, `--from-step validate`, or `--from-step profile` to resume from saved local state after a completed earlier step.
 Resume behavior:
 ```text
 --from-step quantize  Run quantize, compile, validate, and profile.
 --from-step compile   Skip quantize; compile the last quantized model unless an explicit source is passed.
 --from-step validate  Skip quantize and compile; validate the last compiled model.
 --from-step profile   Skip quantize, compile, and validate; profile the last compiled model.
 ```
 When a step runs in the current command, `upload` passes its returned model ID directly to the next step. When a step is skipped, the next step resolves the needed model ID from `.qc-cli.json`. This avoids re-running earlier AI Hub jobs when you only need to continue from a later step.
 `ai-hub compile` resolves model sources in this order: `--model-id`, explicit source options (`--onnx-path`, `--model-s3-uri`, `--from-job`), last quantized model from state, then the last training job from local state. `ai-hub download` is separate because downloading the optimized artifact is outside the four-step Workbench upload loop.
 AI Hub authentication currently uses the local `qai-hub` SDK configuration. A planned follow-up is to support AWS Systems Manager Parameter Store `SecureString` for team-managed tokens, where `config.yaml` stores only a parameter name such as `/qc-cli/aihub/token`, AWS KMS encrypts the token at rest, and the CLI retrieves it at runtime with `ssm:GetParameter` plus `kms:Decrypt` permissions.
 ## Model lifecycle
 The CLI uses neutral experiment naming for trained artifacts and reserves release terminology for an explicit promotion step.
 Current behavior:
 1. `qc-cli train start` submits a SageMaker training job.
 2. `qc-cli train status` finalizes the MLflow run after the job reaches a terminal state.
 3. If the job completed and `mlflow.register_trained_models` is enabled, the SageMaker `model.tar.gz` is registered as a new MLflow model version with:
   - `qc_cli.stage=experiment`
   - `qc_cli.artifact_kind=trained_source`
   - `qc_cli.source=sagemaker`
 4. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
 5. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.
 Future release aliases such as `v1` or `production` can point at a selected deployable artifact.
 Example future metadata:
 ```text
 qc-cli-model version 12
 qc_cli.stage=experiment
 qc_cli.artifact_kind=trained_source
 qc_cli.source=sagemaker
 qc-cli-model-aihub version 3
 qc_cli.stage=ai_hub_compiled
 qc_cli.artifact_kind=deployable
 qc_cli.parent_registered_model_name=qc-cli-model
 qc_cli.parent_model_version=12
 qc_cli.runtime=tflite
 qc_cli.quantization=int8
 qc_cli.target_device=Samsung Galaxy S25
 ```
 In that flow, `experiment-latest` remains a training convenience alias. Release selection is a separate promotion decision based on the derived artifact, not on the experiment name.
 ## AWS permissions required
 The IAM user or role running the CLI needs:
--- a/app.py
+++ b/app.py
@@ -8,17 +8,19 @@ from src.infra.stack import QCStack
 app = cdk.App()
 config_path = app.node.try_get_context("config") or "config.yaml"
 stack_name = app.node.try_get_context("stack_name") or "MLOpsStack"
 account_id = app.node.try_get_context("account_id") or os.getenv("CDK_DEFAULT_ACCOUNT")
 delete_bucket_data = str(app.node.try_get_context("delete_bucket_data") or "false").lower() == "true"
 cfg = load_config(config_path)
 stack_name = app.node.try_get_context("stack_name") or cfg.infra.stack_name
 bootstrap_qualifier = app.node.try_get_context("bootstrap_qualifier") or cfg.infra.effective_bootstrap_qualifier
 QCStack(
    app,
    stack_name,
    config=cfg,
    delete_bucket_data=delete_bucket_data,
    synthesizer=cdk.DefaultStackSynthesizer(qualifier=bootstrap_qualifier),
    env=cdk.Environment(
        account=account_id,
        region=cfg.aws.region,
--- a/examples/ai-hub/README.md
+++ b/examples/ai-hub/README.md
@@ -0,0 +1,79 @@
 # Qualcomm AI Hub Example
 This example takes the ONNX model produced by the SageMaker training example and runs the Qualcomm AI Hub upload workflow:
 1. Quantize
 2. Compile
 3. Validate
 4. Profile
 5. Download the compiled artifact
 ## Prerequisites
 Run the training example first and wait for it to complete:
 ```bash
 examples/training/run_training.sh --wait
 ```
 The `config.yaml` file must include AI Hub settings:
 ```yaml
 aihub:
  device:
    name: Samsung Galaxy S25 (Family)
  target_runtime: tflite
  input_specs:
    input: [[1, 3, 160, 160], float32]
  output_dir: build/qai-hub
 ```
 Finally, the user needs to authenticate with Qualcomm AI Hub using:
 ```bash
 qai-hub configure --api_token
 ```
 ## Prepare Inputs
 AI Hub does not consume the raw JPG training images directly. It needs NumPy tensors that match the ONNX model input shape and preprocessing.
 To generate calibration and validation inputs:
 ```bash
 python examples/ai-hub/prepare_inputs.py
 ```
 This writes:
 ```text
 examples/training/data/aihub_calibration/*.npy
 examples/training/data/inputs.npz
 ```
 The script applies the same image preprocessing used by the training example:
 - resize to `160x160`
 - convert to channel-first `1x3x160x160`
 - normalize with ImageNet mean and standard deviation
 ## Upload Model to Qualcomm Workbench
 The model can be uploaded to Qualcomm Workbench using:
 ```bash
 qc-cli ai-hub upload examples/training/data/aihub_calibration examples/training/data/inputs.npz
 ```
 The first argument is the calibration path for the model and the second argument is the input file, both of which were created by the `prepare_inputs.py` script. For more details, add `--help` after the `upload` command.
 The `upload` command runs the following commands in order:
 1. `qc-cli ai-hub quantize`
 2. `qc-cli ai-hub compile`
 3. `qc-cli ai-hub validate`
 4. `qc-cli ai-hub profile`
 Finally the user can download the model from AI Workbench using the command
 ```bash
 qc-cli ai-hub download
 ```
--- a/examples/ai-hub/prepare_inputs.py
+++ b/examples/ai-hub/prepare_inputs.py
@@ -0,0 +1,74 @@
 #!/usr/bin/env python3
 """Prepare Qualcomm AI Hub calibration and validation inputs for the training example."""
 from __future__ import annotations
 import argparse
 from pathlib import Path
 import numpy as np
 from PIL import Image
 IMAGE_EXTENSIONS = {".jpg", ".jpeg", ".png"}
 def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--dataset-dir",
        type=Path,
        default=Path("examples/training/data/flower_photos_sagemaker"),
        help="ImageFolder-style dataset used for training.",
    )
    parser.add_argument(
        "--calibration-dir",
        type=Path,
        default=Path("examples/training/data/aihub_calibration"),
        help="Directory where .npy calibration samples will be written.",
    )
    parser.add_argument(
        "--input-file",
        type=Path,
        default=Path("examples/training/data/inputs.npz"),
        help="Validation .npz input file for qc-cli ai-hub validate.",
    )
    parser.add_argument("--input-name", default="input", help="ONNX input name.")
    parser.add_argument("--image-size", type=int, default=160, help="Square image size used by training.")
    parser.add_argument("--samples", type=int, default=16, help="Number of calibration samples to write.")
    return parser.parse_args()
 def preprocess_image(path: Path, image_size: int) -> np.ndarray:
    image = Image.open(path).convert("RGB").resize((image_size, image_size), Image.Resampling.BILINEAR)
    array = np.asarray(image, dtype=np.float32) / 255.0
    array = np.transpose(array, (2, 0, 1))
    mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)[:, None, None]
    std = np.array([0.229, 0.224, 0.225], dtype=np.float32)[:, None, None]
    return ((array - mean) / std)[None, ...].astype("float32")
 def main() -> None:
    args = parse_args()
    images = sorted(p for p in args.dataset_dir.rglob("*") if p.suffix.lower() in IMAGE_EXTENSIONS)
    if not images:
        raise SystemExit(f"No images found under {args.dataset_dir}")
    if args.samples < 1:
        raise SystemExit("--samples must be at least 1")
    args.calibration_dir.mkdir(parents=True, exist_ok=True)
    args.input_file.parent.mkdir(parents=True, exist_ok=True)
    sample_count = min(args.samples, len(images))
    prepared = []
    for index, image_path in enumerate(images[:sample_count]):
        sample = preprocess_image(image_path, args.image_size)
        np.save(args.calibration_dir / f"sample_{index:03d}.npy", sample)
        prepared.append(sample)
    np.savez(args.input_file, **{args.input_name: prepared[0]})
    print(f"Wrote {sample_count} calibration samples to {args.calibration_dir}")
    print(f"Wrote validation input to {args.input_file}")
 if __name__ == "__main__":
    main()
--- a/examples/training/README.md
+++ b/examples/training/README.md
@@ -13,7 +13,6 @@ s3:
  bucket: your-bucket-name
 sagemaker:
  role_name: <role-name>
  training:
    image_uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.6-cpu-py312-ubuntu22.04-sagemaker-v1
    instance_type: ml.m4.xlarge
--- a/examples/training/run_training.sh
+++ b/examples/training/run_training.sh
@@ -72,10 +72,11 @@ if [[ "${SKIP_UPLOAD}" == false ]]; then
  run uv run qc-cli upload "${DATASET_DIR}" --config "${CONFIG_PATH}"
 fi
-TRAIN_OUTPUT="$(uv run qc-cli train start --config "${CONFIG_PATH}")"
+TRAIN_OUTPUT_FILE="$(mktemp)"
-echo "${TRAIN_OUTPUT}"
+trap 'rm -f "${TRAIN_OUTPUT_FILE}"' EXIT
 run uv run qc-cli train start --config "${CONFIG_PATH}" | tee "${TRAIN_OUTPUT_FILE}"
-JOB_NAME="$(printf '%s\n' "${TRAIN_OUTPUT}" | grep -Eo 'qc-cli-[0-9]{8}-[0-9]{6}' | tail -n 1)"
+JOB_NAME="$(grep -Eo 'qc-cli-[0-9]{8}-[0-9]{6}' "${TRAIN_OUTPUT_FILE}" | tail -n 1)"
 if [[ -z "${JOB_NAME}" ]]; then
  echo "Could not find training job name in qc-cli output." >&2
  exit 1
--- a/examples/training/source/train.py
+++ b/examples/training/source/train.py
@@ -126,10 +126,6 @@ def export_onnx(model: nn.Module, model_dir: Path, image_size: int) -> None:
        do_constant_folding=True,
        input_names=["input"],
        output_names=["logits"],
        dynamic_axes={
            "input": {0: "batch_size"},
            "logits": {0: "batch_size"},
        },
    )
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -5,15 +5,19 @@ build-backend = "hatchling.build"
 [project]
 name = "qc-cli"
 version = "0.1.0"
-description = "CLI for SageMaker ONNX training and Qualcomm AI Hub optimization"
+description = "CLI for training and deploying models for Qualcomm AI Hub"
 requires-python = ">=3.13"
 dependencies = [
    "aws-cdk-lib>=2.180.0",
    "typer==0.25.0",
    "boto3>=1.34,<1.42",
    "constructs>=10.0.0",
    "mlflow>=3.0",
    "numpy>=1.26",
    "pydantic>=2.13.3",
    "pyyaml>=6.0.3",
    "qai-hub>=0.49.0",
    "sagemaker-mlflow>=0.4.0",
 ]
 [project.scripts]
--- a/src/aws/cloudformation.py
+++ b/src/aws/cloudformation.py
@@ -3,13 +3,11 @@ from typing import Any
 import boto3
 from botocore.exceptions import ClientError
 from src.infra.provisioning import STACK_NAME
-
+def stack_status(region: str, profile: str, stack_name: str) -> dict[str, Any] | None:
 def stack_status(region: str, profile: str) -> dict[str, Any] | None:
    client = boto3.Session(profile_name=profile, region_name=region).client("cloudformation")
    try:
-        stack = client.describe_stacks(StackName=STACK_NAME)["Stacks"][0]
+        stack = client.describe_stacks(StackName=stack_name)["Stacks"][0]
    except ClientError as e:
        message = e.response.get("Error", {}).get("Message", "")
        if "does not exist" in message:
--- a/src/aws/mlflow.py
+++ b/src/aws/mlflow.py
@@ -17,3 +17,20 @@ def describe_tracking_server(region: str, profile: str, name: str) -> dict[str,
        ):
            return None
        raise
 def get_tracking_server_arn(region: str, profile: str, name: str) -> str:
    server = describe_tracking_server(region, profile, name)
    if not server:
        raise ValueError(f"MLflow tracking server not found: {name}")
    arn = server.get("TrackingServerArn")
    if not arn:
        raise ValueError(f"MLflow tracking server has no ARN: {name}")
    return str(arn)
 def create_presigned_tracking_server_url(region: str, profile: str, name: str) -> str:
    client = boto3.Session(profile_name=profile, region_name=region).client("sagemaker")
    response = client.create_presigned_mlflow_tracking_server_url(TrackingServerName=name)
    return str(response["AuthorizedUrl"])
--- a/src/aws/s3.py
+++ b/src/aws/s3.py
@@ -21,6 +21,24 @@ def upload_file(
    return f"s3://{bucket}/{s3_key}"
 def download_file(
    region: str,
    profile: str,
    s3_uri: str,
    local_path: str,
 ) -> str:
    if not s3_uri.startswith("s3://"):
        raise ValueError(f"Expected S3 URI, got: {s3_uri}")
    bucket_key = s3_uri.removeprefix("s3://")
    bucket, _, key = bucket_key.partition("/")
    if not bucket or not key:
        raise ValueError(f"Expected S3 URI with bucket and key, got: {s3_uri}")
    dest = Path(local_path)
    dest.parent.mkdir(parents=True, exist_ok=True)
    _client(region, profile).download_file(bucket, key, str(dest))
    return str(dest)
 def upload_dir(
    region: str,
    profile: str,
--- a/src/aws/sagemaker.py
+++ b/src/aws/sagemaker.py
@@ -36,6 +36,7 @@ class TrainingJobStatus:
    modified: datetime | None
    model_artifacts: str | None
    failure_reason: str | None
    raw: dict[str, Any] = field(default_factory=dict)
 def _sm(session: Boto3SessionKwargs) -> SageMakerClient:
@@ -116,9 +117,20 @@ def get_training_job_status(session: Boto3SessionKwargs, job_name: str) -> Train
        modified=resp.get("LastModifiedTime"),
        model_artifacts=resp.get("ModelArtifacts", {}).get("S3ModelArtifacts"),
        failure_reason=resp.get("FailureReason"),
        raw=dict(resp),
    )
 def get_model_artifacts(region: str, profile: str, job_name: str) -> str:
    resp = boto3.Session(profile_name=profile, region_name=region).client("sagemaker").describe_training_job(
        TrainingJobName=job_name
    )
    artifact = resp.get("ModelArtifacts", {}).get("S3ModelArtifacts")
    if not artifact:
        raise RuntimeError(f"Training job '{job_name}' does not have model artifacts yet.")
    return str(artifact)
 def list_training_jobs(
    session: Boto3SessionKwargs,
    max_results: int = 10,
--- a/src/cloud/init.py
+++ b/src/cloud/init.py
--- a/src/commands/ai_hub.py
+++ b/src/commands/ai_hub.py
@@ -0,0 +1,406 @@
 from collections.abc import Mapping, Sequence
 from datetime import datetime
 from enum import StrEnum
 from pathlib import Path
 from typing import Any
 import qai_hub.hub as hub
 import typer
 from qai_hub.client import Device
 from src import state as state_ops
 from src.commands.utils import CONFIG_OPT, CONSOLE, load_cfg
 from src.config import Config
 from src.qualcomm import aihub_jobs
 from src.qualcomm.artifacts import resolve_onnx
 app = typer.Typer(help="Quantize, compile, validate, profile, and download models with Qualcomm Workbench")
 _RUNTIME_EXTENSIONS = {
    "tflite": "tflite",
    "qnn_context_binary": "bin",
    "onnx": "onnx",
 }
 class UploadStep(StrEnum):
    quantize = "quantize"
    compile = "compile"
    validate = "validate"
    profile = "profile"
 def _input_specs(cfg: Config) -> dict[str, tuple[tuple[int, ...], str]]:
    specs = {name: (tuple(shape), dtype) for name, (shape, dtype) in cfg.aihub.input_specs.items()}
    if not specs:
        CONSOLE.print("[red]aihub.input_specs must define at least one input.[/red]")
        raise typer.Exit(1)
    return specs
 def _load_inputs(
    input_file: Path,
    specs: Mapping[str, tuple[Sequence[int], str]],
    input_name: str | None = None,
 ) -> dict[str, Any]:
    import numpy as np
    if not input_file.exists():
        raise FileNotFoundError(f"File not found: {input_file}")
    if input_file.suffix == ".npz":
        loaded = np.load(input_file)
        missing = set(specs) - set(loaded.files)
        if missing:
            raise ValueError(f"Missing input(s) in NPZ: {', '.join(sorted(missing))}")
        return {name: loaded[name] for name in specs}
    if input_file.suffix == ".npy":
        if input_name is None:
            if len(specs) != 1:
                raise ValueError("--input-name is required when config has multiple inputs")
            input_name = next(iter(specs))
        if input_name not in specs:
            raise ValueError(f"Input name '{input_name}' is not defined in aihub.input_specs")
        return {input_name: np.load(input_file)}
    raise ValueError("Input file must be .npz or .npy")
 def _load_calibration(path: Path, specs: Mapping[str, tuple[Sequence[int], str]]) -> dict[str, Any]:
    import numpy as np
    if path.is_file():
        return _load_inputs(path, specs)
    if not path.is_dir():
        raise FileNotFoundError(f"Calibration path not found: {path}")
    if len(specs) != 1:
        raise ValueError("Directory calibration data is supported only for single-input models.")
    input_name = next(iter(specs))
    samples = [np.load(p) for p in sorted(path.glob("*.npy"))]
    if not samples:
        raise ValueError(f"No .npy calibration samples found in {path}")
    return {input_name: samples}
 def _job_name(cfg: Config, operation: str) -> str | None:
    if not cfg.aihub.job_name:
        return None
    return f"{cfg.aihub.job_name}-{operation}"
 def _model_id_or_state(config_path: str, model_id: str | None, *, quantized: bool = False) -> str:
    st = state_ops.store(config_path)
    resolved = model_id or (st.get_last_quantized_model_id() if quantized else st.get_last_compiled_model_id())
    if not resolved:
        source = "quantized" if quantized else "compiled"
        CONSOLE.print(f"[red]No {source} model found. Pass --model-id or run the previous AI Hub step first.[/red]")
        raise typer.Exit(1)
    return resolved
 def _device_selector(device: Device) -> str:
    parts: list[str] = []
    if device.name:
        parts.append(f"name={device.name!r}")
    if device.os:
        parts.append(f"os={device.os!r}")
    if device.attributes:
        parts.append(f"attributes={device.attributes!r}")
    return ", ".join(parts) if parts else "empty selector"
 def _validate_device(cfg: Config) -> None:
    device = cfg.aihub.device
    try:
        matches = hub.get_devices(name=device.name, os=device.os, attributes=device.attributes)
    except Exception as e:
        CONSOLE.print(f"[red]Unable to validate AI Hub device {_device_selector(device)}: {e}[/red]")
        raise typer.Exit(1)
    if matches:
        return
    CONSOLE.print(f"[red]AI Hub device not found: {_device_selector(device)}[/red]")
    CONSOLE.print("Run [bold]qai-hub list-devices[/bold] to see valid device names.")
    raise typer.Exit(1)
 def _quantize_step(
    cfg: Config,
    config_path: str,
    calibration_path: Path,
    from_job: str | None,
    model_s3_uri: str | None,
    onnx_path: str | None,
 ) -> str:
    st = state_ops.store(config_path)
    specs = _input_specs(cfg)
    try:
        resolved = resolve_onnx(
            cfg=cfg,
            output_dir=cfg.aihub.output_dir,
            from_job=from_job,
            model_s3_uri=model_s3_uri or st.get_last_model_artifact(),
            onnx_path=onnx_path,
            last_training_job=st.get_last_training_job(),
        )
        calibration_data = _load_calibration(calibration_path, specs)
    except (FileNotFoundError, ValueError) as e:
        CONSOLE.print(f"[red]{e}[/red]")
        raise typer.Exit(1)
    try:
        result = aihub_jobs.submit_quantize_job(
            resolved.onnx_path,
            calibration_data,
            cfg.aihub.quantize_options,
            job_name=_job_name(cfg, "quantize"),
            model_name=cfg.aihub.model_name,
        )
    except Exception as e:
        CONSOLE.print(f"[red]AI Hub quantize failed: {e}[/red]")
        raise typer.Exit(1)
    st.update(
        last_model_artifact=resolved.model_artifact,
        last_quantize_job_id=result["job_id"],
        last_quantized_model_id=result["model_id"],
    )
    CONSOLE.print(f"[green]✓[/green] Quantize job: [bold]{result['job_id']}[/bold]")
    CONSOLE.print(f"[green]✓[/green] Quantized model: [bold]{result['model_id']}[/bold]")
    return str(result["model_id"])
 def _compile_step(
    cfg: Config,
    config_path: str,
    model_id: str | None,
    from_job: str | None,
    model_s3_uri: str | None,
    onnx_path: str | None,
    *,
    prefer_quantized: bool,
 ) -> str:
    st = state_ops.store(config_path)
    _validate_device(cfg)
    specs = _input_specs(cfg)
    model: Any
    model_artifact: str | None = None
    has_explicit_source = bool(from_job or model_s3_uri or onnx_path)
    if model_id:
        model = model_id
    elif prefer_quantized and not has_explicit_source and st.get_last_quantized_model_id():
        model = st.get_last_quantized_model_id()
    else:
        try:
            resolved = resolve_onnx(
                cfg=cfg,
                output_dir=cfg.aihub.output_dir,
                from_job=from_job,
                model_s3_uri=model_s3_uri,
                onnx_path=onnx_path,
                last_training_job=st.get_last_training_job(),
            )
        except (FileNotFoundError, ValueError) as e:
            CONSOLE.print(f"[red]{e}[/red]")
            raise typer.Exit(1)
        model = resolved.onnx_path
        model_artifact = resolved.model_artifact
    try:
        result = aihub_jobs.submit_compile_job(
            model=model,
            device=cfg.aihub.device,
            input_specs=specs,
            target_runtime=cfg.aihub.target_runtime,
            options=cfg.aihub.compile_options,
            job_name=_job_name(cfg, "compile"),
            model_name=cfg.aihub.model_name if isinstance(model, Path) else None,
        )
    except Exception as e:
        CONSOLE.print(f"[red]AI Hub compile failed: {e}[/red]")
        raise typer.Exit(1)
    updates: dict[str, Any] = {
        "last_compile_job_id": result["job_id"],
        "last_compiled_model_id": result["model_id"],
    }
    if model_artifact:
        updates["last_model_artifact"] = model_artifact
    st.update(**updates)
    CONSOLE.print(f"[green]✓[/green] Compile job: [bold]{result['job_id']}[/bold]")
    CONSOLE.print(f"[green]✓[/green] Compiled model: [bold]{result['model_id']}[/bold]")
    return str(result["model_id"])
 def _validate_step(
    cfg: Config,
    config_path: str,
    input_file: Path,
    model_id: str | None,
    input_name: str | None,
 ) -> str:
    _validate_device(cfg)
    specs = _input_specs(cfg)
    resolved_model_id = _model_id_or_state(config_path, model_id)
    try:
        inputs = _load_inputs(input_file, specs, input_name)
    except (FileNotFoundError, ValueError) as e:
        CONSOLE.print(f"[red]{e}[/red]")
        raise typer.Exit(1)
    run = datetime.now().strftime("%Y%m%d-%H%M%S")
    out_dir = Path(cfg.aihub.output_dir) / run / "validation"
    try:
        result = aihub_jobs.submit_inference_job(
            resolved_model_id,
            cfg.aihub.device,
            inputs,
            out_dir,
            job_name=_job_name(cfg, "validate"),
        )
    except Exception as e:
        CONSOLE.print(f"[red]AI Hub inference failed: {e}[/red]")
        raise typer.Exit(1)
    state_ops.store(config_path).update(last_inference_job_id=result["job_id"])
    CONSOLE.print(f"[green]✓[/green] Inference job: [bold]{result['job_id']}[/bold]")
    outputs = result.get("outputs")
    if isinstance(outputs, dict):
        for name, value in outputs.items():
            CONSOLE.print(f"  {name}: shape={getattr(value, 'shape', '?')}")
    CONSOLE.print(f"Outputs: [cyan]{out_dir}[/cyan]")
    return str(result["job_id"])
 def _profile_step(cfg: Config, config_path: str, model_id: str | None) -> str:
    _validate_device(cfg)
    resolved_model_id = _model_id_or_state(config_path, model_id)
    try:
        result = aihub_jobs.submit_profile_job(
            resolved_model_id,
            cfg.aihub.device,
            cfg.aihub.profile_options,
            job_name=_job_name(cfg, "profile"),
        )
    except Exception as e:
        CONSOLE.print(f"[red]AI Hub profile failed: {e}[/red]")
        raise typer.Exit(1)
    state_ops.store(config_path).update(last_profile_job_id=result["job_id"])
    CONSOLE.print(f"[green]✓[/green] Profile job: [bold]{result['job_id']}[/bold]")
    return str(result["job_id"])
@app.command()
 def quantize(
    calibration_path: Path = typer.Argument(..., help="Calibration .npz file or directory of .npy samples"),
    from_job: str | None = typer.Option(None, "--from-job", help="Training job name whose model artifact should quantize"),
    model_s3_uri: str | None = typer.Option(None, "--model-s3-uri", help="S3 URI of model.tar.gz to quantize"),
    onnx_path: str | None = typer.Option(
        None, "--onnx-path", help="Local ONNX path or ONNX path inside extracted artifact"
    ),
    config: str = CONFIG_OPT,
 ) -> None:
    """Quantize an ONNX model to INT8."""
    cfg = load_cfg(config)
    _quantize_step(cfg, config, calibration_path, from_job, model_s3_uri, onnx_path)
@app.command()
 def compile(
    model_id: str | None = typer.Option(None, "--model-id", help="AI Hub model ID to compile"),
    from_job: str | None = typer.Option(None, "--from-job", help="Training job name whose model artifact should compile"),
    model_s3_uri: str | None = typer.Option(None, "--model-s3-uri", help="S3 URI of model.tar.gz to compile"),
    onnx_path: str | None = typer.Option(
        None, "--onnx-path", help="Local ONNX path or ONNX path inside extracted artifact"
    ),
    config: str = CONFIG_OPT,
 ) -> None:
    """Compile a model for the configured Qualcomm AI Hub target."""
    cfg = load_cfg(config)
    _compile_step(cfg, config, model_id, from_job, model_s3_uri, onnx_path, prefer_quantized=True)
@app.command()
 def validate(
    input_file: Path = typer.Argument(..., help="NumPy .npz or .npy inputs to run on device"),
    model_id: str | None = typer.Option(None, "--model-id", help="AI Hub compiled model ID"),
    input_name: str | None = typer.Option(None, "--input-name", help="Input name for .npy files"),
    config: str = CONFIG_OPT,
 ) -> None:
    """Run an AI Hub inference job using sample inputs."""
    cfg = load_cfg(config)
    _validate_step(cfg, config, input_file, model_id, input_name)
@app.command()
 def profile(
    model_id: str | None = typer.Option(None, "--model-id", help="AI Hub compiled model ID"),
    config: str = CONFIG_OPT,
 ) -> None:
    """Profile a compiled model on the configured AI Hub device."""
    cfg = load_cfg(config)
    _profile_step(cfg, config, model_id)
@app.command()
 def upload(
    calibration_path: Path = typer.Argument(..., help="Calibration .npz file or directory of .npy samples"),
    input_file: Path = typer.Argument(..., help="Validation .npz or .npy inputs to run on device"),
    from_step: UploadStep = typer.Option(UploadStep.quantize, "--from-step", help="Resume from this Workbench step"),
    from_job: str | None = typer.Option(None, "--from-job", help="Training job name whose model artifact should upload"),
    model_s3_uri: str | None = typer.Option(None, "--model-s3-uri", help="S3 URI of model.tar.gz to upload"),
    onnx_path: str | None = typer.Option(
        None, "--onnx-path", help="Local ONNX path or ONNX path inside extracted artifact"
    ),
    input_name: str | None = typer.Option(None, "--input-name", help="Input name for .npy validation files"),
    config: str = CONFIG_OPT,
 ) -> None:
    """Run the four Workbench upload steps: quantize, compile, validate, and profile."""
    cfg = load_cfg(config)
    steps = [UploadStep.quantize, UploadStep.compile, UploadStep.validate, UploadStep.profile]
    selected = steps[steps.index(from_step) :]
    quantized_model_id: str | None = None
    compiled_model_id: str | None = None
    if UploadStep.quantize in selected:
        quantized_model_id = _quantize_step(cfg, config, calibration_path, from_job, model_s3_uri, onnx_path)
    if UploadStep.compile in selected:
        compiled_model_id = _compile_step(
            cfg,
            config,
            model_id=quantized_model_id,
            from_job=from_job,
            model_s3_uri=model_s3_uri,
            onnx_path=onnx_path,
            prefer_quantized=True,
        )
    if UploadStep.validate in selected:
        _validate_step(cfg, config, input_file, compiled_model_id, input_name)
    if UploadStep.profile in selected:
        _profile_step(cfg, config, compiled_model_id)
@app.command()
 def download(
    model_id: str | None = typer.Option(None, "--model-id", help="AI Hub compiled model ID"),
    output: Path | None = typer.Option(None, "--output", "-o", help="Destination file path"),
    config: str = CONFIG_OPT,
 ) -> None:
    """Download the last compiled deployable artifact from AI Hub."""
    cfg = load_cfg(config)
    resolved_model_id = _model_id_or_state(config, model_id)
    ext = _RUNTIME_EXTENSIONS.get(cfg.aihub.target_runtime, cfg.aihub.target_runtime)
    dest = output or (Path(cfg.aihub.output_dir) / f"model.{ext}")
    try:
        written = aihub_jobs.download_model(resolved_model_id, dest)
    except Exception as e:
        CONSOLE.print(f"[red]AI Hub download failed: {e}[/red]")
        raise typer.Exit(1)
    state_ops.store(config).update(last_downloaded_model=written)
    CONSOLE.print(f"[green]✓[/green] Downloaded model: [cyan]{written}[/cyan]")
--- a/src/commands/infra.py
+++ b/src/commands/infra.py
@@ -51,6 +51,8 @@ def setup(
                    profile=cfg.aws.profile,
                    account_id=account_id,
                    region=cfg.aws.region,
                    bootstrap_qualifier=cfg.infra.effective_bootstrap_qualifier,
                    toolkit_stack_name=cfg.infra.effective_toolkit_stack_name,
                    cloudformation_execution_policy=cloudformation_execution_policy,
                )
        with CONSOLE.status("Running cdk deploy..."):
@@ -58,6 +60,9 @@ def setup(
                profile=cfg.aws.profile,
                account_id=account_id,
                region=cfg.aws.region,
                stack_name=cfg.infra.stack_name,
                bootstrap_qualifier=cfg.infra.effective_bootstrap_qualifier,
                toolkit_stack_name=cfg.infra.effective_toolkit_stack_name,
                config_path=config,
                config_dir=str(Path(config).parent),
                config_snapshot=cfg.model_dump(mode="json"),
@@ -72,7 +77,8 @@ def setup(
    if outputs.get("SageMakerRoleArn"):
        CONSOLE.print(f"[green]✓[/green] IAM role:  {outputs['SageMakerRoleArn']}")
    if cfg.mlflow.mode is MlflowMode.create and outputs.get("MlflowTrackingServerArn"):
-        CONSOLE.print(f"[green]✓[/green] MLflow:    {outputs['MlflowTrackingServerArn']}")
+        mlflow_name = outputs.get("MlflowTrackingServerName", cfg.managed_mlflow_tracking_server_name)
        CONSOLE.print(f"[green]✓[/green] MLflow:    {mlflow_name}")
    elif cfg.mlflow.mode is MlflowMode.existing:
        CONSOLE.print(f"[green]✓[/green] MLflow:    {cfg.mlflow.tracking_server_name}")
    CONSOLE.print("\n[bold green]Infrastructure ready.[/bold green]")
@@ -82,7 +88,7 @@ def setup(
 def status(config: str = CONFIG_OPT) -> None:
    """Show current infrastructure status."""
    cfg = load_cfg(config)
-    stack = cloudformation.stack_status(cfg.aws.region, cfg.aws.profile)
+    stack = cloudformation.stack_status(cfg.aws.region, cfg.aws.profile, cfg.infra.stack_name)
    table = Table(title="Infrastructure Status")
    table.add_column("Resource", style="cyan")
@@ -91,13 +97,13 @@ def status(config: str = CONFIG_OPT) -> None:
    table.add_column("ARN / URI")
    if not stack:
-        table.add_row("CDK Stack", provisioning.STACK_NAME, "[red]missing[/red]", "-")
+        table.add_row("CDK Stack", cfg.infra.stack_name, "[red]missing[/red]", "-")
        table.add_row("S3 Bucket", cfg.s3.bucket, "[red]unknown[/red]", "-")
        table.add_row("IAM Role", cfg.sagemaker.role_name, "[red]unknown[/red]", "-")
        if cfg.mlflow.mode is not MlflowMode.disabled:
            table.add_row(
                "MLflow",
-                cfg.mlflow.tracking_server_name or "-",
+                cfg.effective_mlflow_tracking_server_name or "-",
                "[red]unknown[/red]",
                "-",
            )
@@ -114,14 +120,14 @@ def status(config: str = CONFIG_OPT) -> None:
    )
    table.add_row(
        "IAM Role",
-        cfg.sagemaker.role_name,
+        _role_name(cfg.sagemaker.role_name, outputs.get("SageMakerRoleArn", "")),
        "[green]managed[/green]",
        outputs.get("SageMakerRoleArn", "-"),
    )
    if cfg.mlflow.mode is MlflowMode.create:
        table.add_row(
            "MLflow",
-            cfg.mlflow.tracking_server_name or "-",
+            outputs.get("MlflowTrackingServerName", cfg.managed_mlflow_tracking_server_name),
            "[green]managed[/green]",
            outputs.get("MlflowTrackingServerArn", outputs.get("MlflowArtifactUri", "-")),
        )
@@ -156,10 +162,13 @@ def destroy(
 ) -> None:
    """Destroy the CDK stack."""
    cfg = _destroy_config(config)
    stack_name = _destroy_stack_name(config, cfg)
    bootstrap_qualifier = _destroy_bootstrap_qualifier(config, cfg)
    toolkit_stack_name = _destroy_toolkit_stack_name(config, cfg)
    if not yes and not delete_bucket_data:
        typer.confirm(
-            f"Destroy CDK stack '{provisioning.STACK_NAME}' while retaining S3 bucket data?",
+            f"Destroy CDK stack '{stack_name}' while retaining S3 bucket data?",
            abort=True,
        )
@@ -172,13 +181,17 @@ def destroy(
                provisioning.destroy(
                    profile=cfg.aws.profile,
                    account_id=account_id,
                    stack_name=stack_name,
                    bootstrap_qualifier=bootstrap_qualifier,
                    toolkit_stack_name=toolkit_stack_name,
                    config_path=str(snapshot_path),
                    delete_bucket_data=delete_bucket_data,
                )
    except RuntimeError as e:
        CONSOLE.print(f"[red]{e}[/red]")
        raise typer.Exit(1)
-    CONSOLE.print(f"[green]✓[/green] Destroyed stack: {provisioning.STACK_NAME}")
+    CONSOLE.print(f"[green]✓[/green] Destroyed stack: {stack_name}")
    CONSOLE.print(f"[yellow]CDK bootstrap stack retained: {toolkit_stack_name}[/yellow]")
 def _destroy_config(config_path: str) -> Config:
@@ -190,6 +203,14 @@ def _destroy_config(config_path: str) -> Config:
    return load_cfg(config_path)
 def _role_name(configured_name: str, role_arn: str) -> str:
    if configured_name:
        return configured_name
    if role_arn:
        return role_arn.rsplit("/", 1)[-1]
    return "-"
 def _destroy_account_id(config_path: str, cfg: Config) -> str:
    config_dir = str(Path(config_path).parent)
    state = read_infra_state(config_dir)
@@ -197,3 +218,30 @@ def _destroy_account_id(config_path: str, cfg: Config) -> str:
    if account_id:
        return str(account_id)
    return identity.account_id(cfg.aws.region, cfg.aws.profile)
 def _destroy_stack_name(config_path: str, cfg: Config) -> str:
    config_dir = str(Path(config_path).parent)
    state = read_infra_state(config_dir)
    stack_name = state.get("stack_name")
    if stack_name:
        return str(stack_name)
    return cfg.infra.stack_name
 def _destroy_bootstrap_qualifier(config_path: str, cfg: Config) -> str:
    config_dir = str(Path(config_path).parent)
    state = read_infra_state(config_dir)
    bootstrap_qualifier = state.get("bootstrap_qualifier")
    if bootstrap_qualifier:
        return str(bootstrap_qualifier)
    return cfg.infra.effective_bootstrap_qualifier
 def _destroy_toolkit_stack_name(config_path: str, cfg: Config) -> str:
    config_dir = str(Path(config_path).parent)
    state = read_infra_state(config_dir)
    toolkit_stack_name = state.get("toolkit_stack_name")
    if toolkit_stack_name:
        return str(toolkit_stack_name)
    return cfg.infra.effective_toolkit_stack_name
--- a/src/commands/init.py
+++ b/src/commands/init.py
@@ -0,0 +1,40 @@
 import secrets
 from pathlib import Path
 import typer
 import yaml
 from src.commands.utils import CONSOLE
 from src.config import GENERATED_STACK_PREFIX, Config, InfraConfig, S3Config
 app = typer.Typer()
@app.command()
 def init(
    output: str = typer.Option("config.yaml", help="Destination path for the config file"),
    force: bool = typer.Option(False, "--force", "-f", help="Overwrite an existing config file"),
 ) -> None:
    """Write a starter config.yaml to the current directory."""
    dest = Path(output)
    if dest.exists() and not force:
        CONSOLE.print(f"[yellow]{dest} already exists.[/yellow] Use --force to overwrite.")
        raise typer.Exit(1)
    config = _new_isolated_config()
    dest.parent.mkdir(parents=True, exist_ok=True)
    config_data = config.model_dump(mode="json")
    config_data["sagemaker"].pop("role_name", None)
    with open(dest, "w") as f:
        yaml.safe_dump(config_data, f, sort_keys=False)
    CONSOLE.print(f"[green]✓[/green] Config written to [bold]{dest}[/bold]")
    CONSOLE.print("Edit [cyan]sagemaker.training.image_uri[/cyan] before running training commands.")
 def _new_isolated_config() -> Config:
    suffix = secrets.token_hex(6)
    namespace = f"{GENERATED_STACK_PREFIX}{suffix}"
    config = Config(infra=InfraConfig(stack_name=namespace))
    config.s3 = S3Config(bucket=f"{namespace}-data")
    return config
--- a/src/commands/mlflow.py
+++ b/src/commands/mlflow.py
@@ -0,0 +1,41 @@
 import webbrowser
 import typer
 from src.aws import mlflow as aws_mlflow
 from src.commands.utils import CONFIG_OPT, CONSOLE, load_cfg
 app = typer.Typer(help="Manage MLflow tracking server access")
@app.command(name="open")
 def open_mlflow(config: str = CONFIG_OPT) -> None:
    """Open a presigned URL for the configured MLflow tracking server."""
    cfg = load_cfg(config)
    tracking_server_name = cfg.effective_mlflow_tracking_server_name
    if not tracking_server_name:
        CONSOLE.print("[red]MLflow is disabled in config.yaml.[/red]")
        raise typer.Exit(1)
    try:
        url = aws_mlflow.create_presigned_tracking_server_url(
            cfg.aws.region,
            cfg.aws.profile,
            tracking_server_name,
        )
    except Exception as e:
        CONSOLE.print("[yellow]Could not create a SageMaker MLflow UI URL.[/yellow]")
        CONSOLE.print(f"Tracking server: [cyan]{tracking_server_name}[/cyan]")
        CONSOLE.print(f"Reason: {e}")
        CONSOLE.print(
            "This command can create presigned URLs only for MLflow tracking servers managed by "
            "Amazon SageMaker. If this is an external MLflow server, open it with that server's own URL."
        )
        raise typer.Exit(1)
    CONSOLE.print(f"MLflow tracking server: [cyan]{tracking_server_name}[/cyan]")
    CONSOLE.print(f"MLflow UI: {url}")
    if webbrowser.open(url):
        CONSOLE.print("[green]✓[/green] Opened MLflow UI in your browser.")
    else:
        CONSOLE.print("[yellow]Could not open a browser automatically. Open the URL above manually.[/yellow]")
--- a/src/commands/train.py
+++ b/src/commands/train.py
@@ -1,4 +1,5 @@
 from datetime import datetime
 from pathlib import Path
 import typer
 from rich.table import Table
@@ -7,6 +8,9 @@ from src import state as state_ops
 from src.aws import iam
 from src.aws import sagemaker as sm_ops
 from src.commands.utils import CONFIG_OPT, CONSOLE, load_cfg
 from src.config import Config, MlflowMode
 from src.infra.state import read_infra_state
 from src.tracking.mlflow import MlflowTracker
 app = typer.Typer(help="Manage SageMaker training jobs")
@@ -19,11 +23,31 @@ _STATUS_COLOR = {
 }
 def _tracker(cfg):
    try:
        return MlflowTracker.from_config(cfg)
    except Exception as e:
        CONSOLE.print(f"[red]MLflow setup failed: {e}[/red]")
        raise typer.Exit(1)
 def _config_dir(config_path: str) -> str:
    from pathlib import Path
    return str(Path(config_path).parent)
 def _sagemaker_role_arn(config_path: str, cfg: Config) -> str:
    state = read_infra_state(_config_dir(config_path))
    role_arn = state.get("outputs", {}).get("SageMakerRoleArn")
    if role_arn:
        return str(role_arn)
    if cfg.sagemaker.role_name:
        role_arn = iam.get_role_arn(cfg.aws.profile, cfg.sagemaker.role_name)
        if role_arn:
            return role_arn
        raise RuntimeError(f"IAM role '{cfg.sagemaker.role_name}' not found. Run 'qc-cli infra setup' first.")
    raise RuntimeError("SageMaker role not found in infra state. Run 'qc-cli infra setup' first.")
@app.command()
 def start(config: str = CONFIG_OPT) -> None:
    """Submit a SageMaker training job."""
@@ -37,11 +61,13 @@ def start(config: str = CONFIG_OPT) -> None:
        )
        raise typer.Exit(1)
-    role_arn = iam.get_role_arn(cfg.aws.profile, cfg.sagemaker.role_name)
+    try:
-    if not role_arn:
+        role_arn = _sagemaker_role_arn(config, cfg)
-        CONSOLE.print(f"[red]IAM role '{cfg.sagemaker.role_name}' not found. Run 'qc-cli infra setup' first.[/red]")
+    except RuntimeError as e:
        CONSOLE.print(f"[red]{e}[/red]")
        raise typer.Exit(1)
    tracker = _tracker(cfg)
    job_name = f"qc-cli-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
    s3_train_uri = f"s3://{cfg.s3.bucket}/{cfg.s3.data_prefix}"
    s3_output = f"s3://{cfg.s3.bucket}/{cfg.s3.model_prefix}"
@@ -61,9 +87,21 @@ def start(config: str = CONFIG_OPT) -> None:
    )
    sm_ops.start_training_job(cfg.aws.boto3_session, training_job)
-    state_ops.write_state(_config_dir(config), last_training_job=job_name)
+    st = state_ops.store(config)
    st.set_last_training_job(job_name)
    run_id = tracker.start_training_run(
        training_job,
        region=cfg.aws.region,
        profile=cfg.aws.profile,
        role_arn=role_arn,
    )
    if run_id:
        st.update_training_job(job_name, mlflow_run_id=run_id)
    CONSOLE.print(f"[green]✓[/green] Job submitted: [bold]{job_name}[/bold]")
    if run_id:
        CONSOLE.print(f"MLflow run: [cyan]{run_id}[/cyan]")
        CONSOLE.print("Open MLflow: [cyan]qc-cli mlflow open[/cyan]")
    CONSOLE.print("Track progress: [cyan]qc-cli train status[/cyan]")
@@ -74,9 +112,10 @@ def status(
 ) -> None:
    """Show training job status."""
    cfg = load_cfg(config)
    st = state_ops.store(config)
    if not job_name:
-        job_name = state_ops.get_last_training_job(_config_dir(config))
+        job_name = st.get_last_training_job()
        if not job_name:
            CONSOLE.print(
                "[red]No training job found in state. Pass a job name or run 'qc-cli train start' first.[/red]"
@@ -95,6 +134,25 @@ def status(
    if status.failure_reason:
        CONSOLE.print(f"[red]Failure: {status.failure_reason}[/red]")
    job_state = st.get_training_job(job_name)
    run_id = job_state.get("mlflow_run_id")
    already_registered = job_state.get("registered_model_version")
    if run_id and not already_registered and status.status in {"Completed", "Failed", "Stopped"}:
        tracker = _tracker(cfg)
        version = tracker.finalize_training_run(
            run_id=str(run_id),
            training_job_status=status,
        )
        updates = {"mlflow_finalized_status": status.status}
        if version:
            updates["registered_model_version"] = version
        st.update_training_job(job_name, **updates)
        if version:
            st.set_latest_experiment_model_version(version)
            CONSOLE.print(f"MLflow model version: [cyan]{version}[/cyan] ([cyan]experiment-latest[/cyan])")
    if run_id and cfg.mlflow.mode is not MlflowMode.disabled:
        CONSOLE.print("Open MLflow: [cyan]qc-cli mlflow open[/cyan]")
@app.command(name="list")
 def list_jobs(
--- a/src/commands/upload.py
+++ b/src/commands/upload.py
@@ -0,0 +1,70 @@
 from pathlib import Path
 import typer
 from rich.progress import BarColumn, Progress, SpinnerColumn, TaskProgressColumn, TextColumn
 from src.aws import s3 as s3_ops
 from src.commands.utils import CONFIG_OPT, CONSOLE, load_cfg
 app = typer.Typer()
@app.command()
 def upload(
    path: Path = typer.Argument(..., help="Local file or directory to upload"),
    s3_key: str | None = typer.Option(None, "--s3-key", help="S3 key for file uploads"),
    config: str = CONFIG_OPT,
 ) -> None:
    """Upload a local file or directory to S3."""
    cfg = load_cfg(config)
    if path.is_file():
        key = s3_key or f"{cfg.s3.data_prefix.rstrip('/')}/{path.name}"
        try:
            with CONSOLE.status(f"Uploading {path.name}..."):
                uri = s3_ops.upload_file(cfg.aws.region, cfg.aws.profile, cfg.s3.bucket, str(path), key)
        except Exception as e:
            CONSOLE.print(f"[red]Upload failed: {e}[/red]")
            raise typer.Exit(1)
        CONSOLE.print(f"[green]✓[/green] {path.name} -> {uri}")
        return
    if path.is_dir():
        if s3_key is not None:
            CONSOLE.print("[red]--s3-key can only be used when uploading a single file.[/red]")
            raise typer.Exit(1)
        files = [file for file in path.rglob("*") if file.is_file()]
        if not files:
            CONSOLE.print("[yellow]No files found in directory.[/yellow]")
            raise typer.Exit(0)
        prefix = cfg.s3.data_prefix
        CONSOLE.print(f"Uploading {len(files)} files to s3://{cfg.s3.bucket}/{prefix.rstrip('/')}/")
        try:
            with Progress(
                SpinnerColumn(),
                TextColumn("[progress.description]{task.description}"),
                BarColumn(),
                TaskProgressColumn(),
                console=CONSOLE,
            ) as progress:
                task = progress.add_task("Uploading...", total=len(files))
                count = s3_ops.upload_dir(
                    cfg.aws.region,
                    cfg.aws.profile,
                    cfg.s3.bucket,
                    str(path),
                    prefix,
                    on_progress=lambda: progress.advance(task),
                )
        except Exception as e:
            CONSOLE.print(f"[red]Upload failed: {e}[/red]")
            raise typer.Exit(1)
        CONSOLE.print(f"[green]✓[/green] Uploaded {count} files to s3://{cfg.s3.bucket}/{prefix.rstrip('/')}/")
        return
    CONSOLE.print(f"[red]Path not found: {path}[/red]")
    raise typer.Exit(1)
--- a/src/config.py
+++ b/src/config.py
@@ -1,18 +1,20 @@
-from enum import Enum
+import re
 from enum import StrEnum
 from typing import Any, Literal, TypedDict
 from mypy_boto3_s3.literals import BucketLocationConstraintType
 from mypy_boto3_sagemaker.literals import TrainingInstanceTypeType
-from pydantic import BaseModel, Field, model_validator
+from pydantic import BaseModel, Field, field_validator, model_validator
 from qai_hub.client import Device
-class MlflowMode(str, Enum):
+class MlflowMode(StrEnum):
    disabled = "disabled"
    create = "create"
    existing = "existing"
-class MlflowServerSize(str, Enum):
+class MlflowServerSize(StrEnum):
    small = "Small"
    medium = "Medium"
    large = "Large"
@@ -32,6 +34,33 @@ class AwsConfig(BaseModel):
        return {"profile_name": self.profile, "region_name": self.region}
 DEFAULT_BOOTSTRAP_QUALIFIER = "hnb659fds"
 GENERATED_STACK_PREFIX = "qc-cli-mlops-"
 class InfraConfig(BaseModel):
    stack_name: str
    @property
    def effective_bootstrap_qualifier(self) -> str:
        sanitized = re.sub(r"[^a-z0-9]", "", self.stack_name.lower())
        if not sanitized:
            return DEFAULT_BOOTSTRAP_QUALIFIER
        if self.stack_name.startswith(GENERATED_STACK_PREFIX):
            suffix = re.sub(r"[^a-z0-9]", "", self.stack_name.removeprefix(GENERATED_STACK_PREFIX).lower())
            if suffix:
                return f"q{suffix}"[:10]
        return f"q{sanitized}"[:10]
    @property
    def effective_toolkit_stack_name(self) -> str:
        if self.stack_name.startswith(GENERATED_STACK_PREFIX):
            suffix = re.sub(r"[^A-Za-z0-9-]", "", self.stack_name.removeprefix(GENERATED_STACK_PREFIX))
            if suffix:
                return f"{self.stack_name}-bootstrap"
        return f"{self.stack_name}-bootstrap"
 class S3Config(BaseModel):
    bucket: str = "my-qc-mlops-bucket"
    data_prefix: str = "data/"
@@ -48,13 +77,35 @@ class TrainingConfig(BaseModel):
 class SageMakerConfig(BaseModel):
-    role_name: str = "qc-cli-sagemaker-role"
+    role_name: str = ""
    training: TrainingConfig = Field(default_factory=TrainingConfig)
 class AIHubConfig(BaseModel):
    device: Device = Field(default_factory=lambda: Device("Samsung Galaxy S25 (Family)"))
    target_runtime: str = "tflite"
    input_specs: dict[str, tuple[list[int], str]] = Field(default_factory=dict)
    job_name: str | None = None
    model_name: str | None = None
    compile_options: str | None = None
    profile_options: str | None = None
    quantize_options: str | None = None
    output_dir: str = "build/qai-hub"
    @field_validator("device", mode="before")
    @classmethod
    def parse_device(cls, value: Any) -> Any:
        if isinstance(value, str):
            return Device(value)
        return value
 class MlflowConfig(BaseModel):
    mode: MlflowMode = MlflowMode.disabled
    tracking_server_name: str | None = None
    experiment_name: str = "qc-cli-training"
    registered_model_name: str = "qc-cli-model"
    register_trained_models: bool = True
    artifact_prefix: str = "mlflow/"
    tracking_server_size: MlflowServerSize = MlflowServerSize.small
    mlflow_version: str | None = None
@@ -63,13 +114,27 @@ class MlflowConfig(BaseModel):
    @model_validator(mode="after")
    def require_tracking_server_name(self) -> "MlflowConfig":
-        if self.mode in {MlflowMode.create, MlflowMode.existing} and not self.tracking_server_name:
+        if self.mode is MlflowMode.existing and not self.tracking_server_name:
-            raise ValueError("mlflow.tracking_server_name is required when mlflow.mode is create or existing")
+            raise ValueError("mlflow.tracking_server_name is required when mlflow.mode is existing")
        return self
 class Config(BaseModel):
    infra: InfraConfig
    aws: AwsConfig = Field(default_factory=AwsConfig)
    s3: S3Config = Field(default_factory=S3Config)
    sagemaker: SageMakerConfig = Field(default_factory=SageMakerConfig)
    aihub: AIHubConfig = Field(default_factory=AIHubConfig)
    mlflow: MlflowConfig = Field(default_factory=MlflowConfig)
    @property
    def managed_mlflow_tracking_server_name(self) -> str:
        return f"{self.infra.stack_name}-mlflow"
    @property
    def effective_mlflow_tracking_server_name(self) -> str | None:
        if self.mlflow.mode is MlflowMode.disabled:
            return None
        if self.mlflow.mode is MlflowMode.existing:
            return self.mlflow.tracking_server_name
        return self.managed_mlflow_tracking_server_name
--- a/src/infra/provisioning.py
+++ b/src/infra/provisioning.py
@@ -5,17 +5,27 @@ from typing import Any
 from src.infra.state import state_path, write_infra_state
 STACK_NAME = "MLOpsStack"
 def bootstrap(
    *,
    profile: str,
    account_id: str,
    region: str,
    bootstrap_qualifier: str,
    toolkit_stack_name: str,
    cloudformation_execution_policy: str | None = None,
 ) -> None:
-    cmd = ["cdk", "bootstrap", f"aws://{account_id}/{region}", "--profile", profile]
+    cmd = [
        "cdk",
        "bootstrap",
        f"aws://{account_id}/{region}",
        "--profile",
        profile,
        "--qualifier",
        bootstrap_qualifier,
        "--toolkit-stack-name",
        toolkit_stack_name,
    ]
    if cloudformation_execution_policy:
        cmd.extend(["--cloudformation-execution-policies", cloudformation_execution_policy])
    _run(cmd)
@@ -26,6 +36,9 @@ def deploy(
    profile: str,
    account_id: str,
    region: str,
    stack_name: str,
    bootstrap_qualifier: str,
    toolkit_stack_name: str,
    config_path: str,
    config_dir: str,
    config_snapshot: dict[str, Any],
@@ -35,19 +48,24 @@ def deploy(
        "deploy",
        profile=profile,
        account_id=account_id,
        stack_name=stack_name,
        bootstrap_qualifier=bootstrap_qualifier,
        toolkit_stack_name=toolkit_stack_name,
        config_path=config_path,
        delete_bucket_data=False,
    ) + ["--require-approval", "never", "--outputs-file", str(outputs_file)]
    _run(cmd)
-    outputs = _read_outputs(outputs_file)
+    outputs = _read_outputs(outputs_file, stack_name)
    state = {
-        "stack_name": STACK_NAME,
+        "stack_name": stack_name,
        "aws": {
            "account_id": account_id,
            "region": region,
            "profile": profile,
        },
        "bootstrap_qualifier": bootstrap_qualifier,
        "toolkit_stack_name": toolkit_stack_name,
        "config": config_snapshot,
        "outputs": outputs,
    }
@@ -59,6 +77,9 @@ def destroy(
    *,
    profile: str,
    account_id: str,
    stack_name: str,
    bootstrap_qualifier: str,
    toolkit_stack_name: str,
    config_path: str,
    delete_bucket_data: bool,
 ) -> None:
@@ -67,6 +88,9 @@ def destroy(
            "deploy",
            profile=profile,
            account_id=account_id,
            stack_name=stack_name,
            bootstrap_qualifier=bootstrap_qualifier,
            toolkit_stack_name=toolkit_stack_name,
            config_path=config_path,
            delete_bucket_data=True,
        ) + ["--require-approval", "never"]
@@ -76,6 +100,9 @@ def destroy(
        "destroy",
        profile=profile,
        account_id=account_id,
        stack_name=stack_name,
        bootstrap_qualifier=bootstrap_qualifier,
        toolkit_stack_name=toolkit_stack_name,
        config_path=config_path,
        delete_bucket_data=delete_bucket_data,
    ) + ["--force"]
@@ -87,26 +114,35 @@ def _cdk_cmd(
    *,
    profile: str,
    account_id: str,
    stack_name: str,
    bootstrap_qualifier: str,
    toolkit_stack_name: str,
    config_path: str,
    delete_bucket_data: bool,
 ) -> list[str]:
    cmd = [
        "cdk",
        action,
-        STACK_NAME,
+        stack_name,
        "--app",
        "python app.py",
        "--profile",
        profile,
    ]
    if action == "deploy":
        cmd.extend(["--toolkit-stack-name", toolkit_stack_name])
    cmd.extend([
        "-c",
        f"account_id={account_id}",
        "-c",
        f"config={config_path}",
        "-c",
-        f"stack_name={STACK_NAME}",
+        f"stack_name={stack_name}",
        "-c",
        f"bootstrap_qualifier={bootstrap_qualifier}",
        "-c",
        f"delete_bucket_data={str(delete_bucket_data).lower()}",
-    ]
+    ])
    return cmd
@@ -119,9 +155,9 @@ def _run(cmd: list[str]) -> None:
        raise RuntimeError(f"CDK command failed with exit code {e.returncode}.") from e
-def _read_outputs(path: Path) -> dict[str, str]:
+def _read_outputs(path: Path, stack_name: str) -> dict[str, str]:
    if not path.exists():
        return {}
    with open(path) as f:
        data = json.load(f)
-    return data.get(STACK_NAME, {})
+    return data.get(stack_name, {})
--- a/src/infra/stack.py
+++ b/src/infra/stack.py
@@ -34,7 +34,7 @@ class QCStack(Stack):
        role = iam.CfnRole(
            self,
            "SageMakerRole",
-            role_name=config.sagemaker.role_name,
+            role_name=config.sagemaker.role_name or None,
            assume_role_policy_document=self._sagemaker_trust_policy(),
            managed_policy_arns=[
                f"arn:{self.partition}:iam::aws:policy/AmazonSageMakerFullAccess",
@@ -74,6 +74,7 @@ class QCStack(Stack):
        CfnOutput(self, "SageMakerRoleArn", value=role.attr_arn)
        if config.mlflow.mode is MlflowMode.create:
            tracking_server_name = config.managed_mlflow_tracking_server_name
            artifact_prefix = config.mlflow.artifact_prefix.strip("/")
            artifact_uri = (
                f"s3://{data_bucket.bucket_name}/{artifact_prefix}/"
@@ -145,14 +146,14 @@ class QCStack(Stack):
                "MlflowTrackingServer",
                artifact_store_uri=artifact_uri,
                role_arn=mlflow_role.attr_arn,
-                tracking_server_name=config.mlflow.tracking_server_name or "",
+                tracking_server_name=tracking_server_name,
                automatic_model_registration=config.mlflow.automatic_model_registration,
                mlflow_version=config.mlflow.mlflow_version,
                tracking_server_size=config.mlflow.tracking_server_size.value,
                weekly_maintenance_window_start=config.mlflow.weekly_maintenance_window_start,
            )
-            CfnOutput(self, "MlflowTrackingServerName", value=config.mlflow.tracking_server_name or "")
+            CfnOutput(self, "MlflowTrackingServerName", value=tracking_server_name)
            CfnOutput(self, "MlflowTrackingServerArn", value=tracking_server.attr_tracking_server_arn)
            CfnOutput(self, "MlflowArtifactUri", value=artifact_uri)
            CfnOutput(self, "MlflowRoleArn", value=mlflow_role.attr_arn)
--- a/src/main.py
+++ b/src/main.py
@@ -1,104 +1,14 @@
 from pathlib import Path
 import typer
 import yaml
 from rich.console import Console
 from rich.progress import BarColumn, Progress, SpinnerColumn, TaskProgressColumn, TextColumn
-from src.aws import s3 as s3_ops
+from src.commands import ai_hub, infra, init, mlflow, train, upload
 from src.commands import infra, train
 from src.commands.utils import CONFIG_OPT, load_cfg
 from src.config import Config
 app = typer.Typer(
    help="qc-cli: End-to-end model managment for Qualcomm AI Hub.",
    no_args_is_help=True,
 )
 app.add_typer(init.app)
 app.add_typer(upload.app)
 app.add_typer(mlflow.app, name="mlflow")
 app.add_typer(infra.app, name="infra")
 app.add_typer(train.app, name="train")
-
+app.add_typer(ai_hub.app, name="ai-hub")
 console = Console()
@app.command()
 def init(
    output: str = typer.Option("config.yaml", help="Destination path for the config file"),
    force: bool = typer.Option(False, "--force", "-f", help="Overwrite an existing config file"),
 ) -> None:
    """Write a starter config.yaml to the current directory."""
    dest = Path(output)
    if dest.exists() and not force:
        console.print(f"[yellow]{dest} already exists.[/yellow] Use --force to overwrite.")
        raise typer.Exit(1)
    config = Config()
    dest.parent.mkdir(parents=True, exist_ok=True)
    with open(dest, "w") as f:
        yaml.safe_dump(config.model_dump(mode="json"), f, sort_keys=False)
    console.print(f"[green]✓[/green] Config written to [bold]{dest}[/bold]")
    console.print(
        "Edit it (especially [cyan]s3.bucket[/cyan] and [cyan]sagemaker.training.image_uri[/cyan]) "
        "before running other commands."
    )
@app.command()
 def upload(
    path: Path = typer.Argument(..., help="Local file or directory to upload"),
    s3_key: str | None = typer.Option(None, "--s3-key", help="S3 key for file uploads"),
    config: str = CONFIG_OPT,
 ) -> None:
    """Upload a local file or directory to S3."""
    cfg = load_cfg(config)
    if path.is_file():
        key = s3_key or f"{cfg.s3.data_prefix.rstrip('/')}/{path.name}"
        try:
            with console.status(f"Uploading {path.name}..."):
                uri = s3_ops.upload_file(cfg.aws.region, cfg.aws.profile, cfg.s3.bucket, str(path), key)
        except Exception as e:
            console.print(f"[red]Upload failed: {e}[/red]")
            raise typer.Exit(1)
        console.print(f"[green]✓[/green] {path.name} -> {uri}")
        return
    if path.is_dir():
        if s3_key is not None:
            console.print("[red]--s3-key can only be used when uploading a single file.[/red]")
            raise typer.Exit(1)
        files = [file for file in path.rglob("*") if file.is_file()]
        if not files:
            console.print("[yellow]No files found in directory.[/yellow]")
            raise typer.Exit(0)
        prefix = cfg.s3.data_prefix
        console.print(f"Uploading {len(files)} files to s3://{cfg.s3.bucket}/{prefix.rstrip('/')}/")
        try:
            with Progress(
                SpinnerColumn(),
                TextColumn("[progress.description]{task.description}"),
                BarColumn(),
                TaskProgressColumn(),
                console=console,
            ) as progress:
                task = progress.add_task("Uploading...", total=len(files))
                count = s3_ops.upload_dir(
                    cfg.aws.region,
                    cfg.aws.profile,
                    cfg.s3.bucket,
                    str(path),
                    prefix,
                    on_progress=lambda: progress.advance(task),
                )
        except Exception as e:
            console.print(f"[red]Upload failed: {e}[/red]")
            raise typer.Exit(1)
        console.print(f"[green]✓[/green] Uploaded {count} files to s3://{cfg.s3.bucket}/{prefix.rstrip('/')}/")
        return
    console.print(f"[red]Path not found: {path}[/red]")
    raise typer.Exit(1)
--- a/src/qualcomm/init.py
+++ b/src/qualcomm/init.py
--- a/src/qualcomm/aihub_jobs.py
+++ b/src/qualcomm/aihub_jobs.py
@@ -0,0 +1,129 @@
 from pathlib import Path
 from typing import Any, TypedDict
 import qai_hub.hub as hub
 from qai_hub.client import CompileJob, Device, InferenceJob, Model, ProfileJob, QuantizeDtype, QuantizeJob
 class ModelJobResult(TypedDict):
    job: CompileJob | QuantizeJob
    job_id: str
    model: Model
    model_id: str
 class InferenceJobResult(TypedDict):
    job: InferenceJob
    job_id: str
    outputs: Any
 class ProfileJobResult(TypedDict):
    job: ProfileJob
    job_id: str
 def _dataset_entries(inputs: dict[str, Any]) -> dict[str, list[Any]]:
    return {name: value if isinstance(value, list) else [value] for name, value in inputs.items()}
 def submit_compile_job(
    model: Any,
    device: Device,
    input_specs: dict[str, tuple[tuple[int, ...], str]],
    target_runtime: str,
    options: str | None = None,
    job_name: str | None = None,
    model_name: str | None = None,
 ) -> ModelJobResult:
    compile_options = f"--target_runtime {target_runtime}"
    if options:
        compile_options = f"{compile_options} {options}"
    model_arg = model
    if isinstance(model, Path):
        model_arg = str(model)
    elif isinstance(model, str):
        candidate = Path(model)
        model_arg = model if candidate.exists() or candidate.suffix else hub.get_model(model)
    if model_name and isinstance(model_arg, str) and Path(model_arg).exists():
        model_arg = hub.upload_model(model_arg, name=model_name)
    job = hub.submit_compile_job(
        model=model_arg,
        device=device,
        name=job_name,
        input_specs=input_specs,
        options=compile_options,
    )
    target_model = job.get_target_model()
    if target_model is None:
        raise RuntimeError(f"Compile job {job.job_id} did not produce a target model.")
    return {"job": job, "job_id": str(job.job_id), "model": target_model, "model_id": str(target_model.model_id)}
 def submit_inference_job(
    model_id: str,
    device: Device,
    inputs: dict[str, Any],
    output_dir: str | Path,
    job_name: str | None = None,
 ) -> InferenceJobResult:
    job = hub.submit_inference_job(
        model=hub.get_model(model_id),
        device=device,
        inputs=_dataset_entries(inputs),
        name=job_name,
    )
    out = Path(output_dir)
    out.mkdir(parents=True, exist_ok=True)
    data = job.download_output_data(str(out))
    return {"job": job, "job_id": str(job.job_id), "outputs": data}
 def submit_profile_job(
    model_id: str,
    device: Device,
    options: str | None = None,
    job_name: str | None = None,
 ) -> ProfileJobResult:
    job = hub.submit_profile_job(
        model=hub.get_model(model_id),
        device=device,
        name=job_name,
        options=options or "",
    )
    return {"job": job, "job_id": str(job.job_id)}
 def submit_quantize_job(
    model: str | Path,
    calibration_data: dict[str, Any],
    options: str | None = None,
    job_name: str | None = None,
    model_name: str | None = None,
 ) -> ModelJobResult:
    model_arg = str(model)
    if model_name and Path(model_arg).exists():
        model_arg = hub.upload_model(model_arg, name=model_name)
    job = hub.submit_quantize_job(
        model=model_arg,
        calibration_data=_dataset_entries(calibration_data),
        weights_dtype=QuantizeDtype.INT8,
        activations_dtype=QuantizeDtype.INT8,
        name=job_name,
        options=options or "",
    )
    target_model = job.get_target_model()
    if target_model is None:
        raise RuntimeError(f"Quantize job {job.job_id} did not produce a target model.")
    return {"job": job, "job_id": str(job.job_id), "model": target_model, "model_id": str(target_model.model_id)}
 def download_model(model_id: str, output_path: str | Path) -> str:
    dest = Path(output_path)
    dest.parent.mkdir(parents=True, exist_ok=True)
    model = hub.get_model(model_id)
    result = model.download(str(dest))
    return str(result or dest)
--- a/src/qualcomm/artifacts.py
+++ b/src/qualcomm/artifacts.py
@@ -0,0 +1,83 @@
 import tarfile
 from dataclasses import dataclass
 from pathlib import Path
 from src.aws import s3 as s3_ops
 from src.aws import sagemaker as sm_ops
 from src.config import Config
@dataclass(frozen=True)
 class ResolvedOnnx:
    onnx_path: Path
    model_artifact: str | None
    run_name: str
 def _safe_extract(tar: tarfile.TarFile, dest: Path) -> None:
    dest_root = dest.resolve()
    for member in tar.getmembers():
        target = (dest / member.name).resolve()
        if dest_root != target and dest_root not in target.parents:
            raise ValueError(f"Unsafe tar member path: {member.name}")
    tar.extractall(dest, filter="data")
 def _find_onnx(root: Path, explicit: str | None = None) -> Path:
    if explicit:
        p = Path(explicit)
        if not p.is_absolute():
            p = root / p
        if not p.exists():
            raise FileNotFoundError(f"ONNX file not found: {p}")
        return p
    matches = sorted(root.rglob("model.onnx"))
    if not matches:
        matches = sorted(root.rglob("*.onnx"))
    if not matches:
        raise FileNotFoundError(f"No ONNX file found under {root}")
    if len(matches) > 1:
        joined = ", ".join(str(p.relative_to(root)) for p in matches)
        raise ValueError(f"Multiple ONNX files found ({joined}). Pass --onnx-path.")
    return matches[0]
 def resolve_onnx(
    cfg: Config,
    output_dir: str,
    from_job: str | None = None,
    model_s3_uri: str | None = None,
    onnx_path: str | None = None,
    last_training_job: str | None = None,
 ) -> ResolvedOnnx:
    if onnx_path:
        path = Path(onnx_path)
        if path.exists():
            return ResolvedOnnx(onnx_path=path, model_artifact=None, run_name=path.stem)
    job = from_job or last_training_job
    artifact = model_s3_uri
    if not artifact:
        if not job:
            raise ValueError("No model source found. Pass --onnx-path, --model-s3-uri, --from-job, or run training first.")
        artifact = sm_ops.get_model_artifacts(cfg.aws.region, cfg.aws.profile, job)
    run_name = job or Path(artifact).name.removesuffix(".tar.gz").replace("/", "-")
    root = Path(output_dir) / run_name / "source"
    tar_path = root / "model.tar.gz"
    s3_ops.download_file(cfg.aws.region, cfg.aws.profile, artifact, str(tar_path))
    extract_dir = root / "extracted"
    extract_dir.mkdir(parents=True, exist_ok=True)
    try:
        with tarfile.open(tar_path, "r:gz") as tar:
            _safe_extract(tar, extract_dir)
    except tarfile.TarError as e:
        raise ValueError(f"Invalid model tarball: {tar_path}") from e
    return ResolvedOnnx(
        onnx_path=_find_onnx(extract_dir, onnx_path),
        model_artifact=artifact,
        run_name=run_name,
    )
--- a/src/state.py
+++ b/src/state.py
@@ -1,30 +1,81 @@
 import json
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Any
 STATE_FILE = ".qc-cli.json"
-def _path(config_dir: str) -> Path:
+@dataclass(frozen=True)
-    return Path(config_dir) / STATE_FILE
+class CliStateStore:
    config_dir: str = "."
    @property
    def path(self) -> Path:
        return Path(self.config_dir) / STATE_FILE
-def read_state(config_dir: str = ".") -> dict[str, Any]:
+    def read(self) -> dict[str, Any]:
-    path = _path(config_dir)
+        if not self.path.exists():
    if not path.exists():
            return {}
-    with open(path) as f:
+        with open(self.path) as f:
-        return json.load(f)
+            value = json.load(f)
        return dict(value) if isinstance(value, dict) else {}
-
+    def update(self, **updates: Any) -> None:
-def write_state(config_dir: str = ".", **updates: str | None) -> None:
+        state = self.read()
    path = _path(config_dir)
    state = read_state(config_dir)
        state.update(updates)
-    with open(path, "w") as f:
+        self._write(state)
    def get(self, key: str, default: Any = None) -> Any:
        return self.read().get(key, default)
    def get_last_training_job(self) -> str | None:
        value = self.get("last_training_job")
        return str(value) if value else None
    def get_last_model_artifact(self) -> str | None:
        value = self.get("last_model_artifact")
        return str(value) if value else None
    def get_last_quantized_model_id(self) -> str | None:
        value = self.get("last_quantized_model_id")
        return str(value) if value else None
    def get_last_compiled_model_id(self) -> str | None:
        value = self.get("last_compiled_model_id")
        return str(value) if value else None
    def get_last_downloaded_model(self) -> str | None:
        value = self.get("last_downloaded_model")
        return str(value) if value else None
    def set_last_training_job(self, job_name: str) -> None:
        self.update(last_training_job=job_name)
    def get_training_job(self, job_name: str) -> dict[str, Any]:
        jobs = self._training_jobs(self.read())
        value = jobs.get(job_name, {})
        return dict(value) if isinstance(value, dict) else {}
    def update_training_job(self, job_name: str, **updates: Any) -> None:
        state = self.read()
        jobs = self._training_jobs(state)
        jobs[job_name] = {**jobs.get(job_name, {}), **updates}
        state["training_jobs"] = jobs
        self._write(state)
    def set_latest_experiment_model_version(self, version: str) -> None:
        self.update(latest_experiment_model_version=version)
    def _write(self, state: dict[str, Any]) -> None:
        with open(self.path, "w") as f:
            json.dump(state, f, indent=2)
    def _training_jobs(self, state: dict[str, Any]) -> dict[str, Any]:
        value = state.get("training_jobs", {})
        return dict(value) if isinstance(value, dict) else {}
-def get_last_training_job(config_dir: str = ".") -> str | None:
+
-    value = read_state(config_dir).get("last_training_job")
+def store(config_path: str) -> CliStateStore:
-    return str(value) if value else None
+    config_dir = str(Path(config_path).parent)
    return CliStateStore(config_dir)
--- a/src/tracking/init.py
+++ b/src/tracking/init.py
@@ -0,0 +1,3 @@
 from src.tracking.mlflow import MlflowTracker, NoopTracker, Tracker
 __all__ = ["MlflowTracker", "NoopTracker", "Tracker"]
--- a/src/tracking/mlflow.py
+++ b/src/tracking/mlflow.py
@@ -0,0 +1,153 @@
 import os
 from dataclasses import dataclass
 from typing import Any, Protocol
 import mlflow
 from mlflow.tracking import MlflowClient
 from src.aws import mlflow as aws_mlflow
 from src.config import Config, MlflowMode
 class Tracker(Protocol):
    def start_training_run(self, training_job: Any, *, region: str, profile: str, role_arn: str) -> str | None: ...
    def finalize_training_run(self, *, run_id: str | None, training_job_status: Any) -> str | None: ...
@dataclass(frozen=True)
 class NoopTracker:
    def start_training_run(self, training_job: Any, *, region: str, profile: str, role_arn: str) -> str | None:
        return None
    def finalize_training_run(self, *, run_id: str | None, training_job_status: Any) -> str | None:
        return None
@dataclass(frozen=True)
 class MlflowTracker:
    tracking_uri: str
    experiment_name: str
    registered_model_name: str
    register_trained_models: bool
    @classmethod
    def from_config(cls, cfg: Config) -> Tracker:
        if cfg.mlflow.mode is MlflowMode.disabled:
            return NoopTracker()
        os.environ.setdefault("MLFLOW_SUPPRESS_PRINTING_URL_TO_STDOUT", "true")
        tracking_server_name = cfg.effective_mlflow_tracking_server_name
        if not tracking_server_name:
            raise RuntimeError("MLflow tracking server name could not be resolved.")
        tracking_uri = aws_mlflow.get_tracking_server_arn(
            cfg.aws.region,
            cfg.aws.profile,
            tracking_server_name,
        )
        mlflow.set_tracking_uri(tracking_uri)
        mlflow.set_experiment(cfg.mlflow.experiment_name)
        return cls(
            tracking_uri=tracking_uri,
            experiment_name=cfg.mlflow.experiment_name,
            registered_model_name=cfg.mlflow.registered_model_name,
            register_trained_models=cfg.mlflow.register_trained_models,
        )
    def start_training_run(self, training_job: Any, *, region: str, profile: str, role_arn: str) -> str | None:
        run = mlflow.start_run(run_name=training_job.job_name)
        run_id = str(run.info.run_id)
        params = {
            "aws.region": region,
            "aws.profile": profile,
            "sagemaker.role_arn": role_arn,
            "sagemaker.job_name": training_job.job_name,
            "sagemaker.training_image": training_job.image_uri,
            "sagemaker.instance_type": training_job.instance_type,
            "sagemaker.instance_count": training_job.instance_count,
            "sagemaker.s3_train_uri": training_job.s3_train_uri,
            "sagemaker.s3_output_path": training_job.s3_output_path,
            "sagemaker.entry_point": training_job.entry_point,
            "sagemaker.source_dir": training_job.source_dir,
        }
        self._log_params(params)
        self._log_params({f"hyperparameters.{key}": value for key, value in training_job.hyperparameters.items()})
        mlflow.set_tags(
            {
                "qc_cli.stage": "experiment",
                "qc_cli.artifact_kind": "trained_source",
                "qc_cli.source": "sagemaker",
                "qc_cli.command": "train start",
                "sagemaker.job_name": training_job.job_name,
            }
        )
        mlflow.end_run()
        return run_id
    def finalize_training_run(self, *, run_id: str | None, training_job_status: Any) -> str | None:
        if not run_id:
            return None
        with mlflow.start_run(run_id=run_id):
            self._log_params(
                {
                    "sagemaker.training_status": training_job_status.status,
                    "sagemaker.created_at": training_job_status.created,
                    "sagemaker.modified_at": training_job_status.modified,
                    "sagemaker.model_artifacts": training_job_status.model_artifacts,
                    "sagemaker.failure_reason": training_job_status.failure_reason,
                }
            )
            self._log_final_metrics(training_job_status.raw)
            mlflow.set_tag("qc_cli.command", "train status")
            if training_job_status.status != "Completed" or not training_job_status.model_artifacts:
                mlflow.set_tag("qc_cli.training_terminal_status", training_job_status.status)
                return None
            if not self.register_trained_models:
                return None
            client = MlflowClient()
            self._ensure_registered_model(client, self.registered_model_name)
            version = client.create_model_version(
                name=self.registered_model_name,
                source=training_job_status.model_artifacts,
                run_id=run_id,
                tags={
                    "qc_cli.stage": "experiment",
                    "qc_cli.artifact_kind": "trained_source",
                    "qc_cli.source": "sagemaker",
                    "sagemaker.job_name": training_job_status.name,
                },
            )
            version_number = str(version.version)
            client.set_registered_model_alias(self.registered_model_name, "experiment-latest", version_number)
            mlflow.set_tag("qc_cli.registered_model_name", self.registered_model_name)
            mlflow.set_tag("qc_cli.registered_model_version", version_number)
            return version_number
    def _log_params(self, params: dict[str, Any]) -> None:
        cleaned = {key: str(value) for key, value in params.items() if value is not None}
        if cleaned:
            mlflow.log_params(cleaned)
    def _log_final_metrics(self, training_job: dict[str, Any]) -> None:
        metrics = {}
        for metric in training_job.get("FinalMetricDataList", []):
            name = metric.get("MetricName")
            value = metric.get("Value")
            if name and value is not None:
                metrics[str(name)] = float(value)
        if metrics:
            mlflow.log_metrics(metrics)
    def _ensure_registered_model(self, client: MlflowClient, name: str) -> None:
        try:
            client.get_registered_model(name)
        except Exception:
            client.create_registered_model(name)
--- a/uv.lock
+++ b/uv.lock
Author	SHA1	Message	Date
samirodr	5360a482fc	update	2026-06-08 14:59:44 -04:00
samirodr	6a560a8610	match	2026-06-08 14:54:13 -04:00
slalom	d244150d98	move mlflow to its own command	2026-06-05 11:47:38 -04:00
slalom	d7c7158464	clean main file	2026-06-05 11:25:04 -04:00
slalom	6bc25dc183	restructure config to use Device class directly Also include device validation	2026-06-04 17:28:17 -04:00
samirodr	71a95aa3a7	update description	2026-06-03 17:13:00 -04:00
slalom	a3f3060e13	ai-hub (#3 ) Reviewed-on: #3	2026-06-03 21:06:06 +00:00
slalom	e9ada2612f	Mlflow implementation (#2 ) Reviewed-on: #2	2026-06-02 19:04:23 +00:00
slalom	6ac9702dc5	make sure resources are set up in isolated namespaces (#1 ) Reviewed-on: #1	2026-05-27 12:51:26 +00:00
		`@@ -0,0 +1,3 @@`
							`from src.tracking.mlflow import MlflowTracker, NoopTracker, Tracker`

							`__all__ = ["MlflowTracker", "NoopTracker", "Tracker"]`