This commit is contained in:
2026-06-12 14:34:45 -04:00
parent a1ffbb77c5
commit b56b77330c

View File

@@ -105,11 +105,7 @@ mlflow:
tracking_server_name: your-tracking-server-name
```
When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. Metric upload through
`train start --upload-metrics` or `mlflow upload-metrics` finalizes that run and registers completed model artifacts
as experiment model versions using the `experiment-latest` MLflow alias. `train status` reads SageMaker status only.
An experiment version is an immutable trained-source artifact; it records that training produced a model, not that
the model is better than earlier versions or ready for release.
When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. Training metrics can be upload with `train start --upload-metrics` or `mlflow upload-metrics`.
To open the managed SageMaker MLflow UI, request a fresh presigned URL:
@@ -129,17 +125,6 @@ qc-cli init --output <path> Write config to a custom path
qc-cli init --force Overwrite an existing config file
```
### `mlflow`
```
qc-cli mlflow open Open a presigned MLflow UI URL
qc-cli mlflow upload-metrics [job-name] Upload completed training metrics
```
`mlflow upload-metrics` defaults to the last submitted training job. It creates or recovers the job's MLflow run,
imports `training_metrics.json` from the SageMaker model artifact, and records successful upload in `.qc-cli.json`.
Use `--force` to upload the metrics again.
### `infra`
```
@@ -158,6 +143,15 @@ qc-cli infra destroy --delete-bucket-data Destroy stack and delete S3 data
qc-cli infra setup --cloudformation-execution-policy arn:aws:iam::aws:policy/PowerUserAccess
```
### `mlflow`
```
qc-cli mlflow open Open a presigned MLflow UI URL
qc-cli mlflow upload-metrics [job-name] Upload completed training metrics
```
`mlflow upload-metrics` defaults to the last submitted training job. It creates or recovers the job's MLflow run, imports `training_metrics.json` from the SageMaker model artifact, and records successful upload in `.qc-cli.json`. Use `--force` to upload the metrics again.
### `upload`
```
@@ -197,9 +191,7 @@ qc-cli ai-hub profile [--model-id ID]
qc-cli ai-hub download [--model-id ID] [--output PATH]
```
`ai-hub upload` optimizes to ONNX, quantizes, validates, and profiles. When `aihub.target_runtime` is not `onnx`, it
also compiles the quantized model to that deployment runtime. The initial ONNX optimization gives external models
Workbench provenance and applies compiler optimization passes before quantization.
`ai-hub upload` optimizes to ONNX, quantizes, validates, and profiles. When `aihub.target_runtime` is not `onnx`, it also compiles the quantized model to that deployment runtime. The initial ONNX optimization gives external models Workbench provenance and applies compiler optimization passes before quantization.
Resume behavior:
@@ -213,11 +205,7 @@ Resume behavior:
When a step runs in the current command, `upload` passes its returned model ID directly to the next step. When a step is skipped, the next step resolves the needed model ID from `.qc-cli.json`. This avoids re-running earlier AI Hub jobs when you only need to continue from a later step.
`ai-hub optimize` compiles an external model with `--target_runtime onnx`. `ai-hub quantize` uses an explicit
`--model-id`, the last optimized ONNX model, or an explicit/local model source in that order. `ai-hub compile` resolves
model sources in this order: `--model-id`, explicit source options, last quantized model, then the last training job.
For `target_runtime: onnx`, upload treats the quantized ONNX as the final model and skips a redundant second compile.
`ai-hub download` remains separate because downloading is outside the Workbench processing loop.
`ai-hub optimize` compiles an external model with `--target_runtime onnx`. `ai-hub quantize` uses an explicit `--model-id`, the last optimized ONNX model, or an explicit/local model source in that order. `ai-hub compile` resolves model sources in this order: `--model-id`, explicit source options, last quantized model, then the last training job. For `target_runtime: onnx`, upload treats the quantized ONNX as the final model and skips a redundant second compile. `ai-hub download` remains separate because downloading is outside the Workbench processing loop.
AI Hub authentication currently uses the local `qai-hub` SDK configuration. A planned follow-up is to support AWS Systems Manager Parameter Store `SecureString` for team-managed tokens, where `config.yaml` stores only a parameter name such as `/qc-cli/aihub/token`, AWS KMS encrypts the token at rest, and the CLI retrieves it at runtime with `ssm:GetParameter` plus `kms:Decrypt` permissions.
@@ -238,9 +226,7 @@ Current behavior:
6. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
7. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.
Training scripts can include a `training_metrics.json` file in the SageMaker model directory. When present, the
explicit metrics upload command logs its ordered metrics to the associated MLflow run using each epoch as the MLflow
step and stores the JSON as a run artifact:
Training scripts can include a `training_metrics.json` file in the SageMaker model directory. When present, the explicit metrics upload command logs its ordered metrics to the associated MLflow run using each epoch as the MLflow step and stores the JSON as a run artifact:
```json
{
@@ -252,10 +238,7 @@ step and stores the JSON as a run artifact:
}
```
Metric names must be non-empty strings, values must be finite numbers, and steps must be non-negative, unique, and
strictly increasing. If the file is missing, the command uploads the final metrics reported by SageMaker and continues
model registration without per-epoch history. A malformed metrics artifact still fails the upload command without
affecting the trained model or model registration.
Metric names must be non-empty strings, values must be finite numbers, and steps must be non-negative, unique, and strictly increasing. If the file is missing, the command uploads the final metrics reported by SageMaker and continues model registration without per-epoch history. A malformed metrics artifact still fails the upload command without affecting the trained model or model registration.
Future release aliases such as `v1` or `production` can point at a selected deployable artifact.