include-metrics-from-training (#6)

Reviewed-on: #6
This commit was merged in pull request #6.
This commit is contained in:
2026-06-12 18:23:25 +00:00
parent 522ddc74e2
commit a1ffbb77c5
13 changed files with 785 additions and 116 deletions

View File

@@ -105,7 +105,11 @@ mlflow:
tracking_server_name: your-tracking-server-name
```
When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. `train status` finalizes that run once the job reaches a terminal state and registers completed model artifacts as experiment model versions using the `experiment-latest` MLflow alias. An experiment version is an immutable trained-source artifact; it records that training produced a model, not that the model is better than earlier versions or ready for release.
When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. Metric upload through
`train start --upload-metrics` or `mlflow upload-metrics` finalizes that run and registers completed model artifacts
as experiment model versions using the `experiment-latest` MLflow alias. `train status` reads SageMaker status only.
An experiment version is an immutable trained-source artifact; it records that training produced a model, not that
the model is better than earlier versions or ready for release.
To open the managed SageMaker MLflow UI, request a fresh presigned URL:
@@ -128,9 +132,14 @@ qc-cli init --force Overwrite an existing config file
### `mlflow`
```
qc-cli mlflow open Open a presigned MLflow UI URL in a browser
qc-cli mlflow open Open a presigned MLflow UI URL
qc-cli mlflow upload-metrics [job-name] Upload completed training metrics
```
`mlflow upload-metrics` defaults to the last submitted training job. It creates or recovers the job's MLflow run,
imports `training_metrics.json` from the SageMaker model artifact, and records successful upload in `.qc-cli.json`.
Use `--force` to upload the metrics again.
### `infra`
```
@@ -163,6 +172,7 @@ Uploads use `s3.bucket` and `s3.data_prefix` from `config.yaml`. File uploads de
```
qc-cli train start Submit a SageMaker training job
qc-cli train start --upload-metrics Submit, wait, and upload metrics
qc-cli train status [job-name] Show job status; defaults to the last submitted job
qc-cli train list List recent training jobs
qc-cli train list --limit 3 Show a custom number of recent jobs
@@ -170,6 +180,8 @@ qc-cli train list --limit 3 Show a custom number of recent jobs
`train start` uses `s3://<bucket>/<data_prefix>/` as the training channel and writes outputs under `s3://<bucket>/<model_prefix>/`. If `sagemaker.training.source_dir` is set, the CLI packages that directory, uploads it beside the job output prefix, and passes `sagemaker_program`/`sagemaker_submit_directory` to the SageMaker container.
`train start --upload-metrics` checks SageMaker every 30 seconds by default, then uploads metrics after completion. Use `--poll-interval <seconds>` to choose another positive interval. Stopping the local command does not stop the SageMaker job.
The expected output artifact is SageMakers `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.
### `ai-hub`
@@ -216,13 +228,34 @@ The CLI uses neutral experiment naming for trained artifacts and reserves releas
Current behavior:
1. `qc-cli train start` submits a SageMaker training job.
2. `qc-cli train status` finalizes the MLflow run after the job reaches a terminal state.
3. If the job completed and `mlflow.register_trained_models` is enabled, the SageMaker `model.tar.gz` is registered as a new MLflow model version with:
2. `qc-cli train status` reads and displays SageMaker status only; it does not contact MLflow.
3. `qc-cli train start --upload-metrics` polls every 30 seconds by default, then uploads per-epoch metrics after completion.
4. `qc-cli mlflow upload-metrics [job-name]` uploads or retries metrics for an existing completed job.
5. The metrics upload workflow finalizes the MLflow run and, when `mlflow.register_trained_models` is enabled, registers the SageMaker `model.tar.gz` as a new MLflow model version with:
- `qc_cli.stage=experiment`
- `qc_cli.artifact_kind=trained_source`
- `qc_cli.source=sagemaker`
4. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
5. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.
6. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
7. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.
Training scripts can include a `training_metrics.json` file in the SageMaker model directory. When present, the
explicit metrics upload command logs its ordered metrics to the associated MLflow run using each epoch as the MLflow
step and stores the JSON as a run artifact:
```json
{
"schema_version": 1,
"steps": [
{"step": 0, "metrics": {"val.precision": 0.72, "val.recall": 0.68}}
],
"summary": {"summary.best_epoch": 0}
}
```
Metric names must be non-empty strings, values must be finite numbers, and steps must be non-negative, unique, and
strictly increasing. If the file is missing, the command uploads the final metrics reported by SageMaker and continues
model registration without per-epoch history. A malformed metrics artifact still fails the upload command without
affecting the trained model or model registration.
Future release aliases such as `v1` or `production` can point at a selected deployable artifact.