include-metrics-from-training (#6)

Reviewed-on: #6
2026-06-12 18:23:25 +00:00
parent 522ddc74e2
commit a1ffbb77c5
13 changed files with 785 additions and 116 deletions
--- a/README.md
+++ b/README.md
@@ -105,7 +105,11 @@ mlflow:
  tracking_server_name: your-tracking-server-name
 ```

-When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. `train status` finalizes that run once the job reaches a terminal state and registers completed model artifacts as experiment model versions using the `experiment-latest` MLflow alias. An experiment version is an immutable trained-source artifact; it records that training produced a model, not that the model is better than earlier versions or ready for release.
+When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. Metric upload through
+`train start --upload-metrics` or `mlflow upload-metrics` finalizes that run and registers completed model artifacts
+as experiment model versions using the `experiment-latest` MLflow alias. `train status` reads SageMaker status only.
+An experiment version is an immutable trained-source artifact; it records that training produced a model, not that
+the model is better than earlier versions or ready for release.

 To open the managed SageMaker MLflow UI, request a fresh presigned URL:

@@ -128,9 +132,14 @@ qc-cli init --force          Overwrite an existing config file
 ### `mlflow`

 ```
-qc-cli mlflow open  Open a presigned MLflow UI URL in a browser
+qc-cli mlflow open                       Open a presigned MLflow UI URL
+qc-cli mlflow upload-metrics [job-name] Upload completed training metrics
 ```

+`mlflow upload-metrics` defaults to the last submitted training job. It creates or recovers the job's MLflow run,
+imports `training_metrics.json` from the SageMaker model artifact, and records successful upload in `.qc-cli.json`.
+Use `--force` to upload the metrics again.
+
 ### `infra`

 ```
@@ -163,6 +172,7 @@ Uploads use `s3.bucket` and `s3.data_prefix` from `config.yaml`. File uploads de

 ```
 qc-cli train start              Submit a SageMaker training job
+qc-cli train start --upload-metrics  Submit, wait, and upload metrics
 qc-cli train status [job-name]  Show job status; defaults to the last submitted job
 qc-cli train list               List recent training jobs
 qc-cli train list --limit 3     Show a custom number of recent jobs
@@ -170,6 +180,8 @@ qc-cli train list --limit 3     Show a custom number of recent jobs

 `train start` uses `s3://<bucket>/<data_prefix>/` as the training channel and writes outputs under `s3://<bucket>/<model_prefix>/`. If `sagemaker.training.source_dir` is set, the CLI packages that directory, uploads it beside the job output prefix, and passes `sagemaker_program`/`sagemaker_submit_directory` to the SageMaker container.

+`train start --upload-metrics` checks SageMaker every 30 seconds by default, then uploads metrics after completion. Use `--poll-interval <seconds>` to choose another positive interval. Stopping the local command does not stop the SageMaker job.
+
 The expected output artifact is SageMaker’s `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.

 ### `ai-hub`
@@ -216,13 +228,34 @@ The CLI uses neutral experiment naming for trained artifacts and reserves releas
 Current behavior:

 1. `qc-cli train start` submits a SageMaker training job.
-2. `qc-cli train status` finalizes the MLflow run after the job reaches a terminal state.
-3. If the job completed and `mlflow.register_trained_models` is enabled, the SageMaker `model.tar.gz` is registered as a new MLflow model version with:
+2. `qc-cli train status` reads and displays SageMaker status only; it does not contact MLflow.
+3. `qc-cli train start --upload-metrics` polls every 30 seconds by default, then uploads per-epoch metrics after completion.
+4. `qc-cli mlflow upload-metrics [job-name]` uploads or retries metrics for an existing completed job.
+5. The metrics upload workflow finalizes the MLflow run and, when `mlflow.register_trained_models` is enabled, registers the SageMaker `model.tar.gz` as a new MLflow model version with:
   - `qc_cli.stage=experiment`
   - `qc_cli.artifact_kind=trained_source`
   - `qc_cli.source=sagemaker`
-4. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
-5. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.
+6. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
+7. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.
+
+Training scripts can include a `training_metrics.json` file in the SageMaker model directory. When present, the
+explicit metrics upload command logs its ordered metrics to the associated MLflow run using each epoch as the MLflow
+step and stores the JSON as a run artifact:
+
+```json
+{
+  "schema_version": 1,
+  "steps": [
+    {"step": 0, "metrics": {"val.precision": 0.72, "val.recall": 0.68}}
+  ],
+  "summary": {"summary.best_epoch": 0}
+}
+```
+
+Metric names must be non-empty strings, values must be finite numbers, and steps must be non-negative, unique, and
+strictly increasing. If the file is missing, the command uploads the final metrics reported by SageMaker and continues
+model registration without per-epoch history. A malformed metrics artifact still fails the upload command without
+affecting the trained model or model registration.

 Future release aliases such as `v1` or `production` can point at a selected deployable artifact.