WIP

2026-06-12 11:42:26 -04:00
parent 522ddc74e2
commit 2d4d377051
8 changed files with 390 additions and 38 deletions
--- a/README.md
+++ b/README.md
@@ -164,12 +164,15 @@ Uploads use `s3.bucket` and `s3.data_prefix` from `config.yaml`. File uploads de
 ```
 qc-cli train start              Submit a SageMaker training job
 qc-cli train status [job-name]  Show job status; defaults to the last submitted job
+qc-cli train wait [job-name]    Wait for completion and finalize MLflow tracking
 qc-cli train list               List recent training jobs
 qc-cli train list --limit 3     Show a custom number of recent jobs
 ```

 `train start` uses `s3://<bucket>/<data_prefix>/` as the training channel and writes outputs under `s3://<bucket>/<model_prefix>/`. If `sagemaker.training.source_dir` is set, the CLI packages that directory, uploads it beside the job output prefix, and passes `sagemaker_program`/`sagemaker_submit_directory` to the SageMaker container.

+`train wait` checks SageMaker every 30 seconds by default. Use `--poll-interval <seconds>` to choose another positive interval. Stopping the local command does not stop the SageMaker job.
+
 The expected output artifact is SageMaker’s `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.

 ### `ai-hub`
@@ -216,7 +219,7 @@ The CLI uses neutral experiment naming for trained artifacts and reserves releas
 Current behavior:

 1. `qc-cli train start` submits a SageMaker training job.
-2. `qc-cli train status` finalizes the MLflow run after the job reaches a terminal state.
+2. `qc-cli train status` or `qc-cli train wait` finalizes the MLflow run after the job reaches a terminal state. `train wait` blocks and polls every 30 seconds by default.
 3. If the job completed and `mlflow.register_trained_models` is enabled, the SageMaker `model.tar.gz` is registered as a new MLflow model version with:
   - `qc_cli.stage=experiment`
   - `qc_cli.artifact_kind=trained_source`
@@ -224,6 +227,20 @@ Current behavior:
 4. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
 5. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.

+Training scripts can include a `training_metrics.json` file in the SageMaker model directory. During finalization, the CLI logs its ordered metrics to the associated MLflow run using each epoch as the MLflow step and stores the JSON as a run artifact:
+
+```json
+{
+  "schema_version": 1,
+  "steps": [
+    {"step": 0, "metrics": {"val.precision": 0.72, "val.recall": 0.68}}
+  ],
+  "summary": {"summary.best_epoch": 0}
+}
+```
+
+Metric names must be non-empty strings, values must be finite numbers, and steps must be non-negative, unique, and strictly increasing. Missing or malformed metrics produce a warning but do not block model registration.
+
 Future release aliases such as `v1` or `production` can point at a selected deployable artifact.

 Example future metadata: