another update

2026-06-12 12:17:02 -04:00
parent 53e886a535
commit 5211d0af14
6 changed files with 278 additions and 89 deletions
--- a/README.md
+++ b/README.md
@@ -128,9 +128,14 @@ qc-cli init --force          Overwrite an existing config file
 ### `mlflow`

 ```
-qc-cli mlflow open  Open a presigned MLflow UI URL in a browser
+qc-cli mlflow open                       Open a presigned MLflow UI URL
+qc-cli mlflow upload-metrics [job-name] Upload completed training metrics
 ```

+`mlflow upload-metrics` defaults to the last submitted training job. It creates or recovers the job's MLflow run,
+imports `training_metrics.json` from the SageMaker model artifact, and records successful upload in `.qc-cli.json`.
+Use `--force` to upload the metrics again.
+
 ### `infra`

 ```
@@ -163,7 +168,7 @@ Uploads use `s3.bucket` and `s3.data_prefix` from `config.yaml`. File uploads de

 ```
 qc-cli train start              Submit a SageMaker training job
-qc-cli train start --wait       Submit, wait, and finalize MLflow tracking
+qc-cli train start --upload-metrics  Submit, wait, and upload metrics
 qc-cli train status [job-name]  Show job status; defaults to the last submitted job
 qc-cli train list               List recent training jobs
 qc-cli train list --limit 3     Show a custom number of recent jobs
@@ -171,7 +176,7 @@ qc-cli train list --limit 3     Show a custom number of recent jobs

 `train start` uses `s3://<bucket>/<data_prefix>/` as the training channel and writes outputs under `s3://<bucket>/<model_prefix>/`. If `sagemaker.training.source_dir` is set, the CLI packages that directory, uploads it beside the job output prefix, and passes `sagemaker_program`/`sagemaker_submit_directory` to the SageMaker container.

-`train start --wait` checks SageMaker every 30 seconds by default. Use `--poll-interval <seconds>` to choose another positive interval. Stopping the local command does not stop the SageMaker job.
+`train start --upload-metrics` checks SageMaker every 30 seconds by default, then uploads metrics after completion. Use `--poll-interval <seconds>` to choose another positive interval. Stopping the local command does not stop the SageMaker job.

 The expected output artifact is SageMaker’s `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.

@@ -219,15 +224,19 @@ The CLI uses neutral experiment naming for trained artifacts and reserves releas
 Current behavior:

 1. `qc-cli train start` submits a SageMaker training job.
-2. `qc-cli train status` or `qc-cli train start --wait` finalizes the MLflow run after the job reaches a terminal state. `--wait` polls every 30 seconds by default.
-3. If the job completed and `mlflow.register_trained_models` is enabled, the SageMaker `model.tar.gz` is registered as a new MLflow model version with:
+2. `qc-cli train status` finalizes the MLflow run and registers completed model artifacts.
+3. `qc-cli train start --upload-metrics` polls every 30 seconds by default, then uploads per-epoch metrics after completion.
+4. `qc-cli mlflow upload-metrics [job-name]` uploads or retries metrics for an existing completed job.
+5. If the job completed and `mlflow.register_trained_models` is enabled, the SageMaker `model.tar.gz` is registered as a new MLflow model version with:
   - `qc_cli.stage=experiment`
   - `qc_cli.artifact_kind=trained_source`
   - `qc_cli.source=sagemaker`
-4. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
-5. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.
+6. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
+7. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.

-Training scripts can include a `training_metrics.json` file in the SageMaker model directory. During finalization, the CLI logs its ordered metrics to the associated MLflow run using each epoch as the MLflow step and stores the JSON as a run artifact:
+Training scripts can include a `training_metrics.json` file in the SageMaker model directory. The explicit metrics
+upload command logs its ordered metrics to the associated MLflow run using each epoch as the MLflow step and stores
+the JSON as a run artifact:

 ```json
 {
@@ -239,7 +248,9 @@ Training scripts can include a `training_metrics.json` file in the SageMaker mod
 }
 ```

-Metric names must be non-empty strings, values must be finite numbers, and steps must be non-negative, unique, and strictly increasing. Missing or malformed metrics produce a warning but do not block model registration.
+Metric names must be non-empty strings, values must be finite numbers, and steps must be non-negative, unique, and
+strictly increasing. A missing or malformed metrics artifact fails the upload command without affecting the trained
+model or model registration.

 Future release aliases such as `v1` or `production` can point at a selected deployable artifact.