another update

This commit is contained in:
2026-06-12 12:17:02 -04:00
parent 53e886a535
commit 5211d0af14
6 changed files with 278 additions and 89 deletions

View File

@@ -128,9 +128,14 @@ qc-cli init --force Overwrite an existing config file
### `mlflow`
```
qc-cli mlflow open Open a presigned MLflow UI URL in a browser
qc-cli mlflow open Open a presigned MLflow UI URL
qc-cli mlflow upload-metrics [job-name] Upload completed training metrics
```
`mlflow upload-metrics` defaults to the last submitted training job. It creates or recovers the job's MLflow run,
imports `training_metrics.json` from the SageMaker model artifact, and records successful upload in `.qc-cli.json`.
Use `--force` to upload the metrics again.
### `infra`
```
@@ -163,7 +168,7 @@ Uploads use `s3.bucket` and `s3.data_prefix` from `config.yaml`. File uploads de
```
qc-cli train start Submit a SageMaker training job
qc-cli train start --wait Submit, wait, and finalize MLflow tracking
qc-cli train start --upload-metrics Submit, wait, and upload metrics
qc-cli train status [job-name] Show job status; defaults to the last submitted job
qc-cli train list List recent training jobs
qc-cli train list --limit 3 Show a custom number of recent jobs
@@ -171,7 +176,7 @@ qc-cli train list --limit 3 Show a custom number of recent jobs
`train start` uses `s3://<bucket>/<data_prefix>/` as the training channel and writes outputs under `s3://<bucket>/<model_prefix>/`. If `sagemaker.training.source_dir` is set, the CLI packages that directory, uploads it beside the job output prefix, and passes `sagemaker_program`/`sagemaker_submit_directory` to the SageMaker container.
`train start --wait` checks SageMaker every 30 seconds by default. Use `--poll-interval <seconds>` to choose another positive interval. Stopping the local command does not stop the SageMaker job.
`train start --upload-metrics` checks SageMaker every 30 seconds by default, then uploads metrics after completion. Use `--poll-interval <seconds>` to choose another positive interval. Stopping the local command does not stop the SageMaker job.
The expected output artifact is SageMakers `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.
@@ -219,15 +224,19 @@ The CLI uses neutral experiment naming for trained artifacts and reserves releas
Current behavior:
1. `qc-cli train start` submits a SageMaker training job.
2. `qc-cli train status` or `qc-cli train start --wait` finalizes the MLflow run after the job reaches a terminal state. `--wait` polls every 30 seconds by default.
3. If the job completed and `mlflow.register_trained_models` is enabled, the SageMaker `model.tar.gz` is registered as a new MLflow model version with:
2. `qc-cli train status` finalizes the MLflow run and registers completed model artifacts.
3. `qc-cli train start --upload-metrics` polls every 30 seconds by default, then uploads per-epoch metrics after completion.
4. `qc-cli mlflow upload-metrics [job-name]` uploads or retries metrics for an existing completed job.
5. If the job completed and `mlflow.register_trained_models` is enabled, the SageMaker `model.tar.gz` is registered as a new MLflow model version with:
- `qc_cli.stage=experiment`
- `qc_cli.artifact_kind=trained_source`
- `qc_cli.source=sagemaker`
4. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
5. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.
6. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
7. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.
Training scripts can include a `training_metrics.json` file in the SageMaker model directory. During finalization, the CLI logs its ordered metrics to the associated MLflow run using each epoch as the MLflow step and stores the JSON as a run artifact:
Training scripts can include a `training_metrics.json` file in the SageMaker model directory. The explicit metrics
upload command logs its ordered metrics to the associated MLflow run using each epoch as the MLflow step and stores
the JSON as a run artifact:
```json
{
@@ -239,7 +248,9 @@ Training scripts can include a `training_metrics.json` file in the SageMaker mod
}
```
Metric names must be non-empty strings, values must be finite numbers, and steps must be non-negative, unique, and strictly increasing. Missing or malformed metrics produce a warning but do not block model registration.
Metric names must be non-empty strings, values must be finite numbers, and steps must be non-negative, unique, and
strictly increasing. A missing or malformed metrics artifact fails the upload command without affecting the trained
model or model registration.
Future release aliases such as `v1` or `production` can point at a selected deployable artifact.