include-metrics-from-training (#6)
Reviewed-on: #6
This commit was merged in pull request #6.
This commit is contained in:
45
README.md
45
README.md
@@ -105,7 +105,11 @@ mlflow:
|
||||
tracking_server_name: your-tracking-server-name
|
||||
```
|
||||
|
||||
When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. `train status` finalizes that run once the job reaches a terminal state and registers completed model artifacts as experiment model versions using the `experiment-latest` MLflow alias. An experiment version is an immutable trained-source artifact; it records that training produced a model, not that the model is better than earlier versions or ready for release.
|
||||
When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. Metric upload through
|
||||
`train start --upload-metrics` or `mlflow upload-metrics` finalizes that run and registers completed model artifacts
|
||||
as experiment model versions using the `experiment-latest` MLflow alias. `train status` reads SageMaker status only.
|
||||
An experiment version is an immutable trained-source artifact; it records that training produced a model, not that
|
||||
the model is better than earlier versions or ready for release.
|
||||
|
||||
To open the managed SageMaker MLflow UI, request a fresh presigned URL:
|
||||
|
||||
@@ -128,9 +132,14 @@ qc-cli init --force Overwrite an existing config file
|
||||
### `mlflow`
|
||||
|
||||
```
|
||||
qc-cli mlflow open Open a presigned MLflow UI URL in a browser
|
||||
qc-cli mlflow open Open a presigned MLflow UI URL
|
||||
qc-cli mlflow upload-metrics [job-name] Upload completed training metrics
|
||||
```
|
||||
|
||||
`mlflow upload-metrics` defaults to the last submitted training job. It creates or recovers the job's MLflow run,
|
||||
imports `training_metrics.json` from the SageMaker model artifact, and records successful upload in `.qc-cli.json`.
|
||||
Use `--force` to upload the metrics again.
|
||||
|
||||
### `infra`
|
||||
|
||||
```
|
||||
@@ -163,6 +172,7 @@ Uploads use `s3.bucket` and `s3.data_prefix` from `config.yaml`. File uploads de
|
||||
|
||||
```
|
||||
qc-cli train start Submit a SageMaker training job
|
||||
qc-cli train start --upload-metrics Submit, wait, and upload metrics
|
||||
qc-cli train status [job-name] Show job status; defaults to the last submitted job
|
||||
qc-cli train list List recent training jobs
|
||||
qc-cli train list --limit 3 Show a custom number of recent jobs
|
||||
@@ -170,6 +180,8 @@ qc-cli train list --limit 3 Show a custom number of recent jobs
|
||||
|
||||
`train start` uses `s3://<bucket>/<data_prefix>/` as the training channel and writes outputs under `s3://<bucket>/<model_prefix>/`. If `sagemaker.training.source_dir` is set, the CLI packages that directory, uploads it beside the job output prefix, and passes `sagemaker_program`/`sagemaker_submit_directory` to the SageMaker container.
|
||||
|
||||
`train start --upload-metrics` checks SageMaker every 30 seconds by default, then uploads metrics after completion. Use `--poll-interval <seconds>` to choose another positive interval. Stopping the local command does not stop the SageMaker job.
|
||||
|
||||
The expected output artifact is SageMaker’s `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.
|
||||
|
||||
### `ai-hub`
|
||||
@@ -216,13 +228,34 @@ The CLI uses neutral experiment naming for trained artifacts and reserves releas
|
||||
Current behavior:
|
||||
|
||||
1. `qc-cli train start` submits a SageMaker training job.
|
||||
2. `qc-cli train status` finalizes the MLflow run after the job reaches a terminal state.
|
||||
3. If the job completed and `mlflow.register_trained_models` is enabled, the SageMaker `model.tar.gz` is registered as a new MLflow model version with:
|
||||
2. `qc-cli train status` reads and displays SageMaker status only; it does not contact MLflow.
|
||||
3. `qc-cli train start --upload-metrics` polls every 30 seconds by default, then uploads per-epoch metrics after completion.
|
||||
4. `qc-cli mlflow upload-metrics [job-name]` uploads or retries metrics for an existing completed job.
|
||||
5. The metrics upload workflow finalizes the MLflow run and, when `mlflow.register_trained_models` is enabled, registers the SageMaker `model.tar.gz` as a new MLflow model version with:
|
||||
- `qc_cli.stage=experiment`
|
||||
- `qc_cli.artifact_kind=trained_source`
|
||||
- `qc_cli.source=sagemaker`
|
||||
4. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
|
||||
5. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.
|
||||
6. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
|
||||
7. AI Hub upload commands create deployable derived artifacts from a trained-source experiment or local ONNX model.
|
||||
|
||||
Training scripts can include a `training_metrics.json` file in the SageMaker model directory. When present, the
|
||||
explicit metrics upload command logs its ordered metrics to the associated MLflow run using each epoch as the MLflow
|
||||
step and stores the JSON as a run artifact:
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": 1,
|
||||
"steps": [
|
||||
{"step": 0, "metrics": {"val.precision": 0.72, "val.recall": 0.68}}
|
||||
],
|
||||
"summary": {"summary.best_epoch": 0}
|
||||
}
|
||||
```
|
||||
|
||||
Metric names must be non-empty strings, values must be finite numbers, and steps must be non-negative, unique, and
|
||||
strictly increasing. If the file is missing, the command uploads the final metrics reported by SageMaker and continues
|
||||
model registration without per-epoch history. A malformed metrics artifact still fails the upload command without
|
||||
affecting the trained model or model registration.
|
||||
|
||||
Future release aliases such as `v1` or `production` can point at a selected deployable artifact.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user