Mlflow implementation (#2)
Reviewed-on: #2
This commit was merged in pull request #2.
This commit is contained in:
57
README.md
57
README.md
@@ -78,9 +78,13 @@ To provision an MLflow tracking server, set:
|
||||
```yaml
|
||||
mlflow:
|
||||
mode: create
|
||||
tracking_server_name: your-tracking-server-name
|
||||
experiment_name: qc-cli-training
|
||||
registered_model_name: qc-cli-model
|
||||
register_trained_models: true
|
||||
```
|
||||
|
||||
In `create` mode, the CLI manages the tracking server name from `infra.stack_name`; you do not need to set `tracking_server_name`.
|
||||
|
||||
To use an existing MLflow tracking server, set:
|
||||
|
||||
```yaml
|
||||
@@ -89,6 +93,16 @@ mlflow:
|
||||
tracking_server_name: your-tracking-server-name
|
||||
```
|
||||
|
||||
When MLflow is enabled, `train start` creates an MLflow run for the SageMaker job. `train status` finalizes that run once the job reaches a terminal state and registers completed model artifacts as experiment model versions using the `experiment-latest` MLflow alias. An experiment version is an immutable trained-source artifact; it records that training produced a model, not that the model is better than earlier versions or ready for release.
|
||||
|
||||
To open the managed SageMaker MLflow UI, request a fresh presigned URL:
|
||||
|
||||
```bash
|
||||
qc-cli infra mlflow-url --config config.yaml
|
||||
```
|
||||
|
||||
This works for `mode: create` and for `mode: existing` when the existing server is managed by Amazon SageMaker. In `create` mode, the command uses the CLI-managed tracking server name. In `existing` mode, it uses `mlflow.tracking_server_name`. If the existing MLflow server is external to SageMaker, open it with that server's own URL instead.
|
||||
|
||||
## Commands
|
||||
|
||||
### `init`
|
||||
@@ -106,6 +120,7 @@ qc-cli infra setup Deploy the CDK stack
|
||||
qc-cli infra setup --no-bootstrap Deploy without running CDK bootstrap
|
||||
qc-cli infra setup --cloudformation-execution-policy <arn> Set CDK bootstrap execution policy ARN
|
||||
qc-cli infra status Show CDK stack/resource status
|
||||
qc-cli infra mlflow-url Print a presigned MLflow UI URL
|
||||
qc-cli infra destroy Destroy stack, retaining S3 data
|
||||
qc-cli infra destroy --yes Destroy stack without confirmation
|
||||
qc-cli infra destroy --delete-bucket-data Destroy stack and delete S3 data
|
||||
@@ -140,6 +155,46 @@ qc-cli train list --limit 3 Show a custom number of recent jobs
|
||||
|
||||
The expected output artifact is SageMaker’s `model.tar.gz`, normally containing the trained model file your container writes to `/opt/ml/model`.
|
||||
|
||||
## Model lifecycle
|
||||
|
||||
The CLI uses neutral experiment naming for trained artifacts and reserves release terminology for an explicit promotion step.
|
||||
|
||||
Current behavior:
|
||||
|
||||
1. `qc-cli train start` submits a SageMaker training job.
|
||||
2. `qc-cli train status` finalizes the MLflow run after the job reaches a terminal state.
|
||||
3. If the job completed and `mlflow.register_trained_models` is enabled, the SageMaker `model.tar.gz` is registered as a new MLflow model version with:
|
||||
- `qc_cli.stage=experiment`
|
||||
- `qc_cli.artifact_kind=trained_source`
|
||||
- `qc_cli.source=sagemaker`
|
||||
4. The MLflow alias `experiment-latest` points at the most recently registered experiment version.
|
||||
|
||||
Planned AI Hub extension:
|
||||
|
||||
1. AI Hub compile or quantize will create deployable derived artifacts from a trained-source experiment.
|
||||
2. Derived artifacts will keep lineage back to the source experiment version instead of replacing it.
|
||||
3. Release aliases such as `v1` or `production` will point at the selected deployable artifact.
|
||||
|
||||
Example future metadata:
|
||||
|
||||
```text
|
||||
qc-cli-model version 12
|
||||
qc_cli.stage=experiment
|
||||
qc_cli.artifact_kind=trained_source
|
||||
qc_cli.source=sagemaker
|
||||
|
||||
qc-cli-model-aihub version 3
|
||||
qc_cli.stage=ai_hub_compiled
|
||||
qc_cli.artifact_kind=deployable
|
||||
qc_cli.parent_registered_model_name=qc-cli-model
|
||||
qc_cli.parent_model_version=12
|
||||
qc_cli.runtime=tflite
|
||||
qc_cli.quantization=int8
|
||||
qc_cli.target_device=Samsung Galaxy S25
|
||||
```
|
||||
|
||||
In that flow, `experiment-latest` remains a training convenience alias. Release selection is a separate promotion decision based on the derived artifact, not on the experiment name.
|
||||
|
||||
## AWS permissions required
|
||||
|
||||
The IAM user or role running the CLI needs:
|
||||
|
||||
Reference in New Issue
Block a user