# Determined — Open-Source ML Training Platform with GPU Scheduling

> A self-hosted machine learning platform that handles distributed training, hyperparameter tuning, experiment tracking, and GPU cluster management in one integrated system.

## Install

Save in your project root:

# Determined — Open-Source ML Training Platform with GPU Scheduling

## Quick Use
```bash
# Install the CLI
pip install determined

# Deploy a local cluster
det deploy local cluster-up

# Submit a training experiment
det experiment create config.yaml model_dir/
```

## Introduction
Determined is an open-source ML training platform that unifies distributed training, hyperparameter search, resource scheduling, and experiment tracking. It lets ML teams share GPU clusters efficiently while maintaining reproducibility across experiments.

## What Determined Does
- Schedules and manages distributed training jobs across GPU clusters
- Runs hyperparameter searches with adaptive algorithms (ASHA, PBT)
- Tracks experiments with metrics, checkpoints, and configuration versioning
- Provides fault-tolerant training with automatic checkpoint and resume
- Manages multi-tenant GPU allocation with fair-share scheduling

## Architecture Overview
Determined runs a master service that accepts experiment submissions and schedules them on agent nodes. Each agent manages one or more GPUs and launches training containers. The master tracks experiment state in PostgreSQL, serves the Web UI, and coordinates distributed training via Horovod or PyTorch DDP. A resource pool abstraction supports on-prem, cloud, and Kubernetes backends.

## Self-Hosting & Configuration
- Deploy locally: `det deploy local cluster-up` for development
- Deploy on Kubernetes with the official Helm chart for production
- Configure resource pools in `master.yaml` to define GPU allocation policies
- Set up cloud auto-scaling for AWS or GCP to provision GPUs on demand
- Access the Web UI at the master address for experiment monitoring

## Key Features
- State-of-the-art hyperparameter search with early stopping (ASHA) and population-based training
- Fault-tolerant training: experiments resume automatically from last checkpoint on preemption
- Distributed training with zero code changes using PyTorch DDP or Horovod
- Fair-share GPU scheduling across teams and projects
- Integrated experiment tracking with metric visualization and model registry

## Comparison with Similar Tools
- **MLflow** — experiment tracking and model registry but no built-in GPU scheduling or distributed training
- **Kubeflow** — Kubernetes ML toolkit, more modular but requires assembling multiple components
- **Ray Train** — distributed training library, no integrated scheduler or experiment management
- **Weights & Biases** — SaaS experiment tracking, no compute scheduling
- **ClearML** — similar scope but different scheduling architecture and open-core model

## FAQ
**Q: Does Determined work with PyTorch and TensorFlow?**
A: Yes. It has first-class support for both via its Trial API or Core API for minimal code changes.

**Q: Can I run Determined on Kubernetes?**
A: Yes. The Helm chart deploys the master and agents as Kubernetes resources, using the cluster's GPU nodes.

**Q: How does fault tolerance work?**
A: Determined periodically checkpoints training state. If a node fails or a job is preempted, training resumes from the last checkpoint automatically.

**Q: Is there a hosted version?**
A: HPE (which acquired Determined AI) offers a managed version, but the open-source platform is fully self-hostable.

## Sources
- https://github.com/determined-ai/determined
- https://docs.determined.ai/

---
Source: https://tokrepo.com/en/workflows/asset-41dad2e0
Author: AI Open Source