Determined — Open-Source ML Training Platform with GPU Scheduling

Introduction

Determined is an open-source ML training platform that unifies distributed training, hyperparameter search, resource scheduling, and experiment tracking. It lets ML teams share GPU clusters efficiently while maintaining reproducibility across experiments.

What Determined Does

Schedules and manages distributed training jobs across GPU clusters
Runs hyperparameter searches with adaptive algorithms (ASHA, PBT)
Tracks experiments with metrics, checkpoints, and configuration versioning
Provides fault-tolerant training with automatic checkpoint and resume
Manages multi-tenant GPU allocation with fair-share scheduling

Architecture Overview

Determined runs a master service that accepts experiment submissions and schedules them on agent nodes. Each agent manages one or more GPUs and launches training containers. The master tracks experiment state in PostgreSQL, serves the Web UI, and coordinates distributed training via Horovod or PyTorch DDP. A resource pool abstraction supports on-prem, cloud, and Kubernetes backends.

Self-Hosting & Configuration

Deploy locally: det deploy local cluster-up for development
Deploy on Kubernetes with the official Helm chart for production
Configure resource pools in master.yaml to define GPU allocation policies
Set up cloud auto-scaling for AWS or GCP to provision GPUs on demand
Access the Web UI at the master address for experiment monitoring

Key Features

State-of-the-art hyperparameter search with early stopping (ASHA) and population-based training
Fault-tolerant training: experiments resume automatically from last checkpoint on preemption
Distributed training with zero code changes using PyTorch DDP or Horovod
Fair-share GPU scheduling across teams and projects
Integrated experiment tracking with metric visualization and model registry

Comparison with Similar Tools

MLflow — experiment tracking and model registry but no built-in GPU scheduling or distributed training
Kubeflow — Kubernetes ML toolkit, more modular but requires assembling multiple components
Ray Train — distributed training library, no integrated scheduler or experiment management
Weights & Biases — SaaS experiment tracking, no compute scheduling
ClearML — similar scope but different scheduling architecture and open-core model

FAQ

Q: Does Determined work with PyTorch and TensorFlow? A: Yes. It has first-class support for both via its Trial API or Core API for minimal code changes.

Q: Can I run Determined on Kubernetes? A: Yes. The Helm chart deploys the master and agents as Kubernetes resources, using the cluster's GPU nodes.

Q: How does fault tolerance work? A: Determined periodically checkpoints training state. If a node fails or a job is preempted, training resumes from the last checkpoint automatically.

Q: Is there a hosted version? A: HPE (which acquired Determined AI) offers a managed version, but the open-source platform is fully self-hostable.

Determined — Open-Source ML Training Platform with GPU Scheduling

Este activo puede ser leído e instalado directamente por agents

Introduction

What Determined Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discusión

Activos relacionados

Polyaxon — ML Lifecycle Management and Orchestration Platform

TheHive — Open Source Security Incident Response Platform

Gophish — Open Source Phishing Simulation Platform

VibeVoice — Open-Source Frontier Voice AI by Microsoft