Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsMay 17, 2026·3 min de lectura

Determined — Open-Source ML Training Platform with GPU Scheduling

A self-hosted machine learning platform that handles distributed training, hyperparameter tuning, experiment tracking, and GPU cluster management in one integrated system.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
Determined Overview
Comando CLI universal
npx tokrepo install 41dad2e0-51a8-11f1-9bc6-00163e2b0d79

Introduction

Determined is an open-source ML training platform that unifies distributed training, hyperparameter search, resource scheduling, and experiment tracking. It lets ML teams share GPU clusters efficiently while maintaining reproducibility across experiments.

What Determined Does

  • Schedules and manages distributed training jobs across GPU clusters
  • Runs hyperparameter searches with adaptive algorithms (ASHA, PBT)
  • Tracks experiments with metrics, checkpoints, and configuration versioning
  • Provides fault-tolerant training with automatic checkpoint and resume
  • Manages multi-tenant GPU allocation with fair-share scheduling

Architecture Overview

Determined runs a master service that accepts experiment submissions and schedules them on agent nodes. Each agent manages one or more GPUs and launches training containers. The master tracks experiment state in PostgreSQL, serves the Web UI, and coordinates distributed training via Horovod or PyTorch DDP. A resource pool abstraction supports on-prem, cloud, and Kubernetes backends.

Self-Hosting & Configuration

  • Deploy locally: det deploy local cluster-up for development
  • Deploy on Kubernetes with the official Helm chart for production
  • Configure resource pools in master.yaml to define GPU allocation policies
  • Set up cloud auto-scaling for AWS or GCP to provision GPUs on demand
  • Access the Web UI at the master address for experiment monitoring

Key Features

  • State-of-the-art hyperparameter search with early stopping (ASHA) and population-based training
  • Fault-tolerant training: experiments resume automatically from last checkpoint on preemption
  • Distributed training with zero code changes using PyTorch DDP or Horovod
  • Fair-share GPU scheduling across teams and projects
  • Integrated experiment tracking with metric visualization and model registry

Comparison with Similar Tools

  • MLflow — experiment tracking and model registry but no built-in GPU scheduling or distributed training
  • Kubeflow — Kubernetes ML toolkit, more modular but requires assembling multiple components
  • Ray Train — distributed training library, no integrated scheduler or experiment management
  • Weights & Biases — SaaS experiment tracking, no compute scheduling
  • ClearML — similar scope but different scheduling architecture and open-core model

FAQ

Q: Does Determined work with PyTorch and TensorFlow? A: Yes. It has first-class support for both via its Trial API or Core API for minimal code changes.

Q: Can I run Determined on Kubernetes? A: Yes. The Helm chart deploys the master and agents as Kubernetes resources, using the cluster's GPU nodes.

Q: How does fault tolerance work? A: Determined periodically checkpoints training state. If a node fails or a job is preempted, training resumes from the last checkpoint automatically.

Q: Is there a hosted version? A: HPE (which acquired Determined AI) offers a managed version, but the open-source platform is fully self-hostable.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados