LLM Foundry — LLM Training Code for Foundation Models by Databricks

Introduction

LLM Foundry is the training codebase behind Databricks' foundation models including DBRX. It wraps the Composer distributed training library with LLM-specific configurations, data pipelines, and evaluation harnesses, providing a production-grade starting point for training language models.

What LLM Foundry Does

Provides ready-to-use training scripts for GPT-style and MPT-style language models
Supports pretraining from scratch, continued pretraining, and supervised fine-tuning
Includes a streaming data pipeline for efficient training on large datasets
Integrates model evaluation via the Eval Gauntlet and lm-evaluation-harness
Handles multi-GPU and multi-node distributed training through Composer

Architecture Overview

LLM Foundry builds on MosaicML Composer for distributed training orchestration and Streaming for data loading. Model architectures are defined as modular Composer models. Training configurations are specified in YAML files that control model size, parallelism, data sources, and optimization settings.

Self-Hosting & Configuration

Install via pip from PyPI or clone the repository
Define training configurations in YAML covering model, data, optimizer, and scheduler
Requires NVIDIA GPUs with CUDA; supports FSDP and tensor parallelism
Use the MosaicML Streaming library for efficient data loading from cloud storage
Integrates with Weights and Biases and MLflow for experiment tracking

Key Features

Battle-tested code used to train production foundation models (DBRX, MPT)
Flash Attention 2 and ALiBi positional encodings for efficient long-context training
MosaicML Streaming enables resumable, multi-worker data loading from S3 or GCS
Built-in MCLI integration for launching training on MosaicML Cloud
Modular architecture makes it straightforward to add custom model components

Comparison with Similar Tools

LlamaFactory — high-level fine-tuning with a web UI; LLM Foundry targets from-scratch pretraining workflows
Megatron-LM — NVIDIA's parallelism framework; LLM Foundry uses Composer/FSDP and is more accessible for medium-scale training
Axolotl — fine-tuning focused; LLM Foundry covers the full pretraining-to-evaluation cycle
NeMo — NVIDIA's end-to-end platform; LLM Foundry is lighter and more modular
litgpt — Lightning AI's GPT training toolkit; LLM Foundry has deeper integration with Databricks ecosystem

FAQ

Q: Can I use LLM Foundry without MosaicML Cloud? A: Yes. It runs on any machine with PyTorch and CUDA. Cloud integration is optional.

Q: What model sizes can I train? A: From small models on a single GPU to 100B+ parameter models across multi-node clusters using FSDP.

Q: Does it support LoRA fine-tuning? A: Yes, through PEFT integration. Configure LoRA parameters in the training YAML.

Q: What datasets are supported? A: Any text dataset in JSON, JSONL, or HuggingFace format. The Streaming library also supports custom formats.

Sources

https://github.com/mosaicml/llm-foundry

LLM Foundry — LLM Training Code for Foundation Models by Databricks

Introduction

What LLM Foundry Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Flyte — Resilient AI and Data Workflow Orchestration

Megatron-LM — Train Transformer Models at Scale by NVIDIA

PageIndex — Document Index for Reasoning-Based RAG