# LLM Foundry — LLM Training Code for Foundation Models by Databricks

> An open-source library for training, fine-tuning, and evaluating large language models, built on the Composer training library by MosaicML/Databricks.

## Install

Save in your project root:

# LLM Foundry — LLM Training Code for Foundation Models by Databricks

## Quick Use
```bash
pip install llm-foundry
# Configure a training run via YAML
composer train.yaml
```

## Introduction
LLM Foundry is the training codebase behind Databricks' foundation models including DBRX. It wraps the Composer distributed training library with LLM-specific configurations, data pipelines, and evaluation harnesses, providing a production-grade starting point for training language models.

## What LLM Foundry Does
- Provides ready-to-use training scripts for GPT-style and MPT-style language models
- Supports pretraining from scratch, continued pretraining, and supervised fine-tuning
- Includes a streaming data pipeline for efficient training on large datasets
- Integrates model evaluation via the Eval Gauntlet and lm-evaluation-harness
- Handles multi-GPU and multi-node distributed training through Composer

## Architecture Overview
LLM Foundry builds on MosaicML Composer for distributed training orchestration and Streaming for data loading. Model architectures are defined as modular Composer models. Training configurations are specified in YAML files that control model size, parallelism, data sources, and optimization settings.

## Self-Hosting & Configuration
- Install via pip from PyPI or clone the repository
- Define training configurations in YAML covering model, data, optimizer, and scheduler
- Requires NVIDIA GPUs with CUDA; supports FSDP and tensor parallelism
- Use the MosaicML Streaming library for efficient data loading from cloud storage
- Integrates with Weights and Biases and MLflow for experiment tracking

## Key Features
- Battle-tested code used to train production foundation models (DBRX, MPT)
- Flash Attention 2 and ALiBi positional encodings for efficient long-context training
- MosaicML Streaming enables resumable, multi-worker data loading from S3 or GCS
- Built-in MCLI integration for launching training on MosaicML Cloud
- Modular architecture makes it straightforward to add custom model components

## Comparison with Similar Tools
- **LlamaFactory** — high-level fine-tuning with a web UI; LLM Foundry targets from-scratch pretraining workflows
- **Megatron-LM** — NVIDIA's parallelism framework; LLM Foundry uses Composer/FSDP and is more accessible for medium-scale training
- **Axolotl** — fine-tuning focused; LLM Foundry covers the full pretraining-to-evaluation cycle
- **NeMo** — NVIDIA's end-to-end platform; LLM Foundry is lighter and more modular
- **litgpt** — Lightning AI's GPT training toolkit; LLM Foundry has deeper integration with Databricks ecosystem

## FAQ
**Q: Can I use LLM Foundry without MosaicML Cloud?**
A: Yes. It runs on any machine with PyTorch and CUDA. Cloud integration is optional.

**Q: What model sizes can I train?**
A: From small models on a single GPU to 100B+ parameter models across multi-node clusters using FSDP.

**Q: Does it support LoRA fine-tuning?**
A: Yes, through PEFT integration. Configure LoRA parameters in the training YAML.

**Q: What datasets are supported?**
A: Any text dataset in JSON, JSONL, or HuggingFace format. The Streaming library also supports custom formats.

## Sources
- https://github.com/mosaicml/llm-foundry

---
Source: https://tokrepo.com/en/workflows/e1ea2334-416b-11f1-9bc6-00163e2b0d79
Author: AI Open Source