# ChatGLM — Open Bilingual Chat Model by Tsinghua KEG

> ChatGLM is a family of open bilingual language models from Tsinghua University that support English and Chinese conversation, code generation, and tool use, with variants optimized for consumer GPU deployment.

## Install

Save as a script file and run:

## Quick Use
```bash
pip install transformers torch
python -c "from transformers import AutoTokenizer, AutoModel; t = AutoTokenizer.from_pretrained('THUDM/chatglm3-6b', trust_remote_code=True); m = AutoModel.from_pretrained('THUDM/chatglm3-6b', trust_remote_code=True).half().cuda().eval(); print(m.chat(t, 'Hello')[0])"
```

## Introduction
ChatGLM is an open-source bilingual (English/Chinese) language model series developed by the KEG Lab at Tsinghua University and Zhipu AI. Built on the General Language Model architecture, it provides competitive chat, reasoning, and code generation capabilities while remaining small enough to run on a single consumer GPU.

## What ChatGLM Does
- Generates fluent bilingual text for conversation, summarization, and translation
- Supports function calling and tool-use patterns for agent workflows
- Runs inference in INT4 quantization on GPUs with as little as 6 GB VRAM
- Provides a web demo and CLI for interactive chat out of the box
- Serves as a base model for supervised fine-tuning and RLHF

## Architecture Overview
ChatGLM uses a prefix-decoder transformer with rotary position embeddings and multi-query attention. The model is pre-trained on a balanced English-Chinese corpus and aligned via RLHF. Later versions (GLM-4) add a longer context window and vision capabilities. Quantized variants use GPTQ or native INT4 weight compression to reduce memory requirements.

## Self-Hosting & Configuration
- Clone the repo and install dependencies with `pip install -r requirements.txt`
- Download weights from Hugging Face Hub or the Tsinghua mirror
- Launch a Gradio web demo with `python web_demo.py` or CLI chat with `python cli_demo.py`
- INT4 quantization is enabled by loading with `AutoModel.from_pretrained(...).quantize(4)`
- Deploy as an OpenAI-compatible API server using the included `openai_api.py` script

## Key Features
- Strong bilingual performance in both English and Chinese
- Runs on consumer hardware with INT4 quantization (6 GB VRAM for 6B model)
- OpenAI-compatible API server included for drop-in integration
- Supports P-Tuning v2 and LoRA for efficient domain adaptation
- Active model family with regular upgrades (GLM-2, GLM-3, GLM-4)

## Comparison with Similar Tools
- **LLaMA** — English-centric; stronger ecosystem but weaker Chinese support
- **Qwen** — Alibaba bilingual model; similar size range with different architecture
- **Baichuan** — another Chinese-first LLM; focuses on longer context
- **Yi** — 01.AI bilingual model; newer with different training data mix
- **Mistral** — high performance per parameter; English-only

## FAQ
**Q: Which ChatGLM version should I use?**
A: Use the latest available version (GLM-4 or ChatGLM3-6B) for the best performance. Older versions remain available for reproducibility.

**Q: Can I fine-tune ChatGLM on my own data?**
A: Yes. The repository includes P-Tuning v2 scripts, and the model works with standard LoRA tools like PEFT and LLaMA-Factory.

**Q: Is it suitable for production APIs?**
A: The included OpenAI-compatible server works for moderate traffic. For high-throughput serving, use vLLM or TGI with the ChatGLM model.

**Q: What license governs commercial use?**
A: ChatGLM models are released under a custom license that permits commercial use with attribution. Check the specific model card for details.

## Sources
- https://github.com/THUDM/ChatGLM-6B
- https://github.com/THUDM/GLM-4

---
Source: https://tokrepo.com/en/workflows/98bef1e7-42b9-11f1-9bc6-00163e2b0d79
Author: Script Depot