ScriptsApr 6, 2026·2 min read

Promptfoo — LLM Eval & Red-Team Testing Framework

Open-source framework for evaluating and red-teaming LLM applications. Test prompts across models, detect jailbreaks, measure quality, and catch regressions. 5,000+ GitHub stars.

TL;DR
Open-source framework for testing prompts across models, detecting jailbreaks, and catching LLM quality regressions.
§01

What it is

Promptfoo is an open-source CLI and library for evaluating LLM outputs systematically. It runs your prompts against multiple models, scores the outputs using custom assertions (exact match, contains, LLM-graded, similarity), and surfaces regressions. The red-teaming module generates adversarial inputs to test jailbreak resistance and safety guardrails.

Promptfoo targets AI engineers, product teams, and security researchers who need repeatable LLM testing. It replaces manual prompt testing with automated evaluation pipelines that run in CI/CD.

§02

How it saves time or tokens

Manual prompt testing means running inputs one by one and eyeballing outputs. Promptfoo automates this with batch evaluation and structured scoring. You define test cases once and re-run them every time you change a prompt, model, or system configuration. The comparison view shows side-by-side results across models, making it obvious which performs better.

§03

How to use

  1. Install Promptfoo via npm.
  2. Create a configuration file with prompts, providers, and test cases.
  3. Run the eval and open the results viewer.
npm install -g promptfoo

# Initialize a config
promptfoo init

# Run evaluation
promptfoo eval

# Open results in browser
promptfoo view
§04

Example

# promptfooconfig.yaml
prompts:
  - 'Summarize this article in 3 bullet points: {{article}}'

providers:
  - openai:gpt-4o
  - anthropic:claude-sonnet-4-20250514

tests:
  - vars:
      article: 'The Federal Reserve held interest rates steady...'
    assert:
      - type: contains
        value: 'interest rate'
      - type: llm-rubric
        value: 'Output should contain exactly 3 bullet points'
      - type: cost
        threshold: 0.01
§05

Related on TokRepo

§06

Common pitfalls

  • LLM-graded assertions (llm-rubric) consume additional tokens for the grading call. Use deterministic assertions (contains, regex) where possible and reserve LLM grading for subjective quality checks.
  • Running evaluations against many providers in parallel can hit rate limits. Configure concurrency limits in the config file.
  • Red-team tests may produce harmful content in outputs. Run red-team evaluations in isolated environments and restrict access to the results.

Frequently Asked Questions

Can Promptfoo test any LLM provider?+

Yes. Promptfoo supports OpenAI, Anthropic, Google, AWS Bedrock, Azure, Ollama, and any OpenAI-compatible endpoint. You can also define custom providers using scripts. This makes it easy to compare outputs across different models and providers.

How does the red-team feature work?+

Promptfoo's red-team module generates adversarial inputs designed to trigger jailbreaks, prompt injection, and safety bypass attempts. It runs these inputs against your application and scores whether the model maintained its safety guardrails. Results highlight specific vulnerabilities.

Can I run Promptfoo in CI/CD?+

Yes. Promptfoo has a CLI that exits with a non-zero code if any test assertion fails. You can run it as a step in GitHub Actions, GitLab CI, or any pipeline. The JSON output format enables integration with reporting tools.

How does Promptfoo compare to LangSmith for evaluation?+

Promptfoo is open source, runs locally, and focuses on batch evaluation with assertions. LangSmith is a SaaS platform with tracing, monitoring, and annotation features. Promptfoo is better for CI/CD-integrated testing; LangSmith is better for production observability and human annotation workflows.

Does Promptfoo support custom scoring functions?+

Yes. You can write custom assertion functions in JavaScript or Python that receive the LLM output and return a pass/fail result with a score. This lets you implement domain-specific quality checks beyond the built-in assertion types.

Citations (3)
🙏

Source & Thanks

Created by Promptfoo. Licensed under MIT.

promptfoo — ⭐ 5,000+

Thanks for bringing test-driven development to AI applications.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets