TOKREPO · Arsenal de IA

Estable

Content Moderation — El Stack de Moderación para Plataformas UGC

Diez picks para el lead de trust & safety de una comunidad, foro, app de video corto o marketplace con contenido generado por usuarios real. Cinco capas en orden: texto → imagen → video → audio → human-in-loop con appeal y feedback de falsos positivos. Infra abierta antes que vendor lock-in. Multilingüe por defecto. El modelo propone; un humano sigue decidiendo los casos límite; cada decisión (y cada apelación) queda logueada para que puedas reentrenar un modelo con tus propios errores el próximo trimestre.

10 recursos

Sobre este pack

What's in this pack

This is the rig the trust & safety lead at a UGC platform — community, forum, short-video app, marketplace, comments system, anywhere users post — would actually assemble in 2026. It is explicitly not a wrapper around one moderation API. Single-vendor moderation looks fine on slide one and fails on slide eight: pricing per-call breaks at scale, English-only models miss the majority of harm on a multilingual platform, no vendor exposes their false-positive rate, and your appeal queue ends up as a Google Form on a Notion page.

The stack is organized in five deliberate layers because that is how the work actually flows when a user uploads a 30-second video with a caption, a thumbnail, and background audio:

Text layer — captions, comments, titles, bios, DMs. Highest volume, lowest cost per check, fastest false-positive feedback loop.
Image layer — thumbnails, profile photos, message attachments. Lower volume than text, much higher per-item risk.
Video layer — by far the most expensive per item. You do not classify every frame; you sample.
Audio layer — background music, voiceover, voice messages. Almost always reduced to text first, then handed to the text layer.
Appeal + feedback layer — the human-in-the-loop queue, the dashboard the moderator opens at 9am, and the pipeline that turns every overturned decision into a labelled retraining example.

Three principles run through every pick:

Open infrastructure first. Every layer has a self-hostable option. Vendor APIs are a fine first line for English text, but they cannot become the only line — multilingual coverage, pricing risk, and the inability to inspect their decisions all break at scale.
The model proposes; a human decides borderline calls. A confidence threshold splits traffic into auto-allow / auto-block / human-review queues. The middle queue is where your trust & safety team earns its salary.
Log every decision and every appeal. False-positive rate is the metric you cannot fake. Without an immutable record of "model said block, human said allow on appeal", you cannot improve the model and you cannot defend a decision to a regulator.

No accuracy percentages are quoted in this pack. Any vendor or open-source project that publishes a single number — "99.2% accurate!" — is selling you a benchmark, not a moderation system. Accuracy on your content, in your languages, against your policy is the only number that matters, and you have to measure it yourself.

Install in this order (text → image → video → audio → appeal + false-positive analysis)

NeMo Guardrails — Programmable Safety for LLM Applications (id: 4258) — start the text layer here. NVIDIA's open framework lets you compose input rails, output rails, and topic rails declaratively (in Colang), and call any OpenAI-compatible LLM as the classifier. For UGC text moderation specifically, the value is that the same rail definitions cover user-generated content, your own outbound copy, and your bot's replies. One policy, three surfaces.
llm-guard — Secure LLM Inputs & Outputs (id: 3103) — a sibling text-layer scanner that runs as a Python library and ships with pre-built scanners for toxicity, prompt injection, PII, and banned-topic detection. Pair it with NeMo: NeMo is the orchestrator and policy language, llm-guard is the deep bench of pre-built detectors you point each rail at. Two libraries together cover both "how do I define a moderation policy" and "what specific classifiers do I run inside it".
Guardrails AI — Validate LLM Outputs in Production (id: 773) — schema-validation for any AI-generated artifact you produce yourself (auto-generated descriptions, moderator-suggestion text, AI-written safety responses to users). Keeps your own AI from inventing policy categories or recommending actions that don't exist. The moderation system itself needs guardrails before you can ship one.
Presidio — Detect and Anonymize PII (id: 3106) — the PII detection and redaction layer. UGC platforms see phone numbers, emails, addresses, and government IDs leak constantly. Presidio runs locally, supports many languages out of the box, and gives you both detection ("this comment contains a phone number") and anonymisation ("replace with <PHONE> before sending to any downstream service"). Wire it in before anything user-generated reaches an external moderation API.
OpenCV — Open-Source Computer Vision Library (id: 1828) — the image layer workhorse. Even when you eventually call a cloud vision API for NSFW or violence classification, OpenCV is what you use to preprocess: resize to model input, strip EXIF, downsample, and extract perceptual hashes (pHash) so duplicate-image abuse gets caught once and blocked forever. Cheap, local, fast.
FFmpeg — The Universal Multimedia Processing Toolkit (id: 1157) — the video layer foundation. You do not run a classifier on every frame of every video — that is a budget bonfire. You use FFmpeg to extract keyframes (every 1-2 seconds is a defensible default), separate the audio track for the next step, and downscale to whatever resolution your image classifier actually needs. FFmpeg is also where you run scene-change detection so your samples land on visually distinct frames, not 24 near-identical copies.
Whisper — OpenAI Speech-to-Text (id: 105) — the audio layer. Background music, voiceover, voice messages, livestream audio — almost all audio moderation reduces to "transcribe it, then run the text layer". Whisper runs locally, handles many languages out of the box, and is the open default. For real-time / livestream audio, the same model lineage runs in faster server-side variants.
HumanLayer — Approval Loops for Coding Agents (id: 3036) — the human-in-the-loop appeal layer. Originally built for agent approval flows, the same primitive ("pause, route to a human, resume on decision") is exactly the queue model a moderator needs. The model classifies, anything above the auto-block threshold gets blocked, anything below auto-allow gets allowed, and everything in the middle band — plus every user appeal — lands in HumanLayer's queue with full context.
doccano — Open-Source Text Annotation Tool for Machine Learning (id: 4763) — the false-positive analysis layer. Every overturned moderation decision (model said block, human said allow on appeal) is a labelled training example you already paid for. doccano gives the trust & safety team a clean labelling UI, exports to standard formats, and feeds your next fine-tuning round. The platforms that improve over time are the ones that turn their own appeal queue into training data.
Discourse — Open Source Community Forum Platform (id: 1745) — the UGC platform itself, if you don't already have one. Worth including because Discourse ships with a mature moderation surface (flagging, trust levels, post review queues, user silence/suspension, audit log) that has been hardened by years of real community use. Even if you don't run Discourse in production, read its moderation admin pages — the data model and the workflow are a strong reference for what you should be building.

How they fit together

  User upload (caption + image / video + audio)
           │
           ▼
  ┌────────────────────────────────────────────────────────┐
  │ Presidio (id 3106) — PII redaction on caption + bio    │
  └────────────────────────────────────────────────────────┘
           │
   ┌───────┴───────────────┬──────────────────┬─────────────┐
   ▼                       ▼                  ▼             ▼
 TEXT LAYER             IMAGE LAYER       VIDEO LAYER   AUDIO LAYER
 NeMo Guardrails        OpenCV            FFmpeg        FFmpeg → demux
  (id 4258)              (id 1828)         (id 1157)         │
 llm-guard               + cloud vision    sample             ▼
  (id 3103)              for NSFW/         keyframes     Whisper
  ⤷ confidence           violence          every 1–2 s    (id 105)
     0–1 score                                │             │
                                              ▼             ▼
                                       (frames → image (text → text
                                        layer)            layer)
           │
           ▼ classification + confidence
  ┌────────────────────────────────────────────────────────┐
  │ Threshold router                                       │
  │   confidence > T_block  → auto-block                   │
  │   confidence < T_allow  → auto-allow                   │
  │   in between OR user appeal → HumanLayer queue (3036)  │
  └────────────────────────────────────────────────────────┘
           │
           ▼
  Moderator decision (allow / block / escalate)
           │
           ▼
  ┌────────────────────────────────────────────────────────┐
  │ Append-only decision log                               │
  │   {item_id, model_label, confidence, human_label,      │
  │    appeal_outcome, moderator_id, timestamp, language}  │
  └────────────────────────────────────────────────────────┘
           │
           ▼
  doccano (id 4763) — label every overturned decision
           │
           ▼
  Next-quarter retraining set

The most important arrow is the loop at the bottom: appeal outcome → doccano → retraining. A moderation stack that doesn't close that loop drifts; one that does compounds.

Tradeoffs you'll hit

False positive vs false negative. Every moderation system trades these off, and the right trade-off depends on what's being moderated. A children's product tolerates more false positives (over-block) to avoid any false negative. An adult-content marketplace tolerates more false negatives in exchange for fewer creator complaints. The threshold router in the diagram above is where you tune this, per-policy, per-surface — not in the model.
Vendor moderation API vs open stack. Cloud vendors (OpenAI Moderation, Perspective, Hive, Sightengine, Clarifai, Amazon Rekognition, Azure Content Safety) ship faster, cover English well, and abstract operations. They also charge per call, refuse to publish per-language error rates, and lock your appeal evidence behind their dashboard. The pragmatic pattern most platforms land on: vendor API as the first-pass filter for English text, open stack (this pack) as the defensible layer underneath that you control. Run both.
Multilingual coverage. Most commercial moderation APIs are quietly English-first — performance on Spanish, French, Portuguese is acceptable; on Chinese, Arabic, Hindi, Vietnamese, Tagalog, and the long tail it varies wildly and is rarely disclosed. Whisper and Presidio are multilingual by design; LLM-based rails (NeMo, llm-guard with an LLM-backed scanner) inherit whatever language coverage the underlying model has, which for frontier models in 2026 is broad but still uneven. Measure per-language false-positive and false-negative rates separately. A 95% overall score that is 99% in English and 70% in Vietnamese is not a 95% system on a platform where 40% of users post in Vietnamese.
Latency budget. Text moderation runs in tens to low hundreds of ms; image moderation in hundreds of ms to seconds; video moderation in seconds to tens of seconds because of frame sampling. Build your write path to optimistically accept and then async-moderate, with an explicit "under review" state visible to the user. Synchronous blocking moderation on upload feels broken even when it's working.
Cost. Vendor moderation APIs at scale cost real money, especially per-frame on video. Open stack on your own GPUs has different economics: high fixed cost (hardware + ops), low marginal cost per call, much higher cost to operate well. Most platforms past Series A end up with a hybrid: vendor for spiky / low-volume surfaces, self-hosted open stack for the steady high-volume surface.

Common pitfalls

Only trusting English-trained models. A model with great English benchmark numbers can underperform badly on Indonesian, Arabic, or Hindi. If your platform is multilingual, your moderation evaluation set must be multilingual too — sampled from your actual traffic, not translated from English seeds.
No appeal channel at all. Users with no way to contest a takedown either churn, post the same content under a new account, or take the complaint to social media. An appeal queue (HumanLayer-style) is not a nice-to-have; it is the only mechanism that surfaces your false-positive rate.
No false-positive log. "Model said block, human said allow on appeal" is the single most valuable data point you can generate. If those decisions are not written to an immutable, queryable store, you are paying for moderation twice (model + human) and learning nothing.
Classifying every video frame. Running a classifier on 30 frames per second of a 60-second video is 1,800 classifier calls per upload. Sample at 1 frame per 1-2 seconds with scene-change detection, log which frame fired, and accept that you will occasionally miss a transient 5-frame violation. The alternative is bankruptcy.
Putting moderation in the synchronous upload path. Users see a spinner, the upload feels broken, and on any vendor outage your entire write path goes down. Accept the post optimistically with an "under review" state, run moderation async, and only reverse-publish if the verdict is block.
Treating the moderation policy as a model parameter. Policy decisions ("is satire of public figures allowed", "is medical advice in DMs allowed") belong in a written policy document that the trust & safety team owns. The model implements the policy; it does not define it. When you change the policy, you change the rails (NeMo Colang) and the threshold (router), not the underlying classifier weights.

INSTALAR · UN COMANDO

$ tokrepo install pack/content-moderation-stack

pásalo a tu agente — o pégalo en tu terminal

Qué incluye

10 recursos listos para instalar

Skill#01

NeMo Guardrails — Programmable Safety for LLM Applications

NeMo Guardrails is an open-source toolkit by NVIDIA for adding programmable guardrails to LLM-based conversational systems. It provides input/output moderation, fact-checking, hallucination detection, jailbreak prevention, and dialog management via a declarative Colang configuration language.

by Script Depot·183 views

$ tokrepo install nemo-guardrails-programmable-safety-llm-applications-e3c9db87

Skill#02

llm-guard — Secure LLM Inputs & Outputs

Harden LLM apps with a scanner pipeline for prompt injection, PII leakage, toxicity, and unsafe output. Install in minutes and gate requests in code.

by Script Depot·222 views

$ tokrepo install llm-guard-secure-llm-inputs-outputs

Skill#03

Guardrails AI — Validate LLM Outputs in Production

Add validation and guardrails to any LLM output. Guardrails AI checks for hallucination, toxicity, PII leakage, and format compliance with 50+ built-in validators.

by Agent Toolkit·386 views

$ tokrepo install guardrails-ai-validate-llm-outputs-production-ffbad589

Skill#04

Presidio — Detect and Anonymize PII

Detect and anonymize PII in text with Microsoft Presidio, then feed sanitized inputs to LLMs to reduce leakage risk. Works via pip or Docker deployments.

by Script Depot·128 views

$ tokrepo install presidio-detect-and-anonymize-pii

Skill#05

OpenCV — Open-Source Computer Vision Library

The most widely used computer vision library with 2500+ optimized algorithms for image and video analysis, object detection, face recognition, and real-time processing across C++, Python, Java, and more.

by Script Depot·194 views

$ tokrepo install opencv-open-source-computer-vision-library-96185804

Skill#06

FFmpeg — The Universal Multimedia Processing Toolkit

FFmpeg is the most powerful and widely used multimedia processing framework. It can decode, encode, transcode, mux, demux, stream, filter, and play almost every audio and video format ever created. Nearly every media application on earth uses FFmpeg.

by Script Depot·220 views

$ tokrepo install ffmpeg-universal-multimedia-processing-toolkit-353248b1

Skill#07

Whisper — OpenAI Speech-to-Text

OpenAI's open-source speech recognition model. Transcribe audio/video to text with word-level timestamps in 99 languages. Essential for subtitle generation.

by OpenAI·398 views

$ tokrepo install whisper-openai-speech-text-eb0f9dd6

Skill#08

HumanLayer — Approval Loops for Coding Agents

HumanLayer adds human approval and delegation loops around coding agents. Use it when autonomous edits need review, escalation, or team signoff.

by HumanLayer·55 views

$ tokrepo install humanlayer-approval-loops-for-coding-agents

Script#09

doccano — Open-Source Text Annotation Tool for Machine Learning

A web-based annotation platform for creating labeled datasets for NLP tasks including text classification, sequence labeling, and sequence-to-sequence problems.

by Script Depot·66 views

$ tokrepo install doccano-open-source-text-annotation-tool-machine-learning-51dcb118

Skill#10

Discourse — Open Source Community Forum Platform

Discourse is a modern, open-source discussion platform built for civilized community conversations. It replaces traditional forums with real-time updates, dynamic threading, and built-in moderation tools.

by Script Depot·236 views

$ tokrepo install discourse-open-source-community-forum-platform-052f0cf1

Preguntas frecuentes

Hive vs Sightengine vs OpenAI Moderation — which vendor should I start with for the first-pass filter?

Pick the one whose pricing, latency, and language coverage match your actual product, not the one with the best marketing benchmark. OpenAI Moderation is free with the API and reasonable for English text, with limited coverage of other modalities; it is the right zero-cost starting point if you are already an OpenAI customer and your traffic is mostly text. Sightengine and Hive are paid, image- and video-strong, multilingual to varying degrees, and operationally serious; they earn their keep on platforms where image and video are the primary surface. Perspective API (Jigsaw) is text-only, free, focused on toxicity, and weak on many non-English languages — useful as a second opinion on English comments, not as a sole filter. The defensible pattern is: pick one vendor as a first-pass filter for the surface where they are strongest, and put the open stack from this pack as the layer underneath you actually control. Do not bet the platform on any single vendor.

How does the stack handle Chinese, Arabic, and other non-English content?

Unevenly, and you must measure it. Whisper and Presidio are multilingual by design and perform credibly across many languages, though performance varies — Whisper handles tonal languages and Arabic well; Presidio's PII detection is strongest in languages with formal training data (English, Spanish, French, German, Chinese, Japanese) and weaker for low-resource scripts. The LLM-based rails (NeMo Guardrails, llm-guard) inherit whatever language coverage the underlying frontier model has, which for current frontier models is broad but uneven. The two non-negotiable practices: (1) evaluate per-language false-positive and false-negative rates separately on a sample drawn from your actual traffic, not translated English seeds; (2) when a language underperforms a defensible threshold, route that language to a higher proportion of human review until the model improves, rather than papering over the gap.

Video moderation — how much frame sampling is reasonable?

There is no universal answer; the right number depends on the policy and the video duration. A defensible starting point for general UGC: extract keyframes at 1 frame every 1-2 seconds for the first 30 seconds, then 1 frame every 5 seconds beyond that, with scene-change detection (FFmpeg supports this natively) prioritised over uniform sampling so your samples land on visually distinct frames. Combine this with mandatory analysis of the first 3 seconds (thumbnail and opening) and the last 3 seconds (often where bait-and-switch content lives). For high-risk policies (CSAM, severe violence) sample more densely and accept the cost; for low-risk policies (mild profanity in caption) sample less or skip frames entirely and rely on the text layer. The error you most want to avoid is uniform per-frame classification at 30 fps; that scales the bill faster than it scales the safety.

How do I protect minors and handle age-restricted content?

This pack does not attempt to be a complete child safety solution because that problem requires legal, operational, and reporting work that no software stack delivers alone. The infrastructure pieces that apply: classify image and video uploads for age-inappropriate content with a strict threshold (much closer to T_block than to T_allow); maintain perceptual hashes (pHash via OpenCV) of confirmed harmful content so re-uploads are blocked at hash match, before any classifier runs; route any potential CSAM to a dedicated escalation queue with specially trained human reviewers (not the general moderation queue) and a clear handoff to the legal jurisdiction's reporting authority (in the US, NCMEC CyberTipline; equivalent bodies exist in other jurisdictions). Tooling does not replace the program — appropriate-age determination, regional rules, parental consent, and law enforcement reporting are program work. Build the program first; the stack supports it.

What appeal rate and overturn rate should I expect, and what does each tell me?

There is no universally correct number, but two ratios are useful as health signals to track over time. (1) Appeal rate = appeals filed / decisions issued. A very low rate (well under 1%) often means users do not know they can appeal or that the appeal channel is buried, not that the model is perfect. A very high rate (well into double digits) usually means the model is over-blocking, the policy is unclear to users, or the appeal form is being used as a help channel. (2) Overturn rate = appeals resulting in reversal / appeals decided. A high overturn rate means the model is too aggressive at the current threshold and you should retrain on the overturned set or raise T_block. A very low overturn rate sometimes means the model is well-calibrated, and sometimes means the human reviewers are anchoring on the model's verdict; spot-check by having a second reviewer score a random sample blind to the first decision. Track both ratios per-language, per-policy, and per-surface — aggregates hide the disparities that actually matter.

MÁS DEL ARSENAL

12 packs · 80+ recursos seleccionados

Explora todos los packs curados en la página principal

Volver a todos los packs