Content Moderation — El Stack de Moderación para Plataformas UGC
Diez picks para el lead de trust & safety de una comunidad, foro, app de video corto o marketplace con contenido generado por usuarios real. Cinco capas en orden: texto → imagen → video → audio → human-in-loop con appeal y feedback de falsos positivos. Infra abierta antes que vendor lock-in. Multilingüe por defecto. El modelo propone; un humano sigue decidiendo los casos límite; cada decisión (y cada apelación) queda logueada para que puedas reentrenar un modelo con tus propios errores el próximo trimestre.
What's in this pack
This is the rig the trust & safety lead at a UGC platform — community, forum, short-video app, marketplace, comments system, anywhere users post — would actually assemble in 2026. It is explicitly not a wrapper around one moderation API. Single-vendor moderation looks fine on slide one and fails on slide eight: pricing per-call breaks at scale, English-only models miss the majority of harm on a multilingual platform, no vendor exposes their false-positive rate, and your appeal queue ends up as a Google Form on a Notion page.
The stack is organized in five deliberate layers because that is how the work actually flows when a user uploads a 30-second video with a caption, a thumbnail, and background audio:
- Text layer — captions, comments, titles, bios, DMs. Highest volume, lowest cost per check, fastest false-positive feedback loop.
- Image layer — thumbnails, profile photos, message attachments. Lower volume than text, much higher per-item risk.
- Video layer — by far the most expensive per item. You do not classify every frame; you sample.
- Audio layer — background music, voiceover, voice messages. Almost always reduced to text first, then handed to the text layer.
- Appeal + feedback layer — the human-in-the-loop queue, the dashboard the moderator opens at 9am, and the pipeline that turns every overturned decision into a labelled retraining example.
Three principles run through every pick:
- Open infrastructure first. Every layer has a self-hostable option. Vendor APIs are a fine first line for English text, but they cannot become the only line — multilingual coverage, pricing risk, and the inability to inspect their decisions all break at scale.
- The model proposes; a human decides borderline calls. A confidence threshold splits traffic into auto-allow / auto-block / human-review queues. The middle queue is where your trust & safety team earns its salary.
- Log every decision and every appeal. False-positive rate is the metric you cannot fake. Without an immutable record of "model said block, human said allow on appeal", you cannot improve the model and you cannot defend a decision to a regulator.
No accuracy percentages are quoted in this pack. Any vendor or open-source project that publishes a single number — "99.2% accurate!" — is selling you a benchmark, not a moderation system. Accuracy on your content, in your languages, against your policy is the only number that matters, and you have to measure it yourself.
Install in this order (text → image → video → audio → appeal + false-positive analysis)
- NeMo Guardrails — Programmable Safety for LLM Applications (id: 4258) — start the text layer here. NVIDIA's open framework lets you compose input rails, output rails, and topic rails declaratively (in Colang), and call any OpenAI-compatible LLM as the classifier. For UGC text moderation specifically, the value is that the same rail definitions cover user-generated content, your own outbound copy, and your bot's replies. One policy, three surfaces.
- llm-guard — Secure LLM Inputs & Outputs (id: 3103) — a sibling text-layer scanner that runs as a Python library and ships with pre-built scanners for toxicity, prompt injection, PII, and banned-topic detection. Pair it with NeMo: NeMo is the orchestrator and policy language, llm-guard is the deep bench of pre-built detectors you point each rail at. Two libraries together cover both "how do I define a moderation policy" and "what specific classifiers do I run inside it".
- Guardrails AI — Validate LLM Outputs in Production (id: 773) — schema-validation for any AI-generated artifact you produce yourself (auto-generated descriptions, moderator-suggestion text, AI-written safety responses to users). Keeps your own AI from inventing policy categories or recommending actions that don't exist. The moderation system itself needs guardrails before you can ship one.
- Presidio — Detect and Anonymize PII (id: 3106) — the PII detection and redaction layer. UGC platforms see phone numbers, emails, addresses, and government IDs leak constantly. Presidio runs locally, supports many languages out of the box, and gives you both detection ("this comment contains a phone number") and anonymisation ("replace with
<PHONE>before sending to any downstream service"). Wire it in before anything user-generated reaches an external moderation API. - OpenCV — Open-Source Computer Vision Library (id: 1828) — the image layer workhorse. Even when you eventually call a cloud vision API for NSFW or violence classification, OpenCV is what you use to preprocess: resize to model input, strip EXIF, downsample, and extract perceptual hashes (pHash) so duplicate-image abuse gets caught once and blocked forever. Cheap, local, fast.
- FFmpeg — The Universal Multimedia Processing Toolkit (id: 1157) — the video layer foundation. You do not run a classifier on every frame of every video — that is a budget bonfire. You use FFmpeg to extract keyframes (every 1-2 seconds is a defensible default), separate the audio track for the next step, and downscale to whatever resolution your image classifier actually needs. FFmpeg is also where you run scene-change detection so your samples land on visually distinct frames, not 24 near-identical copies.
- Whisper — OpenAI Speech-to-Text (id: 105) — the audio layer. Background music, voiceover, voice messages, livestream audio — almost all audio moderation reduces to "transcribe it, then run the text layer". Whisper runs locally, handles many languages out of the box, and is the open default. For real-time / livestream audio, the same model lineage runs in faster server-side variants.
- HumanLayer — Approval Loops for Coding Agents (id: 3036) — the human-in-the-loop appeal layer. Originally built for agent approval flows, the same primitive ("pause, route to a human, resume on decision") is exactly the queue model a moderator needs. The model classifies, anything above the auto-block threshold gets blocked, anything below auto-allow gets allowed, and everything in the middle band — plus every user appeal — lands in HumanLayer's queue with full context.
- doccano — Open-Source Text Annotation Tool for Machine Learning (id: 4763) — the false-positive analysis layer. Every overturned moderation decision (model said block, human said allow on appeal) is a labelled training example you already paid for. doccano gives the trust & safety team a clean labelling UI, exports to standard formats, and feeds your next fine-tuning round. The platforms that improve over time are the ones that turn their own appeal queue into training data.
- Discourse — Open Source Community Forum Platform (id: 1745) — the UGC platform itself, if you don't already have one. Worth including because Discourse ships with a mature moderation surface (flagging, trust levels, post review queues, user silence/suspension, audit log) that has been hardened by years of real community use. Even if you don't run Discourse in production, read its moderation admin pages — the data model and the workflow are a strong reference for what you should be building.
How they fit together
User upload (caption + image / video + audio)
│
▼
┌────────────────────────────────────────────────────────┐
│ Presidio (id 3106) — PII redaction on caption + bio │
└────────────────────────────────────────────────────────┘
│
┌───────┴───────────────┬──────────────────┬─────────────┐
▼ ▼ ▼ ▼
TEXT LAYER IMAGE LAYER VIDEO LAYER AUDIO LAYER
NeMo Guardrails OpenCV FFmpeg FFmpeg → demux
(id 4258) (id 1828) (id 1157) │
llm-guard + cloud vision sample ▼
(id 3103) for NSFW/ keyframes Whisper
⤷ confidence violence every 1–2 s (id 105)
0–1 score │ │
▼ ▼
(frames → image (text → text
layer) layer)
│
▼ classification + confidence
┌────────────────────────────────────────────────────────┐
│ Threshold router │
│ confidence > T_block → auto-block │
│ confidence < T_allow → auto-allow │
│ in between OR user appeal → HumanLayer queue (3036) │
└────────────────────────────────────────────────────────┘
│
▼
Moderator decision (allow / block / escalate)
│
▼
┌────────────────────────────────────────────────────────┐
│ Append-only decision log │
│ {item_id, model_label, confidence, human_label, │
│ appeal_outcome, moderator_id, timestamp, language} │
└────────────────────────────────────────────────────────┘
│
▼
doccano (id 4763) — label every overturned decision
│
▼
Next-quarter retraining set
The most important arrow is the loop at the bottom: appeal outcome → doccano → retraining. A moderation stack that doesn't close that loop drifts; one that does compounds.
Tradeoffs you'll hit
- False positive vs false negative. Every moderation system trades these off, and the right trade-off depends on what's being moderated. A children's product tolerates more false positives (over-block) to avoid any false negative. An adult-content marketplace tolerates more false negatives in exchange for fewer creator complaints. The threshold router in the diagram above is where you tune this, per-policy, per-surface — not in the model.
- Vendor moderation API vs open stack. Cloud vendors (OpenAI Moderation, Perspective, Hive, Sightengine, Clarifai, Amazon Rekognition, Azure Content Safety) ship faster, cover English well, and abstract operations. They also charge per call, refuse to publish per-language error rates, and lock your appeal evidence behind their dashboard. The pragmatic pattern most platforms land on: vendor API as the first-pass filter for English text, open stack (this pack) as the defensible layer underneath that you control. Run both.
- Multilingual coverage. Most commercial moderation APIs are quietly English-first — performance on Spanish, French, Portuguese is acceptable; on Chinese, Arabic, Hindi, Vietnamese, Tagalog, and the long tail it varies wildly and is rarely disclosed. Whisper and Presidio are multilingual by design; LLM-based rails (NeMo, llm-guard with an LLM-backed scanner) inherit whatever language coverage the underlying model has, which for frontier models in 2026 is broad but still uneven. Measure per-language false-positive and false-negative rates separately. A 95% overall score that is 99% in English and 70% in Vietnamese is not a 95% system on a platform where 40% of users post in Vietnamese.
- Latency budget. Text moderation runs in tens to low hundreds of ms; image moderation in hundreds of ms to seconds; video moderation in seconds to tens of seconds because of frame sampling. Build your write path to optimistically accept and then async-moderate, with an explicit "under review" state visible to the user. Synchronous blocking moderation on upload feels broken even when it's working.
- Cost. Vendor moderation APIs at scale cost real money, especially per-frame on video. Open stack on your own GPUs has different economics: high fixed cost (hardware + ops), low marginal cost per call, much higher cost to operate well. Most platforms past Series A end up with a hybrid: vendor for spiky / low-volume surfaces, self-hosted open stack for the steady high-volume surface.
Common pitfalls
- Only trusting English-trained models. A model with great English benchmark numbers can underperform badly on Indonesian, Arabic, or Hindi. If your platform is multilingual, your moderation evaluation set must be multilingual too — sampled from your actual traffic, not translated from English seeds.
- No appeal channel at all. Users with no way to contest a takedown either churn, post the same content under a new account, or take the complaint to social media. An appeal queue (HumanLayer-style) is not a nice-to-have; it is the only mechanism that surfaces your false-positive rate.
- No false-positive log. "Model said block, human said allow on appeal" is the single most valuable data point you can generate. If those decisions are not written to an immutable, queryable store, you are paying for moderation twice (model + human) and learning nothing.
- Classifying every video frame. Running a classifier on 30 frames per second of a 60-second video is 1,800 classifier calls per upload. Sample at 1 frame per 1-2 seconds with scene-change detection, log which frame fired, and accept that you will occasionally miss a transient 5-frame violation. The alternative is bankruptcy.
- Putting moderation in the synchronous upload path. Users see a spinner, the upload feels broken, and on any vendor outage your entire write path goes down. Accept the post optimistically with an "under review" state, run moderation async, and only reverse-publish if the verdict is block.
- Treating the moderation policy as a model parameter. Policy decisions ("is satire of public figures allowed", "is medical advice in DMs allowed") belong in a written policy document that the trust & safety team owns. The model implements the policy; it does not define it. When you change the policy, you change the rails (NeMo Colang) and the threshold (router), not the underlying classifier weights.
10 recursos listos para instalar
Preguntas frecuentes
Hive vs Sightengine vs OpenAI Moderation — which vendor should I start with for the first-pass filter?
Pick the one whose pricing, latency, and language coverage match your actual product, not the one with the best marketing benchmark. OpenAI Moderation is free with the API and reasonable for English text, with limited coverage of other modalities; it is the right zero-cost starting point if you are already an OpenAI customer and your traffic is mostly text. Sightengine and Hive are paid, image- and video-strong, multilingual to varying degrees, and operationally serious; they earn their keep on platforms where image and video are the primary surface. Perspective API (Jigsaw) is text-only, free, focused on toxicity, and weak on many non-English languages — useful as a second opinion on English comments, not as a sole filter. The defensible pattern is: pick one vendor as a first-pass filter for the surface where they are strongest, and put the open stack from this pack as the layer underneath you actually control. Do not bet the platform on any single vendor.
How does the stack handle Chinese, Arabic, and other non-English content?
Unevenly, and you must measure it. Whisper and Presidio are multilingual by design and perform credibly across many languages, though performance varies — Whisper handles tonal languages and Arabic well; Presidio's PII detection is strongest in languages with formal training data (English, Spanish, French, German, Chinese, Japanese) and weaker for low-resource scripts. The LLM-based rails (NeMo Guardrails, llm-guard) inherit whatever language coverage the underlying frontier model has, which for current frontier models is broad but uneven. The two non-negotiable practices: (1) evaluate per-language false-positive and false-negative rates separately on a sample drawn from your actual traffic, not translated English seeds; (2) when a language underperforms a defensible threshold, route that language to a higher proportion of human review until the model improves, rather than papering over the gap.
Video moderation — how much frame sampling is reasonable?
There is no universal answer; the right number depends on the policy and the video duration. A defensible starting point for general UGC: extract keyframes at 1 frame every 1-2 seconds for the first 30 seconds, then 1 frame every 5 seconds beyond that, with scene-change detection (FFmpeg supports this natively) prioritised over uniform sampling so your samples land on visually distinct frames. Combine this with mandatory analysis of the first 3 seconds (thumbnail and opening) and the last 3 seconds (often where bait-and-switch content lives). For high-risk policies (CSAM, severe violence) sample more densely and accept the cost; for low-risk policies (mild profanity in caption) sample less or skip frames entirely and rely on the text layer. The error you most want to avoid is uniform per-frame classification at 30 fps; that scales the bill faster than it scales the safety.
How do I protect minors and handle age-restricted content?
This pack does not attempt to be a complete child safety solution because that problem requires legal, operational, and reporting work that no software stack delivers alone. The infrastructure pieces that apply: classify image and video uploads for age-inappropriate content with a strict threshold (much closer to T_block than to T_allow); maintain perceptual hashes (pHash via OpenCV) of confirmed harmful content so re-uploads are blocked at hash match, before any classifier runs; route any potential CSAM to a dedicated escalation queue with specially trained human reviewers (not the general moderation queue) and a clear handoff to the legal jurisdiction's reporting authority (in the US, NCMEC CyberTipline; equivalent bodies exist in other jurisdictions). Tooling does not replace the program — appropriate-age determination, regional rules, parental consent, and law enforcement reporting are program work. Build the program first; the stack supports it.
What appeal rate and overturn rate should I expect, and what does each tell me?
There is no universally correct number, but two ratios are useful as health signals to track over time. (1) Appeal rate = appeals filed / decisions issued. A very low rate (well under 1%) often means users do not know they can appeal or that the appeal channel is buried, not that the model is perfect. A very high rate (well into double digits) usually means the model is over-blocking, the policy is unclear to users, or the appeal form is being used as a help channel. (2) Overturn rate = appeals resulting in reversal / appeals decided. A high overturn rate means the model is too aggressive at the current threshold and you should retrain on the overturned set or raise T_block. A very low overturn rate sometimes means the model is well-calibrated, and sometimes means the human reviewers are anchoring on the model's verdict; spot-check by having a second reviewer score a random sample blind to the first decision. Track both ratios per-language, per-policy, and per-surface — aggregates hide the disparities that actually matter.
12 packs · 80+ recursos seleccionados
Explora todos los packs curados en la página principal
Volver a todos los packs