Apache NiFi — Visual Dataflow Automation & Integration Platform
Apache NiFi is a powerful dataflow management system that lets you design, control, and monitor data pipelines through a drag-and-drop web interface. Built for enterprise data routing, transformation, and system mediation with provenance tracking and guaranteed delivery.
Installation avec revue préalable
Cet actif nécessite une revue. Le prompt copié demande un dry-run, affiche les écritures, puis continue seulement après confirmation.
npx -y tokrepo@latest install 45f70684-39ec-11f1-9bc6-00163e2b0d79 --target codexDry-run d'abord, confirmez les écritures, puis lancez cette commande.
What it is
Apache NiFi was originally developed by the NSA and donated to the Apache Foundation. It automates the movement of data between disparate systems with a visual flow-based programming interface. NiFi excels at complex enterprise integration scenarios where data provenance, backpressure, and guaranteed delivery are non-negotiable.
NiFi provides a web-based drag-and-drop interface for designing dataflow pipelines, routes and transforms data between hundreds of source and destination systems, and tracks full data provenance from origin to destination.
How it saves time or tokens
NiFi eliminates the need to write custom ETL code for data integration tasks. Instead of coding data pipelines in Python or Java, you drag processors onto a canvas, connect them, and configure routing rules through the UI. Changes to pipelines take effect immediately without restarting. Backpressure handling means downstream slowdowns are managed automatically. Data provenance tracking provides a complete audit trail for compliance.
How to use
- Download and start NiFi:
wget https://downloads.apache.org/nifi/2.1.0/nifi-2.1.0-bin.zip
unzip nifi-2.1.0-bin.zip && cd nifi-2.1.0
./bin/nifi.sh start
# Access UI at https://localhost:8443/nifi
# Default credentials in logs/nifi-app.log
- Create your first dataflow by dragging processors onto the canvas.
- Common pipeline pattern:
GetFile -> SplitText -> EvaluateJsonPath -> PutDatabaseRecord
# Reads files, splits into records, extracts fields, writes to database
Example
A NiFi pipeline configuration in XML for fetching and transforming API data:
<!-- NiFi flow snippet: API to Database -->
<processors>
<processor>
<name>Fetch API Data</name>
<type>InvokeHTTP</type>
<config>
<property name="HTTP Method">GET</property>
<property name="Remote URL">https://api.example.com/data</property>
<property name="Schedule">5 min</property>
</config>
</processor>
<processor>
<name>Transform JSON</name>
<type>JoltTransformJSON</type>
</processor>
<processor>
<name>Write to PostgreSQL</name>
<type>PutDatabaseRecord</type>
</processor>
</processors>
Related on TokRepo
- Automation tools — More data pipeline and automation tools on TokRepo.
- Database tools — Browse database integration tools.
Common pitfalls
- Running NiFi with default heap settings causes OutOfMemory errors under load. Set java.arg.Xms and java.arg.Xmx in bootstrap.conf based on your data volume.
- Not configuring backpressure thresholds on connections leads to memory exhaustion. Set object and size thresholds on every connection.
- NiFi's default single-user authentication is not suitable for production. Configure LDAP, OpenID Connect, or client certificate authentication before deploying.
Questions fréquentes
NiFi tracks every event that happens to every piece of data (FlowFile): creation, modification, routing, cloning, and delivery. You can trace any byte from its origin to its final destination, which is critical for compliance and debugging.
Each connection between processors has configurable thresholds for object count and data size. When a downstream processor falls behind, NiFi stops the upstream processor from sending more data, preventing memory exhaustion.
Yes. NiFi supports clustering for horizontal scaling and high availability. ZooKeeper manages cluster coordination, and the flow design is replicated across all nodes. Data is distributed across the cluster for parallel processing.
NiFi ships with 300+ processors covering HDFS, S3, Kafka, databases (JDBC), HTTP APIs, FTP, SFTP, email, Elasticsearch, Solr, and many more. Custom processors can be built in Java.
NiFi handles both batch and streaming data. It processes FlowFiles as they arrive, making it suitable for near-real-time use cases. For true event streaming with strict ordering, pair NiFi with Apache Kafka.
Sources citées (3)
- Apache NiFi— Apache NiFi dataflow management system
- NiFi Documentation— NiFi documentation and user guide
- Apache NiFi Overview— Flow-based programming paradigm
En lien sur TokRepo
Fil de discussion
Actifs similaires
Apache Kafka — Distributed Event Streaming Platform
Apache Kafka is the open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, and mission-critical applications. Trillions of messages per day at LinkedIn, Netflix, Uber.
Apache Airflow — Programmatic Workflow Orchestration Platform
Apache Airflow is the industry-standard platform for authoring, scheduling, and monitoring data workflows. Define DAGs in Python to orchestrate ETL pipelines, ML training, data processing, and any complex workflow with dependencies.
Apache Camel — Enterprise Integration Framework for Java
Apache Camel is an open-source integration framework that implements the Enterprise Integration Patterns. It provides a routing and mediation engine with connectors for over 300 protocols and data formats, enabling developers to integrate systems using a concise Java or YAML DSL.
Apache SkyWalking — Distributed APM & Observability Platform
Apache-licensed APM platform unifying distributed tracing, metrics, logs, and eBPF profiling for microservices and service meshes.