Introduction
pandas is the cornerstone of data science in Python. It provides fast, flexible data structures — DataFrame (2D table) and Series (1D array) — designed to make working with structured and time series data intuitive and efficient. Whether you are cleaning messy CSV files, analyzing financial data, or preparing datasets for machine learning, pandas is almost always the first tool you reach for.
With over 48,000 GitHub stars and as a dependency of virtually every Python data project, pandas is one of the most important libraries in the scientific Python ecosystem.
What pandas Does
pandas provides a rich set of operations for data manipulation: reading data from CSV, Excel, SQL, JSON, and Parquet files; filtering rows and selecting columns; handling missing values; grouping and aggregating; merging and joining datasets; reshaping (pivot, melt); and time series analysis. All operations are optimized for performance using NumPy arrays under the hood.
Architecture Overview
[Data Sources]
CSV, Excel, SQL, JSON, Parquet,
HDF5, Feather, HTML tables
|
[pd.read_*()]
|
[DataFrame / Series]
NumPy-backed columns
with labeled axes
|
+-------+-------+-------+
| | | |
[Select] [Transform] [Aggregate]
.loc[] .apply() .groupby()
.iloc[] .map() .agg()
.query() .fillna() .pivot_table()
.merge() .resample()
|
[Output]
.to_csv(), .to_parquet(),
.to_sql(), .plot()Self-Hosting & Configuration
import pandas as pd
# Read and explore data
df = pd.read_csv("sales.csv", parse_dates=["date"])
print(df.info())
print(df.describe())
# Data cleaning
df = df.dropna(subset=["revenue"])
df["category"] = df["category"].str.strip().str.lower()
df["year"] = df["date"].dt.year
# Analysis
monthly = (df
.groupby(pd.Grouper(key="date", freq="ME"))
.agg({"revenue": "sum", "orders": "count"})
.rename(columns={"orders": "order_count"})
)
# Merge datasets
products = pd.read_csv("products.csv")
result = df.merge(products, on="product_id", how="left")
# Export
result.to_parquet("output.parquet", index=False)Key Features
- DataFrame — labeled 2D data structure with mixed column types
- IO Tools — read/write CSV, Excel, SQL, JSON, Parquet, HDF5, and more
- Missing Data — intelligent handling of NaN values across operations
- GroupBy — split-apply-combine operations for aggregation
- Merge/Join — SQL-like joins between DataFrames
- Reshaping — pivot, melt, stack, unstack for data transformation
- Time Series — date range generation, resampling, rolling windows
- Vectorized Ops — fast NumPy-backed operations without loops
Comparison with Similar Tools
| Feature | pandas | Polars | R data.table | DuckDB | Spark |
|---|---|---|---|---|---|
| Language | Python | Python/Rust | R | SQL/Python | Python/Scala |
| Speed | Moderate | Very Fast | Fast | Fast | Distributed |
| Memory | High | Efficient | Efficient | Efficient | Distributed |
| API Style | Method chain | Method chain | Bracket syntax | SQL | DataFrame API |
| Max Data Size | RAM-limited | RAM-limited | RAM-limited | Larger-than-RAM | Unlimited |
| Ecosystem | Dominant | Growing | R ecosystem | Growing | Enterprise |
| Learning Curve | Moderate | Low | Moderate | Low (SQL) | High |
FAQ
Q: pandas vs Polars — which should I use? A: For new projects with large datasets, Polars is faster and more memory-efficient. For existing codebases, scikit-learn integration, and maximum ecosystem compatibility, pandas remains the safe choice. pandas 2.0+ has improved performance significantly with Arrow-backed dtypes.
Q: How do I speed up slow pandas operations? A: Avoid iterrows() (use vectorized operations instead), use categorical dtypes for string columns, read Parquet instead of CSV, and consider using pandas with the PyArrow backend for better performance.
Q: Can pandas handle big data? A: pandas is limited to data that fits in RAM. For larger datasets, use chunked reading (chunksize parameter), Dask for parallel pandas, or switch to Polars or DuckDB.
Q: How do I handle time series data? A: Use pd.to_datetime() to parse dates, set the date column as index, then use .resample() for aggregation, .rolling() for moving averages, and .shift() for lagged features.
Sources
- GitHub: https://github.com/pandas-dev/pandas
- Documentation: https://pandas.pydata.org
- Created by Wes McKinney
- License: BSD 3-Clause