ScriptsApr 12, 2026·3 min read

pandas — Powerful Data Analysis and Manipulation for Python

pandas is the essential data analysis library for Python. It provides DataFrame and Series data structures for efficient manipulation of tabular data, time series, and structured datasets with an expressive API for filtering, grouping, joining, and reshaping.

SC
Script Depot · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

# Install pandas
pip install pandas

# Quick data analysis
python3 -c "
import pandas as pd

# Read CSV
df = pd.read_csv('data.csv')

# Basic exploration
print(df.head())
print(df.describe())
print(df.groupby('category')['value'].mean())
"

Introduction

pandas is the cornerstone of data science in Python. It provides fast, flexible data structures — DataFrame (2D table) and Series (1D array) — designed to make working with structured and time series data intuitive and efficient. Whether you are cleaning messy CSV files, analyzing financial data, or preparing datasets for machine learning, pandas is almost always the first tool you reach for.

With over 48,000 GitHub stars and as a dependency of virtually every Python data project, pandas is one of the most important libraries in the scientific Python ecosystem.

What pandas Does

pandas provides a rich set of operations for data manipulation: reading data from CSV, Excel, SQL, JSON, and Parquet files; filtering rows and selecting columns; handling missing values; grouping and aggregating; merging and joining datasets; reshaping (pivot, melt); and time series analysis. All operations are optimized for performance using NumPy arrays under the hood.

Architecture Overview

[Data Sources]
CSV, Excel, SQL, JSON, Parquet,
HDF5, Feather, HTML tables
        |
   [pd.read_*()]
        |
   [DataFrame / Series]
   NumPy-backed columns
   with labeled axes
        |
+-------+-------+-------+
|       |       |       |
[Select] [Transform] [Aggregate]
.loc[]   .apply()    .groupby()
.iloc[]  .map()      .agg()
.query() .fillna()   .pivot_table()
         .merge()    .resample()
        |
   [Output]
   .to_csv(), .to_parquet(),
   .to_sql(), .plot()

Self-Hosting & Configuration

import pandas as pd

# Read and explore data
df = pd.read_csv("sales.csv", parse_dates=["date"])
print(df.info())
print(df.describe())

# Data cleaning
df = df.dropna(subset=["revenue"])
df["category"] = df["category"].str.strip().str.lower()
df["year"] = df["date"].dt.year

# Analysis
monthly = (df
    .groupby(pd.Grouper(key="date", freq="ME"))
    .agg({"revenue": "sum", "orders": "count"})
    .rename(columns={"orders": "order_count"})
)

# Merge datasets
products = pd.read_csv("products.csv")
result = df.merge(products, on="product_id", how="left")

# Export
result.to_parquet("output.parquet", index=False)

Key Features

  • DataFrame — labeled 2D data structure with mixed column types
  • IO Tools — read/write CSV, Excel, SQL, JSON, Parquet, HDF5, and more
  • Missing Data — intelligent handling of NaN values across operations
  • GroupBy — split-apply-combine operations for aggregation
  • Merge/Join — SQL-like joins between DataFrames
  • Reshaping — pivot, melt, stack, unstack for data transformation
  • Time Series — date range generation, resampling, rolling windows
  • Vectorized Ops — fast NumPy-backed operations without loops

Comparison with Similar Tools

Feature pandas Polars R data.table DuckDB Spark
Language Python Python/Rust R SQL/Python Python/Scala
Speed Moderate Very Fast Fast Fast Distributed
Memory High Efficient Efficient Efficient Distributed
API Style Method chain Method chain Bracket syntax SQL DataFrame API
Max Data Size RAM-limited RAM-limited RAM-limited Larger-than-RAM Unlimited
Ecosystem Dominant Growing R ecosystem Growing Enterprise
Learning Curve Moderate Low Moderate Low (SQL) High

FAQ

Q: pandas vs Polars — which should I use? A: For new projects with large datasets, Polars is faster and more memory-efficient. For existing codebases, scikit-learn integration, and maximum ecosystem compatibility, pandas remains the safe choice. pandas 2.0+ has improved performance significantly with Arrow-backed dtypes.

Q: How do I speed up slow pandas operations? A: Avoid iterrows() (use vectorized operations instead), use categorical dtypes for string columns, read Parquet instead of CSV, and consider using pandas with the PyArrow backend for better performance.

Q: Can pandas handle big data? A: pandas is limited to data that fits in RAM. For larger datasets, use chunked reading (chunksize parameter), Dask for parallel pandas, or switch to Polars or DuckDB.

Q: How do I handle time series data? A: Use pd.to_datetime() to parse dates, set the date column as index, then use .resample() for aggregation, .rolling() for moving averages, and .shift() for lagged features.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets