ScriptsApr 12, 2026·3 min read

pandas — Powerful Data Analysis and Manipulation for Python

pandas is the essential data analysis library for Python. It provides DataFrame and Series data structures for efficient manipulation of tabular data, time series, and structured datasets with an expressive API for filtering, grouping, joining, and reshaping.

TL;DR
pandas provides DataFrame and Series for efficient tabular data manipulation, filtering, grouping, and analysis in Python.
§01

What it is

pandas is the foundational data analysis library for Python. It provides two primary data structures: DataFrame (2D labeled table) and Series (1D labeled array). With pandas you can read data from CSV, Excel, SQL, JSON, and Parquet files, then filter, group, join, pivot, and reshape it using an expressive API.

pandas is for anyone working with structured data in Python: data analysts, data scientists, backend engineers processing logs, and researchers analyzing experimental results. If your data fits in memory and has rows and columns, pandas is the standard tool.

§02

How it saves time or tokens

This workflow provides ready-to-run pandas snippets for common data tasks. Instead of searching documentation for the right method signature, you get copy-paste code for reading files, filtering rows, grouping aggregations, and merging datasets. Each snippet is a self-contained operation you can adapt to your data.

§03

How to use

  1. Install pandas:
pip install pandas
  1. Read your data into a DataFrame:
import pandas as pd

# From CSV
df = pd.read_csv('data.csv')

# From Excel
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')

# From SQL
from sqlalchemy import create_engine
engine = create_engine('sqlite:///app.db')
df = pd.read_sql('SELECT * FROM users', engine)
  1. Explore and manipulate:
# Basic exploration
df.shape          # (rows, cols)
df.dtypes         # column types
df.describe()     # summary statistics

# Filter rows
active = df[df['status'] == 'active']

# Group and aggregate
by_country = df.groupby('country')['revenue'].sum().sort_values(ascending=False)
§04

Example

import pandas as pd

# Load sales data
df = pd.read_csv('sales.csv', parse_dates=['date'])

# Monthly revenue by product category
monthly = (
    df.assign(month=df['date'].dt.to_period('M'))
    .groupby(['month', 'category'])['amount']
    .sum()
    .unstack(fill_value=0)
)

# Top 5 customers by total spend
top_customers = (
    df.groupby('customer_id')['amount']
    .sum()
    .nlargest(5)
)

print(monthly)
print(top_customers)
§05

Related on TokRepo

§06

Common pitfalls

  • Using iterrows() for row-by-row processing is slow. Prefer vectorized operations like df['col'].apply() or boolean indexing for better performance.
  • Chained assignment (df[df['x'] > 0]['y'] = 1) triggers a SettingWithCopyWarning and may not modify the original DataFrame. Use df.loc[df['x'] > 0, 'y'] = 1 instead.
  • Loading large CSV files without specifying dtypes wastes memory. Use the dtype parameter or read in chunks with chunksize for files that approach your RAM limit.

Frequently Asked Questions

What is the difference between a DataFrame and a Series?+

A DataFrame is a 2D table with labeled rows and columns. A Series is a single column (1D array) with labels. When you select one column from a DataFrame, you get a Series. When you select multiple columns, you get a DataFrame.

How large a dataset can pandas handle?+

pandas works well with datasets that fit in memory. For most machines, this means up to a few gigabytes. For larger datasets, consider using Dask (pandas-like API with parallel processing), Polars (Rust-based DataFrame library), or reading data in chunks.

How do I merge two DataFrames?+

Use pd.merge(df1, df2, on='key_column', how='left') for SQL-style joins. The how parameter accepts left, right, inner, and outer. For concatenating DataFrames vertically, use pd.concat([df1, df2]).

Can pandas read from databases directly?+

Yes. Use pd.read_sql() with a SQLAlchemy engine or database connection. It supports any database with a SQLAlchemy dialect: PostgreSQL, MySQL, SQLite, SQL Server, and more.

How do I handle missing values in pandas?+

Use df.isna() to detect missing values, df.dropna() to remove rows with missing values, and df.fillna(value) to replace them. For time series, df.interpolate() fills gaps using interpolation methods.

Citations (3)

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets