pandas — Powerful Data Analysis and Manipulation for Python
pandas is the essential data analysis library for Python. It provides DataFrame and Series data structures for efficient manipulation of tabular data, time series, and structured datasets with an expressive API for filtering, grouping, joining, and reshaping.
What it is
pandas is the foundational data analysis library for Python. It provides two primary data structures: DataFrame (2D labeled table) and Series (1D labeled array). With pandas you can read data from CSV, Excel, SQL, JSON, and Parquet files, then filter, group, join, pivot, and reshape it using an expressive API.
pandas is for anyone working with structured data in Python: data analysts, data scientists, backend engineers processing logs, and researchers analyzing experimental results. If your data fits in memory and has rows and columns, pandas is the standard tool.
How it saves time or tokens
This workflow provides ready-to-run pandas snippets for common data tasks. Instead of searching documentation for the right method signature, you get copy-paste code for reading files, filtering rows, grouping aggregations, and merging datasets. Each snippet is a self-contained operation you can adapt to your data.
How to use
- Install pandas:
pip install pandas
- Read your data into a DataFrame:
import pandas as pd
# From CSV
df = pd.read_csv('data.csv')
# From Excel
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')
# From SQL
from sqlalchemy import create_engine
engine = create_engine('sqlite:///app.db')
df = pd.read_sql('SELECT * FROM users', engine)
- Explore and manipulate:
# Basic exploration
df.shape # (rows, cols)
df.dtypes # column types
df.describe() # summary statistics
# Filter rows
active = df[df['status'] == 'active']
# Group and aggregate
by_country = df.groupby('country')['revenue'].sum().sort_values(ascending=False)
Example
import pandas as pd
# Load sales data
df = pd.read_csv('sales.csv', parse_dates=['date'])
# Monthly revenue by product category
monthly = (
df.assign(month=df['date'].dt.to_period('M'))
.groupby(['month', 'category'])['amount']
.sum()
.unstack(fill_value=0)
)
# Top 5 customers by total spend
top_customers = (
df.groupby('customer_id')['amount']
.sum()
.nlargest(5)
)
print(monthly)
print(top_customers)
Related on TokRepo
- AI tools for research -- Data analysis tools for research workflows
- Automation tools -- Automate data processing pipelines
Common pitfalls
- Using iterrows() for row-by-row processing is slow. Prefer vectorized operations like df['col'].apply() or boolean indexing for better performance.
- Chained assignment (df[df['x'] > 0]['y'] = 1) triggers a SettingWithCopyWarning and may not modify the original DataFrame. Use df.loc[df['x'] > 0, 'y'] = 1 instead.
- Loading large CSV files without specifying dtypes wastes memory. Use the dtype parameter or read in chunks with chunksize for files that approach your RAM limit.
Frequently Asked Questions
A DataFrame is a 2D table with labeled rows and columns. A Series is a single column (1D array) with labels. When you select one column from a DataFrame, you get a Series. When you select multiple columns, you get a DataFrame.
pandas works well with datasets that fit in memory. For most machines, this means up to a few gigabytes. For larger datasets, consider using Dask (pandas-like API with parallel processing), Polars (Rust-based DataFrame library), or reading data in chunks.
Use pd.merge(df1, df2, on='key_column', how='left') for SQL-style joins. The how parameter accepts left, right, inner, and outer. For concatenating DataFrames vertically, use pd.concat([df1, df2]).
Yes. Use pd.read_sql() with a SQLAlchemy engine or database connection. It supports any database with a SQLAlchemy dialect: PostgreSQL, MySQL, SQLite, SQL Server, and more.
Use df.isna() to detect missing values, df.dropna() to remove rows with missing values, and df.fillna(value) to replace them. For time series, df.interpolate() fills gaps using interpolation methods.
Citations (3)
- pandas GitHub— pandas is the foundational data analysis library for Python
- pandas Documentation— DataFrame and Series data structures for tabular data
- pandas IO Tools— Supports reading from CSV, Excel, SQL, JSON, and Parquet
Related on TokRepo
Discussion
Related Assets
NAPI-RS — Build Node.js Native Addons in Rust
Write high-performance Node.js native modules in Rust with automatic TypeScript type generation and cross-platform prebuilt binaries.
Mamba — Fast Cross-Platform Package Manager
A drop-in conda replacement written in C++ that resolves environments in seconds instead of minutes.
Plasmo — The Browser Extension Framework
Build, test, and publish browser extensions for Chrome, Firefox, and Edge using React or Vue with hot-reload and automatic manifest generation.