How do I install pandas — Powerful Data Analysis and Manipulation for Python?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

pandas — Powerful Data Analysis and Manipulation for Python

Introduction

pandas is the cornerstone of data science in Python. It provides fast, flexible data structures — DataFrame (2D table) and Series (1D array) — designed to make working with structured and time series data intuitive and efficient. Whether you are cleaning messy CSV files, analyzing financial data, or preparing datasets for machine learning, pandas is almost always the first tool you reach for.

With over 48,000 GitHub stars and as a dependency of virtually every Python data project, pandas is one of the most important libraries in the scientific Python ecosystem.

What pandas Does

pandas provides a rich set of operations for data manipulation: reading data from CSV, Excel, SQL, JSON, and Parquet files; filtering rows and selecting columns; handling missing values; grouping and aggregating; merging and joining datasets; reshaping (pivot, melt); and time series analysis. All operations are optimized for performance using NumPy arrays under the hood.

Architecture Overview

[Data Sources]
CSV, Excel, SQL, JSON, Parquet,
HDF5, Feather, HTML tables
        |
   [pd.read_*()]
        |
   [DataFrame / Series]
   NumPy-backed columns
   with labeled axes
        |
+-------+-------+-------+
|       |       |       |
[Select] [Transform] [Aggregate]
.loc[]   .apply()    .groupby()
.iloc[]  .map()      .agg()
.query() .fillna()   .pivot_table()
         .merge()    .resample()
        |
   [Output]
   .to_csv(), .to_parquet(),
   .to_sql(), .plot()

Self-Hosting & Configuration

import pandas as pd

# Read and explore data
df = pd.read_csv("sales.csv", parse_dates=["date"])
print(df.info())
print(df.describe())

# Data cleaning
df = df.dropna(subset=["revenue"])
df["category"] = df["category"].str.strip().str.lower()
df["year"] = df["date"].dt.year

# Analysis
monthly = (df
    .groupby(pd.Grouper(key="date", freq="ME"))
    .agg({"revenue": "sum", "orders": "count"})
    .rename(columns={"orders": "order_count"})
)

# Merge datasets
products = pd.read_csv("products.csv")
result = df.merge(products, on="product_id", how="left")

# Export
result.to_parquet("output.parquet", index=False)

Key Features

DataFrame — labeled 2D data structure with mixed column types
IO Tools — read/write CSV, Excel, SQL, JSON, Parquet, HDF5, and more
Missing Data — intelligent handling of NaN values across operations
GroupBy — split-apply-combine operations for aggregation
Merge/Join — SQL-like joins between DataFrames
Reshaping — pivot, melt, stack, unstack for data transformation
Time Series — date range generation, resampling, rolling windows
Vectorized Ops — fast NumPy-backed operations without loops

Comparison with Similar Tools

Feature	pandas	Polars	R data.table	DuckDB	Spark
Language	Python	Python/Rust	R	SQL/Python	Python/Scala
Speed	Moderate	Very Fast	Fast	Fast	Distributed
Memory	High	Efficient	Efficient	Efficient	Distributed
API Style	Method chain	Method chain	Bracket syntax	SQL	DataFrame API
Max Data Size	RAM-limited	RAM-limited	RAM-limited	Larger-than-RAM	Unlimited
Ecosystem	Dominant	Growing	R ecosystem	Growing	Enterprise
Learning Curve	Moderate	Low	Moderate	Low (SQL)	High

FAQ

Q: pandas vs Polars — which should I use? A: For new projects with large datasets, Polars is faster and more memory-efficient. For existing codebases, scikit-learn integration, and maximum ecosystem compatibility, pandas remains the safe choice. pandas 2.0+ has improved performance significantly with Arrow-backed dtypes.

Q: How do I speed up slow pandas operations? A: Avoid iterrows() (use vectorized operations instead), use categorical dtypes for string columns, read Parquet instead of CSV, and consider using pandas with the PyArrow backend for better performance.

Q: Can pandas handle big data? A: pandas is limited to data that fits in RAM. For larger datasets, use chunked reading (chunksize parameter), Dask for parallel pandas, or switch to Polars or DuckDB.

Q: How do I handle time series data? A: Use pd.to_datetime() to parse dates, set the date column as index, then use .resample() for aggregation, .rolling() for moving averages, and .shift() for lagged features.

Sources

GitHub: https://github.com/pandas-dev/pandas
Documentation: https://pandas.pydata.org
Created by Wes McKinney
License: BSD 3-Clause

pandas — Powerful Data Analysis and Manipulation for Python

先拿来用，再决定要不要深挖

Introduction

What pandas Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Matplotlib — Comprehensive Visualization Library for Python

Gradio — Build ML Demos and Web UIs in Python

FastAPI — High Performance Python Web Framework for APIs