# pandas — Powerful Data Analysis and Manipulation for Python > pandas is the essential data analysis library for Python. It provides DataFrame and Series data structures for efficient manipulation of tabular data, time series, and structured datasets with an expressive API for filtering, grouping, joining, and reshaping. ## Install Save as a script file and run: # pandas — Powerful Data Analysis and Manipulation for Python ## Quick Use ```bash # Install pandas pip install pandas # Quick data analysis python3 -c " import pandas as pd # Read CSV df = pd.read_csv('data.csv') # Basic exploration print(df.head()) print(df.describe()) print(df.groupby('category')['value'].mean()) " ``` ## Introduction pandas is the cornerstone of data science in Python. It provides fast, flexible data structures — DataFrame (2D table) and Series (1D array) — designed to make working with structured and time series data intuitive and efficient. Whether you are cleaning messy CSV files, analyzing financial data, or preparing datasets for machine learning, pandas is almost always the first tool you reach for. With over 48,000 GitHub stars and as a dependency of virtually every Python data project, pandas is one of the most important libraries in the scientific Python ecosystem. ## What pandas Does pandas provides a rich set of operations for data manipulation: reading data from CSV, Excel, SQL, JSON, and Parquet files; filtering rows and selecting columns; handling missing values; grouping and aggregating; merging and joining datasets; reshaping (pivot, melt); and time series analysis. All operations are optimized for performance using NumPy arrays under the hood. ## Architecture Overview ``` [Data Sources] CSV, Excel, SQL, JSON, Parquet, HDF5, Feather, HTML tables | [pd.read_*()] | [DataFrame / Series] NumPy-backed columns with labeled axes | +-------+-------+-------+ | | | | [Select] [Transform] [Aggregate] .loc[] .apply() .groupby() .iloc[] .map() .agg() .query() .fillna() .pivot_table() .merge() .resample() | [Output] .to_csv(), .to_parquet(), .to_sql(), .plot() ``` ## Self-Hosting & Configuration ```python import pandas as pd # Read and explore data df = pd.read_csv("sales.csv", parse_dates=["date"]) print(df.info()) print(df.describe()) # Data cleaning df = df.dropna(subset=["revenue"]) df["category"] = df["category"].str.strip().str.lower() df["year"] = df["date"].dt.year # Analysis monthly = (df .groupby(pd.Grouper(key="date", freq="ME")) .agg({"revenue": "sum", "orders": "count"}) .rename(columns={"orders": "order_count"}) ) # Merge datasets products = pd.read_csv("products.csv") result = df.merge(products, on="product_id", how="left") # Export result.to_parquet("output.parquet", index=False) ``` ## Key Features - **DataFrame** — labeled 2D data structure with mixed column types - **IO Tools** — read/write CSV, Excel, SQL, JSON, Parquet, HDF5, and more - **Missing Data** — intelligent handling of NaN values across operations - **GroupBy** — split-apply-combine operations for aggregation - **Merge/Join** — SQL-like joins between DataFrames - **Reshaping** — pivot, melt, stack, unstack for data transformation - **Time Series** — date range generation, resampling, rolling windows - **Vectorized Ops** — fast NumPy-backed operations without loops ## Comparison with Similar Tools | Feature | pandas | Polars | R data.table | DuckDB | Spark | |---|---|---|---|---|---| | Language | Python | Python/Rust | R | SQL/Python | Python/Scala | | Speed | Moderate | Very Fast | Fast | Fast | Distributed | | Memory | High | Efficient | Efficient | Efficient | Distributed | | API Style | Method chain | Method chain | Bracket syntax | SQL | DataFrame API | | Max Data Size | RAM-limited | RAM-limited | RAM-limited | Larger-than-RAM | Unlimited | | Ecosystem | Dominant | Growing | R ecosystem | Growing | Enterprise | | Learning Curve | Moderate | Low | Moderate | Low (SQL) | High | ## FAQ **Q: pandas vs Polars — which should I use?** A: For new projects with large datasets, Polars is faster and more memory-efficient. For existing codebases, scikit-learn integration, and maximum ecosystem compatibility, pandas remains the safe choice. pandas 2.0+ has improved performance significantly with Arrow-backed dtypes. **Q: How do I speed up slow pandas operations?** A: Avoid iterrows() (use vectorized operations instead), use categorical dtypes for string columns, read Parquet instead of CSV, and consider using pandas with the PyArrow backend for better performance. **Q: Can pandas handle big data?** A: pandas is limited to data that fits in RAM. For larger datasets, use chunked reading (chunksize parameter), Dask for parallel pandas, or switch to Polars or DuckDB. **Q: How do I handle time series data?** A: Use pd.to_datetime() to parse dates, set the date column as index, then use .resample() for aggregation, .rolling() for moving averages, and .shift() for lagged features. ## Sources - GitHub: https://github.com/pandas-dev/pandas - Documentation: https://pandas.pydata.org - Created by Wes McKinney - License: BSD 3-Clause --- Source: https://tokrepo.com/en/workflows/1005b785-366d-11f1-9bc6-00163e2b0d79 Author: Script Depot