Pandas: The Definitive Deep Dive
πΌ Pandas: The Definitive Deep Dive
Pandas provides the high-level tools needed for data manipulation, analysis, and cleaning. It is essentially SQL in Python with added support for complex time-series operations.
π’ Phase 1: Foundations (Structure & Selection)
1. The Series & DataFrame Anatomy
- Index: Labels for the rows.
- Columns: Labels for the columns.
- Data: The underlying NumPy/Arrow buffer.
import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
print(df.index) # RangeIndex(start=0, stop=2, step=1)2. Precise Selection (loc, iloc, at, iat)
loc/at: Label-based. Useatfor a single value (faster).iloc/iat: Position-based. Useiatfor a single value (faster).
# Select value at first row, 'A' column
value = df.at[0, 'A']π‘ Phase 2: Intermediate (Transformation & Aggregation)
3. Multi-Indexing (Hierarchical Data)
Multi-indexing allows you to store data with multiple dimensions in a 2D structure.
df.set_index(['City', 'Year'], inplace=True)
# Select all data for 'London'
london_data = df.loc['London'] 4. Pivot Tables & Reshaping
pivot_table: Summarize data (like Excel).melt: Convert wide data to long format.stack/unstack: Pivot indices to columns and vice versa.
# Aggregating data with Pivot Table
pivot = df.pivot_table(values='Sales', index='Month', columns='Product', aggfunc='sum')5. Time Series & Resampling
Pandas excels at handling time-based data.
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
# Resample to Weekly Frequency and sum
weekly_sales = df['Sales'].resample('W').sum()π Phase 3: Expert (Performance & Memory)
6. Extension Dtypes & Nullability
Historically, Pandas used NaN for missing data, which forced numeric columns to float. Extension Dtypes (like Int64, boolean) allow for nullable integers/booleans.
# Using the nullable integer type
df['id'] = df['id'].astype('Int64')7. Vectorized String Operations (.str)
Always use the .str accessor instead of apply for string manipulation.
df['Name_Clean'] = df['Name'].str.strip().str.upper()8. Handling Out-of-Memory Data
For datasets larger than RAM:
- Chunking: Process the file in pieces.
usecols&dtype: Only load the columns you need and specify their types duringread_csv.
# Load only 2 columns as specific types
df = pd.read_csv('big.csv', usecols=['id', 'score'], dtype={'id': 'int32', 'score': 'float32'})π΄ Phase 4: Senior Architect (The Internal Engine)
9. Cython & Numba Integration
For custom logic that canβt be vectorized, use Numba to JIT-compile your Python code.
import numba
@numba.jit
def fast_calc(x):
return x * x + 10
# Apply with JIT speed
df['Result'] = df['Value'].apply(fast_calc)10. Apache Arrow Backend (Pandas 2.0+)
Pandas now supports Apache Arrow as a backend. It is significantly faster and more memory-efficient than NumPy for strings and timestamps.
# Using the Arrow backend for reading
df = pd.read_csv('data.csv', engine='pyarrow', dtype_backend='pyarrow')