Skip to content

Pandas: The Definitive Deep Dive

🐼 Pandas: The Definitive Deep Dive

Pandas provides the high-level tools needed for data manipulation, analysis, and cleaning. It is essentially SQL in Python with added support for complex time-series operations.


🟒 Phase 1: Foundations (Structure & Selection)

1. The Series & DataFrame Anatomy

  • Index: Labels for the rows.
  • Columns: Labels for the columns.
  • Data: The underlying NumPy/Arrow buffer.
import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
print(df.index) # RangeIndex(start=0, stop=2, step=1)

2. Precise Selection (loc, iloc, at, iat)

  • loc / at: Label-based. Use at for a single value (faster).
  • iloc / iat: Position-based. Use iat for a single value (faster).
# Select value at first row, 'A' column
value = df.at[0, 'A']

🟑 Phase 2: Intermediate (Transformation & Aggregation)

3. Multi-Indexing (Hierarchical Data)

Multi-indexing allows you to store data with multiple dimensions in a 2D structure.

df.set_index(['City', 'Year'], inplace=True)
# Select all data for 'London'
london_data = df.loc['London'] 

4. Pivot Tables & Reshaping

  • pivot_table: Summarize data (like Excel).
  • melt: Convert wide data to long format.
  • stack / unstack: Pivot indices to columns and vice versa.
# Aggregating data with Pivot Table
pivot = df.pivot_table(values='Sales', index='Month', columns='Product', aggfunc='sum')

5. Time Series & Resampling

Pandas excels at handling time-based data.

df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

# Resample to Weekly Frequency and sum
weekly_sales = df['Sales'].resample('W').sum()

🟠 Phase 3: Expert (Performance & Memory)

6. Extension Dtypes & Nullability

Historically, Pandas used NaN for missing data, which forced numeric columns to float. Extension Dtypes (like Int64, boolean) allow for nullable integers/booleans.

# Using the nullable integer type
df['id'] = df['id'].astype('Int64')

7. Vectorized String Operations (.str)

Always use the .str accessor instead of apply for string manipulation.

df['Name_Clean'] = df['Name'].str.strip().str.upper()

8. Handling Out-of-Memory Data

For datasets larger than RAM:

  1. Chunking: Process the file in pieces.
  2. usecols & dtype: Only load the columns you need and specify their types during read_csv.
# Load only 2 columns as specific types
df = pd.read_csv('big.csv', usecols=['id', 'score'], dtype={'id': 'int32', 'score': 'float32'})

πŸ”΄ Phase 4: Senior Architect (The Internal Engine)

9. Cython & Numba Integration

For custom logic that can’t be vectorized, use Numba to JIT-compile your Python code.

import numba

@numba.jit
def fast_calc(x):
    return x * x + 10

# Apply with JIT speed
df['Result'] = df['Value'].apply(fast_calc)

10. Apache Arrow Backend (Pandas 2.0+)

Pandas now supports Apache Arrow as a backend. It is significantly faster and more memory-efficient than NumPy for strings and timestamps.

# Using the Arrow backend for reading
df = pd.read_csv('data.csv', engine='pyarrow', dtype_backend='pyarrow')