Module 3: Pandas & DataFrames (The Advanced Spreadsheet)

📚 Module 3: Pandas & DataFrames

Course ID: PY-103
Subject: The Automated Spreadsheet

If NumPy is the Calculator, then Pandas is the Excel of Python. It’s the primary tool we use to load, clean, and analyze data in professional projects.

🏗️ Step 1: The “Manual Cleaning” Problem

Imagine you have a CSV file with 1 million orders.

Some orders are missing a Price.
Some orders have a negative Date.
Some rows are duplicates.

🏗️ Step 2: The DataFrame (The “Super Spreadsheet”)

A DataFrame is the core of Pandas.

It’s like a single sheet in an Excel workbook.
It has Columns (The Features) and Rows (The Records).

🧩 The Analogy: The Magic Spreadsheet

Imagine you have a spreadsheet that could answer any question instantly.

You say: “Hey spreadsheet, what is the average price of all the red shoes?”
And it answers in 0.1 seconds.

That is Pandas! It gives you powerful commands like df.groupby('color').price.mean() that allow you to analyze a giant dataset in one line of code.

🏗️ Step 3: Imputation & Filtering (The “Cleanup Crew”)

Pandas makes it easy to handle messy data.

🧩 The Analogy: The Quality Control Inspector

Imputation: Filling in the “None” values with the Average of the rest of the column.
Filtering: Throwing away rows that don’t belong (e.g., negative prices).

🧪 Step 4: Python Practice (Cleaning Your First Table)

Run this code to see how easy it is to clean data with Pandas.

import pandas as pd
import numpy as np

# 1. Create a messy table
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Alice'],
    'age': [25, 30, np.nan, 25], # Charlie is missing his age!
    'price': [100, 200, 150, 100]
}

df = pd.DataFrame(data)

# 2. Clean it up!
# - Remove duplicates (The second 'Alice')
df = df.drop_duplicates()

# - Fill missing age with the average (Mean)
df['age'] = df['age'].fillna(df['age'].mean())

# 3. View the clean table
print("Clean Table:")
print(df)

# 4. Ask a question!
print(f"\nAverage Price: ${df['price'].mean()}")

🥅 Module 3 Review

DataFrame: A 2D table (like a spreadsheet) for your data.
Series: A single column of a DataFrame.
Grouping: The ability to summarize data based on a category (like ‘color’).
Handling NaNs: How we fix missing data in a professional way.

:::tip Slow Learner Note You don’t need to learn every Pandas command at once. Just remember: If you would do it in Excel, you should do it in Pandas! :::