Module 3: Pandas & DataFrames (The Advanced Spreadsheet)
📚 Module 3: Pandas & DataFrames
Course ID: PY-103
Subject: The Automated Spreadsheet
If NumPy is the Calculator, then Pandas is the Excel of Python. It’s the primary tool we use to load, clean, and analyze data in professional projects.
🏗️ Step 1: The “Manual Cleaning” Problem
Imagine you have a CSV file with 1 million orders.
- Some orders are missing a Price.
- Some orders have a negative Date.
- Some rows are duplicates.
🏗️ Step 2: The DataFrame (The “Super Spreadsheet”)
A DataFrame is the core of Pandas.
- It’s like a single sheet in an Excel workbook.
- It has Columns (The Features) and Rows (The Records).
🧩 The Analogy: The Magic Spreadsheet
Imagine you have a spreadsheet that could answer any question instantly.
- You say: “Hey spreadsheet, what is the average price of all the red shoes?”
- And it answers in 0.1 seconds.
That is Pandas! It gives you powerful commands like df.groupby('color').price.mean() that allow you to analyze a giant dataset in one line of code.
🏗️ Step 3: Imputation & Filtering (The “Cleanup Crew”)
Pandas makes it easy to handle messy data.
🧩 The Analogy: The Quality Control Inspector
- Imputation: Filling in the “None” values with the Average of the rest of the column.
- Filtering: Throwing away rows that don’t belong (e.g., negative prices).
🧪 Step 4: Python Practice (Cleaning Your First Table)
Run this code to see how easy it is to clean data with Pandas.
import pandas as pd
import numpy as np
# 1. Create a messy table
data = {
'name': ['Alice', 'Bob', 'Charlie', 'Alice'],
'age': [25, 30, np.nan, 25], # Charlie is missing his age!
'price': [100, 200, 150, 100]
}
df = pd.DataFrame(data)
# 2. Clean it up!
# - Remove duplicates (The second 'Alice')
df = df.drop_duplicates()
# - Fill missing age with the average (Mean)
df['age'] = df['age'].fillna(df['age'].mean())
# 3. View the clean table
print("Clean Table:")
print(df)
# 4. Ask a question!
print(f"\nAverage Price: ${df['price'].mean()}")🥅 Module 3 Review
- DataFrame: A 2D table (like a spreadsheet) for your data.
- Series: A single column of a DataFrame.
- Grouping: The ability to summarize data based on a category (like ‘color’).
- Handling NaNs: How we fix missing data in a professional way.
:::tip Slow Learner Note You don’t need to learn every Pandas command at once. Just remember: If you would do it in Excel, you should do it in Pandas! :::