Skip to content

Module 3: Pandas & DataFrames (The Advanced Spreadsheet)

📚 Module 3: Pandas & DataFrames

Course ID: PY-103
Subject: The Automated Spreadsheet

If NumPy is the Calculator, then Pandas is the Excel of Python. It’s the primary tool we use to load, clean, and analyze data in professional projects.


🏗️ Step 1: The “Manual Cleaning” Problem

Imagine you have a CSV file with 1 million orders.

  • Some orders are missing a Price.
  • Some orders have a negative Date.
  • Some rows are duplicates.

🏗️ Step 2: The DataFrame (The “Super Spreadsheet”)

A DataFrame is the core of Pandas.

  • It’s like a single sheet in an Excel workbook.
  • It has Columns (The Features) and Rows (The Records).

🧩 The Analogy: The Magic Spreadsheet

Imagine you have a spreadsheet that could answer any question instantly.

  • You say: “Hey spreadsheet, what is the average price of all the red shoes?”
  • And it answers in 0.1 seconds.

That is Pandas! It gives you powerful commands like df.groupby('color').price.mean() that allow you to analyze a giant dataset in one line of code.


🏗️ Step 3: Imputation & Filtering (The “Cleanup Crew”)

Pandas makes it easy to handle messy data.

🧩 The Analogy: The Quality Control Inspector

  • Imputation: Filling in the “None” values with the Average of the rest of the column.
  • Filtering: Throwing away rows that don’t belong (e.g., negative prices).

🧪 Step 4: Python Practice (Cleaning Your First Table)

Run this code to see how easy it is to clean data with Pandas.

import pandas as pd
import numpy as np

# 1. Create a messy table
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Alice'],
    'age': [25, 30, np.nan, 25], # Charlie is missing his age!
    'price': [100, 200, 150, 100]
}

df = pd.DataFrame(data)

# 2. Clean it up!
# - Remove duplicates (The second 'Alice')
df = df.drop_duplicates()

# - Fill missing age with the average (Mean)
df['age'] = df['age'].fillna(df['age'].mean())

# 3. View the clean table
print("Clean Table:")
print(df)

# 4. Ask a question!
print(f"\nAverage Price: ${df['price'].mean()}")

🥅 Module 3 Review

  1. DataFrame: A 2D table (like a spreadsheet) for your data.
  2. Series: A single column of a DataFrame.
  3. Grouping: The ability to summarize data based on a category (like ‘color’).
  4. Handling NaNs: How we fix missing data in a professional way.

:::tip Slow Learner Note You don’t need to learn every Pandas command at once. Just remember: If you would do it in Excel, you should do it in Pandas! :::