Vectorization (NumPy & Polars)

🚀 Vectorization (NumPy & Polars)

In Data Engineering, Vectorization is the single most important performance optimization. It allows Python to run numerical operations at “C-speed.”

🏗️ 1. Why is Python Slow for Data?

Python is interpreted and dynamically typed. A standard for loop over a large list involves:

Type Checking for every element.
Reference Counting overhead.
Boxing/Unboxing of Python objects.

🛠️ 2. NumPy: The Foundation

NumPy uses fixed-type arrays (like float64) and executes operations in highly optimized C loops.

import numpy as np

# Slow Python Loop
def slow_sum(data):
    total = 0
    for x in data:
        total += x
    return total

# Fast Vectorized NumPy
def fast_sum(data):
    return np.sum(data)

data = np.random.rand(1_000_000)
# fast_sum will be 10-100x faster!

📦 3. Polars: The Modern Pipeline

Polars is a lightning-fast DataFrame library written in Rust. It uses Apache Arrow memory and is significantly faster than Pandas.

Key Features:

Lazy Evaluation: Optimized query execution.
Multithreading: Parallel processing by default.
Vectorized Engine: Uses SIMD (Single Instruction Multiple Data) instructions.

import polars as pl

df = pl.read_csv("large_data.csv")

# Vectorized transformation in Polars
result = (
    df.lazy()
    .filter(pl.col("value") > 100)
    .group_by("category")
    .agg(pl.col("score").mean())
    .collect()
)

🚦 4. SIMD & Memory Layout

Vectorized libraries leverage:

SIMD: Processing multiple data points in one CPU instruction.
Contiguous Memory: Storing data in a flat buffer for better CPU cache locality.