Vectorization (NumPy & Polars)
π Vectorization (NumPy & Polars)
In Data Engineering, Vectorization is the single most important performance optimization. It allows Python to run numerical operations at βC-speed.β
ποΈ 1. Why is Python Slow for Data?
Python is interpreted and dynamically typed. A standard for loop over a large list involves:
- Type Checking for every element.
- Reference Counting overhead.
- Boxing/Unboxing of Python objects.
π οΈ 2. NumPy: The Foundation
NumPy uses fixed-type arrays (like float64) and executes operations in highly optimized C loops.
import numpy as np
# Slow Python Loop
def slow_sum(data):
total = 0
for x in data:
total += x
return total
# Fast Vectorized NumPy
def fast_sum(data):
return np.sum(data)
data = np.random.rand(1_000_000)
# fast_sum will be 10-100x faster!π¦ 3. Polars: The Modern Pipeline
Polars is a lightning-fast DataFrame library written in Rust. It uses Apache Arrow memory and is significantly faster than Pandas.
Key Features:
- Lazy Evaluation: Optimized query execution.
- Multithreading: Parallel processing by default.
- Vectorized Engine: Uses SIMD (Single Instruction Multiple Data) instructions.
import polars as pl
df = pl.read_csv("large_data.csv")
# Vectorized transformation in Polars
result = (
df.lazy()
.filter(pl.col("value") > 100)
.group_by("category")
.agg(pl.col("score").mean())
.collect()
)π¦ 4. SIMD & Memory Layout
Vectorized libraries leverage:
- SIMD: Processing multiple data points in one CPU instruction.
- Contiguous Memory: Storing data in a flat buffer for better CPU cache locality.