Skip to content

Vectorization (NumPy & Polars)

πŸš€ Vectorization (NumPy & Polars)

In Data Engineering, Vectorization is the single most important performance optimization. It allows Python to run numerical operations at β€œC-speed.”


πŸ—οΈ 1. Why is Python Slow for Data?

Python is interpreted and dynamically typed. A standard for loop over a large list involves:

  • Type Checking for every element.
  • Reference Counting overhead.
  • Boxing/Unboxing of Python objects.

πŸ› οΈ 2. NumPy: The Foundation

NumPy uses fixed-type arrays (like float64) and executes operations in highly optimized C loops.

import numpy as np

# Slow Python Loop
def slow_sum(data):
    total = 0
    for x in data:
        total += x
    return total

# Fast Vectorized NumPy
def fast_sum(data):
    return np.sum(data)

data = np.random.rand(1_000_000)
# fast_sum will be 10-100x faster!

πŸ“¦ 3. Polars: The Modern Pipeline

Polars is a lightning-fast DataFrame library written in Rust. It uses Apache Arrow memory and is significantly faster than Pandas.

Key Features:

  • Lazy Evaluation: Optimized query execution.
  • Multithreading: Parallel processing by default.
  • Vectorized Engine: Uses SIMD (Single Instruction Multiple Data) instructions.
import polars as pl

df = pl.read_csv("large_data.csv")

# Vectorized transformation in Polars
result = (
    df.lazy()
    .filter(pl.col("value") > 100)
    .group_by("category")
    .agg(pl.col("score").mean())
    .collect()
)

🚦 4. SIMD & Memory Layout

Vectorized libraries leverage:

  • SIMD: Processing multiple data points in one CPU instruction.
  • Contiguous Memory: Storing data in a flat buffer for better CPU cache locality.