Step 4: Storage and File Formats

In Data Engineering, choosing the right file format can save thousands of dollars in storage costs and hours in query time.

📊 CSV vs. Parquet

Feature	CSV	Apache Parquet
Type	Row-based	Columnar
Size	Large (Plain Text)	Small (Compressed)
Speed	Slow to read	Fast for analytics
Schema	No built-in schema	Includes metadata/schema

🛠️ Code Example: CSV to Parquet

This script demonstrates how to convert a file to Parquet to improve performance.

import pandas as pd

# 1. Load a CSV
df = pd.read_csv('massive_data.csv')

# 2. Save as Parquet
# Requires 'pyarrow' or 'fastparquet' installed
df.to_parquet('optimized_data.parquet', compression='snappy')

# 3. Compare Read Speed
# Parquet will be much faster because it only reads the columns you need!

🥅 Your Goal

Understand that Columnar Storage is better for analytics because you often only query a few columns out of many.
Install pyarrow and convert your weather data script to save as Parquet.