Step 4: Storage and File Formats
Step 4: Storage and File Formats
In Data Engineering, choosing the right file format can save thousands of dollars in storage costs and hours in query time.
📊 CSV vs. Parquet
| Feature | CSV | Apache Parquet |
|---|---|---|
| Type | Row-based | Columnar |
| Size | Large (Plain Text) | Small (Compressed) |
| Speed | Slow to read | Fast for analytics |
| Schema | No built-in schema | Includes metadata/schema |
🛠️ Code Example: CSV to Parquet
This script demonstrates how to convert a file to Parquet to improve performance.
import pandas as pd
# 1. Load a CSV
df = pd.read_csv('massive_data.csv')
# 2. Save as Parquet
# Requires 'pyarrow' or 'fastparquet' installed
df.to_parquet('optimized_data.parquet', compression='snappy')
# 3. Compare Read Speed
# Parquet will be much faster because it only reads the columns you need!🥅 Your Goal
- Understand that Columnar Storage is better for analytics because you often only query a few columns out of many.
- Install
pyarrowand convert your weather data script to save as Parquet.