Skip to content

Step 4: Storage and File Formats

Step 4: Storage and File Formats

In Data Engineering, choosing the right file format can save thousands of dollars in storage costs and hours in query time.


📊 CSV vs. Parquet

FeatureCSVApache Parquet
TypeRow-basedColumnar
SizeLarge (Plain Text)Small (Compressed)
SpeedSlow to readFast for analytics
SchemaNo built-in schemaIncludes metadata/schema

🛠️ Code Example: CSV to Parquet

This script demonstrates how to convert a file to Parquet to improve performance.

import pandas as pd

# 1. Load a CSV
df = pd.read_csv('massive_data.csv')

# 2. Save as Parquet
# Requires 'pyarrow' or 'fastparquet' installed
df.to_parquet('optimized_data.parquet', compression='snappy')

# 3. Compare Read Speed
# Parquet will be much faster because it only reads the columns you need!

🥅 Your Goal

  • Understand that Columnar Storage is better for analytics because you often only query a few columns out of many.
  • Install pyarrow and convert your weather data script to save as Parquet.