Skip to content

Embedding vs. Referencing

Embedding vs. Referencing

In MongoDB, the most critical design decision is: โ€œShould this data be part of the same document (Embedding), or should it be linked via an ID (Referencing)?โ€

Unlike SQL, where normalization is the goal, MongoDB design is governed by the rule: โ€œData that is accessed together, should be stored together.โ€


๐Ÿ—๏ธ 1. Embedding (Denormalization)

Embedding stores related data within a single document as nested objects or arrays.

Advantages

  • Read Performance: You can retrieve all related data in a single I/O operation (no joins).
  • Atomicity: Updates to the parent and its embedded children are atomic.
  • Simplicity: No need for $lookup or complex application-side logic.

Disadvantages

  • 16MB Limit: If the embedded array grows without bound (e.g., millions of comments), the document will exceed the BSON limit.
  • Data Duplication: If you embed โ€œAuthor Infoโ€ in every โ€œPostโ€ document, updating the authorโ€™s name requires updating every post.

Use Case: One-to-Few

A User and their Addresses. A user typically has only 1-3 addresses, and you almost always want to see the address when looking at the user.

// User Document (Embedded)
{
  "_id": ObjectId("..."),
  "username": "jdoe",
  "addresses": [
    { "type": "home", "city": "NYC", "zip": "10001" },
    { "type": "work", "city": "SF", "zip": "94105" }
  ]
}

๐Ÿš€ 2. Referencing (Normalization)

Referencing links documents together using _id values, similar to Foreign Keys in SQL.

Advantages

  • Scalability: Avoids the 16MB document limit.
  • Consistency: Update a piece of data once, and it reflects everywhere itโ€™s referenced.
  • Smaller Documents: Leads to better cache utilization for high-frequency updates.

Disadvantages

  • Performance: Requires multiple queries or a $lookup (JOIN) stage, which is computationally more expensive.
  • Lack of Atomicity: Updating two referenced documents requires a multi-document transaction (more overhead).

Use Case: One-to-Many (Large)

A Post and its Comments. A popular post might have 100,000 comments. Embedding them all would break the 16MB limit and slow down every read of the post.

# Pymongo: Fetching a Post and its referenced Comments
from pymongo import MongoClient
from bson import ObjectId

db = MongoClient().blog_db

# 1. Fetch the post
post_id = ObjectId("650c...")
post = db.posts.find_one({"_id": post_id})

# 2. Fetch the comments separately (Parent Referencing)
comments = list(db.comments.find({"post_id": post_id}))

print(f"Post: {post['title']}, Comments Count: {len(comments)}")

๐Ÿ“ 3. Choosing the Right Strategy

The best strategy depends on the cardinality of the relationship and the access patterns.

Relationship TypeCardinalityStrategy
One-to-One1:1Embed (unless security/audit requires separation)
One-to-Few1:N (small)Embed
One-to-Many1:N (large)Reference (Parent Reference)
Many-to-ManyN:MReference (Array of IDs on both or one side)

Summary

Embedding provides high performance for read-heavy operations where data is small. Referencing provides flexibility and scale for large, complex datasets. Always monitor your document sizes as your application grows!