Embedding vs. Referencing
Embedding vs. Referencing
In MongoDB, the most critical design decision is: โShould this data be part of the same document (Embedding), or should it be linked via an ID (Referencing)?โ
Unlike SQL, where normalization is the goal, MongoDB design is governed by the rule: โData that is accessed together, should be stored together.โ
๐๏ธ 1. Embedding (Denormalization)
Embedding stores related data within a single document as nested objects or arrays.
Advantages
- Read Performance: You can retrieve all related data in a single I/O operation (no joins).
- Atomicity: Updates to the parent and its embedded children are atomic.
- Simplicity: No need for
$lookupor complex application-side logic.
Disadvantages
- 16MB Limit: If the embedded array grows without bound (e.g., millions of comments), the document will exceed the BSON limit.
- Data Duplication: If you embed โAuthor Infoโ in every โPostโ document, updating the authorโs name requires updating every post.
Use Case: One-to-Few
A User and their Addresses. A user typically has only 1-3 addresses, and you almost always want to see the address when looking at the user.
// User Document (Embedded)
{
"_id": ObjectId("..."),
"username": "jdoe",
"addresses": [
{ "type": "home", "city": "NYC", "zip": "10001" },
{ "type": "work", "city": "SF", "zip": "94105" }
]
}๐ 2. Referencing (Normalization)
Referencing links documents together using _id values, similar to Foreign Keys in SQL.
Advantages
- Scalability: Avoids the 16MB document limit.
- Consistency: Update a piece of data once, and it reflects everywhere itโs referenced.
- Smaller Documents: Leads to better cache utilization for high-frequency updates.
Disadvantages
- Performance: Requires multiple queries or a
$lookup(JOIN) stage, which is computationally more expensive. - Lack of Atomicity: Updating two referenced documents requires a multi-document transaction (more overhead).
Use Case: One-to-Many (Large)
A Post and its Comments. A popular post might have 100,000 comments. Embedding them all would break the 16MB limit and slow down every read of the post.
# Pymongo: Fetching a Post and its referenced Comments
from pymongo import MongoClient
from bson import ObjectId
db = MongoClient().blog_db
# 1. Fetch the post
post_id = ObjectId("650c...")
post = db.posts.find_one({"_id": post_id})
# 2. Fetch the comments separately (Parent Referencing)
comments = list(db.comments.find({"post_id": post_id}))
print(f"Post: {post['title']}, Comments Count: {len(comments)}")๐ 3. Choosing the Right Strategy
The best strategy depends on the cardinality of the relationship and the access patterns.
| Relationship Type | Cardinality | Strategy |
|---|---|---|
| One-to-One | 1:1 | Embed (unless security/audit requires separation) |
| One-to-Few | 1:N (small) | Embed |
| One-to-Many | 1:N (large) | Reference (Parent Reference) |
| Many-to-Many | N:M | Reference (Array of IDs on both or one side) |
Summary
Embedding provides high performance for read-heavy operations where data is small. Referencing provides flexibility and scale for large, complex datasets. Always monitor your document sizes as your application grows!