Understanding Database Internals with PostgreSQL: From Seq Scan to Index Scan
When working with databases in real-world systems, one of the most common performance issues is slow queries. Many developers rely on trial-and-error fixes, but understanding how the database actually executes queries is what truly differentiates an average engineer from a strong one.
In this article, we’ll walk through key database internals using PostgreSQL — covering execution plans, index usage, and core architectural concepts.
Getting Started: What is an Execution Plan?
When you run a query, the database doesn’t execute it directly. Instead, it goes through:
SQL → Parser → Optimizer → Execution Engine
The optimizer decides how to execute your query efficiently.
To see this decision, we use:
EXPLAIN SELECT * FROM test."Employee";
Sequential Scan (Seq Scan)
Example output:
Seq Scan on "Employee" (cost=0.00..1736.00 rows=100000 width=27)
What does this mean?
Seq Scan → Database scans entire table row by row
cost → Estimated effort (not actual time)
rows → Estimated number of rows returned
width → Average row size in bytes
When does this happen?
When querying the entire table
When no useful index exists
When the optimizer thinks scanning everything is cheaper
👉 Key Insight:
Sequential scan is NOT bad — it’s optimal for large data retrieval.
Index Scan (Faster Lookup)
Example:
EXPLAIN SELECT * FROM test."Employee" WHERE id = 10;
Output:
Index Scan using "Employee_pkey" on "Employee"
(cost=0.29..8.31 rows=1 width=27)
What’s happening?
Database uses a B-Tree index
It directly jumps to the matching row
Avoids scanning entire table
👉 Key Insight:
Index scans are useful when fetching a small subset of data
Why Index Is Not Always Used
Even if an index exists:
SELECT * FROM test."Employee";
👉 Still results in Seq Scan
Why?
Because:
Query needs 100% of rows
Sequential read is faster than repeated index lookups
👉 Golden Rule:
Seq Scan → large data access
Index Scan → selective queries
Importance of ANALYZE
Initially, PostgreSQL may show wrong estimates:
rows=850 (incorrect)
After running:
ANALYZE test."Employee";
It becomes:
rows=1 (accurate)
Why?
Because ANALYZE:
Collects table statistics
Helps optimizer make better decisions
👉 Without it:
You may get inefficient query plans
VACUUM and MVCC
PostgreSQL uses MVCC (Multi-Version Concurrency Control).
What happens during UPDATE/DELETE?
Old rows are not removed immediately
New versions are created
Old versions become “dead tuples”
Why VACUUM is needed?
VACUUM test."Employee";
Cleans dead rows
Prevents table bloat
Updates visibility map
👉 Important for:
Performance
Index-only scans
Autovacuum: Do You Need to Run It Manually?
PostgreSQL has autovacuum, which automatically runs:
VACUUM
ANALYZE
But manual execution is needed when:
Bulk data insert/update
Query performance issues
Debugging execution plans
Index Only Scan (Advanced)
Sometimes, PostgreSQL can avoid reading the table entirely:
EXPLAIN SELECT id FROM test."Employee";
👉 Expected:
Index Only Scan
But you may still see Seq Scan
Because:
Visibility map is not updated
VACUUM hasn’t run
👉 Fix:
VACUUM test."Employee";
Primary Key and Index
When you define:
id PRIMARY KEY
PostgreSQL automatically creates:
CREATE UNIQUE INDEX "Employee_pkey" ON "Employee"(id);
👉 Meaning:
Primary key = constraint + index
Used for fast lookups
Core Database Architecture Concepts
To truly understand databases, you should know:
1. Storage (Heap & Pages)
Data stored in pages (8KB blocks)
Tables = collection of pages
2. MVCC
Enables high concurrency
Avoids read locks
Requires VACUUM
3. Query Optimizer
Cost-based decision making
Chooses best execution plan
4. Indexing
B-Tree (default)
Improves lookup performance
Comes with write overhead
5. Joins
Nested Loop
Hash Join
Merge Join
6. WAL (Write-Ahead Logging)
Ensures durability
Used for crash recovery
Key Takeaways
EXPLAIN is your best debugging tool
Indexes are not always faster
Statistics drive optimizer decisions
VACUUM and ANALYZE are critical for performance
Understanding internals helps in real production scenarios
Final Thoughts
Learning database internals is not about memorizing commands — it’s about building intuition:
“Why is the database choosing this plan?”
Once you start thinking like the optimizer, debugging performance issues becomes much easier.
If you’re exploring database internals, start experimenting:
Insert large datasets
Compare execution plans
Add/remove indexes
Observe behavior
That’s where real learning happens.
Comments
Post a Comment