Postgres bloat and autovacuum tuning for faster performance
Postgres bloat and autovacuum tuning: learn how to spot table and index bloat, adjust autovacuum safely, and plan maintenance to restore speed.

What Postgres bloat is and why autovacuum matters
Bloat is the extra, wasted space inside Postgres tables and indexes that builds up over time. It usually follows a steady diet of UPDATEs and DELETEs. Postgres doesn’t overwrite rows in place. It writes a new row version and leaves the old one behind until cleanup happens.
That wasted space hurts performance in a few predictable ways. Queries have to read more pages from disk and memory to find the same amount of real data. Indexes get bigger, so lookups and joins take more work. Writes can slow down too because Postgres has to maintain larger indexes and touch more pages. Backups and restores also take longer because there’s simply more to copy.
Autovacuum is the background worker that keeps this under control. It mainly does two jobs:
- Cleans up dead row versions so their space can be reused (VACUUM)
- Updates planner statistics so Postgres can choose good query plans (ANALYZE)
One important detail: autovacuum usually won’t shrink the physical file on disk. It makes space reusable inside the table, but your disk usage might not drop. To return space back to the OS, you generally need heavier maintenance (like VACUUM FULL or a table rebuild), and those options have real tradeoffs.
A simple example: an AI-generated admin panel that “updates” user records on every page load can create a steady stream of dead rows. Autovacuum can keep up for a while. Then a traffic spike hits, it falls behind, and suddenly common queries feel sluggish even though the dataset size hasn’t changed much.
Autovacuum tuning isn’t a one-time switch. The right settings depend on your write rate, table size, and how sensitive you are to background work. The safest approach is iterative: measure, make small changes, watch the impact, then adjust.
Common signs your database is suffering from bloat
Bloat often shows up as “nothing changed, but everything got slower.” The business data might look stable, yet Postgres has to read and scan more pages to answer the same queries. That extra work turns into more IO, more CPU, and more waiting.
Signals that commonly point to bloat include:
- Queries that were fine last week now drag, especially ones that used to do quick index scans.
- Disk usage keeps climbing even though real business data isn’t growing at the same pace.
- Higher read latency and more cache misses (more reads from disk instead of memory).
- Autovacuum runs often, but you don’t see the usual recovery afterward.
- Backups and restores take longer, and replication lag appears during busy periods.
One subtle clue is when the query plan is still “right” (it still uses the expected index), but timing is much worse. That can happen when the index is bloated and larger than it needs to be, so Postgres needs more page reads to walk it.
Another common pattern: table scans get slower even though row counts look similar. Updates and deletes leave dead tuples behind, and when cleanup falls behind, the table grows in pages. Postgres then has more pages to search through and more visibility checks to do.
A practical example: a small SaaS app adds an “update last_seen” write on every request. Reads start timing out during peak hours, disk use jumps, and autovacuum looks busy all day. The root issue isn’t one bad query. It’s the database doing extra work because dead rows and oversized indexes have piled up.
If you see two or more of these signs at the same time, measure bloat before changing settings. Autovacuum tweaks help most when you can confirm which tables and indexes are actually growing.
Quick checklist: confirm bloat before you tune
Before you change anything, confirm you’re actually dealing with bloat and not a slow query plan, bad indexes, or a transaction that’s blocking cleanup. Tuning works best when you can point to a few specific tables and indexes.
Start by getting quick, plain numbers you can compare week to week. Focus on:
- The largest tables and whether they’re actually seeing heavy writes (cold tables rarely explain bloat).
- Autovacuum history per table: when it last ran, how often it runs, and whether it’s taking longer over time.
- High dead tuple counts, especially on frequently updated tables.
- Large indexes that sit on high-churn columns (indexes can bloat even when the table looks fine).
- Long-running transactions, since they can prevent vacuum from removing dead tuples.
If you want a fast first pass, these queries usually give enough signal to decide what to investigate next:
-- Biggest tables
SELECT relname, pg_total_relation_size(relid) AS bytes
FROM pg_catalog.pg_statio_user_tables
ORDER BY bytes DESC
LIMIT 5;
-- Dead tuples and last vacuum/autovacuum
SELECT relname, n_live_tup, n_dead_tup, last_vacuum, last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;
-- Biggest indexes
SELECT indexrelname, pg_relation_size(indexrelid) AS bytes
FROM pg_stat_user_indexes
ORDER BY bytes DESC
LIMIT 10;
-- Long-running transactions (can block cleanup)
SELECT pid, now() - xact_start AS xact_age, state
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_age DESC
LIMIT 10;
A practical way to read the results: if one events-like table is huge, has a rising dead tuple count, and autovacuum runs but never catches up, that table is a tuning candidate. If dead tuples are high across many tables and you also see a transaction running for hours, fix the long transaction first. Autovacuum can’t outwork a transaction that never ends.
How to measure table and index bloat (practical methods)
Start with what Postgres already tracks. Bloat often correlates with lots of dead rows (dead tuples) and heavy churn (updates and deletes) on the same tables.
1) Use built-in stats to find dead tuples and churn
These views are a fast, low-risk first pass. Look for tables where n_dead_tup is high, and where n_tup_upd and n_tup_del keep climbing.
-- Tables with lots of dead tuples
SELECT
schemaname,
relname,
n_live_tup,
n_dead_tup,
last_autovacuum,
last_vacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
-- Churn (writes) that tends to create bloat
SELECT
schemaname,
relname,
n_tup_ins,
n_tup_upd,
n_tup_del,
n_tup_hot_upd
FROM pg_stat_user_tables
ORDER BY (n_tup_upd + n_tup_del) DESC
LIMIT 20;
A simple rule of thumb: if dead tuples are a large share of live tuples and autovacuum doesn’t run often (or takes a long time), you likely have table bloat and should investigate autovacuum settings and blockers.
2) Estimate bloat without heavy math
Keep it practical. Compare how big the relation is on disk versus how many live rows it has. If a table is huge but live rows aren’t, the gap is often bloat or wide rows. You can confirm later with deeper tools.
Also check whether the table grew quickly and never shrank. Postgres doesn’t automatically return space to the OS after normal deletes and updates, which is why bloat and autovacuum tuning are usually discussed together.
3) Check index size vs usefulness
Indexes can bloat too, and unused indexes are a common hidden tax. Compare size to how often Postgres scans the index.
SELECT
s.schemaname,
s.relname AS table_name,
s.indexrelname AS index_name,
pg_relation_size(s.indexrelid) AS index_bytes,
s.idx_scan
FROM pg_stat_user_indexes s
ORDER BY pg_relation_size(s.indexrelid) DESC;
If an index is large and idx_scan stays near zero under real traffic, it’s a candidate for removal or redesign. Confirm it isn’t required for constraints, uniqueness, or a rare but critical query before you touch it.
4) When to use deeper inspection tools
If you need real bloat numbers, use pgstattuple (or pgstattuple_approx for large tables). It can report dead space and tuple density, but it can be heavier than stats views, so run it during low traffic.
5) Write down a baseline
Before changing anything, record a snapshot: top tables by n_dead_tup, top indexes by size, and a few key query timings. After tuning and maintenance, you’ll know what actually improved.
Why autovacuum falls behind in real apps
Autovacuum is Postgres’s background cleanup crew. It removes dead rows left behind by updates and deletes, and keeps statistics fresh so plans stay sane. It decides when to run using thresholds that are partly fixed and partly based on table size.
For each table, Postgres waits until the number of changed rows crosses a trigger like:
- a base number (
autovacuum_vacuum_threshold) - plus a percentage of the table (
autovacuum_vacuum_scale_factor)
That works fine for many tables. But high-update tables (sessions, events, job queues) can create dead rows faster than those triggers fire, especially if the scale factor is too high. The table grows, the trigger grows with it, and vacuum keeps arriving late.
Real-world blockers that stop cleanup
The most common reason vacuum can’t remove dead rows is a long transaction. If any transaction stays open, Postgres must keep old row versions around, even if everyone else is done with them. This can happen with idle-in-transaction sessions (a tab left open), a stuck worker, or a batch job that wraps too much work in one transaction.
Even when vacuum is allowed to run, it can be too gentle. Autovacuum has cost limits (how much IO it’s allowed to use before it sleeps). On busy systems, those limits are often conservative, so vacuum makes slow progress while your app keeps generating more dead rows.
Hidden causes that make it worse
A few patterns regularly push autovacuum behind: large batch updates/deletes that touch most rows at once, hot tables with frequent UPDATEs (even tiny ones), idle-in-transaction connections holding snapshots, long-running reports that keep a transaction open, and bursty traffic where vacuum can’t catch up between spikes.
When you’re troubleshooting bloat, start by finding two things: the table that churns the most and the longest transaction that never seems to end. Fix those first, then tune settings.
Step-by-step: tune autovacuum safely (start small)
The safest way to tune autovacuum is to change one thing, on one table, and watch it for a full day. Most autovacuum pain comes from a few high-churn tables (sessions, events, logs, queue tables), not the whole database.
Pick a table that gets lots of updates or deletes and is showing slowdowns. A good candidate usually has a fast-growing dead tuple count and frequent writes during normal traffic.
A small, controlled sequence that works well:
- Pick one high-churn table and record a baseline: row count, dead tuples, and how often autovacuum runs.
- Lower the per-table vacuum and analyze scale factors so maintenance runs sooner (before the table gets huge). Keep the change modest.
- If vacuum is too slow, adjust capacity carefully: increase
autovacuum_max_workersa bit, or raiseautovacuum_vacuum_cost_limitso vacuums make progress without crushing the app. - Confirm ANALYZE is keeping plans accurate by watching for fewer surprise plan changes and steadier query times.
- Monitor for a full day of real traffic and compare: dead tuples, vacuum frequency, query latency, and CPU/IO.
A practical per-table tweak often looks like this:
ALTER TABLE public.events SET (
autovacuum_vacuum_scale_factor = 0.02,
autovacuum_analyze_scale_factor = 0.01,
autovacuum_vacuum_threshold = 1000,
autovacuum_analyze_threshold = 1000
);
Then validate with stats you already have:
SELECT relname, n_live_tup, n_dead_tup, last_autovacuum, last_autoanalyze
FROM pg_stat_all_tables
WHERE schemaname = 'public'
ORDER BY n_dead_tup DESC
LIMIT 10;
Example: an app writes 5 million event rows a day and deletes old ones hourly. With default settings, vacuum starts too late and dead rows pile up. By lowering the scale factors for just events, autovacuum runs more often, query time stays steadier, and you avoid risky, heavy maintenance during peak hours.
If you inherited an AI-generated app where write patterns are messy (retry loops, duplicate inserts, update storms), it’s worth fixing the root cause. Otherwise you end up tuning autovacuum to compensate for broken behavior.
Step-by-step: reclaim space and speed without breaking uptime
Start with the least disruptive tools. Most of the time you can get noticeable wins without blocking writes or taking the app down.
1) Run a normal VACUUM (usually safe, often enough)
A standard VACUUM removes dead row versions so Postgres can reuse that space inside the table. It also updates visibility info so index scans and sequential scans can get faster. It doesn’t shrink the file on disk, but it often stops the bleeding and improves query speed.
If you can, do it in smaller batches (one large table at a time) during low traffic. Example:
VACUUM (ANALYZE) public.big_table;
2) Rebuild bloated indexes (when queries stay slow)
If table VACUUM helped only a little, indexes may be the main issue. Bloated indexes make reads slower and can increase random IO. For production systems, prefer a concurrent rebuild so reads and writes keep running:
REINDEX INDEX CONCURRENTLY public.big_table_some_idx;
Expect it to take time and use extra temporary disk space while the new index is built.
3) Use VACUUM FULL only when you accept downtime
VACUUM FULL rewrites the whole table and can return disk space to the OS. It’s disruptive because it takes an exclusive lock on the table. That means inserts, updates, and often reads will block.
Use it only when you have a clear maintenance window and the disk space is truly urgent.
4) Safer options for very large tables
If a table is huge and you need space back without extended blocking, plan a maintenance window and use one of these approaches: rebuild the worst indexes first with REINDEX CONCURRENTLY, consider partitioning or archiving old data and dropping old partitions, or use a table rewrite strategy (copy into a new table and swap) when you can control cutover.
5) Verify you actually improved things
Don’t trust vibes. Compare before and after with simple checks:
SELECT relname, n_dead_tup, last_vacuum, last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;
Also re-check sizes and query timing. If bloat returns quickly, the root cause is often app behavior (huge updates, missing indexes, long transactions).
A maintenance schedule that keeps bloat from coming back
A good maintenance plan is boring on purpose. It makes bloat a routine chore instead of a surprise outage. The goal isn’t a one-time win, but a steady rhythm that matches how your app writes data.
Start by choosing maintenance windows using real traffic. Look at your busiest 30 to 60 minutes each day and avoid them. If you don’t have graphs, use simple clues: checkout times, daily batch jobs, marketing sends, and support tickets that spike.
A simple weekly rhythm that works for many teams:
- Review the top 5 largest tables and indexes. Check dead tuples and recent vacuum times.
- Run manual
VACUUM (ANALYZE)on the few hottest tables if autovacuum is regularly behind. - Check for long-running transactions that block vacuum.
Once a month, add one heavier pass when traffic is lowest. Autovacuum should handle normal churn. Manual work is for exceptions: bursty tables, large delete jobs, and schema changes.
For the monthly checklist, keep it short and measurable:
- Compare table and index growth to last month.
- Check whether vacuum duration is trending up.
- Decide if any table needs custom autovacuum settings (thresholds, scale factors, cost limits).
- Schedule a targeted REINDEX (or a controlled rebuild) only when you can tolerate the extra load.
Track a few metrics over time: dead tuples (n_dead_tup), table size vs total including toast, index sizes, and vacuum/analyze timestamps. When you ship new features, plan ahead. Any new high-write table should start with sensible autovacuum settings and a place in your weekly review.
Common mistakes that make bloat and performance worse
Most bloat problems get worse because the “fix” is applied too broadly, too fast. You’ll get better results if you change one thing, measure it, then change the next.
1) Over-tuning autovacuum and creating IO storms
Turning autovacuum settings way up across the whole database can spike disk reads and writes. That can slow down user queries, increase latency, and still not solve the worst bloat if the real problem is only a few hot tables.
A safer approach is to tune per-table for the handful of tables that churn the most (sessions, event logs, queues), then watch the impact.
2) Using VACUUM FULL on a busy table without a plan
VACUUM FULL can reclaim space, but it rewrites the table and typically blocks normal access while it runs. On a busy production table, that can look like an outage.
If you need big space recovery, plan a maintenance window, consider alternative approaches, and confirm you have enough free disk for the rewrite.
3) Ignoring long transactions and blaming autovacuum
Autovacuum can’t remove dead rows if a long-running transaction is still holding old snapshots open. You’ll see dead tuples pile up even if autovacuum runs constantly.
Common causes are forgotten admin sessions, background jobs that never commit, or app code that opens a transaction and then does slow work.
4) Only changing global settings instead of fixing hotspots
Global autovacuum settings are a blunt tool. Bloat usually concentrates in specific tables and indexes, and those often need their own thresholds and cost settings.
If you tune globally, you can waste resources vacuuming cold tables while the real problem table keeps growing.
5) Measuring once, changing five things, and losing your baseline
If you adjust multiple knobs at once, you can’t tell what helped. Keep a simple baseline (table sizes, dead tuples, vacuum/analyze timing, query latency) and change one variable at a time.
Example: from slow queries to stable performance in a week
A startup launches an activity feed: likes, comments, and “seen” markers. Overnight, writes jump from a few dozen per second to a few thousand. The app still works, but pages that used to load in 200 ms now take 2 to 5 seconds.
Within two days, two things stand out: indexes are growing fast, disk usage keeps climbing even after old rows are “deleted,” and on-call starts getting low-disk alerts. Query latency also becomes spiky. It’s fine for 10 minutes, then suddenly slow.
What we found
The hot spot is one table that changes constantly, like activity_events. It has frequent updates (status flags) and deletes (retention cleanup). Autovacuum is running, but it’s always late because the table hits its threshold quickly and then keeps changing while vacuum is trying to catch up. Dead tuples pile up, indexes bloat, and simple queries end up reading far more pages than they should.
What we changed (without risky “big bang” maintenance)
We fixed it with small, targeted steps:
- Set per-table autovacuum thresholds so this one high-churn table vacuums earlier and more often.
- Increased autovacuum work for that table (cost limit or workers) so it finishes before the next spike.
- Ran a targeted reindex on the worst-bloated indexes during a low-traffic window.
- Scheduled a weekly maintenance window for the few tables that churn the most.
- Adjusted retention jobs to delete in smaller batches so vacuum can keep pace.
By the end of the week, disk growth flattened and stayed predictable. Query latency stopped swinging wildly, and the activity feed became steady again.
Next steps: keep it stable and fix the root cause
Once bloat is under control, the goal changes: keep it from coming back. The safest way is to make one change at a time and watch what happens. That keeps you from trading one problem for another.
Start with the worst offender (one table or one queue table that gets lots of updates) and make a single, clear change. For example: lower thresholds and scale factors for that table, raise the cost limit so vacuum can finish, reduce churn in the app by avoiding needless rewrites, or add alerting for dead tuples, vacuum lag, and long transactions.
After each change, track a few numbers for a week: table growth, dead tuples, autovacuum frequency, and p95 query time for the endpoints that hit that table. If you can’t show improvement, roll it back and try the next idea.
Write a small runbook your team can repeat:
- What metric triggers action (example: dead tuples over X% or vacuum lag over Y hours)
- What you change first (table-level settings before global settings)
- How you verify success (two metrics and one user-facing check)
- What to avoid (manual VACUUM during peak, long transactions)
- Who owns the decision and when to escalate
If your app was AI-generated, look for code patterns that create constant churn: chatty writes (updating rows every request), missing indexes on foreign keys, and bad session handling that leaves transactions open. A common example is a prototype that writes last_seen on every page load.
When you keep tuning but performance still drifts, it’s usually a product behavior problem, not a vacuum problem. If you’re dealing with an AI-generated codebase that’s hammering Postgres in odd ways (broken auth flows, retry loops, queue jobs that never stop), FixMyMess at fixmymess.ai can help diagnose and repair those code paths so the database stops fighting a losing battle.