How to Diagnose and Fix MySQL Replication Lag

By banditz

Wednesday, January 7, 2026 • 4 min read

MySQL SHOW REPLICA STATUS showing high Seconds_Behind_Master

It starts with a support ticket. A user updated their profile but the change isn’t showing. They refresh — old data is back. Your application reads from a MySQL replica that’s lagging behind the primary, returning stale data.

You SSH into the replica:

SHOW REPLICA STATUS\G

You find the line that matters:

Seconds_Behind_Master: 347

Nearly 6 minutes behind. Every read from this server returns data that’s 6 minutes old. For a profile update that’s annoying. For e-commerce inventory, a customer is buying a product that sold out 5 minutes ago.

How Replication Works

Three components:

Binlog on the primary. Every write (INSERT, UPDATE, DELETE, DDL) is recorded in the binary log.

IO thread on the replica. Connects to the primary, reads binlog events, writes them to the local relay log.

SQL thread on the replica. Reads relay log events and executes them. This is where data changes on the replica.

Lag happens at two points: IO thread can’t fetch fast enough (network), or SQL thread can’t replay fast enough (most common).

Step 1: Read SHOW REPLICA STATUS

SHOW REPLICA STATUS\G

On MySQL before 8.0.22, use SHOW SLAVE STATUS\G.

Critical fields:

Replica_IO_Running — must be Yes. If No, check network, credentials, and whether the primary’s binlog was purged.

Replica_SQL_Running — must be Yes. If No, check Last_SQL_Error.

Seconds_Behind_Master — the lag. Approximate.

Read_Master_Log_Pos — how far IO thread has read.

Exec_Master_Log_Pos — how far SQL thread has executed. Gap = queued work.

Relay_Log_Space — unprocessed data size. Large and growing = SQL thread can’t keep up.

Step 2: IO Thread or SQL Thread?

Check the primary’s position:

-- On primary:

SHOW MASTER STATUS\G

If replica’s Read_Master_Log_Pos is close to primary, the IO thread is fine. SQL thread is the bottleneck.

If Read_Master_Log_Pos is far behind: network bandwidth, slow primary storage, or SSL overhead on the IO thread.

90% of replication lag is the SQL thread. Let’s fix it.

Step 3: Why the SQL Thread Is Slow

Cause 1: Single-threaded replay.

By default, MySQL replays events on one thread. The primary processes thousands of concurrent transactions across 32 cores, but the replica replays them one at a time.

SHOW VARIABLES LIKE 'replica_parallel_workers';

If 0 or 1, you’re single-threaded.

Fix — enable parallel replication:

STOP REPLICA;

SET GLOBAL replica_parallel_workers = 4;

SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';

SET GLOBAL replica_preserve_commit_order = ON;

START REPLICA;

In my.cnf:

[mysqld]

replica_parallel_workers = 4

replica_parallel_type = LOGICAL_CLOCK

replica_preserve_commit_order = ON

Start with 4 workers. Going above 8-16 rarely helps.

Cause 2: Missing indexes on the replica.

With row-based replication, each row change is found by primary key on the replica. No primary key = full table scan per change. Multiply by thousands of changes per second and the SQL thread grinds.

Check what SQL thread is doing:

SHOW PROCESSLIST;

If you see slow UPDATE or DELETE from the system user, check that table’s indexes:

SHOW CREATE TABLE the_table;

Add missing primary keys.

Cause 3: Large transactions.

A single transaction updating 500,000 rows generates one massive binlog event. Replays as one operation, blocking everything behind it.

Fix on the application side — batch operations:

-- Instead of:

UPDATE orders SET status = 'archived' WHERE created_at < '2024-01-01';

-- Do:

UPDATE orders SET status = 'archived'

WHERE created_at < '2024-01-01' AND status != 'archived'

LIMIT 1000;

Loop until no rows affected. Each batch commits separately.

Cause 4: DDL operations.

ALTER TABLE on a large table blocks the SQL thread for its entire duration. Monitor lag during DDL — it’s expected and temporary.

Step 4: Monitoring That Works

Seconds_Behind_Master lies. It can show 0 between event bursts even when behind.

Use pt-heartbeat from Percona Toolkit:

On primary:

pt-heartbeat --update --database heartbeat --create-table

On replica:

pt-heartbeat --monitor --database heartbeat

Writes timestamps every second. True lag, no estimation.

Alert thresholds:

  • < 1s — acceptable
  • 1-10s — watch
  • 10-60s — investigate
  • > 60s — critical

The Diagnostic Sequence

  1. SHOW REPLICA STATUS — threads running? What’s the error?
  2. Compare positions — IO thread keeping up?
  3. SHOW PROCESSLIST — what’s the SQL thread doing?
  4. Check parallel replication — workers > 1?
  5. Check table indexes — primary keys present?
  6. Check for DDL — ALTER TABLE in progress?

Most lag resolves with parallel replication and proper indexing. The single-threaded SQL thread is by far the most common cause, and it’s the easiest fix. One config change that should be the default but somehow isn’t.

Step-by-Step Guide

1

Check SHOW REPLICA STATUS

Run SHOW REPLICA STATUS on the replica. Key fields are Seconds_Behind_Master, Replica_IO_Running, Replica_SQL_Running, and Relay_Log_Space. If IO thread is No the replica cannot connect. If SQL thread is No check Last_SQL_Error.

2

Identify IO thread vs SQL thread bottleneck

Compare Read_Master_Log_Pos on replica with primary position. If close the IO thread is fine and SQL thread is the bottleneck. If Read_Master_Log_Pos is behind check network bandwidth.

3

Enable parallel replication

Default is single-threaded replay. Set replica_parallel_workers to 4 or 8. Set replica_parallel_type to LOGICAL_CLOCK and replica_preserve_commit_order to ON. Restart replica.

4

Check for missing indexes on replica

Row-based replication finds matching rows by primary key. Missing indexes cause full table scans per row change. Check SHOW PROCESSLIST for slow SQL thread queries and add missing indexes.

5

Set up proper monitoring

Use pt-heartbeat from Percona Toolkit for accurate measurement since Seconds_Behind_Master can be misleading. Alert on lag exceeding your application tolerance threshold.

Frequently Asked Questions

Is Seconds_Behind_Master always accurate?
No. It can briefly show zero when idle between events even if behind. Use pt-heartbeat for accurate measurement by writing timestamps to a table on the primary and comparing on the replica.
Can queries on the replica cause lag?
Yes. Read queries holding locks that conflict with the SQL thread cause it to wait. More common with MyISAM. InnoDB has row-level locking but DDL and metadata locks can still block.
ROW or STATEMENT binlog format?
ROW is recommended. It is deterministic and avoids inconsistency from non-deterministic functions. It generates more data but consistency is worth it.
How to handle replication errors?
Check Last_SQL_Error. For duplicate key errors use sql_replica_skip_counter to skip. For data inconsistency use pt-table-checksum and pt-table-sync from Percona Toolkit.
banditz

Research Bug bounty at javahack team

Freeland Reseacrh Bug Bounty

View all articles →