mysql从库报错1236 could not find next log,但直接start slave就修复了

原创

大大刺猬

发布于 2026-04-09 18:23:34

2400

导读

mysql主从环境遇到报错:could not find next log 完整信息为Got fatal error 1236 from master when reading data from binary log: 'could not find next log; the first event 'm3306.019749' at 154, the last event read from '/data/mysql_3306/mysqllog/binlog/m3306.019750' at 4705, the last byte read from '/data/mysql_3306/mysqllog/binlog/m3306.019750' at 4705.''

可参考如下:

(root@127.0.0.1) [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: 127.0.0.1
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: m3306.019751
          Read_Master_Log_Pos: 4
               Relay_Log_File: relay.039504
                Relay_Log_Pos: 237
        Relay_Master_Log_File: m3306.019751
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4
              Relay_Log_Space: 5187
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'could not find next log; the first event 'm3306.019749' at 154, the last event read from '/data/mysql_3306/mysqllog/binlog/m3306.019750' at 4705, the last byte read from '/data/mysql_3306/mysqllog/binlog/m3306.019750' at 4705.'
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 677003306
                  Master_UUID: 5532adf0-3224-11f1-9038-525400b8c565
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 260409 15:37:55
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 5532adf0-3224-11f1-9038-525400b8c565:10-12
            Executed_Gtid_Set: 5532adf0-3224-11f1-9038-525400b8c565:10-12,
7a0ace1b-3224-11f1-9fb1-525400b8c565:1-9
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

简单分析与解决方法

从报错来看, 是从库无法获取到主库的下一个binlog(m3306.019751). 我们可以登录主库确认binlog是否存在, 在index 文件中是否有正确记录该binlog的名字. 发现文件是存在的,且记录正常. 其实该环境不存在被人工修改的可能, 所以是概率性BUG?

由于从库需要的内容,主库都存在, 故我们直接启动IO线程:start slave, 然后查看主从状态:show slave status, 发现就正常了. 后续也没出现相关报错, 仿佛之前只是IO_thread开的一个玩笑.

其实3年前, 在某客户现场做保障还是啥的时候,也遇到过这个, 当时是凌晨, 客户反馈了这个问题, 也是直接start slave就恢复了, 后续也没时间去分析(其实是忘了)

由于是从库获取不到主库的binlog,不属于网络之类的故障, 所以也不会自动重连.

查看源码逻辑

我们得先知道下逻辑, 才好调试. 这里涉及到3个线程:

从库slave_io线程, 最先看到的报错就是它抛出来的.
主库Binlog Dump, 找不到binlog本质上是主库的Binlog Dump干的,干活不利,该查
主库的连接, 导致binary log切换的连接线程,这才是"罪魁祸首"? 本次虽然是自动切换的, 但做切换的实际上就是连接线程.

由于各线程干的活比较多, 故只看和本案例有关的分支.

从库slave_io逻辑

io线程逻辑比较简单, 就是读event, 所以我们就不看了

handle_slave_io
	while (!io_slave_killed(thd,mi))
		event_len= read_event(mysql, mi, &suppress_warnings)
		THD_STAGE_INFO(thd, stage_queueing_master_event_to_the_relay_log);
		event_buf= (const char*)mysql->net.read_pos + 1;
		queue_event(mi, event_buf, event_len)
		....
			(Rotate_log_event::do_update_pos)

我们发现slave_io线程并没有抛出could not find next log的选项, 故退出该报错是主库发送过来的, 从描述来看也更像是主库发的.

主库Binlog Dump逻辑

主库的Binlog Dump是主动发送event(binlog)给从库的, 其逻辑如下:

com_binlog_dump # 开始同步数据 (其实还有个COM_REGISTER_SLAVE的,但我们省略了)
	mysql_binlog_send {Binlog_sender sender;sender.run();} # 定义一个sender
	Binlog_sender::run() # 开始了
		fake_rotate_event(log_file, start_pos) # 构造一个rotate_event,
		file= open_binlog_file(&log_cache, log_file, &m_errmsg) # 打开一个binlog
		THD_STAGE_INFO(m_thd, stage_sending_binlog_event_to_slave) # 就是之前讲的Sending binlog event to slave
		send_binlog(&log_cache, start_pos) # 发送一个binlog的event
		set_last_file(log_file); # 标记为最后一个binlog文件
		THD_STAGE_INFO(m_thd,stage_finished_reading_one_binlog_switching_to_next_binlog);
		mysql_bin_log.lock_index(); /* 给index file加个锁, 对index文件的操作都有这个,防止一起整 */
		mysql_bin_log.find_next_log(&m_linfo, 0) /* 失败就set_fatal_error("could not find next log");break*/
		mysql_bin_log.unlock_index()
		file= -1; /* 失败就没得这一步了 */
		if (file > 0){
			my_snprintf(error_text, sizeof(error_text),
				"%s; the first event '%s' at %lld, "
				"the last event read from '%s' at %lld, "
				"the last byte read from '%s' at %lld.",
				m_errmsg,m_start_file, m_start_pos, m_last_file, m_last_pos,
				log_file, my_b_tell(&log_cache));
		}

从逻辑上来看, 发送完一个binlog之后, 就会调用find_next_log找下一个binlog继续发送, 如果找不到,就会报错"could not find next log".

和我们看到的报错完全一样, 所以问题就在于find_next_log为啥会失败. 我们查看find_next_log的逻辑得到如下:

  if (!my_b_inited(&index_file))
  {   
      error= LOG_INFO_IO;
      goto err;
  }

  
  if ((length=my_b_gets(&index_file, fname, FN_REFLEN)) <= 1)
  {
    error = !index_file.error ? LOG_INFO_EOF : LOG_INFO_IO;
    goto err;
  }


  if(normalize_binlog_name(full_fname, fname, is_relay_log))
  {
    error= LOG_INFO_EOF;
    goto err;
  }

也就是有3种情况会导致find_next_log失败:

index_file未初始化.
从index_file中获取到的文件名大小是0/1 (EOF/\n).
名字不正常.

很明显,看起来就第二种可能性最大. 也就是很大的可能是index_file中还没更新最新的binlog文件信息.

对index_file的操作会加lock_index(), 所以不可能是多个线程同时修改导致的, 也就是说逻辑起码没得问题. 要逻辑有问题早就出现大量的这个报错了.

主库切换日志逻辑

虽然我们怀疑是index file未更新导致的"could not find next log", 但还是要讲证据的, 我们来看看主库轮转日志的时候都在干什么.

日志切换有2种:

手动切换(比如:flush binary logs),
达到max_binlog_size大小了,就自动切换.

为了尽可能的还原, 我们查看第二种的逻辑(实际上两者是一样的, 但模拟的时候还是按Binlog上限的情况), 整理后得到逻辑如下:

rotate_and_purge
rotate
	new_file_impl
		mysql_mutex_lock(&LOCK_index) # 获取index锁
		generate_new_name # 生成新的binlog文件名
		Rotate_log_event # 写rotate event
		close # 关闭旧的binlog文件
		open_index_file # 打开index文件
		open_binlog # 打开新的binlog文件, 打开的时候就顺便写点format_desc_event之类的
			flush_io_cache(&log_file)
			mysql_file_sync(log_file.file, MYF(MY_WME))
			add_log_to_index # 把新的log加到index中去
				open_crash_safe_index_file # 打开一个safe_index文件 
				copy_file(&index_file, &crash_safe_index_file, 0) # 把index的考到safe_index里面去
				my_b_write(&crash_safe_index_file, log_name, log_name_len) # 把新的binlog名字写到safe_index
				my_b_write(&crash_safe_index_file, (uchar*) "\n", 1) # 来个换行符
				flush_io_cache(&crash_safe_index_file) # flush
				mysql_file_sync(crash_safe_index_file.file, MYF(MY_WME) # sync 调用的my_sync
				close_crash_safe_index_file # 关闭safe_index
				move_crash_safe_index_file_to_index_file # 把safe_index改为index, 就是rename
					my_rename(crash_safe_index_file_name, index_file_name, MYF(MY_WME))
						if (rename(from,to)) # MyFlags=MY_WME, 如果是MY_SYNC_DIR才会sync,前提是rename失败...
						else if (MyFlags & MY_SYNC_DIR)
							if (my_sync_dir_by_file(from, MyFlags) ||
							   (strcmp(from, to) &&
							   my_sync_dir_by_file(to, MyFlags)))
		mysql_mutex_unlock(&LOCK_index) # 释放index锁
		

# 附my_sync_dir_by_file
int my_sync_dir_by_file(const char *file_name MY_ATTRIBUTE((unused)),
                        myf my_flags MY_ATTRIBUTE((unused)))
{
#ifdef __linux__
  char dir_name[FN_REFLEN];
  size_t dir_name_length;
  dirname_part(dir_name, file_name, &dir_name_length);
  return my_sync_dir(dir_name, my_flags);
#else
  return 0;
#endif
}

我们发现这逻辑相当严谨, 对index的修改也会先获取相关锁. 操作的文件也都做了fsync.

但并不是直接修改的index_file,而是先拷贝一个safe_index_file,然后把新的binlog信息往safe_index_file里面写, 最后再把safe_index_file改为index_file. 这就存在一个可能: Binlog Dump线程读取到的index file还是指向的之前旧的index file.

我们知道, 文件夹(目录)也是特殊的文件, 而rename操作本质上是修改的文件夹(目录)信息, 那么rename是否会自动做fsync呢?应该不会, 不然旧不需要后面的my_sync_dir_by_file了....

也就是说可能是mysql的bug,即rename之后未对目录做fsync. 该问题不好复现, 已提相关BUG: https://bugs.mysql.com/bug.php?id=120238

strace分析

光看代码比较枯燥, 我们就使用strace来看看.

从库的trace信息

我们先看下从库的trace信息,命令参考如下:

strace -e trace=fsync,fdatasync,sync,msync,open,close,rename,write,sendto,recvfrom -s 300 -f -p 18695

结果如下:(我这里是模拟出相关报错的trace信息)

[pid 18722] recvfrom(50, "D\0\0\3\0\357\0eY\327i\"*@Z(A\0\0\0\251\t\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\1\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\322\313P\320G\0\0\4\0\357\0eY\327i\2*@Z(D\0\0\0\355\t\0\0\10\0\3\0\0\0\0\0\0\0\0\0\0\32\0\0\0\0\0\0\1 \0\240U\0\0\0\0\6\3std\4!\0!\0!\0\0BEGIN\256S\377\2217\0\0\5\0\357\0eY\327i\23*@Z(4\0\0\0!\n\0\0\0\0l\0\0\0\0\0\1\0\3db1\0\tt20260409\0\1\374\1\2\0014\354\260\317\371\7\0\6\0\357\0eY\327i\36*@Z(\366\7\0\0\27\22\0\0\0\0l\0\0\0\0\0\1\0\2\0\1\377\376\320\7xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 16384, MSG_DONTWAIT, NULL, NULL) = 2289
[pid 18722] write(51, "eY\327i\"*@Z(A\0\0\0\251\t\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\1\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\322\313P\320", 65) = 65
[pid 18722] write(51, "eY\327i\2*@Z(D\0\0\0\355\t\0\0\10\0\3\0\0\0\0\0\0\0\0\0\0\32\0\0\0\0\0\0\1 \0\240U\0\0\0\0\6\3std\4!\0!\0!\0\0BEGIN\256S\377\221", 68) = 68
[pid 18723] open(".//db.opt", O_RDONLY <unfinished ...>
[pid 18722] write(51, "eY\327i\23*@Z(4\0\0\0!\n\0\0\0\0l\0\0\0\0\0\1\0\3db1\0\tt20260409\0\1\374\1\2\0014\354\260\317", 52 <unfinished ...>
[pid 18723] <... open resumed>)         = -1 ENOENT (No such file or directory)
[pid 18722] <... write resumed>)        = 52
[pid 18722] write(51, "eY\327i\36*@Z(\366\7\0\0\27\22\0\0\0\0l\0\0\0\0\0\1\0\2\0\1\377\376\320\7xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 2038) = 2038
[pid 18722] write(51, "eY\327i\20*@Z(\37\0\0\0006\22\0\0\0\0,\0\0\0\0\0\0\0:\241xj", 31) = 31
[pid 18722] sendto(50, "\25\0\0\0\3576\22\0\0\0\0\0\0m3306.019752", 25, MSG_DONTWAIT, NULL, 0) = 25
[pid 18722] recvfrom(50, 0x7f9b0c010d60, 16384, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18723] open("./db1/t20260409.frm", O_RDONLY) = 53
[pid 18723] close(53)                   = 0
[pid 18723] fsync(4)                    = 0
[pid 18707] fsync(12)                   = 0
[pid 18703] fsync(45)                   = 0
[pid 18703] fsync(15)                   = 0
[pid 18703] fsync(12)                   = 0
[pid 18712] fsync(4)                    = 0
[pid 18698] fsync(4)                    = 0
[pid 18712] fsync(4)                    = 0
[pid 18698] fsync(4)                    = 0
[pid 18722] recvfrom(50, ".\0\0\1\0\357\0eY\327i\4*@Z(+\0\0\0a\22\0\0\0\0\4\0\0\0\0\0\0\0m3306.019753\363}-P\354\0\0\2\377\324\4#HY000could not find next log; the first event 'm3306.019752' at 2408, the last event read from '/data/mysql_3306/mysqllog/binlog/m3306.019752' at 4705, the last byte read from '/data/mysql_3306/mysqllog/binlog/m3306.019752' at 4705.", 16384, MSG_DONTWAIT, NULL, NULL) = 290
[pid 18722] close(53)                   = 0
[pid 18722] write(51, "JY\327i\4\254K\f,+\0\0\0]\n\0\0@\0\4\0\0\0\0\0\0\0relay.039510\3\325\302O", 43) = 43
[pid 18722] fdatasync(51)               = 0
[pid 18722] close(51)                   = 0
[pid 18722] close(47)                   = 0
[pid 18722] open("/data/mysql_3308/mysqllog/relay/relay.index", O_RDWR|O_CREAT, 0640) = 47
[pid 18722] fdatasync(47)               = 0
[pid 18722] open("/data/mysql_3308/mysqllog/relay/relay.~rec~", O_RDWR|O_CREAT, 0640) = 51
[pid 18722] close(51)                   = 0
[pid 18722] open("/data/mysql_3308/mysqllog/relay/relay.~rec~", O_RDWR|O_CREAT, 0640) = 51
[pid 18722] write(51, "/data/mysql_3308/mysqllog/relay/relay.039510\n", 45) = 45
[pid 18722] fdatasync(51)               = 0
[pid 18722] open("/data/mysql_3308/mysqllog/relay/relay.039510", O_RDWR|O_CREAT|O_APPEND, 0640) = 53
[pid 18722] write(53, "\376binJY\327i\17\254K\f,w\0\0\0{\0\0\0@\0\4\0005.7.27-log\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\238\r\0\10\0\22\0\4\4\4\4\22\0\0_\0\4\32\10\0\0\0\10\10\10\2\0\0\0\n\n\n**\0\0224\0\1u`\274\212JY\327i#\254K\f,G\0\0\0\302\0\0\0\300\0\1\0\0\0\0\0\0\0U2\255\3602$\21\361\2208RT\0\270\305e\1\0\0\0\0\0\0\0\n\0\0\0\0\0\0\0\r\0\0\0\0\0\0\0\304dp:", 194) = 194
[pid 18722] fdatasync(53)               = 0
[pid 18722] open("/data/mysql_3308/mysqllog/relay/relay.index_crash_safe", O_RDWR|O_CREAT, 0640) = 54
[pid 18722] write(54, "/data/mysql_3308/mysqllog/relay/relay.039508\n/data/mysql_3308/mysqllog/relay/relay.039509\n", 90) = 90
[pid 18722] write(54, "/data/mysql_3308/mysqllog/relay/relay.039510\n", 45) = 45
[pid 18722] fdatasync(54)               = 0
[pid 18722] close(54)                   = 0
[pid 18722] close(47)                   = 0
[pid 18722] rename("/data/mysql_3308/mysqllog/relay/relay.index_crash_safe", "/data/mysql_3308/mysqllog/relay/relay.index") = 0
[pid 18722] open("/data/mysql_3308/mysqllog/relay/relay.index", O_RDWR|O_CREAT, 0640) = 47
[pid 18722] fdatasync(47)               = 0
[pid 18722] close(51)                   = 0
[pid 18722] write(53, "eY\327i\4*@Z(+\0\0\0a\22\0\0\0\0\4\0\0\0\0\0\0\0m3306.019753\363}-P", 43) = 43
[pid 18723] open("/data/mysql_3308/mysqllog/relay/relay.039509", O_RDONLY) = 51
[pid 18723] close(51)                   = 0
[pid 18722] write(2, "2026-04-09T07:47:03.855690Z 1 [ERROR] Error reading packet from server for channel '': could not find next log; the first event 'm3306.019752' at 2408, the last event read from '/data/mysql_3306/mysqllog/binlog/m3306.019752' at 4705, the last byte read from '/data/mysql_3306/mysqllog/binlog/m3306.01"..., 335) = 335
[pid 18722] write(2, "2026-04-09T07:47:03.855736Z 1 [ERROR] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'could not find next log; the first event 'm3306.019752' at 2408, the last event read from '/data/mysql_3306/mysqllog/binlog/m3306.019752' at 4705, the last byte read f"..., 380) = 380
[pid 18722] write(2, "2026-04-09T07:47:03.855762Z 1 [Note] Slave I/O thread exiting for channel '', read up to log 'm3306.019753', position 4\n", 120) = 120
[pid 18722] sendto(50, "\1\0\0\0\1", 5, MSG_DONTWAIT, NULL, 0) = 5
[pid 18722] close(50)                   = 0
[pid 18723] fsync(4 <unfinished ...>
[pid 18722] +++ exited with 0 +++
[pid 18723] <... fsync resumed>)        = 0
[pid 18723] open("/data/mysql_3308/mysqllog/relay/relay.~rec~", O_RDWR|O_CREAT, 0640) = 50
[pid 18723] write(50, "/data/mysql_3308/mysqllog/relay/relay.039508\n", 45) = 45
[pid 18723] fdatasync(50)               = 0
[pid 18723] open("/data/mysql_3308/mysqllog/relay/relay.index_crash_safe", O_RDWR|O_CREAT, 0640) = 51
[pid 18723] write(51, "/data/mysql_3308/mysqllog/relay/relay.039509\n/data/mysql_3308/mysqllog/relay/relay.039510\n", 90) = 90
[pid 18723] close(51)                   = 0
[pid 18723] close(47)                   = 0
[pid 18723] rename("/data/mysql_3308/mysqllog/relay/relay.index_crash_safe", "/data/mysql_3308/mysqllog/relay/relay.index") = 0
[pid 18723] open("/data/mysql_3308/mysqllog/relay/relay.index", O_RDWR|O_CREAT, 0640) = 47
[pid 18723] fdatasync(47)               = 0
[pid 18723] close(50)                   = 0
[pid 18707] fsync(12)                   = 0
[pid 18703] fsync(45)                   = 0
[pid 18703] fsync(12)                   = 0
[pid 18712] fsync(4)                    = 0
[pid 18698] fsync(4)                    = 0
[pid 18712] fsync(4)                    = 0
[pid 18698] fsync(4)                    = 0

我们发现"#HY000could not find next log; the first event 'm3306."是主库发送过来的, 也验证了我们上面查看的源码逻辑. 从库的逻辑比较简单, 我们就不继续看了.

主库的trace信息

关键还得看主库的trace信息, 我们使用如下命令:

strace -e trace=fsync,fdatasync,sync,msync,open,close,rename,write,sendto,recvfrom -s 300 -f -p 10747

我们跟踪的信息比较多, 有刷盘的fsync,fdatasync,sync,msync, 有文件的open,close,rename,还有网络的sendto,recvfrom

-s 300表示表大小可达300字节(我想看更多的信息)

-f 包含子线程

-p 10747 mysql的进程id

于是我们得到如下信息:

[pid 11493] recvfrom(51, "/\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 11493] recvfrom(51, "\3insert into t20260409 values(repeat('x',2000))", 47, MSG_DONTWAIT, NULL, NULL) = 47
[pid 11493] fsync(4)                    = 0
[pid 11493] write(33, "\205s\327i\"*@Z(A\0\0\0\262\f\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\4\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0#\215\31\252\205s\327i\2*@Z(G\0\0\0\371\f\0\0\10\0\3\0\0\0\0\0\0\0\3\0\0\32\0\0\0\0\0\0\1 \0\240U\0\0\0\0\6\3std\4!\0!\0!\0db1\0BEGIN\204\360\210c\205s\327i\23*@Z(4\0\0\0-\r\0\0\0\0m\0\0\0\0\0\1\0\3db1\0\tt20260409\0\1\374\1\2\1 \344\v\22\205s\327i\36*@Z(\366\7\0\0#\25\0\0\0\0m\0\0\0\0\0\1\0\2\0\1\377\376\320\7xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 2257) = 2257
[pid 11493] fdatasync(33)               = 0
[pid 11511] sendto(52, "D\0\0\1\0\357\0\205s\327i\"*@Z(A\0\0\0\262\f\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\4\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0#\215\31\252J\0\0\2\0\357\0\205s\327i\2*@Z(G\0\0\0\371\f\0\0\10\0\3\0\0\0\0\0\0\0\3\0\0\32\0\0\0\0\0\0\1 \0\240U\0\0\0\0\6\3std\4!\0!\0!\0db1\0BEGIN\204\360\210c7\0\0\3\0\357\0\205s\327i\23*@Z(4\0\0\0-\r\0\0\0\0m\0\0\0\0\0\1\0\3db1\0\tt20260409\0\1\374\1\2\1 \344\v\22\371\7\0\4\0\357\0\205s\327i\36*@Z(\366\7\0\0#\25\0\0\0\0m\0\0\0\0\0\1\0\2\0\1\377\376\320\7xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 2292, MSG_DONTWAIT, NULL, 0) = 2292
[pid 10773] recvfrom(52, "\25\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 10773] recvfrom(52, "\357B\25\0\0\0\0\0\0m3306.019755", 21, MSG_DONTWAIT, NULL, NULL) = 21
[pid 11493] fsync(4)                    = 0
[pid 11493] close(53)                   = 0
[pid 11493] write(33, "\205s\327i\4*@Z(+\0\0\0m\25\0\0\0\0\4\0\0\0\0\0\0\0m3306.019756\273\265\314\271", 43) = 43
[pid 11493] fdatasync(33)               = 0
[pid 11493] close(33)                   = 0
[pid 11493] close(3)                    = 0
[pid 11493] open("/data/mysql_3306/mysqllog/binlog/m3306.index", O_RDWR|O_CREAT, 0640) = 3
[pid 11493] fdatasync(3)                = 0
[pid 11493] open("/data/mysql_3306/mysqllog/binlog/m3306.~rec~", O_RDWR|O_CREAT, 0640) = 33
[pid 11493] close(33)                   = 0
[pid 11493] open("/data/mysql_3306/mysqllog/binlog/m3306.~rec~", O_RDWR|O_CREAT, 0640) = 33
[pid 11493] write(33, "/data/mysql_3306/mysqllog/binlog/m3306.019756\n", 46) = 46
[pid 11493] fdatasync(33)               = 0
[pid 11493] open("/data/mysql_3306/mysqllog/binlog/m3306.019756", O_WRONLY|O_CREAT, 0640) = 53
[pid 11493] write(53, "\376bin\205s\327i\17*@Z(w\0\0\0{\0\0\0\1\0\4\0005.7.27-log\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\238\r\0\10\0\22\0\4\4\4\4\22\0\0_\0\4\32\10\0\0\0\10\10\10\2\0\0\0\n\n\n**\0\0224\0\0011\33\"\247\205s\327i#*@Z(\37\0\0\0\232\0\0\0\200\0\0\0\0\0\0\0\0\0\373\2\245\250", 154) = 154
[pid 11493] fdatasync(53)               = 0
[pid 11493] open("/data/mysql_3306/mysqllog/binlog/m3306.index_crash_safe", O_RDWR|O_CREAT, 0640) = 54
[pid 11493] write(54, "/data/mysql_3306/mysqllog/binlog/m3306.019735\n/data/mysql_3306/mysqllog/binlog/m3306.019736\n/data/mysql_3306/mysqllog/binlog/m3306.019737\n/data/mysql_3306/mysqllog/binlog/m3306.019738\n/data/mysql_3306/mysqllog/binlog/m3306.019739\n/data/mysql_3306/mysqllog/binlog/m3306.019740\n/data/mysql_3306/mysqllo"..., 966) = 966
[pid 11493] write(54, "/data/mysql_3306/mysqllog/binlog/m3306.019756\n", 46) = 46
[pid 11493] fdatasync(54)               = 0
[pid 11493] close(54)                   = 0
[pid 11493] close(3)                    = 0
[pid 11493] rename("/data/mysql_3306/mysqllog/binlog/m3306.index_crash_safe", "/data/mysql_3306/mysqllog/binlog/m3306.index") = 0
[pid 11493] open("/data/mysql_3306/mysqllog/binlog/m3306.index", O_RDWR|O_CREAT, 0640) = 3
[pid 11493] fdatasync(3)                = 0
[pid 11493] close(33)                   = 0
[pid 11493] sendto(51, "\7\0\0\1\0\1\0\2\0\0\0", 11, MSG_DONTWAIT, NULL, 0) = 11
[pid 11511] close(49 <unfinished ...>
[pid 11493] recvfrom(51,  <unfinished ...>
[pid 11511] <... close resumed>)        = 0
[pid 11493] <... recvfrom resumed>0x7f172800a250, 4, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 11511] open("/data/mysql_3306/mysqllog/binlog/m3306.019756", O_RDONLY) = 33
[pid 11511] sendto(52, ".\0\0\1\0\357\0\205s\327i\4*@Z(+\0\0\0m\25\0\0\0\0\4\0\0\0\0\0\0\0m3306.019756\273\265\314\271.\0\0\2\0\357\0\0\0\0\0\4*@Z(+\0\0\0\0\0\0\0 \0\4\0\0\0\0\0\0\0m3306.019756\322\23\231\223z\0\0\3\0\357\0\205s\327i\17*@Z(w\0\0\0{\0\0\0\0\0\4\0005.7.27-log\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\238\r\0\10\0\22\0\4\4\4\4\22\0\0_\0\4\32\10\0\0\0\10\10\10\2\0\0\0\n\n\n**\0\0224\0\0011\33\"\247\"\0\0\4\0\357\0\205s\327i#*@Z(\37\0\0\0\232\0\0\0\200\0\0\0\0\0\0\0\0\0\373\2\245\250", 264, MSG_DONTWAIT, NULL, 0) = 264
[pid 10759] fsync(12)                   = 0
[pid 10755] fsync(30)                   = 0
[pid 10755] fsync(12)                   = 0
[pid 10764] fsync(4)                    = 0
[pid 10750] fsync(4)                    = 0
[pid 10764] fsync(4)                    = 0
[pid 10750] fsync(4)                    = 0

简单整理整理后得到如下逻辑:

11493 客户端连接,执行SQL语句的
11511 Binlog Dump线程
10773 Ack_receiver

11493(客户端连接对应的服务端线程)
	发送insert语句
	写Binlog,并刷盘
11511(Binlog Dump线程)
	发送binlog给从库(就是那个insert语句的)
10773(Ack_receiver)
	OK
11493(客户端连接对应的服务端线程)
	写rotate_event到binlog
	fdatasync该binlog
	关闭该Binlog文件
	关闭index file文件(根据fd可在/proc/PID/fd下面找到各fd对应的文件,要提前查看)
	打开index file
	fdatasync(index file)
	打开m3306.~rec~并关闭(测试)
	打开m3306.~rec~
	把新生成的binlog信息写到m3306.~rec~
	fdatasync(fdatasync)
	打开index_crash_safe文件
	把index file的信息写到index_crash_safe文件中
	把新生成的binlog信息写到index_crash_safe
	fdatasync(index_crash_safe),关闭index_crash_safe, 关闭index file
	rename(index_crash_safe, index_file) 
	打开index file文件
	fdatasync(index file)
	关闭m3306.~rec~
11511(Binlog Dump线程)
	关闭旧的binlog文件
	打开新的binlog文件
	发送新的binlog中的event给从库

整体逻辑和上面源码分析的一致(只是少了点细节), 我们发现rename("/data/mysql_3306/mysqllog/binlog/m3306.index_crash_safe", "/data/mysql_3306/mysqllog/binlog/m3306.index") = 0 之后确实没有做相关目录的fsync操作.

gdb分析

根据最上面源码的分析,我们就可以对其进行断点分析, 但结果和上面的一致, 只是多了更多的细节而已, 比如generate_new_name 就是生成新的binlog名字给new_name.(长度是512字节). 量太多不再演示, 有兴趣的读者可自行断点验证.

复现方法

我们无法直接复现出来, 但我们可以在close_crash_safe_index_file和move_crash_safe_index_file_to_index_file打断点, 等close完, 我们手动修改safe_index文件,然后再continue. 但这种方法恢复的时候需要手动修改index文件,并且flush下(重新打开index file)

gdb /soft/mysql_3306/mysqlbase/mysql/bin/mysqld 
set non-stop on
set target-async on
set pagination off
break MYSQL_BIN_LOG::close_crash_safe_index_file
break MYSQL_BIN_LOG::move_crash_safe_index_file_to_index_file
run --defaults-file=/data/mysql_3306/conf/mysql_3306.cnf --basedir=/soft/mysql_3306/mysqlbase/mysql --datadir=/data/mysql_3306/mysqldata --plugin-dir=/soft/mysql_3306/mysqlbase/mysql/lib/plugin --user=mysql --log-error=/data/mysql_3306/mysqllog/dblogs/mysql3306.err --open-files-limit=123456 --pid-file=/data/mysql_3306/run/mysql.pid --socket=/data/mysql_3306/run/mysql.sock --port=3306

# 启动的时候也会生成新的binlog, 所以要跳过两次断点
continue
continue

# 触发binlog切换
continue 之后手动修改/data/mysql_3306/mysqllog/binlog/m3306.index_crash_safe, 删除新增的内容
continue 之后再观察从库的io_thread状态

之前测试过程,发现修改binlog checksum参数后flush binary logs触发了一次could not find next log, 但为偶然, 无法(稳定)复现

总结

mysql主从是从库连上之后,就只有主库发送event(从库顶多ack),连报错信息都是主库发过去的.

mysql轮转日志的时候实际上是生成一个新的文件,然后将binlog信息写入新的文件, 再将新的文件覆盖到旧的文件, 整个过程,除了rename外都做了fsync/fdatasync (fdatasync不更新元数据信息), 所以本次的"could not find next log"可能问题就出在这. 当然解决方法很简单, 直接执行start slave就能恢复, 毕竟只是fake.

虽然说了这么多, 但其实并没有完全复现.

参考

https://dev.mysql.com/doc/refman/8.0/en/flush.html#flush-binary-logs

https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-storage-fs#s-storage-ext4

https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/storage_administration_guide/ch-xfs

https://community.ibm.com/community/user/discussion/file-directory-metadata-fsync-issues-on-failover

https://www.percona.com/blog/mysql-replication-got-fatal-error-1236-causes-and-cures/

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

mysql