
mysql主从环境遇到报错:could not find next log 完整信息为Got fatal error 1236 from master when reading data from binary log: 'could not find next log; the first event 'm3306.019749' at 154, the last event read from '/data/mysql_3306/mysqllog/binlog/m3306.019750' at 4705, the last byte read from '/data/mysql_3306/mysqllog/binlog/m3306.019750' at 4705.''
可参考如下:
(root@127.0.0.1) [(none)]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: 127.0.0.1
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: m3306.019751
Read_Master_Log_Pos: 4
Relay_Log_File: relay.039504
Relay_Log_Pos: 237
Relay_Master_Log_File: m3306.019751
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 4
Relay_Log_Space: 5187
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'could not find next log; the first event 'm3306.019749' at 154, the last event read from '/data/mysql_3306/mysqllog/binlog/m3306.019750' at 4705, the last byte read from '/data/mysql_3306/mysqllog/binlog/m3306.019750' at 4705.'
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 677003306
Master_UUID: 5532adf0-3224-11f1-9038-525400b8c565
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp: 260409 15:37:55
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 5532adf0-3224-11f1-9038-525400b8c565:10-12
Executed_Gtid_Set: 5532adf0-3224-11f1-9038-525400b8c565:10-12,
7a0ace1b-3224-11f1-9fb1-525400b8c565:1-9
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)从报错来看, 是从库无法获取到主库的下一个binlog(m3306.019751). 我们可以登录主库确认binlog是否存在, 在index 文件中是否有正确记录该binlog的名字. 发现文件是存在的,且记录正常. 其实该环境不存在被人工修改的可能, 所以是概率性BUG?
由于从库需要的内容,主库都存在, 故我们直接启动IO线程:start slave, 然后查看主从状态:show slave status, 发现就正常了. 后续也没出现相关报错, 仿佛之前只是IO_thread开的一个玩笑.
其实3年前, 在某客户现场做保障还是啥的时候,也遇到过这个, 当时是凌晨, 客户反馈了这个问题, 也是直接start slave就恢复了, 后续也没时间去分析(其实是忘了)
由于是从库获取不到主库的binlog,不属于网络之类的故障, 所以也不会自动重连.
我们得先知道下逻辑, 才好调试. 这里涉及到3个线程:
由于各线程干的活比较多, 故只看和本案例有关的分支.
io线程逻辑比较简单, 就是读event, 所以我们就不看了
handle_slave_io
while (!io_slave_killed(thd,mi))
event_len= read_event(mysql, mi, &suppress_warnings)
THD_STAGE_INFO(thd, stage_queueing_master_event_to_the_relay_log);
event_buf= (const char*)mysql->net.read_pos + 1;
queue_event(mi, event_buf, event_len)
....
(Rotate_log_event::do_update_pos)我们发现slave_io线程并没有抛出could not find next log的选项, 故退出该报错是主库发送过来的, 从描述来看也更像是主库发的.
主库的Binlog Dump是主动发送event(binlog)给从库的, 其逻辑如下:
com_binlog_dump # 开始同步数据 (其实还有个COM_REGISTER_SLAVE的,但我们省略了)
mysql_binlog_send {Binlog_sender sender;sender.run();} # 定义一个sender
Binlog_sender::run() # 开始了
fake_rotate_event(log_file, start_pos) # 构造一个rotate_event,
file= open_binlog_file(&log_cache, log_file, &m_errmsg) # 打开一个binlog
THD_STAGE_INFO(m_thd, stage_sending_binlog_event_to_slave) # 就是之前讲的Sending binlog event to slave
send_binlog(&log_cache, start_pos) # 发送一个binlog的event
set_last_file(log_file); # 标记为最后一个binlog文件
THD_STAGE_INFO(m_thd,stage_finished_reading_one_binlog_switching_to_next_binlog);
mysql_bin_log.lock_index(); /* 给index file加个锁, 对index文件的操作都有这个,防止一起整 */
mysql_bin_log.find_next_log(&m_linfo, 0) /* 失败就set_fatal_error("could not find next log");break*/
mysql_bin_log.unlock_index()
file= -1; /* 失败就没得这一步了 */
if (file > 0){
my_snprintf(error_text, sizeof(error_text),
"%s; the first event '%s' at %lld, "
"the last event read from '%s' at %lld, "
"the last byte read from '%s' at %lld.",
m_errmsg,m_start_file, m_start_pos, m_last_file, m_last_pos,
log_file, my_b_tell(&log_cache));
}从逻辑上来看, 发送完一个binlog之后, 就会调用find_next_log找下一个binlog继续发送, 如果找不到,就会报错"could not find next log".
和我们看到的报错完全一样, 所以问题就在于find_next_log为啥会失败. 我们查看find_next_log的逻辑得到如下:
if (!my_b_inited(&index_file))
{
error= LOG_INFO_IO;
goto err;
}
if ((length=my_b_gets(&index_file, fname, FN_REFLEN)) <= 1)
{
error = !index_file.error ? LOG_INFO_EOF : LOG_INFO_IO;
goto err;
}
if(normalize_binlog_name(full_fname, fname, is_relay_log))
{
error= LOG_INFO_EOF;
goto err;
}也就是有3种情况会导致find_next_log失败:
很明显,看起来就第二种可能性最大. 也就是很大的可能是index_file中还没更新最新的binlog文件信息.
对index_file的操作会加lock_index(), 所以不可能是多个线程同时修改导致的, 也就是说逻辑起码没得问题. 要逻辑有问题早就出现大量的这个报错了.
虽然我们怀疑是index file未更新导致的"could not find next log", 但还是要讲证据的, 我们来看看主库轮转日志的时候都在干什么.
日志切换有2种:
为了尽可能的还原, 我们查看第二种的逻辑(实际上两者是一样的, 但模拟的时候还是按Binlog上限的情况), 整理后得到逻辑如下:
rotate_and_purge
rotate
new_file_impl
mysql_mutex_lock(&LOCK_index) # 获取index锁
generate_new_name # 生成新的binlog文件名
Rotate_log_event # 写rotate event
close # 关闭旧的binlog文件
open_index_file # 打开index文件
open_binlog # 打开新的binlog文件, 打开的时候就顺便写点format_desc_event之类的
flush_io_cache(&log_file)
mysql_file_sync(log_file.file, MYF(MY_WME))
add_log_to_index # 把新的log加到index中去
open_crash_safe_index_file # 打开一个safe_index文件
copy_file(&index_file, &crash_safe_index_file, 0) # 把index的考到safe_index里面去
my_b_write(&crash_safe_index_file, log_name, log_name_len) # 把新的binlog名字写到safe_index
my_b_write(&crash_safe_index_file, (uchar*) "\n", 1) # 来个换行符
flush_io_cache(&crash_safe_index_file) # flush
mysql_file_sync(crash_safe_index_file.file, MYF(MY_WME) # sync 调用的my_sync
close_crash_safe_index_file # 关闭safe_index
move_crash_safe_index_file_to_index_file # 把safe_index改为index, 就是rename
my_rename(crash_safe_index_file_name, index_file_name, MYF(MY_WME))
if (rename(from,to)) # MyFlags=MY_WME, 如果是MY_SYNC_DIR才会sync,前提是rename失败...
else if (MyFlags & MY_SYNC_DIR)
if (my_sync_dir_by_file(from, MyFlags) ||
(strcmp(from, to) &&
my_sync_dir_by_file(to, MyFlags)))
mysql_mutex_unlock(&LOCK_index) # 释放index锁
# 附my_sync_dir_by_file
int my_sync_dir_by_file(const char *file_name MY_ATTRIBUTE((unused)),
myf my_flags MY_ATTRIBUTE((unused)))
{
#ifdef __linux__
char dir_name[FN_REFLEN];
size_t dir_name_length;
dirname_part(dir_name, file_name, &dir_name_length);
return my_sync_dir(dir_name, my_flags);
#else
return 0;
#endif
}我们发现这逻辑相当严谨, 对index的修改也会先获取相关锁. 操作的文件也都做了fsync.
但并不是直接修改的index_file,而是先拷贝一个safe_index_file,然后把新的binlog信息往safe_index_file里面写, 最后再把safe_index_file改为index_file. 这就存在一个可能: Binlog Dump线程读取到的index file还是指向的之前旧的index file.
我们知道, 文件夹(目录)也是特殊的文件, 而rename操作本质上是修改的文件夹(目录)信息, 那么rename是否会自动做fsync呢?应该不会, 不然旧不需要后面的my_sync_dir_by_file了....
也就是说可能是mysql的bug,即rename之后未对目录做fsync. 该问题不好复现, 已提相关BUG: https://bugs.mysql.com/bug.php?id=120238
光看代码比较枯燥, 我们就使用strace来看看.
我们先看下从库的trace信息,命令参考如下:
strace -e trace=fsync,fdatasync,sync,msync,open,close,rename,write,sendto,recvfrom -s 300 -f -p 18695结果如下:(我这里是模拟出相关报错的trace信息)
[pid 18722] recvfrom(50, "D\0\0\3\0\357\0eY\327i\"*@Z(A\0\0\0\251\t\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\1\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\322\313P\320G\0\0\4\0\357\0eY\327i\2*@Z(D\0\0\0\355\t\0\0\10\0\3\0\0\0\0\0\0\0\0\0\0\32\0\0\0\0\0\0\1 \0\240U\0\0\0\0\6\3std\4!\0!\0!\0\0BEGIN\256S\377\2217\0\0\5\0\357\0eY\327i\23*@Z(4\0\0\0!\n\0\0\0\0l\0\0\0\0\0\1\0\3db1\0\tt20260409\0\1\374\1\2\0014\354\260\317\371\7\0\6\0\357\0eY\327i\36*@Z(\366\7\0\0\27\22\0\0\0\0l\0\0\0\0\0\1\0\2\0\1\377\376\320\7xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 16384, MSG_DONTWAIT, NULL, NULL) = 2289
[pid 18722] write(51, "eY\327i\"*@Z(A\0\0\0\251\t\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\1\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\322\313P\320", 65) = 65
[pid 18722] write(51, "eY\327i\2*@Z(D\0\0\0\355\t\0\0\10\0\3\0\0\0\0\0\0\0\0\0\0\32\0\0\0\0\0\0\1 \0\240U\0\0\0\0\6\3std\4!\0!\0!\0\0BEGIN\256S\377\221", 68) = 68
[pid 18723] open(".//db.opt", O_RDONLY <unfinished ...>
[pid 18722] write(51, "eY\327i\23*@Z(4\0\0\0!\n\0\0\0\0l\0\0\0\0\0\1\0\3db1\0\tt20260409\0\1\374\1\2\0014\354\260\317", 52 <unfinished ...>
[pid 18723] <... open resumed>) = -1 ENOENT (No such file or directory)
[pid 18722] <... write resumed>) = 52
[pid 18722] write(51, "eY\327i\36*@Z(\366\7\0\0\27\22\0\0\0\0l\0\0\0\0\0\1\0\2\0\1\377\376\320\7xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 2038) = 2038
[pid 18722] write(51, "eY\327i\20*@Z(\37\0\0\0006\22\0\0\0\0,\0\0\0\0\0\0\0:\241xj", 31) = 31
[pid 18722] sendto(50, "\25\0\0\0\3576\22\0\0\0\0\0\0m3306.019752", 25, MSG_DONTWAIT, NULL, 0) = 25
[pid 18722] recvfrom(50, 0x7f9b0c010d60, 16384, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18723] open("./db1/t20260409.frm", O_RDONLY) = 53
[pid 18723] close(53) = 0
[pid 18723] fsync(4) = 0
[pid 18707] fsync(12) = 0
[pid 18703] fsync(45) = 0
[pid 18703] fsync(15) = 0
[pid 18703] fsync(12) = 0
[pid 18712] fsync(4) = 0
[pid 18698] fsync(4) = 0
[pid 18712] fsync(4) = 0
[pid 18698] fsync(4) = 0
[pid 18722] recvfrom(50, ".\0\0\1\0\357\0eY\327i\4*@Z(+\0\0\0a\22\0\0\0\0\4\0\0\0\0\0\0\0m3306.019753\363}-P\354\0\0\2\377\324\4#HY000could not find next log; the first event 'm3306.019752' at 2408, the last event read from '/data/mysql_3306/mysqllog/binlog/m3306.019752' at 4705, the last byte read from '/data/mysql_3306/mysqllog/binlog/m3306.019752' at 4705.", 16384, MSG_DONTWAIT, NULL, NULL) = 290
[pid 18722] close(53) = 0
[pid 18722] write(51, "JY\327i\4\254K\f,+\0\0\0]\n\0\0@\0\4\0\0\0\0\0\0\0relay.039510\3\325\302O", 43) = 43
[pid 18722] fdatasync(51) = 0
[pid 18722] close(51) = 0
[pid 18722] close(47) = 0
[pid 18722] open("/data/mysql_3308/mysqllog/relay/relay.index", O_RDWR|O_CREAT, 0640) = 47
[pid 18722] fdatasync(47) = 0
[pid 18722] open("/data/mysql_3308/mysqllog/relay/relay.~rec~", O_RDWR|O_CREAT, 0640) = 51
[pid 18722] close(51) = 0
[pid 18722] open("/data/mysql_3308/mysqllog/relay/relay.~rec~", O_RDWR|O_CREAT, 0640) = 51
[pid 18722] write(51, "/data/mysql_3308/mysqllog/relay/relay.039510\n", 45) = 45
[pid 18722] fdatasync(51) = 0
[pid 18722] open("/data/mysql_3308/mysqllog/relay/relay.039510", O_RDWR|O_CREAT|O_APPEND, 0640) = 53
[pid 18722] write(53, "\376binJY\327i\17\254K\f,w\0\0\0{\0\0\0@\0\4\0005.7.27-log\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\238\r\0\10\0\22\0\4\4\4\4\22\0\0_\0\4\32\10\0\0\0\10\10\10\2\0\0\0\n\n\n**\0\0224\0\1u`\274\212JY\327i#\254K\f,G\0\0\0\302\0\0\0\300\0\1\0\0\0\0\0\0\0U2\255\3602$\21\361\2208RT\0\270\305e\1\0\0\0\0\0\0\0\n\0\0\0\0\0\0\0\r\0\0\0\0\0\0\0\304dp:", 194) = 194
[pid 18722] fdatasync(53) = 0
[pid 18722] open("/data/mysql_3308/mysqllog/relay/relay.index_crash_safe", O_RDWR|O_CREAT, 0640) = 54
[pid 18722] write(54, "/data/mysql_3308/mysqllog/relay/relay.039508\n/data/mysql_3308/mysqllog/relay/relay.039509\n", 90) = 90
[pid 18722] write(54, "/data/mysql_3308/mysqllog/relay/relay.039510\n", 45) = 45
[pid 18722] fdatasync(54) = 0
[pid 18722] close(54) = 0
[pid 18722] close(47) = 0
[pid 18722] rename("/data/mysql_3308/mysqllog/relay/relay.index_crash_safe", "/data/mysql_3308/mysqllog/relay/relay.index") = 0
[pid 18722] open("/data/mysql_3308/mysqllog/relay/relay.index", O_RDWR|O_CREAT, 0640) = 47
[pid 18722] fdatasync(47) = 0
[pid 18722] close(51) = 0
[pid 18722] write(53, "eY\327i\4*@Z(+\0\0\0a\22\0\0\0\0\4\0\0\0\0\0\0\0m3306.019753\363}-P", 43) = 43
[pid 18723] open("/data/mysql_3308/mysqllog/relay/relay.039509", O_RDONLY) = 51
[pid 18723] close(51) = 0
[pid 18722] write(2, "2026-04-09T07:47:03.855690Z 1 [ERROR] Error reading packet from server for channel '': could not find next log; the first event 'm3306.019752' at 2408, the last event read from '/data/mysql_3306/mysqllog/binlog/m3306.019752' at 4705, the last byte read from '/data/mysql_3306/mysqllog/binlog/m3306.01"..., 335) = 335
[pid 18722] write(2, "2026-04-09T07:47:03.855736Z 1 [ERROR] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'could not find next log; the first event 'm3306.019752' at 2408, the last event read from '/data/mysql_3306/mysqllog/binlog/m3306.019752' at 4705, the last byte read f"..., 380) = 380
[pid 18722] write(2, "2026-04-09T07:47:03.855762Z 1 [Note] Slave I/O thread exiting for channel '', read up to log 'm3306.019753', position 4\n", 120) = 120
[pid 18722] sendto(50, "\1\0\0\0\1", 5, MSG_DONTWAIT, NULL, 0) = 5
[pid 18722] close(50) = 0
[pid 18723] fsync(4 <unfinished ...>
[pid 18722] +++ exited with 0 +++
[pid 18723] <... fsync resumed>) = 0
[pid 18723] open("/data/mysql_3308/mysqllog/relay/relay.~rec~", O_RDWR|O_CREAT, 0640) = 50
[pid 18723] write(50, "/data/mysql_3308/mysqllog/relay/relay.039508\n", 45) = 45
[pid 18723] fdatasync(50) = 0
[pid 18723] open("/data/mysql_3308/mysqllog/relay/relay.index_crash_safe", O_RDWR|O_CREAT, 0640) = 51
[pid 18723] write(51, "/data/mysql_3308/mysqllog/relay/relay.039509\n/data/mysql_3308/mysqllog/relay/relay.039510\n", 90) = 90
[pid 18723] close(51) = 0
[pid 18723] close(47) = 0
[pid 18723] rename("/data/mysql_3308/mysqllog/relay/relay.index_crash_safe", "/data/mysql_3308/mysqllog/relay/relay.index") = 0
[pid 18723] open("/data/mysql_3308/mysqllog/relay/relay.index", O_RDWR|O_CREAT, 0640) = 47
[pid 18723] fdatasync(47) = 0
[pid 18723] close(50) = 0
[pid 18707] fsync(12) = 0
[pid 18703] fsync(45) = 0
[pid 18703] fsync(12) = 0
[pid 18712] fsync(4) = 0
[pid 18698] fsync(4) = 0
[pid 18712] fsync(4) = 0
[pid 18698] fsync(4) = 0我们发现"#HY000could not find next log; the first event 'm3306."是主库发送过来的, 也验证了我们上面查看的源码逻辑. 从库的逻辑比较简单, 我们就不继续看了.
关键还得看主库的trace信息, 我们使用如下命令:
strace -e trace=fsync,fdatasync,sync,msync,open,close,rename,write,sendto,recvfrom -s 300 -f -p 10747我们跟踪的信息比较多, 有刷盘的fsync,fdatasync,sync,msync, 有文件的open,close,rename,还有网络的sendto,recvfrom
-s 300表示表大小可达300字节(我想看更多的信息)
-f 包含子线程
-p 10747 mysql的进程id
于是我们得到如下信息:
[pid 11493] recvfrom(51, "/\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 11493] recvfrom(51, "\3insert into t20260409 values(repeat('x',2000))", 47, MSG_DONTWAIT, NULL, NULL) = 47
[pid 11493] fsync(4) = 0
[pid 11493] write(33, "\205s\327i\"*@Z(A\0\0\0\262\f\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\4\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0#\215\31\252\205s\327i\2*@Z(G\0\0\0\371\f\0\0\10\0\3\0\0\0\0\0\0\0\3\0\0\32\0\0\0\0\0\0\1 \0\240U\0\0\0\0\6\3std\4!\0!\0!\0db1\0BEGIN\204\360\210c\205s\327i\23*@Z(4\0\0\0-\r\0\0\0\0m\0\0\0\0\0\1\0\3db1\0\tt20260409\0\1\374\1\2\1 \344\v\22\205s\327i\36*@Z(\366\7\0\0#\25\0\0\0\0m\0\0\0\0\0\1\0\2\0\1\377\376\320\7xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 2257) = 2257
[pid 11493] fdatasync(33) = 0
[pid 11511] sendto(52, "D\0\0\1\0\357\0\205s\327i\"*@Z(A\0\0\0\262\f\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\4\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0#\215\31\252J\0\0\2\0\357\0\205s\327i\2*@Z(G\0\0\0\371\f\0\0\10\0\3\0\0\0\0\0\0\0\3\0\0\32\0\0\0\0\0\0\1 \0\240U\0\0\0\0\6\3std\4!\0!\0!\0db1\0BEGIN\204\360\210c7\0\0\3\0\357\0\205s\327i\23*@Z(4\0\0\0-\r\0\0\0\0m\0\0\0\0\0\1\0\3db1\0\tt20260409\0\1\374\1\2\1 \344\v\22\371\7\0\4\0\357\0\205s\327i\36*@Z(\366\7\0\0#\25\0\0\0\0m\0\0\0\0\0\1\0\2\0\1\377\376\320\7xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 2292, MSG_DONTWAIT, NULL, 0) = 2292
[pid 10773] recvfrom(52, "\25\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 10773] recvfrom(52, "\357B\25\0\0\0\0\0\0m3306.019755", 21, MSG_DONTWAIT, NULL, NULL) = 21
[pid 11493] fsync(4) = 0
[pid 11493] close(53) = 0
[pid 11493] write(33, "\205s\327i\4*@Z(+\0\0\0m\25\0\0\0\0\4\0\0\0\0\0\0\0m3306.019756\273\265\314\271", 43) = 43
[pid 11493] fdatasync(33) = 0
[pid 11493] close(33) = 0
[pid 11493] close(3) = 0
[pid 11493] open("/data/mysql_3306/mysqllog/binlog/m3306.index", O_RDWR|O_CREAT, 0640) = 3
[pid 11493] fdatasync(3) = 0
[pid 11493] open("/data/mysql_3306/mysqllog/binlog/m3306.~rec~", O_RDWR|O_CREAT, 0640) = 33
[pid 11493] close(33) = 0
[pid 11493] open("/data/mysql_3306/mysqllog/binlog/m3306.~rec~", O_RDWR|O_CREAT, 0640) = 33
[pid 11493] write(33, "/data/mysql_3306/mysqllog/binlog/m3306.019756\n", 46) = 46
[pid 11493] fdatasync(33) = 0
[pid 11493] open("/data/mysql_3306/mysqllog/binlog/m3306.019756", O_WRONLY|O_CREAT, 0640) = 53
[pid 11493] write(53, "\376bin\205s\327i\17*@Z(w\0\0\0{\0\0\0\1\0\4\0005.7.27-log\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\238\r\0\10\0\22\0\4\4\4\4\22\0\0_\0\4\32\10\0\0\0\10\10\10\2\0\0\0\n\n\n**\0\0224\0\0011\33\"\247\205s\327i#*@Z(\37\0\0\0\232\0\0\0\200\0\0\0\0\0\0\0\0\0\373\2\245\250", 154) = 154
[pid 11493] fdatasync(53) = 0
[pid 11493] open("/data/mysql_3306/mysqllog/binlog/m3306.index_crash_safe", O_RDWR|O_CREAT, 0640) = 54
[pid 11493] write(54, "/data/mysql_3306/mysqllog/binlog/m3306.019735\n/data/mysql_3306/mysqllog/binlog/m3306.019736\n/data/mysql_3306/mysqllog/binlog/m3306.019737\n/data/mysql_3306/mysqllog/binlog/m3306.019738\n/data/mysql_3306/mysqllog/binlog/m3306.019739\n/data/mysql_3306/mysqllog/binlog/m3306.019740\n/data/mysql_3306/mysqllo"..., 966) = 966
[pid 11493] write(54, "/data/mysql_3306/mysqllog/binlog/m3306.019756\n", 46) = 46
[pid 11493] fdatasync(54) = 0
[pid 11493] close(54) = 0
[pid 11493] close(3) = 0
[pid 11493] rename("/data/mysql_3306/mysqllog/binlog/m3306.index_crash_safe", "/data/mysql_3306/mysqllog/binlog/m3306.index") = 0
[pid 11493] open("/data/mysql_3306/mysqllog/binlog/m3306.index", O_RDWR|O_CREAT, 0640) = 3
[pid 11493] fdatasync(3) = 0
[pid 11493] close(33) = 0
[pid 11493] sendto(51, "\7\0\0\1\0\1\0\2\0\0\0", 11, MSG_DONTWAIT, NULL, 0) = 11
[pid 11511] close(49 <unfinished ...>
[pid 11493] recvfrom(51, <unfinished ...>
[pid 11511] <... close resumed>) = 0
[pid 11493] <... recvfrom resumed>0x7f172800a250, 4, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 11511] open("/data/mysql_3306/mysqllog/binlog/m3306.019756", O_RDONLY) = 33
[pid 11511] sendto(52, ".\0\0\1\0\357\0\205s\327i\4*@Z(+\0\0\0m\25\0\0\0\0\4\0\0\0\0\0\0\0m3306.019756\273\265\314\271.\0\0\2\0\357\0\0\0\0\0\4*@Z(+\0\0\0\0\0\0\0 \0\4\0\0\0\0\0\0\0m3306.019756\322\23\231\223z\0\0\3\0\357\0\205s\327i\17*@Z(w\0\0\0{\0\0\0\0\0\4\0005.7.27-log\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\238\r\0\10\0\22\0\4\4\4\4\22\0\0_\0\4\32\10\0\0\0\10\10\10\2\0\0\0\n\n\n**\0\0224\0\0011\33\"\247\"\0\0\4\0\357\0\205s\327i#*@Z(\37\0\0\0\232\0\0\0\200\0\0\0\0\0\0\0\0\0\373\2\245\250", 264, MSG_DONTWAIT, NULL, 0) = 264
[pid 10759] fsync(12) = 0
[pid 10755] fsync(30) = 0
[pid 10755] fsync(12) = 0
[pid 10764] fsync(4) = 0
[pid 10750] fsync(4) = 0
[pid 10764] fsync(4) = 0
[pid 10750] fsync(4) = 0简单整理整理后得到如下逻辑:
11493 客户端连接,执行SQL语句的
11511 Binlog Dump线程
10773 Ack_receiver
11493(客户端连接对应的服务端线程)
发送insert语句
写Binlog,并刷盘
11511(Binlog Dump线程)
发送binlog给从库(就是那个insert语句的)
10773(Ack_receiver)
OK
11493(客户端连接对应的服务端线程)
写rotate_event到binlog
fdatasync该binlog
关闭该Binlog文件
关闭index file文件(根据fd可在/proc/PID/fd下面找到各fd对应的文件,要提前查看)
打开index file
fdatasync(index file)
打开m3306.~rec~并关闭(测试)
打开m3306.~rec~
把新生成的binlog信息写到m3306.~rec~
fdatasync(fdatasync)
打开index_crash_safe文件
把index file的信息写到index_crash_safe文件中
把新生成的binlog信息写到index_crash_safe
fdatasync(index_crash_safe),关闭index_crash_safe, 关闭index file
rename(index_crash_safe, index_file)
打开index file文件
fdatasync(index file)
关闭m3306.~rec~
11511(Binlog Dump线程)
关闭旧的binlog文件
打开新的binlog文件
发送新的binlog中的event给从库
整体逻辑和上面源码分析的一致(只是少了点细节), 我们发现rename("/data/mysql_3306/mysqllog/binlog/m3306.index_crash_safe", "/data/mysql_3306/mysqllog/binlog/m3306.index") = 0 之后确实没有做相关目录的fsync操作.
根据最上面源码的分析,我们就可以对其进行断点分析, 但结果和上面的一致, 只是多了更多的细节而已, 比如generate_new_name 就是生成新的binlog名字给new_name.(长度是512字节). 量太多不再演示, 有兴趣的读者可自行断点验证.
我们无法直接复现出来, 但我们可以在close_crash_safe_index_file和move_crash_safe_index_file_to_index_file打断点, 等close完, 我们手动修改safe_index文件,然后再continue. 但这种方法恢复的时候需要手动修改index文件,并且flush下(重新打开index file)
gdb /soft/mysql_3306/mysqlbase/mysql/bin/mysqld
set non-stop on
set target-async on
set pagination off
break MYSQL_BIN_LOG::close_crash_safe_index_file
break MYSQL_BIN_LOG::move_crash_safe_index_file_to_index_file
run --defaults-file=/data/mysql_3306/conf/mysql_3306.cnf --basedir=/soft/mysql_3306/mysqlbase/mysql --datadir=/data/mysql_3306/mysqldata --plugin-dir=/soft/mysql_3306/mysqlbase/mysql/lib/plugin --user=mysql --log-error=/data/mysql_3306/mysqllog/dblogs/mysql3306.err --open-files-limit=123456 --pid-file=/data/mysql_3306/run/mysql.pid --socket=/data/mysql_3306/run/mysql.sock --port=3306
# 启动的时候也会生成新的binlog, 所以要跳过两次断点
continue
continue
# 触发binlog切换
continue 之后手动修改/data/mysql_3306/mysqllog/binlog/m3306.index_crash_safe, 删除新增的内容
continue 之后再观察从库的io_thread状态之前测试过程,发现修改binlog checksum参数后flush binary logs触发了一次could not find next log, 但为偶然, 无法(稳定)复现
mysql主从是从库连上之后,就只有主库发送event(从库顶多ack),连报错信息都是主库发过去的.
mysql轮转日志的时候实际上是生成一个新的文件,然后将binlog信息写入新的文件, 再将新的文件覆盖到旧的文件, 整个过程,除了rename外都做了fsync/fdatasync (fdatasync不更新元数据信息), 所以本次的"could not find next log"可能问题就出在这. 当然解决方法很简单, 直接执行start slave就能恢复, 毕竟只是fake.
虽然说了这么多, 但其实并没有完全复现.
参考
https://dev.mysql.com/doc/refman/8.0/en/flush.html#flush-binary-logs
https://community.ibm.com/community/user/discussion/file-directory-metadata-fsync-issues-on-failover
https://www.percona.com/blog/mysql-replication-got-fatal-error-1236-causes-and-cures/
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。