Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewind_single_block causes node to become unresponsive #21

Open
bladedoyle opened this issue Dec 13, 2020 · 0 comments
Open

rewind_single_block causes node to become unresponsive #21

bladedoyle opened this issue Dec 13, 2020 · 0 comments
Labels
Grin-Node Test cases for GRIN-Node ready-for-test Ready to be assigned and be worked on Test-Case New test case

Comments

@bladedoyle
Copy link
Collaborator

bladedoyle commented Dec 13, 2020

Some causes of rewind_single_block were fixed before 5.0.beta2 but the issue remains.

Issue: The node enters a loop of "rewind_single_block" and becomes unresponsive to API and fails to sync new blocks.

Testcase:
Start with an archive node
Call the nodes "get_block" API to fetch each block from 0 to tip.
Observe the node logs for message like 20201213 16:47:41.127 WARN grin_chain::txhashset::txhashset - rewind_single_block: 3 output_pos entries missing for: 000251721c7d at 781244

Expected Result:
no api timeouts
node should not fall behind in sync

Actual Results:
Observe that the node gives api timeouts while that rewind event is happening
Observe that immediately after the rewind event concludes the node reports being out of sync:

...
...
20201213 16:47:41.125 WARN grin_chain::txhashset::txhashset - rewind_single_block: 1 output_pos entries missing for: 00025c08d11f at 781245
20201213 16:47:41.127 WARN grin_chain::txhashset::txhashset - rewind_single_block: 3 output_pos entries missing for: 000251721c7d at 781244
20201213 16:47:45.636 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 1731290747276436, peer_difficulty 1731293920230257, threshold 1212787626 (last 5 blocks), enabling sync
20201213 16:47:48.554 INFO grin_servers::grin::sync::syncer - synchronized at 1731293920230257 @ 1000652 [000393625bde]

Some additional information

While the node is unresponsive:

  1. its doing a lot of disk IO. From iotop:
Current DISK READ:      23.92 M/s | Current DISK WRITE:      42.91 K/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 288633 be/4 ubuntu     22.27 M/s    0.00 B/s  0.00 % 97.18 % grin [peer_read]
  1. it reports dozens or hundreds of "no peers" messages
20201214 22:49:22.576 WARN grin_servers::grin::sync::syncer - sync: no peers available, disabling sync
20201214 22:49:52.942 WARN grin_servers::grin::sync::syncer - sync: no peers available, disabling sync
20201214 22:50:01.959 WARN grin_servers::grin::sync::syncer - sync: no peers available, disabling sync
  1. The API accepts connections but does not reply after that:
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=3413): Read timed out. (read timeout=20)
  1. After 5 or 10 minutes there is a flood of "rewind_single_block" messages
20201214 22:59:44.788 WARN grin_chain::txhashset::txhashset - rewind_single_block: 1 output_pos entries missing for: 0002ac2a7bd7 at 986478
20201214 22:59:44.788 WARN grin_chain::txhashset::txhashset - rewind_single_block: 1 output_pos entries missing for: 0005580fd69e at 986477
20201214 22:59:44.809 WARN grin_chain::txhashset::txhashset - rewind_single_block: 1 output_pos entries missing for: 0005048edfd7 at 986474
...

It appears that this is the thread and stack trace for the event:

(gdb) thread 69
[Switching to thread 69 (Thread 0x7e9d675e6700 (LWP 288633))]
#0  0x00005648d2915664 in mdb_page_search_root ()
(gdb) where
#0  0x00005648d2915664 in mdb_page_search_root ()
#1  0x00005648d2915975 in mdb_page_search ()
#2  0x00005648d29160b2 in mdb_cursor_set ()
#3  0x00005648d29166e4 in mdb_get ()
#4  0x00005648d28d970b in grin_store::lmdb::Batch::get_ser ()
#5  0x00005648d28e9f0f in grin_chain::store::Batch::get_block_header ()
#6  0x00005648d28f89b5 in grin_chain::pipe::rewind_and_apply_fork ()
#7  0x00005648d28678c5 in grin_chain::txhashset::txhashset::extending_readonly ()
#8  0x00005648d28bd498 in grin_chain::chain::Chain::convert_block_v2 ()
#9  0x00005648d2462817 in <grin_servers::common::adapters::NetToChainAdapter<B,P,V> as grin_p2p::types::ChainAdapter>::get_block ()
#10 0x00005648d28595ab in <grin_p2p::peers::Peers as grin_p2p::types::ChainAdapter>::get_block ()
#11 0x00005648d280986e in <grin_p2p::peer::TrackingAdapter as grin_p2p::types::ChainAdapter>::get_block ()
#12 0x00005648d2837a15 in <grin_p2p::protocol::Protocol as grin_p2p::conn::MessageHandler>::consume ()
#13 0x00005648d282a726 in grin_p2p::conn::poll::{{closure}} ()
#14 0x00005648d2825026 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#15 0x00005648d280a446 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#16 0x00005648d2ac03aa in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/alloc/src/boxed.rs:1042
#17 <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/alloc/src/boxed.rs:1042
#18 std::sys::unix::thread::Thread::new::thread_start () at library/std/src/sys/unix/thread.rs:87
#19 0x00007f1d6d679609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#20 0x00007f1d6d911293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) 
@marekyggdrasil marekyggdrasil added Grin-Node Test cases for GRIN-Node ready-for-test Ready to be assigned and be worked on Test-Case New test case labels Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Grin-Node Test cases for GRIN-Node ready-for-test Ready to be assigned and be worked on Test-Case New test case
Projects
None yet
Development

No branches or pull requests

2 participants