-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix persistence and add automatic journal flushing #98
base: master
Are you sure you want to change the base?
Fix persistence and add automatic journal flushing #98
Conversation
…e when writing large data ResizableFile only extended once even when the data to be written to it was larger than that. For example, if the file is currently 1024 B and we're trying to write 1.5 KiB to it, it would get resized to 2048 B and the write would fail (because 2.5 KiB would be required).
The FileJournal only stores the hash of the votedForNodeId because the size of the stored value has to be constant. Therefore, FileJournal.votedForNodeId does not return the actual node ID but a proxy object which can be compared against the node ID.
…ed or no file journal is used
Interesting failures. The |
- The test was very sensitive to small timing differences. This updated version is still sensitive, but it should be much better now. I was unable to trigger any failures even under high load. - o1 should be a follower after o2's first vote request due to the higher term in o2's message and its timeout not having been triggered again. - Python 2's math.ceil returns a float instead of an int for whatever reason.
Alright, I blame the Travis environment. I'm unable to reproduce it on either 2.7 or 3.6 on my machine. Multiple runs always complete successfully. And everything passed on AppVeyor as well. These failures don't mean that the code is wrong, by the way. Instead, it seems that the test suite's I can think of two workarounds. The first would be to increase the timeouts even more such that the difference between a timeout being triggered by |
The doTicks approach did not work well because it isn't very accurate and doesn't take into account the time spent outside of the actual ticking. So this replaces it with a direct loop which compares against the current time to ensure that exactly the right amount of time has elapsed since the creation of the SyncObj. This test may break again if the time spent inside SyncObj increases greatly; for example, if __init__ takes very long, the start time would not match the corresponding values inside the objects. Each individual tick may also not take too long.
So everything except 3.5 passed on both Travis and AppVeyor, and I still can't reproduce that failure. I'm also not sure what could cause a timing issue with the current code. |
Hard week. I'll try to look at this tomorrow. Thanks for PR! |
# We can't be sure whether this node voted in the current term yet, so don't vote until maxTimeout has elapsed. | ||
# Factor 1.1 just to be safe. | ||
# Note that an entire non-flushing cluster should always use the same maximum timeout to ensure that this method works! | ||
self.__voteBlockTime = time.time() + 1.1 * self.__conf.raftMaxTimeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see any example of using cluster where part of nodes has enabled flushing and part of them not. So there is no need to use this 1.1 constant. Also i'm not sure this gives any guarantee, looks like a hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how having such a cluster relates to this change.
You're right that this factor 1.1 is a hack. The entire voteBlockTime
thing is a giant hack really. The Raft algorithm requires that the term and vote (and the log) is stored to stable storage before responding to vote requests (or any other message), and using an in-memory journal or disabling flushing violates this requirement.
The Raft algorithm never relies on timing anywhere to guarantee consistency. Clock skew therefore won't affect reliability as long as the algorithm is implemented correctly. But in this particular case, clock skew will affect reliability (because the algorithm's assumptions are violated as mentioned above). For example, if the restarting node's clock is running faster than the rest of the cluster's clocks, it could still vote twice within a term. The factor 1.1 is a hacky fix for this issue in all but the most severe cases (if the clock is running more than 10 % too fast, something's seriously wrong).
In my opinion, using an in-memory journal or not flushing the journal file should not be an option in the first place since it violates a key assumption in Raft. I realise that this has a severe performance impact, but well, that's the cost for reliability. I prefer a slightly slower, reliable solution over a more performant one with nasty timing bugs that are nearly impossible to reproduce or debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but may be it would be a good idea at least to move it to a config (instead of hardcoding in-place)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. I'll do that.
Looked briefly, left few comments. |
# Specifically, o2 is expected to ignore the messages until 1.1 * timeout, i.e. including the one sent by o1 after 2.5 seconds, except for updating its term. | ||
# Note that o1 has flushing enabled but o2 doesn't! | ||
|
||
def tick(o1, o2, startTime, totalTickTime, sleepTime): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time-specific tests will randomly fails sometimes - you should mock time or not rely on time at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, great idea. I'll implement that. 👍
I finally had some time to benchmark this, and it doesn't look good. With flushing disabled, I managed 14k RPS on my test machine (with a request size of 10 and 3 nodes). With flushing enabled, I managed 40. Yes, you read that right: forty instead of fourteen thousand, i.e. about 350 times slower! This doesn't surprise me too much since flushing is a very expensive operation, although I'd've expected to at least reach a few hundred RPS. I'm not sure how to improve this. Flushing only every so often doesn't make sense as mentioned in another comment already, and I can't think of anything else than reducing the flushing frequency. If you have any ideas, I'd be happy to hear about them. As a sidenote, I also tested it with the journal file stored on a tmpfs. Obviously, that doesn't make any sense for actual usage, but it allows for testing the performance of the high-level flushing code. In this test, I'm able to reach the same ~14k RPS. This shows that it's the actual flushing to disk that murders the performance, not my code for calling the flushing. That shouldn't surprise anyone, but I still thought I'd mention it. I tested this with the existing benchmarking code with a few changes to accomodate the journal file and flushing configuration settings. The patch file is here. |
I think that we can try to flush once per multiple writes. But we shouldn't increment lastCommitIdx untill flush. |
There isn't any benefit to that over just disabling flushing entirely though. The Raft spec requires that nodes do not respond to RPCs before persisting those values to disk. So if we only flush after a while, we're still in the same situation as master is now: there is no guarantee that a node hasn't voted, so it has to wait, which introduces timing issues into the cluster. |
There is a batch mode - we already wait a little before sending commands. If we add another 0.1 sec. delay - that will be much better than reducing overall performance. |
Ah, so the |
Ok, that helps, but unfortunately the batches seem to be rather small (less than 10 entries in my tests), so it only provides a fairly small improvement. My test machine now manages about 70 RPS instead of 40. |
I'm not sure that it's possible to achive using existing batch mode in append_entries. I think you should try following:
|
Yeah. It looks like the |
Any further progress on this issue? I am really keen to see this improvement on RPS. |
This fixes #84.
A new configuration option,
flushJournal
, controls whether the journal should be flushed to disk on every write. This is enabled by default when using a file-based journal (i.e. whenjournalFile
is notNone
). It is an error to specifyflushJournal = True
for the memory-based journal.To store the
currentTerm
and thevotedForNodeIdHash
in the journal, the header had to be expanded, which also means that the journal file format is now version 2. Migration is performed upon restarting with the upgraded code (and there's also a test for this). Regarding details aboutvotedForNodeIdHash
(and why this is an MD5 hash), see my recent comment in #84. TheFileJournal
now also verifies that the journal file version is supported by the code and throws aRuntimeError
if that is not the case; previously, it simply always tried to read it as a version 1 file.If journal flushing is disabled, the node will not take part in elections until 1.1 times the maximum timeout has elapsed. This is necessary to prevent a node from voting twice in the same election, which could lead to two leaders being elected within a single term. It can however lead to a longer delay before the cluster is operational again after failover.
This furthermore fixes a crash in the
ResizableFile
when performing a big write (compared to the current file size). See commit message of 38d6adf for details.