-
Notifications
You must be signed in to change notification settings - Fork 178
1.4.2patch
- merged to master
- code complete October 9, 2013
- development started October 4, 2013
A Riak CS customer experienced performance problems with the leveldb backend. The data written to leveldb varied in size from 2Mbytes to 30Mbytes, and it was updated (rewritten) often. The leveldb LOG file showed frequent "waiting 2" messages indicating leveldb had stalled write operations waiting upon the Imm write buffer to be flushed.
The user experienced pain was that timeouts were occurring. And there were error messages about fadvise and fdatasync errors in the syslog.
This branch targets two independent problems:
-
PosixMmapFile when used asynchronously by the recovery log has a race condition related to close operations. This caused the syslog error messages. It also implies the recovery log likely has minor corruption at the file end (which is only critical if the file is actually used in a database recovery).
-
The large key/value pairs, overwritten often, resulted in a high number of small, level-0 files. leveldb handles this situation poorly.
NOTE: This branch has not been "cleaned up". It intentionally contains dead code due to changes. It contains debug LOG messages and heavy debug syslog messages. The branch has not been tested in the customer environment yet and any of the "messy" items might be needed to clarify further work. Cleaning will happen before application to Riak 2.0 (and maybe a future 1.4 release).
The PosixMmapFile async problem was partially addressed in a previous patch. That patch corrected PosixMmapFile::UnmapCurrentRegion() to only use async unmapping if the file was a recovery log. The problem that remained (undetected) was that Env::Default()->Schedule() did not place sequential file operations on the same thread. It was therefore possible for an unmap with close operation to process prior to simple unmap operations that precede it.
This branch adds logic to maintain a reference count of active objects related to a file. The PosixMmapFile object is one type of "active object". A background BGCloseInfo object is the other. The reference count is malloc'd in memory and works similar to an std::shared_ptr. The difference is that std::shared_ptr is not guaranteed thread safe and this reference counter is. The object that decrements the reference counter to zero causes the file to close. This guarantees that all pending writes have completed (in any order).
The malloc'd ref_count_ is actually a 2 element uint64_t array. The first element is the reference count. The count is incremented and decremented via atomic operations. The second element is the final file size. It is set upon call to PosixMmapFile::Close(). The file is extended in chunks during write operations to facilitate memory mapped operations. The object that close the file will also adjust the file size is to match the actual count of bytes used.
Files edited for this fix:
-
leveldb/include/leveldb/atomics.h: this file originated in eleveldb for the 1.3 release. It is also part of pending changes to the leveldb 2.0 release. While it is "new" to 1.4.2, it is an already reviewed, production file.
-
leveldb/util/env_posix.cc: logic changes were made to the PosixMmapFile class relating to the use of ref_count_. There was a previous mess of four special purpose background task functions: BGFileCloser(), BGFileCloser2(), BGFileUnmapper(), and BGFileUnmapper2. The mess is now reduced to one, general purpose BGFileUnmapper2 function. The asynchronous close is contained within a new function ReleaseRef() that manages the ref_count_ atomically and also performs the file resize / close when count is zero.