-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathvldb_sample.tex
executable file
·798 lines (690 loc) · 41 KB
/
vldb_sample.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
% THIS IS AN EXAMPLE DOCUMENT FOR VLDB 2012
% based on ACM SIGPROC-SP.TEX VERSION 2.7
% Modified by Gerald Weber <[email protected]>
% Removed the requirement to include *bbl file in here. (AhmetSacan, Sep2012)
% Fixed the equation on page 3 to prevent line overflow. (AhmetSacan, Sep2012)
\documentclass{vldb}
\usepackage{graphicx}
\usepackage{balance} % for \balance command ON LAST PAGE (only there!)
\usepackage{multirow}
\usepackage{tabularx}
\usepackage{colortbl}
\definecolor{table_header}{RGB}{221,221,221}
% Include information below and uncomment for camera ready
\vldbTitle{Deferred secondary index update with LSM trees}
\vldbAuthors{Vladislav Shpilevoy, Konstantin Osipov, Dmitry Volkanov}
\vldbDOI{https://doi.org/TBD}
\begin{document}
% ****************** TITLE ****************************************
\title{Deferred secondary index update with {\ttlit LSM trees}}
% possible, but not really needed or used for PVLDB:
%\subtitle{[Extended Abstract]
%\titlenote{A full version of this paper is available as\textit{Author's Guide to Preparing ACM SIG Proceedings Using \LaTeX$2_\epsilon$\ and BibTeX} at \texttt{www.acm.org/eaddress.htm}}}
% ****************** AUTHORS **************************************
% You need the command \numberofauthors to handle the 'placement
% and alignment' of the authors beneath the title.
%
% For aesthetic reasons, we recommend 'three authors at a time'
% i.e. three 'name/affiliation blocks' be placed beneath the title.
%
% NOTE: You are NOT restricted in how many 'rows' of
% "name/affiliations" may appear. We just ask that you restrict
% the number of 'columns' to three.
%
% Because of the available 'opening page real-estate'
% we ask you to refrain from putting more than six authors
% (two rows with three columns) beneath the article title.
% More than six makes the first-page appear very cluttered indeed.
%
% Use the \alignauthor commands to handle the names
% and affiliations for an 'aesthetic maximum' of six authors.
% Add names, affiliations, addresses for
% the seventh etc. author(s) as the argument for the
% \additionalauthors command.
% These 'additional authors' will be output/set for you
% without further effort on your part as the last section in
% the body of your article BEFORE References or any Appendices.
\numberofauthors{3} % in this sample file, there are a *total*
% of EIGHT authors. SIX appear on the 'first-page' (for formatting
% reasons) and the remaining two appear in the \additionalauthors section.
\author{
% You can go ahead and credit any number of authors here,
% e.g. one 'row of three' or two rows (consisting of one row of three
% and a second row of one, two or three).
%
% The command \alignauthor (no curly braces needed) should
% precede each author name, affiliation/snail-mail address and
% e-mail address. Additionally, tag each line of
% affiliation/address with \affaddr, and tag the
% e-mail address with \email.
%
% 1st. author
\alignauthor
Vladislav Shpilevoy
\affaddr{Lomonosov Moscow State University}\\
\email{[email protected]}
% 2nd. author
\alignauthor
Konstantin Osipov
\affaddr{Tarantool}\\
\email{[email protected]}
% 3rd. author
\alignauthor
Dmitry Volkanov
\affaddr{Lomonosov Moscow State University}\\
\email{[email protected]}
}
\date{24 January 2018}
% Just remember to make sure that the TOTAL number of authors
% is the number that will appear on the first page PLUS the
% number that will appear in the \additionalauthors section.
\maketitle
\begin{abstract}
Log-structured merge-trees (LSM trees) are becoming more and more widely
adopted as a staple data structure in secondary storage database management
systems, ending the half-a-century-long dominance of B-trees. The major
advantage of an LSM tree is that it always writes new data and old data updates
to disk sequentially. It is made possible due to the ability of the LSM tree to store multiple
versions of the same key - it allows LSM tree-based tables to not read and delete
old data explicitly from the primary index on such operations as deletion or
replacement. Old data is deleted by new data during the compaction instead. But
this advantage is nullified by secondary indexes, because replace/delete-like
operations require reading and explicitly deleting old data from each secondary index.
This paper presents an LSM tree modification that allows not reading any indexes
on replacement/deletion even if there are secondary indexes in the table.
Experimental research of the modified LSM tree shows a 1.5-10-times increase
in the write speed on tables with 2-4 non-unique secondary indexes, as compared
with the original LSM tree. Write performance of the new LSM tree grows linearly with
the number of secondary indexes.
\end{abstract}
\section{Introduction}
Log-structured merge-trees were developed for write-intensive tasks in 1990s
and were used in file systems and for creating backups. But their rise to dominance
was hampered by hidden reads.
These are the reads that accompany writes: for example, when it is necessary to delete
old data from a secondary index after an update, or check a unique or a foreign
key constraint. A significant number of reads in a modern database are hidden.
Such reads, when followed by a write, cost nearly nothing in a B-tree, but can
reduce the performance of an LSM tree-based table insomuch that its write
advantage becomes diluted.
With the emergence of solid-state drives (SSDs), the cost of random writes started
having a much heavier impact on the overall performance than the cost of random reads,
and nowadays the LSM tree is considered to be a standard data structure for
databases. For example, it is used in LevelDB, RocksDB, Cassandra, Tarantool,
BigTable, HBase, Riak, MySQL (MyRocks).
However, even SSDs do not help much in one particular widespread case that undermines
the ability of the LSM tree to store multiple versions of the same keys. It has to do
with the LSM tree compaction algorithm, and manifests itself in tables with multiple indexes.
In a primary LSM tree-based index, any replace or delete operation is executed
with no hidden reads because of the ability of the LSM tree to store multiple key versions.
A new tuple is just inserted into memory, and after some compaction, the old
tuples get discarded. But it becomes impossible with secondary indexes: if a table
contains secondary indexes, any update operation must read an old tuple from the
primary index, extract secondary keys from it, and explicitly delete them from each
secondary index. That is, updating such a table produces hidden reads from the
primary index.
Reading from the primary index and explicit deletion of old tuples cannot be
avoided in the classic LSM tree, because it deletes old versions of the key only
if they are equal by key. If a request changes a secondary key in an existing
tuple and does not delete the old one, then it is not deleted from a secondary
index. As a result, the LSM tree regards the new tuple and the old one not as
different versions of the same key, but as two different keys.
This paper describes a new LSM tree compaction algorithm as well as new LSM
tree-based table update and read algorithms. Replace and delete operations on the new
LSM tree do not require any reads regardless of the number of secondary indexes
if all of them are non-unique. Experiments show exponential growth in the number
of requests per second (RPS) with some workload types and some schemas. For example,
on a table with 3 non-unique secondary indexes and with batched replace/delete
requests, the number of RPS increased up to 10 times.
\section{Background and motivation}
This section provides some basic information on the LSM tree: how it can be
stored on disk and in memory; how compaction, updates, and reads work;
which LSM tree parameters are typically worth configuring.
LSM tree is a data structure optimized for write-intensive tasks. It has
multiple levels: level zero is stored in memory, and the remaining levels on disk.
Level zero accumulates updates and is periodically dumped to disk, thus becoming
level one. After being dumped, level zero is cleared and starts collecting new updates.
Using level zero allows writing updates to disk in batches regardless of
their order, keys, and types. Even delete operations are put in level zero as a tuple of a
special type.
When the number of levels grows too large, the LSM tree is compacted: several levels are
merged into a new one. During the compaction, new tuples discard old tuples with the same
key. For example, the delete operation discards all older data and sometimes itself; the replace
operation discards all older data, but never itself.
The LSM tree can store levels in multiple ways. For example, the memory level can be
organized as a B-tree or a red-black tree. Disk levels can be organized as B-trees or sorted
arrays. In this paper, sorted arrays for disk levels and a B+ tree for the memory level
are considered.
\subsection{Compaction}
\begin{figure}
\centering
\includegraphics[width=0.4\textwidth]{compaction_schema}
\caption{LSM tree ranges dump and compaction}
\label{fig:compaction_schema}
\end{figure}
Each level consists of multiple sublevels. Each sublevel is an array sorted by
key (in case of the LSM tree index, a key consists of index parts) and
containing a specific range of key values. During the compaction, some sublevels of each
level are merge-sorted into a new sorted array. It is known that if one key
version is on level $i$ and another is on level $j < i$, then the key version on level
$j$ is newer, so the other one is discarded when compacted.
Compaction is a procedure needed to reduce the number of levels and to discard old tuples
in order to make room for new ones. Compaction frequency is regulated by the LSM tree
parameters called \textbf{level size ratio} and \textbf{maximal sublevel count}. In each pair of
neighboring levels $i$ and $i + 1$, the size of level $i + 1$ must at least be greater than
the size of level $i$ $*$ the level size ratio. If the maximal sublevel count per
level is exceeded or the level size ratio is violated, then compaction merges as many
sublevels as needed to satisfy both limitations.
In practice, sublevels are produced by dumping the memory level and by splitting index
key values into ranges, where each range stores and maintains its own sublevels
independently of the other ranges (see Figure \ref{fig:compaction_schema}).
\subsection{LSM tree-based table structure}
In an LSM tree-based table, each index is an LSM tree sorted by index parts.
It is a common pattern for a table to have one primary index and several
secondary indexes. A primary index is always unique and stores full tuples,
including all indexed and non-indexed fields, as they were inserted by the user.
A secondary index stores only its own key parts and primary index parts to save
memory. Such indexes are called non-covering. Primary index parts are used as a link to
a full tuple in the primary index, when the search uses a secondary index only.
The described pattern is not a unique feature of LSM tree-based tables. For
example, in SQLite, secondary indexes based on a B-tree are non-covering by default,
too. And each SQLite table has a unique primary index, even if a user did not
specify it explicitly.
Since secondary indexes store primary index parts, they are linked to primary
indexes. And if a tuple does not exist in the primary index, it cannot exist in a
secondary index as well. This is the reason why hidden reads are
indispensable - if a tuple is replaced with another secondary key or is deleted
from the primary index, then its old version must be deleted from each secondary
index by the old secondary key. Otherwise, table indexes will not be consistent.
\subsection{Multiple index update problem}
The ability of the LSM tree to manage multiple versions of a key allows it to avoid hidden
reads on such operations as replacement or deletion, which cannot break consistency if
a table has only the primary index and no foreign key constraints. Such operations are known
as \textbf{blind writes} \cite{kai:slimdb}. Blind writes are much faster than regular insert
or update operations that, in the worst-case scenario, read an old tuple from disk in order
to check for duplicates or apply update operations.
A large problem with LSM tree-based tables is that, with secondary indexes defined,
all writes become non-blind. See Figure \ref{fig:inconsistent_example}
for an example of inconsistency if replace is blind and a table has a secondary
index. Here, a table with 4 columns is defined: column 1 is a part of the primary index,
columns 2 and 3 are parts of a secondary index, and column 4 is not indexed. Before being
executed, \textit{Replace\{1, 5, 6, 7\}} must read the old tuple from the
primary index by key \textit{\{1\}}, extract the secondary key
\textit{\{2, 3\}}, delete it from the secondary index, and only then insert the
new tuple. If the old tuple is not deleted, then during the compaction it becomes
garbage with no link to a full tuple in the primary index.
\begin{figure}
\centering
\includegraphics[width=0.3\textwidth]{inconsistent_example}
\caption{LSM tree-based table inconsistency example}
\label{fig:inconsistent_example}
\end{figure}
Since disk access is much slower than memory access, the presence of a secondary index
cancels out the advantage of the LSM tree. In the next section, a new LSM tree update
algorithm is presented, reviving versions and blind writes.
\section{Design and implementation}
In this section, the basics of the new algorithm are described, followed by three sections,
each focusing on a specific part of the algorithm.
The key idea of the algorithm is that deleting old data can be deferred until the
primary index compaction stage. It allows avoiding any hidden reads on replacement
and deletion by primary key, since these operations use hidden reads only to identify
and delete old secondary key versions, not to check consistency. It works on
tables with non-unique secondary indexes only, because it is impossible to break
the consistency of a non-unique secondary index by using delete or replace operations.
Obviously, it does not hold true for unique secondary indexes. For example,
consider a table with columns \textit{\{a, b\}}, a unique primary index on
\textit{\{a\}} and a unique secondary index on \textit{\{b\}}. Let the table contain
two records: \textit{\{1, 1\}} and \textit{\{2, 2\}}. It is impossible now to do
replace \textit{\{1, 2\}}, because it produces duplicates in the secondary index.
Of course, in MySQL, for example, replace would delete both the old records, but
this kind of replacement cannot be deferred, because during the primary index
compaction there is no way of knowing that a new tuple with one primary key
discards an old tuple with another primary key without an explicitly inserted
deletion. This multi-replace operation must delete old records with different
primary keys and is not considered in the paper.
The deferred update algorithm consists of three separate parts:
\begin{enumerate}
\item Updating level zero of all indexes. This is where replacement and deletion by
primary key are inserted into memory and make secondary and primary indexes
malsynchronized.
\item Compaction, which differs for primary and secondary indexes. The primary
index detects tuples to discard and sends them to a secondary index as a new
LSM tree sublevel. A secondary index takes into account sublevels received from
the primary index.
\item Reading a secondary index, which can now see tuples already deleted from the
primary index. Existence of a specific version of a key read from a secondary
index must be checked via a primary index lookup.
\end{enumerate}
\subsection{Deferred update}
Only two operations can be deferred: replacement and deletion by primary key.
First, an algorithm for replacement is presented, and then for deletion, which is
very similar.
Consider a table with one primary index and several non-unique secondary indexes.
Assume the user executes the replace operation. A tuple specified in replace is inserted
into the memory level of each index as is, without preliminarily reading and deleting
the old tuple from secondary indexes.
\begin{figure}
\centering
\includegraphics[width=0.46\textwidth]{table_after_deferred_update}
\caption{Deferred update of LSM tree-based table}
\label{fig:table_after_deferred_update}
\end{figure}
With this approach, it is possible and valid for some secondary indexes to consider
discarded tuples as non-discarded, like in the example given in Figure
\ref{fig:table_after_deferred_update}: there is a tuple with
version 3, which gets discarded in the primary index by a newer tuple with the same
key and version 5. But secondary keys of these tuples are different, so a
secondary index regards the tuple with version 3 as a separate non-discarded tuple.
Such tuples, visible in a secondary index, but invisible in the primary one, are
called \textbf{dirty}.
Deletion by primary key is inserted only in the primary index memory. It cannot
be inserted in a secondary index, because a corresponding secondary key is
unknown without any hidden reads. In this case, a tuple discarded by the delete
operation becomes dirty in each secondary index.
In the next section, the dirty tuple deletion algorithm is presented as a part of
the new LSM tree compaction procedure.
\subsection{Compaction}
\subsubsection{Primary index}
Compaction is triggered by the same signals as in the classic LSM tree - by
too many sublevels per level, or excessively large difference in the size of
neighboring levels.
Consider the basic compaction, with no deferred updates and dirty tuples.
Assume the levels consist of sublevels belonging to key ranges, and a sublevel
is a sorted array of versioned keys.
Sublevels are compacted via being merge-sorted, where tuples having lower
versions across compacted sublevels get discarded. That is, only tuples with the
highest version among compaction participants remain untouched. Two tuples are considered
to be versions of the same key if they are equal by parts of a compacted LSM tree
index. If there are several versions of a key, then during merge-sort they are compared
in one of the steps, and only one tuple stays. After the merge sort is over,
a new single sublevel replaces the old ones.
Obviously, old tuples are read from disk during the compaction. And these reads
replace hidden reads associated with update operations. Actually, hidden reads are
deferred until the compaction, where they are done sequentially: a sequential read
is far quicker than a random one, both on SSDs and HDDs.
Old tuples that are to be discarded are used to extract dirty secondary keys.
The extracted keys are sorted into the order of a secondary index and are written as
a new sublevel directly into it. Secondary keys are written as deletes, saving the
version of the original tuple. New primary index compaction produces sublevels for
itself and for each secondary index. When a secondary index is being compacted, it takes
into account sublevels created for it by the primary index.
Sorting keys into the order of a secondary index is necessary to observe the LSM tree
index invariant, whereby an index sublevel is an array of tuples sorted by key
parts of the index. So sublevels created for secondary indexes during the primary
index compaction must be sorted into the secondary index order. A sorting method
depends on the implementation, and can be one of these:
\begin{itemize}
\item in-memory quick sort after all tuples have been read from disk;
\item in-memory tree sort, whereby a tree is created for each index prior to the
compaction and is being filled as tuples are being read from disk. A tree can be of
any type: binary, red-black, or B-tree. When the primary index compaction is finished, the
tree is written to a secondary index sublevel as a sorted array. So a tree allowing for a fast
iteration over all keys is preferred (for example, a B+ tree: it maintains a sorted list
of stored values);
\item on-disk merge sort, which can be used when compacting sublevels discards too
many tuples and there are many secondary indexes that may not fit into memory.
The idea is to do an in-memory sort of discarded tuples by batches of some fixed
size and then dump them as sorted arrays into temporary files. When the primary index
compaction is over, these temporary files are merge-sorted into a single sublevel
intended for a secondary index. Merge-sorting the temporary files is done in the same
way as compacting sublevels - files are read and written in parts, without being read
entirely into memory. The described procedure is performed for each secondary index.
\end{itemize}
\subsubsection{Secondary index}
Secondary index compaction is slightly modified to correctly process deletes received
from the primary index. The point is that these deletes have both the same version and
the same key as the dirty tuples for which they are intended, and the compaction
procedure must correctly handle these matching versions and keys.
The only modification of the secondary index compaction is that in case
a delete tuple and a replace tuple have matching versions and keys, the replace
tuple must be discarded, and the delete tuple must remain, unless it is a \textbf{major}
compaction. Compaction is called major if all the sublevels of all levels are compacted.
\begin{figure}
\centering
\includegraphics[width=0.46\textwidth]{secondary_compaction_example}
\caption{Secondary index compaction example}
\label{fig:secondary_compaction_example}
\end{figure}
Figure \ref{fig:secondary_compaction_example} provides an example of
the secondary index compaction with sublevels received from the primary index.
In the example, the primary and secondary indexes store two tuples:
\textit{\{pk1, sk1\}} and \textit{\{pk1, sk2\}}. In the primary index, different
versions of the key \textit{pk1} are stored. When the primary index is compacted,
the tuple \textit{\{pk1, sk1\}} is discarded and sent to the secondary index as a
delete tuple with the same version. During a secondary index compaction, this
delete tuple with version 1 meets the dirty tuple \textit{\{pk1, sk1\}} with
version 1 - their keys and versions are equal. It is assumed that it is not a major
compaction, so the dirty tuple \textit{\{pk1, sk1\}} is discarded and the delete
tuple remains.
\subsection{Read}
Primary index reads remain unchanged, because dirty tuples can appear in
secondary indexes only. When a secondary index is being read, dirty tuples are
skipped. To check if a tuple is dirty, its primary key and version are looked
up in the primary index. If the primary index contains the same key with the same
version, then the tuple is not dirty and can be returned to the user.
It means that even point reads from a secondary index can lead to multiple
lookups in the primary index in order to exclude dirty tuples.
\begin{figure}
\centering
\includegraphics[width=0.46\textwidth]{secondary_reading_example}
\caption{Secondary index read example}
\label{fig:secondary_reading_example}
\end{figure}
In Figure \ref{fig:secondary_reading_example}, selecting by secondary key
returns a tuple on the third attempt. First, \textit{\{pk1, sk1\}} is read as a tuple having
the highest known version of this secondary key. It appears to be dirty, because
in the primary index a tuple with the same primary key has a different version and
a different secondary key. Second, \textit{\{pk2, sk1\}} is checked as a tuple with
the next highest version. It is dirty, too: in the primary index, this tuple is already
replaced by \textit{\{pk2, sk2\}}. Finally, \textit{\{pk3, sk1\}} is read, and the
primary index stores a tuple with the same key and the same version, which means
that this tuple is not dirty and can be returned to the user.
The common algorithm is to read secondary-index tuples and look up each one in the
primary index until a pair with matching versions and keys is found.
\subsection{Implementation}
The proposed algorithm is implemented as a patch for Tarantool's on-disk
engine named Vinyl. Tarantool is a DBMS and application server that supports two
storage engines: Memtx (in-memory) and Vinyl (on-disk). Tables in Tarantool are
called \textit{spaces}. Memtx stores data in RAM using B+ trees and is not
considered here.
Vinyl stores indexes as LSM trees. An LSM tree-based index is split into key
ranges. An LSM tree level consists of sublevels that are produced by level zero dumps and
compactions and are represented as files. Ranges are compacted independently of each
other. Dumps and compactions are executed in background threads, while transaction
processing is handled by a single main thread that does not do any costly operations like disk
or network access, delegating these tasks to worker threads instead.
Sublevels consist of pages. A page knows its minimal and maximal keys and minimal
and maximal versions - this information is stored in memory and allows reading
only the needed pages, not entire sublevels. Page size is configurable, and the
greater it is, the less meta information is stored in memory and the larger chunks are
read when searching for a key.
Each Vinyl sublevel has a Bloom filter with several improvements: less hashing
with the same performance \cite{Kirsch:bloom_less_hashing} and blocked Bloom
filters \cite{Putze:bloom_cache_oblivious}.
The main thread consists of coroutines written in C and called \textit{fibers}. When
the main thread wants to perform a long operation, such as dumping or compacting
an LSM tree index, or accessing disk or network, it sends a request to a worker
thread and switches to another fiber, while the original one waits for a response
from the worker thread. A huge advantage of this approach is that there are no locks
on internals, except for inter-thread communication queues.
The implementation of the deferred update algorithm consists of three parts, same as the
algorithm itself: new replaces/deletes, compaction, and reads.
New replaces/deletes are quite simple: they just do not do any hidden reads. Replace is
inserted into level zero of all indexes of a Vinyl space, whereas delete is inserted into the
primary index only.
Compaction is the most complex part of the implementation. As applied to
Vinyl and LSM trees, it works as follows before the deferred update:
\begin{enumerate}
\item The main thread detects that compaction is necessary (for example, the level
size ratio is violated or a level consists of too many sublevels) and sends a
request to a worker thread with the information on which sublevels to compact and in
which files they are stored;
\item The worker thread receives the request and starts compaction, which is done
via merge-sorting sublevel files. Processed files are not deleted or changed
by the worker thread, because they may now be used in the main thread for reads.
Once the worker thread finishes its work, it has one new sublevel file - and the main
thread is notified about it;
\item After a while, one of main-thread fibers notices the finished worker thread,
gets the new sublevel, puts it in the LSM tree, and atomically deletes old sublevels and
their associated files.
\end{enumerate}
\begin{figure}
\centering
\includegraphics[width=0.46\textwidth]{compaction_implementation}
\caption{Primary index compaction schema}
\label{fig:compaction_implementation}
\end{figure}
After the deferred update is applied, the second and third steps of the compaction
procedure are completely different. See Figure
\ref{fig:compaction_implementation} for an example. When the primary index is
scheduled for compaction, the main thread sends a worker thread the information
not only about the primary index sublevel, but about each secondary index key parts,
which are needed to extract secondary keys from primary index tuples.
While the primary index is being merge-sorted, discarded tuples are put into
in-memory B+ trees created for each secondary index. Once the
primary index compaction is over and a new sublevel is created, the filled
B+ trees are dumped into sublevel files for secondary indexes.
The dump is fast, because a B+ tree can quickly be iterated over due to
links between neighboring keys. At this step, the worker thread has sublevel files
for each table index, and it notifies the main thread about the finished work.
One of main-thread fibers replaces the compacted sublevels with the new one in the
primary index and puts secondary index sublevels into corresponding indexes
at the first level of the LSM tree. The secondary index sublevel received from the primary
index is situated at the first level of the LSM tree, because it can contain any deletes from
any primary index level, which means that it can overlap any sublevel stored on disk.
These deletes must overlap dirty tuples for which they are intended.
The implementation of reading is modified to take the existence of dirty tuples
into account. This is the same thing that occurs during the secondary index compaction -
the index contains both dirty tuples and their deletes with the same keys and
versions. If an iterator sees this match, it skips these versions and all higher ones.
All the other discovered tuples are looked up in the primary index. If the primary index
contains a tuple with the same secondary key and version as a tuple from
the secondary index, then this tuple is not dirty and can be returned to the
user. Otherwise the tuple is dirty - its delete is not generated by the primary
index compaction. The salient feature of this implementation is that sublevels
received from the primary index are indistinguishable from regular sublevels and
are being read together with the others. It allows avoiding lookups in the
primary index if a tuple is dirty and its delete is already in the secondary
index. However, there is a certain overhead associated with reading these
additional sublevels.
There is another way to implement reading, and it has its own benefits and
drawbacks. The key concept is to distinguish between sublevels received
from the primary index and other sublevels and use them for compaction
only, not for reading. It allows reading fewer sublevel files, but necessitates
lookups in the primary index for each tuple even if a dirty tuple's delete is
already received from the primary index.
\section{Mathematical foundations}
In this section, the mathematical complexity of the original update algorithm is
compared with that of the deferred update algorithm. The following known
values are used:
\begin{align*}
N &- \text{total number of tuples on all levels}, \\
b &- \text{zero-level B+ tree branching factor}, \\
r &- \text{level size ratio}, \\
K &- \text{total number of indexes}.
\end{align*}
According to the definition of the LSM tree level size ratio, level $i$ is $r$ times
larger than level $i - 1$, so the total number of levels is $O(log_rN)$
\cite{kai:slimdb}. Let's denote it as $lc$.
The deferred update algorithm allows avoiding reads, and puts new tuples in the
memory level - it is the only part that the old and new algorithms have in comon.
Memory level access complexity depends on the number of tuples, which can be
calculated from $N$ and the number of levels. Denoting the memory (alias zero)
level size as $m$, it can be calculated by using the geometrical progression sum formula:
\begin{gather*}
N = m + m*r + m*r^2 + ... + m*r^{lc}, \\
N = \frac{m(1 - r^{lc+1})}{1 - lc}, \\
m = \frac{N(1 - r)}{1 - r^{lc + 1}}.
\end{gather*}
In the scope of this paper, the memory level is a B+ tree, which has the
search and insertion complexity of $O(log_bm)$. The total complexity of
replace and delete before the deferred update includes point search on
disk - the complexity of this operation is $O(log_rN)$. It is associated
with the number of levels, because the disk index of levels (pages, their minimal and
maximal keys, versions) is stored in memory, and necessary on-disk pages for a
certain key can be found without accesing disk. Disk is only accessed to read
pages that can contain the required key.
Using the found parameters, the complexity of update can be calculated as follows:
\underline{Complexity of replace}:
\begin{displaymath}
O(log_bm) + O(log_r(N - m)) + O(log_bm) * (2K - 1)
\end{displaymath}
Do a hidden read: search for an old tuple in the memory level ($O(log_bm)$) and on
disk ($O(log_r(N - m))$). In the worst-case scenario, the old tuple with the same primary
key is found, so two tuples ($2(K - 1)$) are inserted into the memory level ($O(log_bm)$)
of each secondary index: the old tuple's delete and the new tuple. In the primary
index, the new tuple discards the old one without delete, so only one tuple is inserted
($2(K - 1) + 1 = 2K - 1$).
\underline{Complexity of delete}:
\begin{displaymath}
O(log_bm) + O(log_r(N - m)) + O(log_bm) * K
\end{displaymath}
Do a hidden read: search for an old tuple in the primary index on disk and in memory
($O(log_bm) + O(log_r(N - m))$). In the worst-case scsenario, the old tuple is found,
so a delete tuple is inserted into the memory level of each index ($O(log_bm) * K$).
After the deferred update is implemented, all hidden reads are eliminated.
\underline{Complexity of \textbf{deferred} replace}:
\begin{displaymath}
O(log_bm) * K
\end{displaymath}
A new tuple is inserted into the memory level of all indexes - no deletes, no
hidden reads: they are deferred until compaction. The deferred replace is at
least twice as fast as the classic one, according to the calculations below:
\begin{gather*}
\frac{O(log_bm) + O(log_r(N - m)) + O(log_bm) * (2K - 1)}{O(log_bm) * K} = \\
\frac{O(log_bm * 2K) + O(log_r(N - m))}{O(log_bm) * K} = \\
2 + \frac{O(log_r(N - m))}{O(log_bm) * K}.
\end{gather*}
The level zero size of an LSM tree is bounded above by the RAM size. The
number of indexes is set once when a database schema is created and is
rarely changed afterward, the zero-level B+ tree branching factor is a constant,
so the denominator $O(log_bm) * K$ can be treated as a constant as well.
The numerator contains $N$, which can be huge (billions and more) in
write-intensive tasks, and the larger it is, the larger is the difference between
the speed of the deferred update and the regular one. For example, for the
following estimation:
\begin{align*}
&r = 3, \\
&b = 10, \\
&m = 10^5, \\
&N = 10^8, \\
&K = 5.
\end{align*}
the deferred replace is theoretically $~2.55$ times faster. In reality, the
speed increase is far greater, because disk access is much slower
than memory access, even for SSDs. And the numerator $O(log_r(N - m))$
depends on the speed of disk access: when $N$ is large, most tuples are stored on
disk. That is why it is easy to get up to 10 times speed increase even on small
tables.
\underline{Complexity of \textbf{deferred} delete}:
\begin{displaymath}
O(log_bm)
\end{displaymath}
Delete is inserted into the memory level of the primary index only. Since only the primary
index is used, the deferred delete is at least $K + 1$ times faster than the regular delete,
and the more tuples are stored on disk, the faster the deferred delete gets:
\begin{gather*}
\frac{O(log_bm) + O(log_r(N-m)) + O(log_bm) * K}{O(log_bm)} = \\
\frac{O(log_bm)*(K + 1) + O(log_r(N-m))}{O(log_bm)} = \\
K + 1 + \frac{O(log_r(N-m))}{O(log_bm)}
\end{gather*}
\section{Evaluation}
The implemented algorithm is tested on two benchmarks: Microbench, on which the
difference between the regular update and the deferred one is most pronounced, and
LinkBench. Both benchmarks are done on a single machine with Apple SSD SM0512L,
Intel i7 2.7Ghz, 4 cores.
\subsection{Microbench}
The benchmark compares stock Tarantool vs Tarantool with deferred updates.
Obviously, the largest throughput increase can be achieved with the workload
consisting of many replaces and deletions on a table with non-unique secondary
indexes. The database schema consists of one space on the Vinyl engine, one primary
index on the first field, and 4 secondary indexes, each on one field. In terms of the
SQL syntax, it looks as follows:
\begin{verbatim}
create table test
(field1 unsigned integer primary key,
field2 unsigned integer, field3 unsigned integer,
field4 unsigned integer, field5 unsigned integer);
create not unique index on test(field2);
create not unique index on test(field3);
create not unique index on test(field4);
create not unique index on test(field5);
\end{verbatim}
\begin{figure}
\centering
\includegraphics[width=0.46\textwidth]{rps_microbench}
\caption{Microbenchmark RPS}
\label{fig:rps_microbench}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=0.46\textwidth]{disk_stmt_microbench}
\caption{Microbenchmark disk statement count}
\label{fig:disk_stmt_microbench}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=0.46\textwidth]{dump_count_microbench}
\caption{Microbenchmark dump count}
\label{fig:dump_count_microbench}
\end{figure}
The level zero size of an LSM tree is capped at 128 MB. The workload is generated and
sent via network by 4 clients on the same machine that is running Tarantool
(unix sockets are used, minimizing the network overhead). Each client generates
replaces and deletions by primary key in random-sized batches (1-500 tuples).
A half of the tuples are deletions and the other half are replaces. A tuple consists
of 5 fields with the total size of about 50 bytes (including tuple meta information). All
keys range between 1 and 1,000,000.
There are two worker threads for dumping and compacting levels. The level size ratio is
3.5. The maximal number of sublevels per level is 2. The Bloom filter has a 5\% false
positive rate. The page size is 8 KB (about 160 tuples).
The RPS results are presented in Figure \ref{fig:rps_microbench}.
Table \ref{table:rps_microbench} contains some aggregated RPS values that show
that on such a small table with less than 1,000,000 tuples, the number of RPS has
grown 6 times.
Given a much higher number of RPS, the table is filled faster, as shown in Figure
\ref{fig:disk_stmt_microbench}, and the memory level is dumped more frequently, as
shown in Figure \ref{fig:dump_count_microbench}.
\begin{table}
\centering
\caption{Microbenchmark aggregated RPS}
\label{table:rps_microbench}
\begin{tabular}{|c|c|c|} \hline
\cellcolor{table_header}&
\cellcolor{table_header}Deferred update &
\cellcolor{table_header}Regular update\\ \hline
\cellcolor{table_header}Average &112343 r/s &21681 r/s\\ \hline
\cellcolor{table_header}Maximal &132316 r/s &89292 r/s\\ \hline
\cellcolor{table_header}Median &114772 r/s &20097 r/s\\
\hline\end{tabular}
\end{table}
\subsection{LinkBench}
Deferring updates slows down readings, because any key found in a secondary
index must be checked for not being dirty in the primary index. The extent of slowdown
depends on the workload type - the more updates againts new data insertions, the more
tuples become dirty and the slower readings become. Figure \ref{fig:linkbench}
shows the LinkBench \cite{Armstrong:linkbench} results. Most requests in this
LinkBench run were readings (over 70\%), and, given this kind of workload, read
requests are 2 times slower than without deferred updates.
\begin{figure}
\centering
\includegraphics[width=0.46\textwidth]{linkbench}
\caption{LinkBench RPS}
\label{fig:linkbench}
\end{figure}
\section{Related work}
Accelerating updates is not a new task. There are some existing works
on this topic: some try to reduce the number of hidden reads, others try to
defer some computations until reads. Various solutions are designed not only for
LSM trees, but for B-trees as well.
The problem of updates that take too long predates the emergence of SSDs
and traces back to when B-trees were a very widespread data structure for
HDDs. In the work \cite{Edward:incremental_update}, authors propose a
method of deferring B-tree updates until the updated keys have been read.
Updates are stored in \textit{differential files}, which are used to look up
each key. Differential files were first proposed in 1976 \cite{Lohman:differential_files}
as a method of B-tree multiversion support. It is curious to note that LSM trees,
which were investigated much later, actually consist entirely of differential files.
Differential files allowed storing multiple versions, but they could not reduce
the cost of random reads and writes in B-trees.
In 2014, another LSM tree performance optimization was proposed
\cite{Wang:open_channel_ssd} on the hardware level. According to this paper, SSDs
do not allow fully utilizing I/O capacity, because they provide only one channel
to the operating system even if they have multiple channels that can be accessed
simultaneously. By using custom LevelDB, it was shown that open-channel SSDs
(SDFs) allow for more than 4 times throughput increase. But this optimization is
purely hardware, as the LSM tree is not changed at all, so it will not work for
regular SSDs. However, this method can be combined with deferred updates.
\section{Conclusion}
This paper introduced a new version of the LSM tree that turns replace/delete
operations into blind writes even if multiple indexes exist. The main idea is to
defer old data lookup and deletion until compaction, which eliminates hidden
reads. Several variants of the new compaction algorithm were proposed, and one
of them was implemented. Exponential speed growth was mathematically proven for
deferred updates, and the same results were obtained in experiments.
% The following two commands are all you need in the
% initial runs of your .tex file to
% produce the bibliography for the citations in your paper.
\bibliographystyle{abbrv}
\bibliography{vldb_sample} % vldb_sample.bib is the name of the Bibliography in this case
\end{document}