-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathdtr.html
73 lines (69 loc) · 3.68 KB
/
dtr.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
layout: default
title: Dynamic Tensor Rematerialization
---
<h1>Dynamic Tensor Rematerialization(DTR)</h1>
<div style="color:#222;font-size:1rem">Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, Zachary Tatlock</div>
<ul>
<li>Save memory for NN by dynamically discarding and recomputing intermediate results at runtime.</li>
<li>By being smart about what to keep and what to discard, train larger models under a tight budget.</li>
<li>Save 3x memory for 20% extra compute time - train models 3x as large!</li>
<li>DTR Makes no assumptions about the model, and does not require a static graph.</li>
<li>DTR can be easily implemented into different deep learning frameworks.</li>
<li>DTR still achieves amazing space-time tradeoff!</li>
</ul>
<div> <a href="https://arxiv.org/abs/2006.09616">[paper]</a> <a href="resources/DTR.pptx">[slides]</a> <a href="https://www.youtube.com/watch?v=S9KJ37Sx2XY">[video]</a> </div>
<h2>Adoption:</h2>
<a href="https://github.com/MegEngine/MegEngine/wiki/Reduce-GPU-memory-usage-by-Dynamic-Tensor-Rematerialization">MegEngine</a> <br>
<img src="https://raw.githubusercontent.com/MegEngine/Resource/master/MegEngine%20Meetup/DTR%20docs%20image/dtr-13.png"> <br>
<a href="https://github.com/awslabs/raf/blob/main/docs/wiki/3_dev_guide/pass/Rematerialization.md">Raf</a> <br>
<table>
<tr>
<th>Model</th>
<th>Maximum non-remat batch size</th>
<th>Remat batch size</th>
<th>Throughput relative to non-remat</th>
</tr>
<tr>
<td>BERT-base-mlm</td>
<td>16</td>
<td>32</td>
<td>94.3%</td>
</tr>
<tr>
<td>BERT-large-mlm</td>
<td>8</td>
<td>16</td>
<td>96.5%</td>
</tr>
<tr>
<td>GPT2</td>
<td>8</td>
<td>24</td>
<td>93.2%</td>
</tr>
</table> <br>
<h2>Inspired:</h2>
<a href="https://arxiv.org/abs/2203.15980">DELTA</a> <br>
<h2>Technical Dive</h2>
<ol start="0">
<li>Gradient Checkpointing is a technique to save memory for Deep Neural Network training.</li>
<li>Or more generally, for reverse-mode automatic differentiation.</li>
<li>However, memory planning is np-complete.</li>
<li>Checkpointing also has to deal with program with arbitary control flow.</li>
<li>To combat this, previous work made different restriction which sacrifice performance or usability.</li>
<li>Some works models the program as a stack machine with no heap...</li>
<li>And suffers performance degradation when the assumption is broken!</li>
<li>(For example, NN with highway connection/branching).</li>
<li>Other works use an ILP solver, which consume lots of time to find the optimal memory planning.</li>
<li>And can only be used for program/framework without control flow, posing problem for real world adoption.</li>
<li>Additionally, gradient checkpointing couple derivative calculation, with memory saving by recomputing.</li>
<li>This add complexity and limit applications range.</li>
<li>DTR tackles the problems above by planning the memory greedily at runtime, instead of as a compiler pass.</li>
<li>Essentially, DTR always pair each tensor with a thunk(think lazy evaluation), so all tensors can be recomputed after evicting.</li>
<li>This solves the control flow and stack machine issue, as we do not model the program in anyway!</li>
<li>However, with a novel cache eviction policy, we are still able to achieve great performance.</li>
<li>Suprisingly DTR 'accidentally' discovered the O(n) time O(n^0.5) space, and the O(n log n) time O(log n) space scheme.</li>
<li>These two are both well-known result.</li>
<li>Furthermore DTR switch smoothly between normal execution without recomputation, O(n^0.5) space scheme, and O(log n) space scheme</li>.
</ol>