-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathlangevin.html
265 lines (253 loc) · 16.2 KB
/
langevin.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
<!DOCTYPE html>
<html>
<head>
<link rel="canonical" href="https://hardmath123.github.io/langevin.html"/>
<link rel="stylesheet" type="text/css" href="/static/base.css"/>
<title>Fortune in Flux - Comfortably Numbered</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<meta charset="utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
<link rel="alternate" type="application/rss+xml" title="Comfortably Numbered" href="/feed.xml" />
<!--
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script>
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$']]}
});
</script>
-->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-Um5gpz1odJg5Z4HAmzPtgZKdTBHZdw8S29IecapCSB31ligYPhHQZMIlWLYQGVoc" crossorigin="anonymous">
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-YNHdsYkH6gMx9y3mRkmcJ2mFUjTd0qNQQvY9VYZgQd7DcN7env35GzlmFaZ23JGp" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/auto-render.min.js" integrity="sha384-vZTG03m+2yp6N6BNi5iM4rW4oIwk5DfcNdFfxkk9ZWpDriOkXX8voJBFrAO7MpVl" crossorigin="anonymous"></script>
<script>
document.addEventListener("DOMContentLoaded", function() {
renderMathInElement(document.body, {
// customised options
// • auto-render specific keys, e.g.:
delimiters: [
{left: '$$', right: '$$', display: true},
{left: '$', right: '$', display: false},
{left: '\\begin{align}', right: '\\end{align}', display: true},
{left: '\\(', right: '\\)', display: false},
{left: '\\[', right: '\\]', display: true}
],
// • rendering keys, e.g.:
throwOnError : false
});
});
</script>
</head>
<body>
<header id="header">
<script src="static/main.js"></script>
<div>
<a href="/"><span class="left-word">Comfortably</span> <span class="right-word">Numbered</span></a>
</div>
</header>
<article id="postcontent" class="centered">
<section>
<h1>Fortune in Flux</h1>
<center><em><p>The fortune of us that are the moon’s men doth ebb and flow like the sea (Henry IV, Part I)</p>
</em></center>
<h4>Wednesday, July 7, 2021 · 7 min read</h4>
<p>Surely <em>something</em> will happen — which means that if you integrate over all
possible outcomes, the probabilities must add up to 1. This elementary fact
remains true, even if the probabilities of the outcomes change for some reason.
Are you reminded of something? <em>Fluids</em> behave in the same way: no matter how
the particles are distributed, the total quantity of fluid is conserved, which
means that if you integrate over all of space, the masses must add up to 1.
Indeed, when we talk about <em>probability density functions</em> and <em>probability
mass functions</em>, we are borrowing fluidic vocabulary.</p>
<p>I get excited about analogies like this, where the physics of one domain
promises to unexpectedly deliver insight into a different domain. Another
example of this is how spring constants and capacitances play the same
mathematical role in the differential equations that model their respective
mechanical and electrical systems. Harry Olson’s wonderful book <a href="https://archive.org/details/DynamicalAnalogies"><em>Dynamical
Analogies</em></a> shows some more
beautifully-illustrated examples of mechanical, electrical, and acoustic
systems with identical dynamics.</p>
<p>Can we get any mileage out of the probability-fluid analogy? It turns out,
<em>yes!</em> We can apply our physical intuition about fluids to design a clever
statistical machine learning algorithm. That’s the subject of this piece. By
the way, you can undoubtedly find better (and more correct) expositions of this
math in other places; my goal here is to emphasize the interaction between
physical intuition and probabilistic applications, bridged so beautifully by
the fluidic analogy.</p>
<p>A little bit about the problem setup: sometimes, when working with data, you
find yourself in a place where you want to sample from a probability
distribution $p(x)$. This might not be so bad if you can compute $p(x)$ exactly
— for example, you can use rejection sampling or something. Often, however,
you can only compute some scaled quantity $q(x)$ such that $p(x) \propto q(x)$
with some unknown constant of proportionality. Commonly, this happens in
Bayesian inference, where you may want to sample from $\pi(x \mid y) =
\pi(x,y)/\pi(y)$ where $y$ might be data you’ve seen and $x$ might be the
parameters to your model. Here, the problem is that the normalizing constant
$\pi(y)=\int \pi(x,y)dx$ is hard to compute because that integral is
intractable in high dimensions.</p>
<p>One famous way to do this kind of sampling is by the Metropolis-Hastings
algorithm, which approximates $p(x)$ with the stable distribution of a Markov
chain. The “miracle” of the algorithm is that the transition probabilities
between two states only depend on the <em>ratios</em> of the probability densities at
those states — the unknown constant simply cancels out!</p>
<p>But now let’s think about this from a fluids perspective. Have unknown
normalizing constants popped up in physics anywhere? Yes! The partition
function $Z$ in Boltzmann is often unknown. Recall that a system at
thermodynamic equilibrium at temperature $T$ has probability $p_i \propto
e^{-U_i/k_BT}$ of being in state $i$, where $U_i$ is the potential energy of
that state and $k_B$ is a physical constant that you can find, e.g. on
Wikipedia. A curious application of this distribution is working out how thin
the atmosphere is at altitude $h$ by reasoning about $U_h$ being the
gravitational potential energy at altitude $h$: because $U$ is linear in $h$,
we can instantly conclude that atmospheric density decreases exponentially in
altitude.</p>
<p>But returning to machine learning, here is the insight that the fluidic analogy
makes possible: if we set $U(x) = -\log q(x)$ then a particle simulated at some
“temperature” $T=1/k_B$ in potential $U$ would eventually have probability
$e^{-U(x)}=p(x)$ of being at $x$. So, we can use what we know about the
dynamics of particles that are a part of a fluid at thermal equilibrium, to
sample from probability distributions that may have nothing at all to do with
fluids.</p>
<p>What exactly does it mean to “simulate a particle” of this probability fluid?
You can take those words pretty literally, as it turns out. There are two
sources of motion at each timestep: $U$ and $T$. Given potential $U(x)$, we can
derive the force $-\nabla U$ that acts on our particle, and then apply Newton’s
Second Law to infer the velocity. We’ll assume that velocity is roughly
independent between timesteps because the particle is repeatedly being bumped
around by other particles and having its velocity reset — so, no need to
track acceleration. By the way, notice that $\nabla \log q(x)$ doesn’t depend
on the unknown normalization constant, because the logarithm turns it into an
additive constant, which disappears under the derivative.</p>
<p>As for the effect of $T$, we can model the thermal motion of the particle as
Brownian motion: that is, at each time, the particle gets bumped to some random
neighboring position drawn independently from a Gaussian that is somehow scaled
with respect to “temperature” $T$. (How do we translate $T=1/k_B$ to this
un-physical world where “joules” and “kelvins” make no sense? Stay tuned…)</p>
<p>This fluidic analogy leads us to what people call the “Unadjusted Langevin
Algorithm.” The algorithm is quite simple to state and implement:</p>
<ol>
<li>Initialize some $x_0$.</li>
<li>Using some small step size $\Delta t$ integrate $\Delta x = \Delta t \cdot
\nabla\log q(x_t) + \sqrt{2\Delta t} \cdot \xi_t$ where you sample $\xi_t
\sim N(0,1)$; the Greek letter $\xi$ is chosen for maximum squiggliness, in
order to act as a visual metaphor for its function. Importantly, this step only
requires you to evaluate the unnormalized density $q$!</li>
<li>That’s it — under some (not <em>too</em> mild but still reasonable) assumptions,
snapshots of the trajectory ${x_t}$ are distributed accoding to $p(x)$!</li>
</ol>
<p>Are you convinced? Here’s a sanity check: without the stochastic component
$\sqrt{2\Delta t}\cdot\xi_t$ at each step, $x_t$ deterministically walks
towards the maximum likelihood estimator — that is, without the stochastic
component, we’d have invented gradient descent! The stochasticity, however,
makes $x_t$ a random variable instead.</p>
<hr>
<p>The rest of this piece is dedicated to answering the question: where does that
stochastic coefficient $\sqrt{2\Delta t}$ come from? You might have already
worked out that it’s the magnitude of the thermal motion, which depends on $T$.
If that coefficient were too small, then, as we just discussed above, we simply
have gradient descent: our particle is analogous to a marble rolling down a
(syrupy — overdamped!) hill. If the coefficient were too large, then thermal
motion would dominate and the “signal” from the potential $U$ would stop
mattering. So, clearly, the threshold $\sqrt{2\Delta t}$ is “special” in some
way.</p>
<p>First let me convince you that we want this coefficient to indeed be
proportional to $\sqrt{\Delta t}$, even though it looks dimensionally wrong
when placed in a sum right next to the $\Delta t\cdot\nabla \log q$ term.
What’s going on? It turns out that this “lots of little Gaussian bumps” motion
doesn’t have an intuitive integral calculus. In particular, for Gaussians, it’s
the <em>variance</em> that adds, not the standard deviation. So in time $t$ the
variance of the total distance traveled would be $\Sigma\sqrt{\Delta t}^2 =
\Sigma \Delta t = t$, and therefore the standard deviation would be $\sqrt{t}$,
which is exactly what we want for the algorithm to be invariant to how small
$\Delta t$ is. (For more on this, you could read about “Wiener processes.”)</p>
<p>Now for the “2.” To derive this value, let’s return to our fluidic analogy.
Suppose we wanted to compute the change in probability density at a point $x$
over time, that is, ${\partial p(x,t)}/{\partial t}$, with the understanding
that we will set this time derivative to zero later to reason about the
system’s behavior at equilibrium. By following the fluidic analogy, we can
hallucinate a “probability flow” $\vec{J}(x, t)$ and assert, by Gauss’ Law,
that ${\partial p}/{\partial t} = -\nabla \cdot \vec{J}$.</p>
<p>To measure $\vec{J}(x, t)$, we should think about what’s flowing; that is,
about “particles,” whose dynamics is given by the Langevin update rule for
$\Delta x_t$ I wrote out above. The deterministic part is easy to deal with:
from a tiny region of volume $\Delta V$, a mass $p\Delta V$ of particles leaves
with velocity $\nabla\log q$, so the flux is $p\nabla \log q$.</p>
<p>For the stochastic part, notice that the dynamics of the thermal motion are
spatially invariant, i.e. independent of $x$. So, whatever amount diffuses out
of $\Delta V$ is proportional to $p$, and whatever amount diffuses in is
proportional to the density in the neighborhood of $\Delta V$: this means the
net flow is proportional to $-\nabla p$. When we talk about fluids, this
observation is also known as Fick’s Law, and the constant of proportionality is
called the diffusion coefficient $D$.</p>
<p>How can we compute $D$? Looking at only the stochastic part, we have $\partial
p/\partial t = -D\nabla^2 p$, which we recognize as the heat equation. That is,
for this brief aside, we’re going to think of $p$ representing — not
probability over Bayesian model parameters <em>or</em> fluid density in a potential
well, but — heat in a block of metal — a bonus analogy!</p>
<p>Thankfully, the solution to the heat equation turns out to be well-known: it is
a time-varying Gaussian. And it better be! After all, the stochastic diffusion
is just a sum of lots of independent little Gaussian-sampled steps of variance
$\Delta(\sigma^2) = {2\Delta t}$, which together form a big Gaussian of
variance $\sigma^2 = 2t$, as discussed above. But let’s say we didn’t know the
coefficient was 2. Let’s set it to $a$ instead. Now, if we know that $p(x,t) =
e^{-x^2/2(a t)}/\sqrt{2\pi(at)}$, then we can literally take the partial
derivatives on both sides of the heat equation and expand to solve for $D$. A
miracle occurs, and many things cancel, leaving us with simply $-D=a/2$. It’s
not worth writing out the algebra here: I encourage you to try it yourself,
since it is that rare combination of straightforward and satisfying. But
furthermore, it gives some mechanical (if not physical) insight into the “2”
that appears in the denominator of the result: algebraically, the $-1/2$ power
of $at$ in the normalization constant $1/\sqrt{2\pi(at)}$ kicks out some
factors of two that are <em>not</em> balanced by the $-1$ power of $t$ in the
exponential.</p>
<p>In any case, though, we conclude from the above that $\Delta p = -(a/2)\Delta t
\nabla^2 p$, which means the stochastic flux is simply $\nabla p$. Perhaps it’s
all clear to you by now, but let’s take this over the finish line. Putting
these two components together, $U$ and $T$, we have $\partial p/\partial t =
\nabla\cdot (p\nabla\log q-(a/2)\nabla p)$. This nice identity is known as the
Fokker-Planck equation (the devoted reader might note that this is the <em>second</em>
reference to Fokker on this blog).</p>
<p>Now, finally, as promised many paragraphs ago, we will now reason about this
equation at equilibrium. At equilibrium the distribution is static in time, so
we have $0 = \nabla \cdot (p\nabla \log q -(a/2) \nabla p)$. Assuming suitable
boundary conditions, we can assume the argument of the divergence is also zero,
so $\nabla \log q = (a/2)\nabla p / p$. Integrating in all directions, we have
$\log q + c = (a/2)\log p$, or rather $p \propto q^{2/a}$. This makes sense:
just as expected, for large $a$ the effect of $q$ is wiped out, and for small
$a$ all but the maximum value get squashed to zero. Crucially, however, if we
want $p \propto q$, then we must set $a=2$. Aha! So <em>that’s</em> why the thermal
motion needs to be scaled by $\sqrt{2\Delta t}$. (By the way, that constant of
integration above, $c$, oddly enough encodes that pesky normalization constant
that started off this whole discussion…)</p>
<hr>
<p>Okay. I’m glad I got that out of my system, it’s been rattling around for
weeks, now! How wonderful it is, that we can start with the simplest axiom of
probability — “<em>something</em> must happen” — and, reasoning only with physical
intuition, reach a sampling algorithm that solves a real problem.</p>
</section>
<div id="comment-breaker">◊ ◊ ◊</div>
</article>
<footer id="footer">
<div>
<ul>
<li><a href="https://github.com/kach">
Github</a></li>
<li><a href="feed.xml">
Subscribe (RSS feed)</a></li>
<li><a href="https://twitter.com/hardmath123">
Twitter</a></li>
<li><a href="https://creativecommons.org/licenses/by-nc/3.0/deed.en_US">
CC BY-NC 3.0</a></li>
</ul>
</div>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-46120535-1', 'hardmath123.github.io');
ga('require', 'displayfeatures');
ga('send', 'pageview');
</script>
</footer>
</body>
</html>