forked from Hardmath123/hardmath123.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathxkcd-random.html
285 lines (276 loc) · 16.5 KB
/
xkcd-random.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" type="text/css" href="/static/base.css"/>
<title>How Random is xkcd? - Comfortably Numbered</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<meta charset="utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script>
MathJax.Hub.Config({
tex2jax: {inlineMath: [['($','$)']]}
});
</script>
</head>
<body>
<header id="header">
<link rel="stylesheet" href="/octicons/octicons.css">
<script src="static/cheet.js" charset="utf-8"></script>
<script src="static/main.js"></script>
<h1>
<a href="/"><span class="left-word">Comfortably</span><span class="right-word">Numbered</span></a>
</h1>
<!-- You're reading the source to my blog? What do you think?
Do you have dinner plans? -->
</header>
<article id="postcontent" class="centered">
<section>
<h2>How Random is xkcd?</h2>
<h4>Friday, March 20, 2015 · 6 min read</h4>
<p>Apparently Randall Munroe gets a lot of messages saying that the “random”
button on xkcd is biased.</p>
<blockquote>
<p>2015-03-19 16:47:00 <strong>Hobz</strong> also, Randall, the random button on the xkcd
frontpage is frustratingly un-random</p>
<p>2015-03-19 18:50:52 <strong>~Randall</strong> it’s random.</p>
<p>2015-03-19 18:50:59 <strong>~Randall</strong> people contact me constantly to tell me
that it’s not</p>
<p>2015-03-19 18:51:17 <strong>~Randall</strong> which is a nice illustration of that mental
bias we have</p>
</blockquote>
<p>I thought I would do a little investigating to see just how random xkcd is.</p>
<hr>
<p>Making consistently random numbers (yes, that sounds weird) is really important
in things like cryptography. Unrandom random numbers can cripple an otherwise
secure network. So there’s a surprisingly large amount of work dedicated to
randomness.</p>
<p>There are services like <a href="http://www.random.org">random.org</a> which pride
themselves on randomness, and <a href="http://www.fourmilab.ch/hotbits/">HotBits</a>,
which lets you order random bytes that are generated from radioactive decay. A
lot of applications use <code>/dev/urandom/</code>, which is an OS-level random generator
that uses all sorts of sources of entropy such as network noise, CPU heat, and
the current weather in Kansas.</p>
<p>Unfortunately, it’s <em>really</em> hard to tell whether numbers are random or not.
Of course, patterns <a href="http://boallen.com/random-numbers.html">can creep into random
numbers</a>. But more annoyingly, a
glaringly obvious pattern might just be accidental. My favorite example of this
is the <a href="http://en.wikipedia.org/wiki/Feynman_point">Feynman Point</a>, which is a
series of lots of 9s that appears somewhere in the (very unpredictable) decimal
expansion of pi.</p>
<p><img src="http://imgs.xkcd.com/comics/ayn_random.png" alt="xkcd 1277"></p>
<p>There are a bunch of established ways to test the randomness of a random number
generator (such as the excitingly-named <a href="http://en.wikipedia.org/wiki/Diehard_tests">Diehard
tests</a>). They all test for features
that ostensibly random data should have. For example, a random stream of bits
should have almost as many ones as zeros. Not all tests are that obvious,
though, and statistics can be very slippery and unintuitive when it feels like
it.</p>
<p>NIST (the National Institute of Standards and Technology, who deal with things
like how long an inch is and how to backdoor elliptic curves)
<a href="http://csrc.nist.gov/groups/ST/toolkit/rng/documentation_software.html">publishes</a>
a standard for randomness based on such tests, and distributes software that
runs these tests on datasets.</p>
<p>I wrote a Python program to download 10,000 xkcd-random numbers (yay
<code>requests</code>!), and converted them into bitstrings. Then, I fed them to the NIST
Statistical Test Suite.</p>
<p>The results are below:</p>
<pre><code>
------------------------------------------------------------------------------
RESULTS FOR THE UNIFORMITY OF P-VALUES AND THE PROPORTION OF PASSING SEQUENCES
------------------------------------------------------------------------------
generator is <data/data.xkcd.long>
------------------------------------------------------------------------------
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 P-VALUE PROPORTION STATISTICAL TEST
------------------------------------------------------------------------------
5 8 7 13 9 8 13 12 15 10 0.437274 100/100 Frequency
10 10 11 9 10 16 10 6 8 10 0.759756 99/100 BlockFrequency
5 12 13 10 7 10 10 10 9 14 0.699313 100/100 CumulativeSums
8 3 13 9 9 14 13 8 12 11 0.366918 98/100 CumulativeSums
8 12 11 3 9 8 17 12 9 11 0.224821 100/100 Runs
7 8 8 6 15 9 12 9 15 11 0.437274 99/100 LongestRun
7 8 7 16 0 25 0 25 0 12 0.000000 * 100/100 FFT
3 10 4 19 15 0 18 6 10 15 0.000009 * 100/100 Serial
9 14 10 2 14 8 6 10 16 11 0.080519 100/100 Serial
16 1 5 9 6 0 6 0 10 47 0.000000 * 93/100 * LinearComplexity
</code></pre><p>The important column here is “Proportion”, which shows the pass rate. They’re
all stellar.</p>
<p>If that isn’t convincing, I ran an obviously nonrandom sample for comparison.
This is what NIST’s STS thinks of the first 100,000 bits of Project Gutenberg’s
<a href="http://www.gutenberg.org/dirs/etext98/2ws1610.txt">plaintext version</a> of
<em>Romeo and Juliet</em>:</p>
<pre><code>
------------------------------------------------------------------------------
RESULTS FOR THE UNIFORMITY OF P-VALUES AND THE PROPORTION OF PASSING SEQUENCES
------------------------------------------------------------------------------
generator is <data/data.rnj>
------------------------------------------------------------------------------
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 P-VALUE PROPORTION STATISTICAL TEST
------------------------------------------------------------------------------
95 3 0 1 0 0 1 0 0 0 0.000000 * 25/100 * Frequency
55 14 10 6 6 3 1 1 3 1 0.000000 * 64/100 * BlockFrequency
94 3 1 0 0 1 1 0 0 0 0.000000 * 28/100 * CumulativeSums
93 4 1 1 0 0 0 0 1 0 0.000000 * 30/100 * CumulativeSums
51 7 10 10 4 4 2 5 2 5 0.000000 * 61/100 * Runs
90 8 1 1 0 0 0 0 0 0 0.000000 * 44/100 * LongestRun
92 2 2 2 0 1 0 1 0 0 0.000000 * 23/100 * FFT
100 0 0 0 0 0 0 0 0 0 0.000000 * 0/100 * Serial
100 0 0 0 0 0 0 0 0 0 0.000000 * 0/100 * Serial
14 2 2 7 11 0 4 0 11 49 0.000000 * 95/100 * LinearComplexity
</code></pre><p>Much worse.</p>
<p>I encourage you to play with the STS code. It lets you do all sorts of other
neat things, like testing bitstrings for common “templates” and reporting if
too many are found. It also segfaults all over the place, which is actually
very disturbing considering that it’s technically part of the US government’s
computer security project.</p>
<p>In any case, we’ve established that xkcd’s random generator is reasonably
unpredictable and unbiased. As it happens, they’re using the Mersenne Twister,
which is a well-established pseudorandom generation algorithm.</p>
<hr>
<p>So why does the random number generation appear so biased when we’re idly
refreshing on lazy Sunday nights? Part of it is, of course, human nature. We
like to see patterns everywhere.</p>
<p>But here’s a more concrete, mathematical explanation. The conceptual idea is
that in the beginning, hitting “random” is likelier to hit an unread comic, but
once you’ve seen more and more of them, you get repeats. Let’s try to quantify
this: we’re going to calculate the <em>expected value</em> of the number of times you
need to hit “random” until you have seen every single comic. You may have seen
this problem in the context of “how many times do you need to roll a die until
you have rolled all six faces at least once?”.</p>
<p><a href="http://en.wikipedia.org/wiki/Expected_value">Expected value</a> is the average
value of some random variable if you do an experiment lots of times. For
example, if you roll a die gazillions of time, the average number you’ll get is
($ (1+2+3+4+5+6)/6 = 3.5 $), so that’s the <em>expected</em> value.</p>
<p>We’re going to calculate the expected number of times you hit “random” by
calculating the number of times you need to hit it to get the first, second,
third, and (in general) nth unique comic. Then, because of a useful property of
expected values, we can just add them together until ($ n = 1500 $) (there are
1500 comics published as of right now) to see how long, as of today, this
process would take.</p>
<p>If you’re looking for your ($ n $)th unread comic, each time you hit “random”
you have a ($ 1 - n/1500 $) chance of getting a fresh one. This is a <a href="http://en.wikipedia.org/wiki/Geometric_distribution">geometric
probability distribution</a>,
which is Math for “you keep trying something with a constant probability until
it succeeds”. For geometric probability distributions, the expected value is
one over the probability (though I’m not going to prove it here, this
intuitively makes sense: you would expect to have to roll a die around 6 times
until you get your first 1, or to flip a coin twice until you get your first
heads).</p>
<p>Anyhow, for the nth comic, the expected number of clicks is ($ 1500/(1500-n) $). Adding
these up for each ($ n $), we have this monstrosity:</p>
<p>\[ \sum_{n=1}^{1500} \frac{1500}{n} = \frac{1500}{1500} + \frac{1500}{1499} + \dots + \frac{1500}{2} + \frac{1500}{1} \]</p>
<p>This works out to, on average, 11836 clicks. That’s a lot of clicks.</p>
<p>As common sense dictates, the more times you have clicked “random”, the less
likely it is for you to hit a new comic. And that’s why Randall’s random button
seems biased.</p>
<hr>
<p>One more bit of statistics: if you’ve taken a probability class, you might have
heard of the birthday problem. That is, say you have a party with ($ n $) people.
What is the probability that some pair of people at the party share a birthday?</p>
<p>It turns out that if you have just 23 people, the probability is already 50-50.
This is somewhat counterintuitive; most birthday parties only have one birthday
boy! The fallacy is that the problem isn’t asking if some <em>particular</em> person
shares a birthday with someone else. It’s asking if <em>any</em> two people share a
birthday.</p>
<p>The birthday “paradox” turns out to be important in cryptography, especially
when looking for hash collisions. The number of hashes you need to generate
before you hit a collision is similar to the number of people you need at a
party before some pair shares a birthday—much smaller than what you would
expect.</p>
<p>In terms of xkcd-surfing, this helps answer the question “how many times will I
hit random before I see a repeat?”.</p>
<p>There are plenty of good explanations for the math behind the birthday problem
online (<a href="http://mathworld.wolfram.com/BirthdayProblem.html">Wolfram Mathworld</a>
and <a href="http://en.wikipedia.org/wiki/Birthday_problem">Wikipedia</a>)—but if you
don’t believe the number 23 quoted above, it’s worth spending some time trying
to solve it yourself just to understand what’s really going on (it’s not hard).
I’m just going to dump the formula here without any explanation.</p>
<p>For 1500 comics, the probability that you get a repeat after ($ k $) clicks is:</p>
<p>\[ 1 - \frac{1500!}{(1500-k)!1500^k} \]</p>
<p>Throwing this at WolframAlpha, we see that after only 45 clicks, you have a
50-50 chance of seeing a duplicate comic. Put a different way, <em>there are even
odds that the last 45 comics you have seen contain a duplicate pair somewhere
in there</em>.</p>
<hr>
<p>So we’ve empirically validated that xkcd’s RNG is as close as we can expect for
something statistically random. We’ve also seen two reasons why it feels
biased.</p>
<p>But on a deeper and much more important level, we’ve seen how counterintuitive
and messy the random-number business is, and how statistical facts can trick us
into seeing patterns that aren’t there.</p>
<hr>
<p>P.S. My methodology for these experiments probably not the best, since I have
no formal statistics background. If you want to check out the code used or a
dump of my dataset, leave a comment below and I’ll send it to you.</p>
</section>
<div id="comment-breaker">◊ ◊ ◊</div>
<!--
<section>
<center>(<a id="comment-toggle" href="#comment-toggle">Click to load comments…</a>)</center>
<div id="disqus_thread"></div>
<script type="text/javascript">
var disqus_shortname = 'comfortablynumbered';
var disqus_identifier = 'xkcd-random.md';
window.addEventListener(
'load',
(function() {
document.getElementById('comment-toggle').addEventListener('click',
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
}),
true);
}),
true
);
</script>
</section>
-->
</article>
<footer id="footer">
<div>
<ul>
<li><a href="http://github.com/Hardmath123">
<span class="octicon octicon-mark-github"></span>
Github</a></li>
<li><a href="feed.xml">
<span class="octicon octicon-rss"></span>
Subscribe (RSS)</a></li>
<li><a href="https://github.com/Hardmath123/hardmath123.github.io/issues/new?title=Hi&body=How's%20it%20going%3F">
<span class="octicon octicon-comment-discussion"></span>
Feedback</a></li>
<li><a href="mailto:contact-at-comfortablynumbered-dot-appspotmail-dot-com">
<span class="octicon octicon-mail-read"></span>
Email me</a></li>
<li><a href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_US">
<span class="octicon octicon-law"></span>
CC BY-NC 3.0</a></li>
<li><a href="appreciation.html">
<span class="octicon octicon-beer"></span>
Other</a></li>
</ul>
</div>
<!--
<span class="sc">ComfortablyNumbered</span> · Hardmath123 (2013) · <a href="feed.xml">RSS Feed</a>
·
<a href="/appreciation.html">
<span class="octicon octicon-beer"></span></a>
<a rel="license" href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_US">
<img
alt="Creative Commons License"
src="http://i.creativecommons.org/l/by-nc/3.0/80x15.png"/>
</a>
-->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-46120535-1', 'hardmath123.github.io');
ga('require', 'displayfeatures');
ga('send', 'pageview');
</script>
</footer>
</body>
</html>