xkcd-random.html

<!DOCTYPE html>
<html>
    <head>
        <link rel="stylesheet" type="text/css" href="/static/base.css"/>
        <title>How Random is xkcd? - Comfortably Numbered</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
        <meta charset="utf-8"/>
        <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
        <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
        <script>
            MathJax.Hub.Config({
                tex2jax: {inlineMath: [['($','$)']]}
            });
        </script>
    </head>
    <body>
        <header id="header">
            <link rel="stylesheet" href="/octicons/octicons.css">
            <script src="static/cheet.js" charset="utf-8"></script>
            <script src="static/main.js"></script>
            <h1>
                <a href="/"><span class="left-word">Comfortably</span><span class="right-word">Numbered</span></a>
            </h1>
            <!-- You're reading the source to my blog? What do you think?

            Do you have dinner plans? -->
        </header>

        <article id="postcontent" class="centered">
            <section>
                <h2>How Random is xkcd?</h2>
                <h4>Friday, March 20, 2015 &middot; 6 min read</h4>
<p>Apparently Randall Munroe gets a lot of messages saying that the “random”
button on xkcd is biased.</p>
<blockquote>
<p>2015-03-19 16:47:00    <strong>Hobz</strong>    also, Randall, the random button on the xkcd
frontpage is frustratingly un-random</p>
<p>2015-03-19 18:50:52   <strong>~Randall</strong>    it’s random.</p>
<p>2015-03-19 18:50:59   <strong>~Randall</strong>    people contact me constantly to tell me
that it’s not</p>
<p>2015-03-19 18:51:17   <strong>~Randall</strong>    which is a nice illustration of that mental
bias we have</p>
</blockquote>
<p>I thought I would do a little investigating to see just how random xkcd is.</p>
<hr>
<p>Making consistently random numbers (yes, that sounds weird) is really important
in things like cryptography. Unrandom random numbers can cripple an otherwise
secure network. So there’s a surprisingly large amount of work dedicated to
randomness.</p>
<p>There are services like <a href="http://www.random.org">random.org</a> which pride
themselves on randomness, and <a href="http://www.fourmilab.ch/hotbits/">HotBits</a>,
which lets you order random bytes that are generated from radioactive decay. A
lot of applications use <code>/dev/urandom/</code>, which is an OS-level random generator
that uses all sorts of sources of entropy such as network noise, CPU heat, and
the current weather in Kansas.</p>
<p>Unfortunately, it’s <em>really</em> hard to tell whether numbers are random or not.
Of course, patterns <a href="http://boallen.com/random-numbers.html">can creep into random
numbers</a>. But more annoyingly, a
glaringly obvious pattern might just be accidental. My favorite example of this
is the <a href="http://en.wikipedia.org/wiki/Feynman_point">Feynman Point</a>, which is a
series of lots of 9s that appears somewhere in the (very unpredictable) decimal
expansion of pi.</p>
<p><img src="http://imgs.xkcd.com/comics/ayn_random.png" alt="xkcd 1277"></p>
<p>There are a bunch of established ways to test the randomness of a random number
generator (such as the excitingly-named <a href="http://en.wikipedia.org/wiki/Diehard_tests">Diehard
tests</a>). They all test for features
that ostensibly random data should have. For example, a random stream of bits
should have almost as many ones as zeros. Not all tests are that obvious,
though, and statistics can be very slippery and unintuitive when it feels like
it.</p>
<p>NIST (the National Institute of Standards and Technology, who deal with things
like how long an inch is and how to backdoor elliptic curves)
<a href="http://csrc.nist.gov/groups/ST/toolkit/rng/documentation_software.html">publishes</a>
a standard for randomness based on such tests, and distributes software that
runs these tests on datasets.</p>
<p>I wrote a Python program to download 10,000 xkcd-random numbers (yay
<code>requests</code>!), and converted them into bitstrings. Then, I fed them to the NIST
Statistical Test Suite.</p>
<p>The results are below:</p>
<pre><code>
------------------------------------------------------------------------------
RESULTS FOR THE UNIFORMITY OF P-VALUES AND THE PROPORTION OF PASSING SEQUENCES
------------------------------------------------------------------------------
   generator is &lt;data/data.xkcd.long&gt;
------------------------------------------------------------------------------
 C1  C2  C3  C4  C5  C6  C7  C8  C9 C10  P-VALUE  PROPORTION  STATISTICAL TEST
------------------------------------------------------------------------------
  5   8   7  13   9   8  13  12  15  10  0.437274    100/100     Frequency
 10  10  11   9  10  16  10   6   8  10  0.759756     99/100     BlockFrequency
  5  12  13  10   7  10  10  10   9  14  0.699313    100/100     CumulativeSums
  8   3  13   9   9  14  13   8  12  11  0.366918     98/100     CumulativeSums
  8  12  11   3   9   8  17  12   9  11  0.224821    100/100     Runs
  7   8   8   6  15   9  12   9  15  11  0.437274     99/100     LongestRun
  7   8   7  16   0  25   0  25   0  12  0.000000 *  100/100     FFT
  3  10   4  19  15   0  18   6  10  15  0.000009 *  100/100     Serial
  9  14  10   2  14   8   6  10  16  11  0.080519    100/100     Serial
 16   1   5   9   6   0   6   0  10  47  0.000000 *   93/100  *  LinearComplexity
</code></pre><p>The important column here is “Proportion”, which shows the pass rate. They’re
all stellar.</p>
<p>If that isn’t convincing, I ran an obviously nonrandom sample for comparison.
This is what NIST’s STS thinks of the first 100,000 bits of Project Gutenberg’s
<a href="http://www.gutenberg.org/dirs/etext98/2ws1610.txt">plaintext version</a> of
<em>Romeo and Juliet</em>:</p>
<pre><code>
------------------------------------------------------------------------------
RESULTS FOR THE UNIFORMITY OF P-VALUES AND THE PROPORTION OF PASSING SEQUENCES
------------------------------------------------------------------------------
   generator is &lt;data/data.rnj&gt;
------------------------------------------------------------------------------
 C1  C2  C3  C4  C5  C6  C7  C8  C9 C10  P-VALUE  PROPORTION  STATISTICAL TEST
------------------------------------------------------------------------------
 95   3   0   1   0   0   1   0   0   0  0.000000 *   25/100  *  Frequency
 55  14  10   6   6   3   1   1   3   1  0.000000 *   64/100  *  BlockFrequency
 94   3   1   0   0   1   1   0   0   0  0.000000 *   28/100  *  CumulativeSums
 93   4   1   1   0   0   0   0   1   0  0.000000 *   30/100  *  CumulativeSums
 51   7  10  10   4   4   2   5   2   5  0.000000 *   61/100  *  Runs
 90   8   1   1   0   0   0   0   0   0  0.000000 *   44/100  *  LongestRun
 92   2   2   2   0   1   0   1   0   0  0.000000 *   23/100  *  FFT
100   0   0   0   0   0   0   0   0   0  0.000000 *    0/100  *  Serial
100   0   0   0   0   0   0   0   0   0  0.000000 *    0/100  *  Serial
 14   2   2   7  11   0   4   0  11  49  0.000000 *   95/100  *  LinearComplexity
</code></pre><p>Much worse.</p>
<p>I encourage you to play with the STS code. It lets you do all sorts of other
neat things, like testing bitstrings for common “templates” and reporting if
too many are found. It also segfaults all over the place, which is actually
very disturbing considering that it’s technically part of the US government’s
computer security project.</p>
<p>In any case, we’ve established that xkcd’s random generator is reasonably
unpredictable and unbiased. As it happens, they’re using the Mersenne Twister,
which is a well-established pseudorandom generation algorithm.</p>
<hr>
<p>So why does the random number generation appear so biased when we’re idly
refreshing on lazy Sunday nights? Part of it is, of course, human nature. We
like to see patterns everywhere.</p>
<p>But here’s a more concrete, mathematical explanation. The conceptual idea is
that in the beginning, hitting “random” is likelier to hit an unread comic, but
once you’ve seen more and more of them, you get repeats. Let’s try to quantify
this: we’re going to calculate the <em>expected value</em> of the number of times you
need to hit “random” until you have seen every single comic. You may have seen
this problem in the context of “how many times do you need to roll a die until
you have rolled all six faces at least once?”.</p>
<p><a href="http://en.wikipedia.org/wiki/Expected_value">Expected value</a> is the average
value of some random variable if you do an experiment lots of times. For
example, if you roll a die gazillions of time, the average number you’ll get is
($ (1+2+3+4+5+6)/6 = 3.5 $), so that’s the <em>expected</em> value.</p>
<p>We’re going to calculate the expected number of times you hit “random” by
calculating the number of times you need to hit it to get the first, second,
third, and (in general) nth unique comic. Then, because of a useful property of
expected values, we can just add them together until ($ n = 1500 $) (there are
1500 comics published as of right now) to see how long, as of today, this
process would take.</p>
<p>If you’re looking for your ($ n $)th unread comic, each time you hit “random”
you have a ($ 1 - n/1500 $) chance of getting a fresh one. This is a <a href="http://en.wikipedia.org/wiki/Geometric_distribution">geometric
probability distribution</a>,
which is Math for “you keep trying something with a constant probability until
it succeeds”.  For geometric probability distributions, the expected value is
one over the probability (though I’m not going to prove it here, this
intuitively makes sense: you would expect to have to roll a die around 6 times
until you get your first 1, or to flip a coin twice until you get your first
heads).</p>
<p>Anyhow, for the nth comic, the expected number of clicks is ($ 1500/(1500-n) $). Adding
these up for each ($ n $), we have this monstrosity:</p>
<p>\[ \sum_{n=1}^{1500} \frac{1500}{n} = \frac{1500}{1500} + \frac{1500}{1499} + \dots + \frac{1500}{2} + \frac{1500}{1} \]</p>
<p>This works out to, on average, 11836 clicks. That’s a lot of clicks.</p>
<p>As common sense dictates, the more times you have clicked “random”, the less
likely it is for you to hit a new comic. And that’s why Randall’s random button
seems biased.</p>
<hr>
<p>One more bit of statistics: if you’ve taken a probability class, you might have
heard of the birthday problem. That is, say you have a party with ($ n $) people.
What is the probability that some pair of people at the party share a birthday?</p>
<p>It turns out that if you have just 23 people, the probability is already 50-50.
This is somewhat counterintuitive; most birthday parties only have one birthday
boy! The fallacy is that the problem isn’t asking if some <em>particular</em> person
shares a birthday with someone else. It’s asking if <em>any</em> two people share a
birthday.</p>
<p>The birthday “paradox” turns out to be important in cryptography, especially
when looking for hash collisions. The number of hashes you need to generate
before you hit a collision is similar to the number of people you need at a
party before some pair shares a birthday—much smaller than what you would
expect.</p>
<p>In terms of xkcd-surfing, this helps answer the question “how many times will I
hit random before I see a repeat?”.</p>
<p>There are plenty of good explanations for the math behind the birthday problem
online (<a href="http://mathworld.wolfram.com/BirthdayProblem.html">Wolfram Mathworld</a>
and <a href="http://en.wikipedia.org/wiki/Birthday_problem">Wikipedia</a>)—but if you
don’t believe the number 23 quoted above, it’s worth spending some time trying
to solve it yourself just to understand what’s really going on (it’s not hard).
I’m just going to dump the formula here without any explanation.</p>
<p>For 1500 comics, the probability that you get a repeat after ($ k $) clicks is:</p>
<p>\[ 1 - \frac{1500!}{(1500-k)!1500^k} \]</p>
<p>Throwing this at WolframAlpha, we see that after only 45 clicks, you have a
50-50 chance of seeing a duplicate comic. Put a different way, <em>there are even
odds that the last 45 comics you have seen contain a duplicate pair somewhere
in there</em>.</p>
<hr>
<p>So we’ve empirically validated that xkcd’s RNG is as close as we can expect for
something statistically random. We’ve also seen two reasons why it feels
biased.</p>
<p>But on a deeper and much more important level, we’ve seen how counterintuitive
and messy the random-number business is, and how statistical facts can trick us
into seeing patterns that aren’t there.</p>
<hr>
<p>P.S. My methodology for these experiments probably not the best, since I have
no formal statistics background. If you want to check out the code used or a
dump of my dataset, leave a comment below and I’ll send it to you.</p>

            </section>

            <div id="comment-breaker">&loz; &loz; &loz;</div>

<!--
            <section>
                <center>(<a id="comment-toggle" href="#comment-toggle">Click to load comments&hellip;</a>)</center>
                <div id="disqus_thread"></div>
                <script type="text/javascript">
                    var disqus_shortname = 'comfortablynumbered';
                    var disqus_identifier = 'xkcd-random.md';

                    window.addEventListener(
                        'load',
                        (function() {
                            document.getElementById('comment-toggle').addEventListener('click',
                                (function() {
                                    var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
                                    dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
                                    (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
                                }),
                            true);
                        }),
                        true
                    );
                </script>
            </section>
-->
        </article>
        <footer id="footer">
            <div>
                <ul>
                    <li><a href="http://github.com/Hardmath123">
                        <span class="octicon octicon-mark-github"></span>
                        Github</a></li>
                    <li><a href="feed.xml">
                        <span class="octicon octicon-rss"></span>
                        Subscribe (RSS)</a></li>
                    <li><a href="https://github.com/Hardmath123/hardmath123.github.io/issues/new?title=Hi&body=How's%20it%20going%3F">
                        <span class="octicon octicon-comment-discussion"></span>
                        Feedback</a></li>
                    <li><a href="mailto:contact-at-comfortablynumbered-dot-appspotmail-dot-com">
                        <span class="octicon octicon-mail-read"></span>
                        Email me</a></li>
                    <li><a href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_US">
                        <span class="octicon octicon-law"></span>
                        CC BY-NC 3.0</a></li>
                    <li><a href="appreciation.html">
                        <span class="octicon octicon-beer"></span>
                        Other</a></li>
                </ul>
            </div>

            <!--
            <span class="sc">ComfortablyNumbered</span> &middot; Hardmath123 (2013) &middot; <a href="feed.xml">RSS Feed</a>
            &nbsp;&middot;&nbsp;
            <a href="/appreciation.html">
                <span class="octicon octicon-beer"></span></a>
            <a rel="license" href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_US">
                <img
                    alt="Creative Commons License"
                    src="http://i.creativecommons.org/l/by-nc/3.0/80x15.png"/>
            </a>
            -->
            <script>
                (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
                (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
                m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
                })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

                ga('create', 'UA-46120535-1', 'hardmath123.github.io');
                ga('require', 'displayfeatures');
                ga('send', 'pageview');
            </script>
        </footer>

    </body>
</html>