<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://rob-zimmerman.ca/feed.xml" rel="self" type="application/atom+xml" /><link href="https://rob-zimmerman.ca/" rel="alternate" type="text/html" /><updated>2026-05-14T15:31:28-07:00</updated><id>https://rob-zimmerman.ca/feed.xml</id><title type="html">Robert Zimmerman</title><subtitle>personal description</subtitle><author><name>Robert Zimmerman</name><email>robert.zimmerman@alumni.utoronto.ca</email></author><entry><title type="html">Find the Fox with MCMC</title><link href="https://rob-zimmerman.ca/posts/2026-find-the-fox" rel="alternate" type="text/html" title="Find the Fox with MCMC" /><published>2026-01-17T00:00:00-08:00</published><updated>2026-01-17T00:00:00-08:00</updated><id>https://rob-zimmerman.ca/posts/find-the-fox</id><content type="html" xml:base="https://rob-zimmerman.ca/posts/2026-find-the-fox"><![CDATA[<p>$\newcommand{\N}{\mathbb{N}}$
$\newcommand{\R}{\mathbb{R}}$
$\newcommand{\E}{\mathbb{E}}$
$\newcommand{\C}{\mathbb{C}}$
$\renewcommand{\P}{\mathbb{P}}$
$\newcommand{\one}[1]{\boldsymbol{1}_{#1}}$
$\newcommand{\FOX}{\mathrm{FOX}}$
$\newcommand{\XOF}{\mathrm{XOF}}$
$\newcommand{\F}{\mathrm{F}}$
$\renewcommand{\O}{\mathrm{O}}$
$\newcommand{\X}{\mathrm{X}}$</p>

<p>Since it was published in 2024, Alex Cheddar’s book <a href="https://alexcheddar.com/"><i>Find the Fox: the Almost Impossible Word Search</i></a> has become quite popular for its difficulty and novelty. If you haven’t heard of this book, the concept is very simple: it’s a word search where the grid consists solely of the letters $\F$, $\O$, and $\X$. There’s only one word to find: FOX, which appears once among all 200 pages of the book. As is standard in word searches, the string can appear horizontally, vertically, or diagonally, and either forward or reversed. Here’s the first page of the book, extracted from the Amazon preview:</p>

<p><img src="/files/blog/find_the_fox/FtF_amazon.jpg" alt="Find the Fox p.1 (screenshot from Amazon.com)" /></p>

<p>Upon a quick inspection, the letters seem to be fairly uniformly distributed – conditional, of course, on the string FOX not appearing (it’s definitely not on this page). This led me to think about how the book would have been generated (I certainly hope that Mr. Cheddar didn’t write the book<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> manually, letter by letter). Each of the 200 pages consists of a $32 \times 20$ grid. Assuming the letters really are randomly generated, the book can be said to have been sampled from the $\mathrm{Unif}(\{\F,\O,\X\}^{32 \times 20 \times 200})$ distribution conditional on the string $\FOX$ only appearing once. Sampling from this conditional distribution is not trivial. In this post, we’ll develop a method to do so using MCMC.</p>

<p>First of all, we can probably agree that any reasonable method will first sample from the distribution in which the string doesn’t appear anywhere, and then randomly choose a spot to insert it. The essential strategy behind the last part is fairly straightforward: we simply overwrite some $3$-letter string to $\FOX$, and then ensure that we haven’t inadvertently created additional $\FOX$s in the process. So for now, let’s focus on generating a single $\FOX$-less grid. Call a grid $\FOX$less if it doesn’t contain the string $\FOX$ anywhere, and let $\Omega$ be the set of $\FOX$less $32 \times 20$ grids. We want to sample from the $\text{Unif}(\Omega)$ distribution, which we’ll call $\pi$ (as per tradition).</p>

<h2 id="exact-sampling">Exact Sampling</h2>

<p>In principle, the simplest way to sample from $\pi$ is by rejection sampling: that is, we keep sampling from $\mathrm{Unif}(\{\F,\O,\X\}^{32 \times 20})$ distribution until we produce a grid in $\Omega$. The only challenge here is coding up a $\FOX$less grid checker, which is more of a programming exercise than anything else. Here’s a way to do it:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">random</span>

<span class="n">letters</span> <span class="o">=</span> <span class="p">(</span><span class="s">"F"</span><span class="p">,</span> <span class="s">"O"</span><span class="p">,</span> <span class="s">"X"</span><span class="p">)</span>
<span class="n">height</span><span class="p">,</span> <span class="n">width</span> <span class="o">=</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">20</span>

<span class="c1"># 8 directions (dx, dy)
</span><span class="n">dirs</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span>
        <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">),</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">)]</span>

<span class="k">def</span> <span class="nf">in_bounds</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">c</span><span class="p">):</span>
    <span class="k">return</span> <span class="mi">0</span> <span class="o">&lt;=</span> <span class="n">r</span> <span class="o">&lt;</span> <span class="n">height</span> <span class="ow">and</span> <span class="mi">0</span> <span class="o">&lt;=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">width</span>

<span class="k">def</span> <span class="nf">violates_local</span><span class="p">(</span><span class="n">grid</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">directions</span><span class="p">):</span>
     <span class="c1"># check whether FOX appears in any 3-cell segment that includes (r,c), along any direction
</span>    <span class="k">for</span> <span class="n">dr</span><span class="p">,</span> <span class="n">dc</span> <span class="ow">in</span> <span class="n">directions</span><span class="p">:</span>
        <span class="c1"># the triple can be (r-2dr,r-dr,r), (r-dr,r,r+dr), or (r,r+dr,r+2dr)
</span>        <span class="k">for</span> <span class="n">offset</span> <span class="ow">in</span> <span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">):</span>
            <span class="n">coords</span> <span class="o">=</span> <span class="p">[(</span><span class="n">r</span> <span class="o">+</span> <span class="p">(</span><span class="n">offset</span> <span class="o">+</span> <span class="n">k</span><span class="p">)</span><span class="o">*</span><span class="n">dr</span><span class="p">,</span> <span class="n">c</span> <span class="o">+</span> <span class="p">(</span><span class="n">offset</span> <span class="o">+</span> <span class="n">k</span><span class="p">)</span><span class="o">*</span><span class="n">dc</span><span class="p">)</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">)]</span>
            <span class="k">if</span> <span class="nb">all</span><span class="p">(</span><span class="n">in_bounds</span><span class="p">(</span><span class="n">rr</span><span class="p">,</span> <span class="n">cc</span><span class="p">)</span> <span class="k">for</span> <span class="n">rr</span><span class="p">,</span> <span class="n">cc</span> <span class="ow">in</span> <span class="n">coords</span><span class="p">):</span>
                <span class="n">triple</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">grid</span><span class="p">[</span><span class="n">rr</span><span class="p">][</span><span class="n">cc</span><span class="p">]</span> <span class="k">for</span> <span class="n">rr</span><span class="p">,</span> <span class="n">cc</span> <span class="ow">in</span> <span class="n">coords</span><span class="p">)</span>
                <span class="k">if</span> <span class="n">triple</span> <span class="ow">in</span> <span class="p">(</span><span class="n">letters</span><span class="p">,</span> <span class="n">letters</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]):</span>
                    <span class="k">return</span> <span class="bp">True</span>
    <span class="k">return</span> <span class="bp">False</span>

<span class="k">def</span> <span class="nf">is_FOXless</span><span class="p">(</span><span class="n">grid</span><span class="p">):</span>
    <span class="c1"># full scan
</span>    <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">height</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">width</span><span class="p">):</span>
            <span class="k">if</span> <span class="n">violates_local</span><span class="p">(</span><span class="n">grid</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">dirs</span><span class="p">):</span>
                <span class="k">return</span> <span class="bp">False</span>
    <span class="k">return</span> <span class="bp">True</span>
</code></pre></div></div>

<p>The rejection sampler then follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1729</span><span class="p">)</span>

<span class="n">FOXlessgrid</span> <span class="o">=</span> <span class="bp">False</span>

<span class="k">while</span> <span class="ow">not</span> <span class="n">FOXlessgrid</span><span class="p">:</span>
  <span class="n">newgrid</span> <span class="o">=</span> <span class="p">[[</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">letters</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">width</span><span class="p">)]</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">height</span><span class="p">)]</span>
  <span class="n">FOXlessgrid</span> <span class="o">=</span> <span class="n">is_FOXless</span><span class="p">(</span><span class="n">newgrid</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="n">newgrid</span><span class="p">)</span>  
</code></pre></div></div>

<p>Because the proposal distribution is uniform over all $3^{32 \cdot 20}$ grids, the accepted sample is exactly uniform over the $\FOX$less ones. So in theory, this works! You can try running this if you want, but I wouldn’t recommend it. Why? The problem is the acceptance rate $p_\text{acc}$. Computing the exact acceptance rate is essentially an inclusion-exclusion/transfer-matrix counting problem which explodes combinatorially in $2$ dimensions, but we can compute a rigorous upper bound.</p>

<p>Let $N$ be the number of length-$3$ segments we’re checking. Checking for a $\FOX$less grid is equivalent to checking the four “forward” directions (→, ↓, ↘, ↙) — of which there are</p>

\[20(32 - 2) + (20-2)32 + 2(32-2)(20-2) = 2256 \notag\]

<p>segments — for the presence of the strings $\FOX$ and $\XOF$. For each segment, we’re forbidding $2$ out of $3^3$ patterns, so the probability that a single segment is forbidden is $2/27$. The expected number of forbidden segments in a random grid is then $2256 \cdot (2/27) = 167.1111\ldots$. For a quick upper bound on the acceptance probability, we can consider only disjoint segments. In each row of length $20$ we can choose $6$ disjoint horizontal triples (covering $18$ cells), which across $32$ rows comes to $192$ independent triples. If the grid is $\FOX$less, then none of these triples can be $\FOX$ and $\XOF$, and hence</p>

\[p_\text{acc} \leq \left(1 - \frac{2}{27}\right)^{192} \approx 3.8251 \times 10^{-7}.\notag\]

<p>So the acceptance rate is at most about one in 2.6 million, which doesn’t seem horrible if we’re running our code on a peppy, multithreaded processor and we’re willing to wait for a few days. In fact, this bound is misleadingly optimistic — it overstates the probability by a factor of over 230 quadrillion.</p>

<p>To show this, we can invoke <i>Janson’s inequality</i>,<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> which provides exponential upper bounds on the probability that none of a large collection of weakly dependent “bad” events occur, in terms of their expected count and the sum of their pairwise dependencies. Let $A_i$ be the event that segment $i$ is $\FOX$ or $\XOF$. Then the events $A_i$ and $A_j$ are independent unless the segements $i$ and $j$ share at least one cell. Define</p>

\[X := \sum_{i=1}^m \mathbf{1}_{A_i}, \qquad \mu := \E[X] = \sum_{i=1}^m \P(A_i), \qquad \text{and} \qquad \Delta := \sum_{\substack{1 \leq i &lt; j \leq m\\ i \sim j}} \P(A_j \cap A_j),\notag\]

<p>where $i \sim j$ means that segments $i$ and $j$ overlap and $m$ is the number of overlapping pairs of segments. The probability that we want is $p_\text{acc} = \P(X = 0)$, which, according to Janson’s inequality,<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> satisfies</p>

\[\P(X = 0) \leq \exp\left(-\frac{\mu^2}{\mu + 2\Delta} \right). \notag\]

<p>How can we compute $\mu$ and $\Delta$? Observe that $\mu$ is simply the expected number of forbidden segments in the random grid, which we computed above as $167.111\ldots$. To compute $\Delta$, note that for each pair $(i,j)$, there are $4$ pattern combinations:</p>

\[C := \{(\FOX, \FOX), (\FOX, \XOF), (\XOF, \FOX), (\XOF, \XOF)\}.\notag\]

<p>For each combination, we have a set of letter constraints on the union of the cells used by the two segments. Because the letters are independent and uniform, if the constraints are inconsistent (i.e., the same cell is required to be both $\F$ and $\X$), then $\P(A_i \cap A_j) = 0$; otherwise, the union involves $m_{ij} \in \{4,5\}$ distinct cells and</p>

\[\mathbb{P}(A_i \cap A_j)
= \sum_{c \in C} \mathbf{1}_{\{ \text{combination } c \text{ is consistent for } (i,j) \}}\, \cdot 3^{-m_{ij}}. \notag\]

<p>We thus compute:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span>
<span class="kn">from</span> <span class="nn">itertools</span> <span class="kn">import</span> <span class="n">combinations</span><span class="p">,</span> <span class="n">product</span>

<span class="c1"># generate all length-3 segments
</span><span class="n">segments</span> <span class="o">=</span> <span class="p">[</span>
    <span class="nb">tuple</span><span class="p">((</span><span class="n">r</span> <span class="o">+</span> <span class="n">k</span><span class="o">*</span><span class="n">dr</span><span class="p">,</span> <span class="n">c</span> <span class="o">+</span> <span class="n">k</span><span class="o">*</span><span class="n">dc</span><span class="p">)</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">))</span>
    <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">height</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">width</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">dr</span><span class="p">,</span> <span class="n">dc</span> <span class="ow">in</span> <span class="n">dirs</span><span class="p">[:</span><span class="mi">4</span><span class="p">]</span>
    <span class="k">if</span> <span class="nb">all</span><span class="p">(</span><span class="n">in_bounds</span><span class="p">(</span><span class="n">r</span> <span class="o">+</span> <span class="n">k</span><span class="o">*</span><span class="n">dr</span><span class="p">,</span> <span class="n">c</span> <span class="o">+</span> <span class="n">k</span><span class="o">*</span><span class="n">dc</span><span class="p">)</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">))</span>
<span class="p">]</span>

<span class="c1"># index segments by cell
</span><span class="n">cell_to_segs</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">seg</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">segments</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">cell</span> <span class="ow">in</span> <span class="n">seg</span><span class="p">:</span>
        <span class="n">cell_to_segs</span><span class="p">[</span><span class="n">cell</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>

<span class="c1"># get all overlapping segment pairs
</span><span class="n">pairs</span> <span class="o">=</span> <span class="p">{</span>
    <span class="nb">tuple</span><span class="p">(</span><span class="nb">sorted</span><span class="p">(</span><span class="n">p</span><span class="p">))</span>
    <span class="k">for</span> <span class="n">idxs</span> <span class="ow">in</span> <span class="n">cell_to_segs</span><span class="p">.</span><span class="n">values</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">combinations</span><span class="p">(</span><span class="n">idxs</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="p">}</span>

<span class="c1"># compute probability that two segments are both FOX/XOF
</span><span class="k">def</span> <span class="nf">pair_prob</span><span class="p">(</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">):</span>
    <span class="n">total</span> <span class="o">=</span> <span class="mf">0.0</span>
    <span class="k">for</span> <span class="n">p1</span><span class="p">,</span> <span class="n">p2</span> <span class="ow">in</span> <span class="n">product</span><span class="p">((</span><span class="n">letters</span><span class="p">,</span> <span class="n">letters</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]),</span> <span class="n">repeat</span><span class="o">=</span><span class="mi">2</span><span class="p">):</span>
        <span class="n">req</span> <span class="o">=</span> <span class="p">{}</span>
        <span class="n">ok</span> <span class="o">=</span> <span class="bp">True</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">cell</span><span class="p">,</span> <span class="n">ch</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">s1</span><span class="p">,</span> <span class="n">p1</span><span class="p">):</span>
            <span class="k">if</span> <span class="n">cell</span> <span class="ow">in</span> <span class="n">req</span> <span class="ow">and</span> <span class="n">req</span><span class="p">[</span><span class="n">cell</span><span class="p">]</span> <span class="o">!=</span> <span class="n">ch</span><span class="p">:</span>
                <span class="n">ok</span> <span class="o">=</span> <span class="bp">False</span><span class="p">;</span> <span class="k">break</span>
            <span class="n">req</span><span class="p">[</span><span class="n">cell</span><span class="p">]</span> <span class="o">=</span> <span class="n">ch</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">cell</span><span class="p">,</span> <span class="n">ch</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">s2</span><span class="p">,</span> <span class="n">p2</span><span class="p">):</span>
            <span class="k">if</span> <span class="n">cell</span> <span class="ow">in</span> <span class="n">req</span> <span class="ow">and</span> <span class="n">req</span><span class="p">[</span><span class="n">cell</span><span class="p">]</span> <span class="o">!=</span> <span class="n">ch</span><span class="p">:</span>
                <span class="n">ok</span> <span class="o">=</span> <span class="bp">False</span><span class="p">;</span> <span class="k">break</span>
            <span class="n">req</span><span class="p">[</span><span class="n">cell</span><span class="p">]</span> <span class="o">=</span> <span class="n">ch</span>
        <span class="k">if</span> <span class="n">ok</span><span class="p">:</span>
            <span class="n">total</span> <span class="o">+=</span> <span class="mi">3</span> <span class="o">**</span> <span class="p">(</span><span class="o">-</span><span class="nb">len</span><span class="p">(</span><span class="n">req</span><span class="p">))</span>
    <span class="k">return</span> <span class="n">total</span>

<span class="n">Delta</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">pair_prob</span><span class="p">(</span><span class="n">segments</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">segments</span><span class="p">[</span><span class="n">j</span><span class="p">])</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">pairs</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">Delta</span><span class="p">)</span>
</code></pre></div></div>

<p>This gives $\Delta \approx 171.4567$. Hence $p_\text{acc} \leq e^{-54.7545} \approx 1.6613 \times 10^{-24}$. Conclusion: don’t use the rejection sampler.</p>

<p>There’s another way to sample exactly from our target distribution, this time using a bit of dynamic programming. To start off, define a scanning order (say row-by-row). At each step, to choose the next letter uniformly, we count how many completions exist if we put $\F$ here, how many exist if we put $\O$, and how many exist if we put $\X$. Then we sample from the three letters with probabilities proportional to those completion counts. Seems simple enough! The catch is that for length-$3$ strings in $8$ directions, whether a choice is $\FOX$less depends on a “neighborhood” of “radius” $2$, and the state we need to remember while sweeping is essentially the previous two rows (across their full widths), plus the last two letters of the current row we’re moving across. The number of states is on the order of $3^{2w}$, where $w$ is the width of the grid. For $w = 20$, it’s hopeless.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>

<h2 id="approximate-sampling">Approximate Sampling</h2>

<p>Instead of trying to sample from $\pi$ directly, what if we start off with <i>some</i> $\FOX$less grid, and then modify it so that it looks like it came from $\pi$? This is where we can exploit MCMC. We will define a symmetric random walk on $\Omega$, where each step makes a tiny local random change but does <i>not</i> violate the constraint. Generating a $\FOX$less grid (call it $G_0 \in \Omega$) is easy for initialization purposes: for example, the grid consisting entirely of $\F$s is perfectly $\FOX$less. Let’s now construct our Markov chain. At the $t$th iteration of our sampler, we’ll do the following:</p>
<ol>
  <li>Choose a cell $(i,j)$ in $G_t$ uniformly at random</li>
  <li>Propose changing the letter in that cell to one of the other two letters (chosen uniformly)</li>
  <li>If the resulting grid is still $\FOX$less (i.e., we didn’t create a new $\FOX$), then accept the move and call the new grid $G_{t+1}$; otherwise, reject the proposed change and set $G_{t+1} = G_t$</li>
</ol>

<p>The resulting chain $\{G_t\}_t$ is obviously time-homogenous. But is $\pi$ really stationary for this Markov chain? Let $G, G’ \in \Omega$ differ in exactly one cell (say the $k$th), and suppose that changing the cell from letter $a$ to letter $b$ keeps the grid $\FOX$less. The probability of moving from $G$ to $G’$ is</p>

\[\begin{align*}
\P(G \to G') &amp;= \P(\mbox{we pick cell $k$}) \cdot \P(\mbox{we propose letter $b$}) \cdot \P(G' \in \Omega)\\
&amp;=\frac{1}{32 \cdot 20} \cdot \frac{1}{2} \cdot 1\\
&amp;=\frac{1}{1280}
\end{align*}\]

<p>and the probability of the reverse move is exactly the same. Since $\pi(G) = 1/\lvert\Omega\rvert$ for any $G \in \Omega$, we have</p>

\[\pi(G) \cdot \P(G \to G') = \pi(G') \cdot \P(G' \to G).\notag\]

<p>So the detailed balance condition is satisfied, and $\pi$ is indeed stationary for our Markov chain.</p>

<p>What about aperiodicity? If we form a graph $\mathcal{G}$ whose vertices are grids in $\Omega$ and draw an edge between vertices $G$ and $G’$ if and only if the grids differ in exactly one cell, then we can view our algorithm as a random walk on $\mathcal{G}$. A random walk on an undirected graph is aperiodic if and only if the graph is non-bipartite, and for $\mathcal{G}$ to be non-bipartite, it suffices to find an odd cycle. But that’s easy! Start from the all-$\F$ grid, then change the first cell to an $\O$, then to an $\X$, and then to an $\F$. All of these grids are clearly in $\mathcal{G}$, so we’ve exhibited a cycle of length $3$ (i.e., a triangle), and our Markov chain is aperiodic.</p>

<p>The only tricky bit is irreducibility. In order to guarantee that the law of $G_t$ will actually converge to $\pi$ as $t \to \infty$, we need to show that our chain is irreducible: any $\FOX$less grid should be reachable from any other via $\FOX$less single-cell flips. Equivalently, we need to show that $\mathcal{G}$ is connected. Fortunately, with some care we can prove this. To be general, we’ll prove the result for any grid size.</p>

<p>For some setup, fix integers $h,w \geq 1$ and identify grid cells with coordinates $(r,c)$ where $r \in \{1,\ldots,h\}$ and $c \in \{1,\ldots,w\}$. A length-$3$ line segment is any triple of distinct cells of the form</p>

\[(r,c), (r + \delta_r, c + \delta_c), (r + 2\delta_r, c + 2 \delta_c)\notag\]

<p>where</p>

\[(\delta_r, \delta_c) \in \{(-1,0), (1,0), (0,-1), (0,1), (-1,-1), (-1,1), (1,-1), (1,1)\}\notag\]

<p>and all three cells lie in the grid. Define a <i>$\FOX$less single-cell flip</i> to be an operation that changes the letter in exactly one cell and results in another grid in $\Omega$. We order the cells in row-major order: $(r,c) \prec (r’,c’)$ if either $r &lt; r’$ or $r = r’$ and $c &lt; c’$. Let $C_1, C_2, \ldots, C_{hw}$ denote the cells in this order. $G(C_i)$ refers to the letter in the $i$th cell of $G$.</p>

<blockquote>
  <p><b>Proposition:</b> For every $G \in \Omega$, there exists a sequence of $\FOX$less single-cell flips that transforms $G$ into the all-$\O$ grid. Consequently, $\mathcal{G}$ is connected.</p>
</blockquote>

<p><i>Proof:</i> We will explicitly construct a $\FOX$less path from an arbitrary $G \in \Omega$ to the all-$\O$ grid. We process cells in row-major order. For $i = 1, 2,\ldots, hw$, at step $i$ we change the value of cell $C_i$ to $\O$ if it’s not already $\O$. Fix $i$ and let $G^{(i)}$ denote the grid after step $i$, with $G^{(0)} := G$. It is immediate that $G^{(i)}(C_j) = O$ for all $j \leq i$ (i.e., this is an invariant). We will show by induction that $G^{(i)} \in \Omega$. We already know that $G^{(0)} \in \Omega$, so suppose that $G^{(i-1)} \in \Omega$ and consider the transition $G^{(i-1)} \to G^{(i)}$, where we set $C_i$ to $\O$. Any newly created forbidden pattern $\FOX$ or $\XOF$ would need to involve $C_i$, since all other cells are unchanged. Moreover, $\O$ must appear in the middle of such a triple. Therefore, if setting $C_i$ to $\O$ creates a forbidden pattern, then $C_i$ must be the center cell of some length-$3$ line segment whose two opposite neighbors have labels $\F$ and $\X$ in $G^{(i)}$.</p>

<p>It thus suffices to show that after the update, $C_i$ cannot have opposite neighbors $\F$ and $\X$ along any of the four lines through $C_i$ (horizontal, vertical, and the two diagonals). Let $C_i = (r_i, c_i)$. Consider any pair of opposite neighbors of $C_i$ along a line segment of length $3$ (when such neighbors exist within the grid). These opposite pairs are:</p>

<ul>
  <li>Horizontal: $(r_i, c_i-1)$ and $(r_i, c_i+1)$</li>
  <li>Vertical: $(r_i-1, c_i)$ and $(r_i+1, c_i)$</li>
  <li>Diagonal NW-SE: $(r_i-1, c_i-1)$ and $(r_i+1, c_i+1)$</li>
  <li>Diagonal NE-SW: $(r_i-1, c_i+1)$ and $(r_i+1, c_i-1)$</li>
</ul>

<p>In every case, one of these two neighbors either lies in a strictly earlier row than $C_i$, or in the same row but an earlier column. Concretely:</p>

<ul>
  <li>In the horizontal pair, $(r_i, c_i-1) \prec (r_i,c_i)$</li>
  <li>In the vertical pair, $(r_i-1, c_i) \prec (r_i,c_i)$</li>
  <li>In the diagonal NW-SE pair, $(r_i-1, c_i-1) \prec (r_i,c_i)$</li>
  <li>In the diagonal NE-SW pair, $(r_i-1, c_i+1) \prec (r_i,c_i)$</li>
</ul>

<p>Thus, whenever a length-$3$ segment centered at $C_i$ exists, at least one of the two opposite neighbors is some $C_j$ with $j &lt; i$. By the induction invariant applied at step $i-1$, that neighbor is already $\O$ in $G^{(i-1)}$, and it remains $\O$ in $G^{(i)}$ since we only changed $C_i$. Therefore, in $G^{(i)}$ every existing opposite-neighbor pair around $C_i$ contains at least one $\O$. In particular, it is impossible for the two opposite neighbors to be $\F$ and $\X$ (in some order). Hence no forbidden pattern can be centered at $C_i$, and so setting $C_i$ to $\O$ cannot create a forbidden pattern. We conclude that the move $G^{(i-1)} \to G^{(i)}$ is $\FOX$less; that is, $G^{(i)} \in \Omega$. This completes the induction step. After $hw$ steps, every cell has been set to $\O$, so $G^{(hw)}$ is the all-$\O$ grid. This proves that every $G \in \Omega$ can be transformed to the all-$\O$ grid through a sequence of $\FOX$less single-cell flips.</p>

<p>Now, take any two grids $G, G’ \in \Omega$. By the previous construction, there exists a $\FOX$less path from $G$ to the all-$\O$ grid, and likewise from $G’$ to the all-$\O$ grid. Reversing the second path gives a $\FOX$less path from the all-$\O$ grid to $G’$, and concatenating the first path and the reversed second path yields a $\FOX$less path from $G$ to $G’$. Thus $\mathcal{G}$ is connected. $\square$</p>

<p>So our Markov chain is irreducible and aperiodic, and therefore has a unique stationary distribution. Since $\pi$ is stationary (as we verified from the detailed balance condition), it is the unique stationary distribution, and for any starting state $G_0 \in \Omega$, we have $\mathcal{L}(G_t) \to \pi$ as $t \to \infty$ by the standard finite-state Markov chain convergence theorem.</p>

<p>It remains to actually code up our sampler. We’ll go for $1{\small,}100{\small,}000$ iterations, burn off the first $50{\small,}000$ and thin every $50{\small,}000$th sample. These numbers may seem large compared to what you’d often see in simpler applications (especially in continuous-space settings), but remember that our Markov chain transitions are <i>local</i> moves in an enormous, heavily constrained state space, so mixing is necessarily slow: changing the large-scale features of a grid requires a lot of accepted local moves, so we’ll need a lot of iterations before the chain forgets its initialization and reaches more typical configurations. We thin aggressively because successive states of the chain are highly correlated; each iteration proposes a change at a single cell, so even accepted moves alter only one of $32 \times 20 = 640$ cells. As a result, many iterations are required before the chain produces a meaningfully different grid, and large thinning factor gives us approximately independent-looking samples; similarly, a large burn-in period allows us to confidently move past the all-$\F$ initialization grid.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">step</span><span class="p">(</span><span class="n">grid</span><span class="p">,</span> <span class="n">lazy_p</span><span class="o">=</span><span class="mf">0.5</span><span class="p">):</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randrange</span><span class="p">(</span><span class="n">height</span><span class="p">)</span>
    <span class="n">c</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randrange</span><span class="p">(</span><span class="n">width</span><span class="p">)</span>
    <span class="n">old</span> <span class="o">=</span> <span class="n">grid</span><span class="p">[</span><span class="n">r</span><span class="p">][</span><span class="n">c</span><span class="p">]</span>
    <span class="n">new</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">([</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">letters</span> <span class="k">if</span> <span class="n">x</span> <span class="o">!=</span> <span class="n">old</span><span class="p">])</span>

    <span class="n">grid</span><span class="p">[</span><span class="n">r</span><span class="p">][</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="n">new</span>
    <span class="k">if</span> <span class="n">violates_local</span><span class="p">(</span><span class="n">grid</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">dirs</span><span class="p">[:</span><span class="mi">4</span><span class="p">]):</span>  <span class="c1"># reject if we created a forbidden triple
</span>        <span class="n">grid</span><span class="p">[</span><span class="n">r</span><span class="p">][</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="n">old</span>

<span class="k">def</span> <span class="nf">run_chain</span><span class="p">(</span><span class="n">steps</span><span class="o">=</span><span class="mi">1_100_000</span><span class="p">,</span> <span class="n">burn</span><span class="o">=</span><span class="mi">50_000</span><span class="p">,</span> <span class="n">thin</span><span class="o">=</span><span class="mi">50_000</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">seed</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>

    <span class="c1"># start from an easy $\FOX$less state (all F)
</span>    <span class="n">grid</span> <span class="o">=</span> <span class="p">[[</span><span class="s">"F"</span><span class="p">]</span> <span class="o">*</span> <span class="n">width</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">height</span><span class="p">)]</span>

    <span class="n">snapshots</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">steps</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
        <span class="n">step</span><span class="p">(</span><span class="n">grid</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">t</span> <span class="o">&gt;=</span> <span class="n">burn</span> <span class="ow">and</span> <span class="p">(</span><span class="n">t</span> <span class="o">-</span> <span class="n">burn</span><span class="p">)</span> <span class="o">%</span> <span class="n">thin</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
            <span class="n">snapshots</span><span class="p">.</span><span class="n">append</span><span class="p">([</span><span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">row</span><span class="p">)</span> <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">grid</span><span class="p">])</span>

    <span class="k">return</span> <span class="n">snapshots</span>


<span class="n">pages</span> <span class="o">=</span> <span class="n">run_chain</span><span class="p">(</span><span class="n">seed</span><span class="o">=</span><span class="mi">1729</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">pages</span><span class="p">[</span><span class="mi">0</span><span class="p">][:</span><span class="mi">10</span><span class="p">]))</span>
</code></pre></div></div>

<p>This takes about 10 seconds to run and produces 20 grids. Some quick MCMC diagnostics: our acceptance rate is a healthy $0.452$, and some acf plots suggest that the autocorrelation of the resulting samples decays reasonably fast:</p>

<p><img src="/files/blog/find_the_fox/FtF_acf.png" alt="Autocorrelation function" /></p>

<p>Each acf above was based on a time series of summary statistics computed from the grids. The first time series ${D^{(1)}_t}$ is the Hamming distance between the $t$th sampled grid and the all-$\F$ grid, \(D^{(1)}_t = |\{c \in \{1,\ldots,h\} \times \{1,\ldots,w\}: G_t(c) \neq G^*(c)\}|, \notag\)
while the second ${D^{(2)}_t}$, is an average of six single-cell indicators for the letter $\X$,
\(\{D^{(2)}_t\} = \frac{1}{6} \sum_{j=1}^g \one{G^t(c_j) = \X}, \notag\) where the cells $c_1, \ldots, c_6$ were randomly selected. Both of these summary statistics are sensitive to structural features: the Hamming distance measures how much the configuration of letters has changed as a whole, while the single-cell $\X$-indicators reflect local changes.</p>

<p>The frequencies of the letters $\F$, $\O$, and $\X$ at the last sampled grid are $0.323$, $0.328$, and $0.348$ respectively, which is about what we’d expect. More precisely, each of those is the observed value of the pushforward of $\pi$ under the coordinate projection onto a single cell. The unconditional distribution $\text{Unif}(\{\F,\O,\X\}^{32 \times 20})$ is invariant under any permutation of the letters, but the conditioning event “no $\FOX$ appears” is not: it’s invariant under the swap $\F \leftrightarrow \X$ but not under permutations involving $\O$, since $\O$ is singled out as the middle letter of the forbidden pattern. It follows that $\pi(\F) = \pi(\X)$. On the other hand, there’s no symmetry that forces $\pi(\O)$ to equal the common value of $\pi(\F)$ and $\pi(\X)$; nevertheless, we’d expect it to be reasonably close. Determining the exact values (either for a finite $h \times w$ grid or for the infinite-grid limit) is a very difficult combinatorial problem and not worth the trouble.</p>

<h2 id="inserting-a-fox">Inserting a FOX</h2>

<p>Now, finally, what about putting in a $\FOX$? To do this, we can simply sample one of our grids (uniformly at random) and then change a $3$-letter segment to $\FOX$.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup> The harder part is ensuring that this doesn’t accidentally create another $\FOX$ nearby via overlapping triples. To do this, we’ll construct a function that picks a random $3$-letter segment uniformly from the grid, and then overwrites it with either $\FOX$ or $\XOF$ (with equal probability). Now, observe that if an arbitrary $3$-letter segment in the grid doesn’t touch any changed cell, then its three letters are exactly the same as before, so it couldn’t suddenly become $\FOX$ if it wasn’t already. Thus, if there are <i>other</i> $\FOX$s after the overwrite, we know that they must be among the length-$3$ segments that intersect the cells that we changed. So we need only enumerate those segments, of which there are a constant number. If, among those segments, we count more than one $\FOX$, we undo the overwrite and try again with another random segment.</p>

<p>Somewhat counterintuitively, most random overwrites of $\FOX$ won’t actually create extra $\FOX$s, because to do so requires a fairly specific local coincidence: the overwritten letters must also complete another length-$3$ pattern passing through one of the modified cells. Since we only alter $3$ cells, the “impact radius” is small, so the probability of collateral $\FOX$s is small as well. You can try this with the first page of <i>Find the Fox</i> above: try to pick any length-$3$ segment at random,<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup> change the letters to $\FOX$, and see if that change created any extra $\FOX$s. Chances are that it didn’t!</p>

<p>It’s worth noting that this method does introduce a tiny amout of sampling bias. This is because when we take a valid 1-$\FOX$ grid $H$ and work backwards, it must have been arisen from a $\FOX$less grid $G \in \Omega$ together with a segment $s$ such that overwriting $s$ with $\FOX$ or $\XOF$ produces $H$ without creating <i>different</i> $\FOX$/$\XOF$ occurences. However, different 1-$\FOX$ grids admit different numbers of such “preimages” $(G,s)$, depending on the cells surrounding $s$. 1-$\FOX$ grids in which the letters surrounding $\FOX$ restrict those valid preimages are slightly less likely to be produced. In other words, our procedure samples 1-$\FOX$ grids with probability proportional to the number of $\FOX$less configurations that lead to them, rather than uniformly. But practically speaking, the difference seems to be negligible.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">segments</span> <span class="o">=</span> <span class="p">[]</span>

<span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">height</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">width</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">dr</span><span class="p">,</span> <span class="n">dc</span> <span class="ow">in</span> <span class="n">dirs</span><span class="p">[:</span><span class="mi">4</span><span class="p">]:</span>  <span class="c1"># →, ↓, ↘, ↙
</span>            <span class="n">coords</span> <span class="o">=</span> <span class="p">[(</span><span class="n">r</span> <span class="o">+</span> <span class="n">k</span><span class="o">*</span><span class="n">dr</span><span class="p">,</span> <span class="n">c</span> <span class="o">+</span> <span class="n">k</span><span class="o">*</span><span class="n">dc</span><span class="p">)</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">)]</span>
            <span class="k">if</span> <span class="nb">all</span><span class="p">(</span><span class="n">in_bounds</span><span class="p">(</span><span class="n">rr</span><span class="p">,</span> <span class="n">cc</span><span class="p">)</span> <span class="k">for</span> <span class="n">rr</span><span class="p">,</span> <span class="n">cc</span> <span class="ow">in</span> <span class="n">coords</span><span class="p">):</span>
                <span class="n">segments</span><span class="p">.</span><span class="n">append</span><span class="p">((</span><span class="nb">tuple</span><span class="p">(</span><span class="n">coords</span><span class="p">),</span> <span class="p">(</span><span class="n">dr</span><span class="p">,</span> <span class="n">dc</span><span class="p">)))</span>

<span class="k">def</span> <span class="nf">segments_through_cell</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">c</span><span class="p">):</span>
    <span class="c1"># all length-3 segments (4 directions) that include (r,c)
</span>    <span class="n">out</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">dr</span><span class="p">,</span> <span class="n">dc</span> <span class="ow">in</span> <span class="n">dirs</span><span class="p">[:</span><span class="mi">4</span><span class="p">]:</span>
        <span class="k">for</span> <span class="n">off</span> <span class="ow">in</span> <span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">):</span>
            <span class="n">coords</span> <span class="o">=</span> <span class="p">[(</span><span class="n">r</span> <span class="o">+</span> <span class="p">(</span><span class="n">off</span><span class="o">+</span><span class="n">k</span><span class="p">)</span><span class="o">*</span><span class="n">dr</span><span class="p">,</span> <span class="n">c</span> <span class="o">+</span> <span class="p">(</span><span class="n">off</span><span class="o">+</span><span class="n">k</span><span class="p">)</span><span class="o">*</span><span class="n">dc</span><span class="p">)</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">)]</span>
            <span class="k">if</span> <span class="nb">all</span><span class="p">(</span><span class="n">in_bounds</span><span class="p">(</span><span class="n">rr</span><span class="p">,</span> <span class="n">cc</span><span class="p">)</span> <span class="k">for</span> <span class="n">rr</span><span class="p">,</span> <span class="n">cc</span> <span class="ow">in</span> <span class="n">coords</span><span class="p">):</span>
                <span class="n">out</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="nb">tuple</span><span class="p">(</span><span class="n">coords</span><span class="p">))</span>
    <span class="k">return</span> <span class="n">out</span>

<span class="k">def</span> <span class="nf">count_fox_in_affected</span><span class="p">(</span><span class="n">grid</span><span class="p">,</span> <span class="n">affected_cells</span><span class="p">):</span>
    <span class="c1"># count FOX/XOF occurrences among segments that intersect affected_cells
</span>    <span class="n">seen</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
    <span class="n">total</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span> <span class="ow">in</span> <span class="n">affected_cells</span><span class="p">:</span>
        <span class="k">for</span> <span class="n">seg</span> <span class="ow">in</span> <span class="n">segments_through_cell</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">c</span><span class="p">):</span>
            <span class="k">if</span> <span class="n">seg</span> <span class="ow">in</span> <span class="n">seen</span><span class="p">:</span>
                <span class="k">continue</span>
            <span class="n">seen</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">seg</span><span class="p">)</span>
            <span class="n">triple</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">grid</span><span class="p">[</span><span class="n">rr</span><span class="p">][</span><span class="n">cc</span><span class="p">]</span> <span class="k">for</span> <span class="n">rr</span><span class="p">,</span> <span class="n">cc</span> <span class="ow">in</span> <span class="n">seg</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">triple</span> <span class="ow">in</span> <span class="p">(</span><span class="n">letters</span><span class="p">,</span> <span class="n">letters</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]):</span>
                <span class="n">total</span> <span class="o">+=</span> <span class="mi">1</span>
    <span class="k">return</span> <span class="n">total</span>

<span class="k">def</span> <span class="nf">inject_exactly_one_fox</span><span class="p">(</span><span class="n">grid</span><span class="p">,</span> <span class="n">max_tries</span><span class="o">=</span><span class="mi">200000</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_tries</span><span class="p">):</span>
        <span class="p">(</span><span class="n">cells</span><span class="p">,</span> <span class="p">(</span><span class="n">dr</span><span class="p">,</span> <span class="n">dc</span><span class="p">))</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">segments</span><span class="p">)</span>
        <span class="n">pat</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">((</span><span class="n">letters</span><span class="p">,</span> <span class="n">letters</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]))</span>  <span class="c1"># choose FOX or XOF orientation on this segment
</span>
        <span class="n">new</span> <span class="o">=</span> <span class="p">[</span><span class="n">row</span><span class="p">[:]</span> <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">grid</span><span class="p">]</span>
        <span class="n">changed</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">rr</span><span class="p">,</span> <span class="n">cc</span><span class="p">),</span> <span class="n">ch</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">cells</span><span class="p">,</span> <span class="n">pat</span><span class="p">):</span>
            <span class="k">if</span> <span class="n">new</span><span class="p">[</span><span class="n">rr</span><span class="p">][</span><span class="n">cc</span><span class="p">]</span> <span class="o">!=</span> <span class="n">ch</span><span class="p">:</span>
                <span class="n">new</span><span class="p">[</span><span class="n">rr</span><span class="p">][</span><span class="n">cc</span><span class="p">]</span> <span class="o">=</span> <span class="n">ch</span>
                <span class="n">changed</span><span class="p">.</span><span class="n">append</span><span class="p">((</span><span class="n">rr</span><span class="p">,</span> <span class="n">cc</span><span class="p">))</span>

        <span class="c1"># if we didn't change anything, we'd still have 0 occurrences — skip
</span>        <span class="k">if</span> <span class="ow">not</span> <span class="n">changed</span><span class="p">:</span>
            <span class="k">continue</span>

        <span class="c1"># starting grid has 0; any new occurrence must touch a changed cell
</span>        <span class="k">if</span> <span class="n">count_fox_in_affected</span><span class="p">(</span><span class="n">new</span><span class="p">,</span> <span class="n">changed</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">new</span><span class="p">,</span> <span class="p">{</span><span class="s">"cells"</span><span class="p">:</span> <span class="n">cells</span><span class="p">,</span> <span class="s">"dir"</span><span class="p">:</span> <span class="p">(</span><span class="n">dr</span><span class="p">,</span> <span class="n">dc</span><span class="p">),</span> <span class="s">"pattern"</span><span class="p">:</span> <span class="n">pat</span><span class="p">}</span>

    <span class="k">raise</span> <span class="nb">RuntimeError</span><span class="p">(</span><span class="s">"Failed to inject exactly one FOX; increase max_tries."</span><span class="p">)</span>
</code></pre></div></div>

<p>Finally, we can create a fresh “book” with as many pages as we’d like. For example, you might like the idea of <i>Find the Fox</i> but think that 200 pages is a bit much. We can easily create a single page with exactly one $\FOX$ in it! And once we solve that one, we can create another one (and continue ad nauseam until we get bored of finding foxes, which I imagine won’t take very long). The code for generating the PDF isn’t very interesting so I’ll omit it here, but a command-line interface is available at <a href="https://github.com/rob-zimmerman/find-the-fox-mcmc">https://github.com/rob-zimmerman/find-the-fox-mcmc</a> with a number of useful options that you can specify (custom alphabet, grid height and width, number of pages, MCMC controls, etc.) <a href="https://github.com/rob-zimmerman/find-the-fox-mcmc/blob/main/find_the_fox.pdf">Here</a>’s an example of a generated page, and <a href="https://github.com/rob-zimmerman/find-the-fox-mcmc/blob/main/find_the_fox_answer_key.pdf">here</a>’s the solution key with the $\FOX$ coloured in red. Enjoy! If you liked this, please support the author and publisher of <i>Find the Fox</i> and purchase the book! Once you’ve solved it, come back here and generate an iid copy of the book and start again.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Or rather, book<i>s</i>. The <a href="https://alexcheddar.com/find-the-fox-answers-and-solutions/">answer page</a> for the book asks you to input the serial number, which suggests that different printings of the book have different solutions. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>Not a typo: <i>Jensen’s</i> inequality is completely different. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>See Theorem 8.1.2 of <a href="https://ocw.mit.edu/courses/18-226-probabilistic-methods-in-combinatorics-fall-2022/mit18_226_f22_lec16-17.pdf">these</a> course notes by Yufei Zhao, for example. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>On the other hand, if you were okay with very narrow or very short grids — say $5$-ish characters wide or high — then this would work pretty nicely, since the running time is linear in the dimension you’re <i>not</i> scanning over. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p>Your first thought might instead be to randomly choose a length-$3$ segment that’s one letter away from $\FOX$ (i.e., of the form $\ast \O\X$, $\F\!\ast\!\X$, or $\F\O\ast$ where the wildcard letter doesn’t complete a $\F\O\X$) and simply insert the missing letter. This one-cell change has a smaller “impact radius”, and the conditional acceptance probability will tend to be higher. However, there are two drawbacks to this approach. First, we’d have to scan the grid looking for such near-misses and then choose one at random, which is much more expensive than just choosing an arbitrary length-$3$ segment. Second (and more importantly), this method won’t actually place the $\FOX$ uniformly among all possible segments. Because we only consider segments that already partially resemble $\FOX$, the probability that a given location and orientation is chosen depends on local letter statistics, which ends up biasing our sampler. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:6" role="doc-endnote">
      <p>This is important: if you specifically look for a length-$3$ segment that has two of the three $\FOX$ letters in the correct positions (e.g., $\F\F\X$, $\X\O\X$, etc.) and it’s not on the boundary of the grid, then you can guarantee at least two $\FOX$s with a carefully chosen overwrite. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Robert Zimmerman</name><email>robert.zimmerman@alumni.utoronto.ca</email></author><category term="MCMC" /><category term="graphs" /><category term="probability" /><summary type="html"><![CDATA[Since it was published in 2024, Alex Cheddar's book [Find the Fox: the Almost Impossible Word Search](https://alexcheddar.com/) has become quite popular for its difficulty and novelty. If you haven't heard of this book, the concept is very simple: it's a word search where the grid consists solely of the letters F, O, and X. There's only one word to find: FOX, which appears once among all 200 pages of the book...]]></summary></entry><entry><title type="html">Non-Integer Moments and Distributions</title><link href="https://rob-zimmerman.ca/posts/2024-non-integer-moments" rel="alternate" type="text/html" title="Non-Integer Moments and Distributions" /><published>2024-03-29T00:00:00-07:00</published><updated>2024-03-29T00:00:00-07:00</updated><id>https://rob-zimmerman.ca/posts/non-integer-moments</id><content type="html" xml:base="https://rob-zimmerman.ca/posts/2024-non-integer-moments"><![CDATA[<p>$\newcommand{\N}{\mathbb{N}}$
$\newcommand{\R}{\mathbb{R}}$
$\newcommand{\E}{\mathbb{E}}$
$\newcommand{\C}{\mathbb{C}}$
$\renewcommand{\P}{\mathbb{P}}$
$\newcommand{\one}[1]{\boldsymbol{1}_{#1}}$</p>
<h2 id="the-moment-problem">The Moment Problem</h2>

<p>In probability theory, a rigorous definition of expectation is followed almost immediately by a proof of the law of the unconscious statistician; following that, one is shown the definition of moments. One encounters integer moments throughout probability and statistics, while non-integer moments are substantially rarer. A natural question that arises is whether a distribution is characterized by its integer moments. That is, given a sequence of real numbers $\{\mu_j\} _{j=1}^\infty$, is there (at most) one distribution $F$ with $\mu_p = \int x^p \, \mathrm{d}F$? This question is called a <i>moment problem</i>.</p>

<p>One learns by example that the answer is no, in general. Work on the moment problem reaches back to Stieltjes in 1894 (who himself invented the term <i>moment</i>), with precursors in Chebyshev and Markov.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> Using the theory of continued fractions, Stieltjes himself solved the moment problem for distributions supported on $(0, \infty)$, ultimately showing that it relied on the positivity of the determinants of what we now call <a href="https://en.wikipedia.org/wiki/Hankel_matrix">Hankel matrices</a> built up from the prescribed moments (in the days before measure theory, Stieltjes’ moments were defined by what we know today as Riemann-Stieltjes integrals). The standard counterexample came in 1963, when Chris Heyde asked about the moment problem for “commonly used distributions in statistics” and presented the famous log-normal family:<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>

<blockquote>
  <p><b>Example 1:</b> Let $\mu \in \R$ and $\sigma^2 &gt; 0$. If $X$ is a $\text{Lognormal}(\mu, \sigma^2)$ random variable — that is, if $X$ has density $f(x) = \frac{1}{\sqrt{2\pi}\sigma x} \exp\left(-\frac{(\log(x) - \mu)^2}{2\sigma^2}\right)$ for $x \in \R$ — then for all $\varepsilon \in [-1,1]$, the function $f \cdot (1 + \varepsilon \cdot \sin(2\pi \log{x}))$ is also a density with the same moments as $X$.</p>
</blockquote>

<p>According to Stoyanov<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>, Stieltjes himself showed this for the $\text{Lognormal}(0, 1)$ case in the same 1894 paper as his “moment” introduction (in a non-probabilistic context, of course), so Heyde’s example was apparently something of a rediscovery. Durrett<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> gives us another example:</p>

<blockquote>
  <p><b>Example 2:</b> Let $\lambda \in (0,1)$. If $X$ has density $f(x) = \left(\int \exp(-|x|^\lambda) \, \mathrm{d}x \right)^{-1} \exp\left(-|x|^{\lambda}\right)$ for $x \in \R$, then for all $\varepsilon \in [-1,1]$, the function $f \cdot (1 + \varepsilon \cdot \sin( \tan(\lambda \pi /2) \cdot |x|^\lambda \cdot \mathrm{sgn}(x)))$ is also a density with the same moments as $X$.</p>
</blockquote>

<p>Examples 1 and 2 obviously share a similar theme. It turns out that the idea generalizes to <i>Stieltjes classes</i>:<sup id="fnref:3:1" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> if $f$ is a density whose integer moments exist and $h$ is a non-zero continuous function taking values in $[-1,1]$ such that $\E_{X \sim f}[h(X) \cdot X^n] = 0$ for all $n \in \N$, then for all $\varepsilon \in [-1,1]$, the function $f_\varepsilon = f \cdot (1 + \varepsilon \cdot h)$ is a density with the same moments as $f$. The conditions required of $h$ essentially forces it to be periodic, as it is in the two examples above.</p>

<h2 id="non-integer-moments">Non-Integer Moments</h2>

<p>Going through the explicit proofs of Examples 1 and 2 in Durrett’s book, I was struck that the equivalance of the moments ultimately relies on the fact that we’re really talking about <i>integer</i> moments here (in Example 1, the integers play into the sine function; in Example 2, they essentially make $x^n \cdot \varepsilon \cdot h \cdot f$ into an odd function that integrates to $0$). Clearly, non-integer moments are a different story. This observation led me to a kind of inverse question: does there exist a distribution which is determined only by its <i>non-integer</i> moments? To put it another way, for $p \in \R$, do there exist positive random variables $X$ and $Y$ such that $\E[X^p] = \E[X^p]$ if and only if $p \not \in \N$? The answer, as it turns out, is no. In fact, we only need the moments to coincide on some dense subset of $\R$ such as the rationals:</p>

<blockquote>
  <p><b>Proposition:</b> Let $X$ and $Y$ be positive random variables. If $A \subseteq \R$ is dense in $\R$ and $\E[X^p] = \E[Y^p]$ for all $p \in A$, then $\E[X^p] = \E[Y^p]$ for all $p \in \R$.</p>
</blockquote>

<p><i>Proof:</i> Fix $p \in \R$. We have two cases to consider: either $\E[X^p] = \infty$ or $\E[X^p] &lt; \infty$. The first case is fairly easy to handle, while the second is only a bit trickier.</p>

<p><u> Case 1: $\E[X^p] = \infty$</u>.</p>

<p>Fix a sequence $\{q_j\}_{j=1}^\infty$ such that $q_j \to 0$ and $p + q_j \in A$ for all $j$ (we can do this because $A$ is dense in $\R$). We have that</p>

\[\begin{align*}
\infty &amp;= \E[X^p] &amp;&amp;\\
&amp;= \E[\liminf_{j \to \infty} \, X^{p + q_j}] &amp;\\ 
&amp;\leq \liminf_{j \to \infty} \, \E[ X^{p + q_j}] &amp;&amp;\mbox{by Fatou's lemma}\\
&amp;= \liminf_{j \to \infty} \, \E[ Y^{p + q_j}] &amp;&amp;\mbox{because $p + q_j \in A$}\\
&amp;\leq \liminf_{j \to \infty} \, \E[ Y^p]^{(p + q_j)/p}  &amp;&amp;\mbox{by Hölder's inequality}\\
&amp;= \E[Y^p].
\end{align*}\]

<p>Thus $\E[Y^p] = \infty$, which gives $\E[X^p] = \E[Y^p]$.</p>

<p><u> Case 2: $\E[X^p] &lt; \infty$</u>.</p>

<p>To begin with, let’s show that $\E[Y^p] &lt; \infty$ using essentially same technique as we used above, but in the contrapositive direction. Fix another sequence $\{r_j\}_{j=1}^\infty$ such that $r_j \to 0$ and $p - r_j \in A$ for all $j$. Then</p>

\[\begin{align*}
\E[Y^p] &amp;= \E[ \liminf_{j \to \infty} \, Y^{p - r_j}] &amp;&amp;\\
&amp;\leq \liminf_{j \to \infty} \, \E[Y^{p - r_j}] &amp;&amp;\mbox{by Fatou's lemma}\\
&amp;= \liminf_{j \to \infty} \, \E[X^{p - r_j}]  &amp;&amp;\mbox{because $p - r_j \in A$}\\
&amp;\leq \liminf_{j \to \infty} \, \E[X^p]^{(p - r_j)/p}  &amp;&amp;\mbox{by Hölder's inequality}\\
&amp;= \E[X^p] &amp;&amp;\\
&amp;&lt; \infty. &amp;&amp;
\end{align*}\]

<p>Now, let $\{p_j\}_{j=1}^\infty \subseteq A$ be a yet another sequence with $p_j \nearrow p$ as $j \to \infty$. Observe that for any $j \in \N$, we have</p>

\[\begin{align*}
X^{p_j} &amp;= X^{p_j} \cdot \one{X \in (0,1]} + X^{p_j} \cdot \one{X &gt; 1} &amp;&amp;\\
&amp;\leq 1 + X^{p_j} \cdot \one{X &gt; 1} &amp;&amp;\mbox{since $x^{p_j} \leq 1$ when $x \in (0,1]$}\\
&amp;\leq 1 + X^p \cdot \one{X &gt; 1} &amp;&amp;\mbox{since $p_j \leq p$ implies $x^{p_j} \leq x^p$ when $x &gt; 1$}\\
&amp;\leq 1 + X^p. &amp;&amp;
\end{align*}\]

<p>Since $\E[1 + X^p] &lt; \infty$, we see that the $X^{p_j}$ are dominated by an integrable random variable. By the dominated convergence theorem, we get $\E[X^{p_j}] \to \E[X^p]$ as $j \to \infty$. The exact same argument applied to the $Y^{p_j}$ shows us that $\E[Y^{p_j}] \to \E[Y^p]$ as $j \to \infty$. Since $\E[X^{p_j}] = \E[Y^{p_j}]$ for all $j$, the uniqueness of limits (in metric spaces!) gives $\E[X^p] = \E[Y^p]$. $\square$</p>

<p>Taking $A = \R^{\geq 0} \setminus \N$ resolves the original question.</p>

<h2 id="back-to-the-moment-problem">Back to the Moment Problem</h2>

<p>The condition $\E[X^p] = \E[Y^p]$ for all $p \in \R$ is very strong. But is it enough to force $X$ and $Y$ to have the same distribution? The answer is yes!</p>

<blockquote>
  <p><b>Theorem 1:</b> Let $X$ and $Y$ be positive random variables. If $A \subseteq \R$ is dense in $\R$ and $\E[X^p] = \E[Y^p] &lt; \infty$ for all $p \in A$, then $X \stackrel{d}{=} Y$.</p>
</blockquote>

<p>The proof follows along the same lines as that of a stronger theorem of Gwo Dong Lin from 1992:<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup></p>

<blockquote>
  <p><b>Theorem 2:</b> Let $X$ and $Y$ be positive random variables, and suppose there exists some $\alpha &gt; 0$ such that $\E[X^\alpha]$ and $\E[Y^\alpha] &lt; \infty$. Let $\{s_j\}_{j=1}^\infty \subseteq (0, \infty)$ be a sequence of distinct numbers such that $s_j \to s \in (0, \alpha)$. If $\E[X^{s_j}] = \E[Y^{s_j}]$ for all $j$, then $X \stackrel{d}{=} Y$.</p>
</blockquote>

<p>Clearly Theorem 1 is weaker than Theorem 2, so it’s enough to prove the latter; the former then falls out as a corollary. Lin’s proof is quite slick. Here’s a sketch of it:</p>

<p><i>Proof of Theorem 2 (sketch):</i> The assumptions imply that the moment generating functions (mgs) of $\log{X}$ and $\log{Y}$ viewed as functions over $\C$ — that is, the functions $z \mapsto \E[X^z]$ and $z \mapsto \E[Y^z]$ — are analytic in the strip $S = \{z \in \C: 0 &lt; \Re(z) &lt; \alpha\}$ (because $\E[X^\alpha], \E[Y^\alpha] &lt; \infty$), and moreover these functions, which are analytic continuations of their respective mgfs, agree on $S$ (because the equality of moments assumption activates the <a href="https://en.wikipedia.org/wiki/Identity_theorem">identity theorem</a>). A (right)-continuity argument then shows that these functions agree on $\{z \in \C: \Re(z) = 0\}$ as well. Replacing $z$ with $it$ for $t \in \R$, we see that the characteristic functions of $\log{X}$ and $\log{Y}$ agree everywhere. From Lévy’s inversion theorem, we get that $\log{X} \stackrel{d}{=} \log{Y}$, and thus $X \stackrel{d}{=} Y$. $\square$</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Shohat, James Alexander, and Jacob David Tamarkin. The problem of moments. Vol. 1. American Mathematical Society (RI), 1950. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>Heyde, Chris C. “On a property of the lognormal distribution.” Journal of the Royal Statistical Society Series B: Statistical Methodology 25.2 (1963): 392-393. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>Stoyanov, Jordan. “Stieltjes classes for moment-indeterminate probability distributions.” Journal of Applied Probability 41.A (2004): 281-294. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:3:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>Durrett, Rick. Probability: theory and examples. Vol. 49. Cambridge university press, 2019. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p>Lin, Gwo Dong. “Characterizations of distributions via moments.” Sankhyā: The Indian Journal of Statistics, Series A (1992): 128-132. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Robert Zimmerman</name><email>robert.zimmerman@alumni.utoronto.ca</email></author><category term="probability" /><summary type="html"><![CDATA[Does there exist a distribution which is determined only by its non-integer moments? To put it another way, for $p \geq 0$, do there exist random variables $X$ and $Y$ supported on $(0, \infty)$ such that $\mathbb{E}[X^p] = \mathbb{E}[X^p]$ if and only if $p \not \in \mathbb{N}$?]]></summary></entry><entry><title type="html">Limits of Self-Normalized Random Variables</title><link href="https://rob-zimmerman.ca/posts/2023-self-normalized-rvs" rel="alternate" type="text/html" title="Limits of Self-Normalized Random Variables" /><published>2023-04-24T00:00:00-07:00</published><updated>2023-04-24T00:00:00-07:00</updated><id>https://rob-zimmerman.ca/posts/limits-normalized-sequences</id><content type="html" xml:base="https://rob-zimmerman.ca/posts/2023-self-normalized-rvs"><![CDATA[<p>$\newcommand{\N}{\mathbb{N}}$
$\newcommand{\E}{\mathbb{E}}$
$\renewcommand{\P}{\mathbb{P}}$
I recently tweeted something very silly (redundant information, I know). The <a href="https://twitter.com/mr_roberts_z/status/1650471367299440641">tweet in question</a> asked the following:</p>

<blockquote>
  <p>Let $X_0$ be supported on some nonempty $A \subseteq \N_{&gt;0}$ with $\P(X_0 = k) = p_{0,k}$ and $\E[X_0] &lt; \infty$. For each $n \geq 1$, recursively define $X_n$ on $\N_{&gt;0}$ by $\P(X_n = k) = c_n \cdot k \cdot p_{n-1,k}$, where $c_n$ is a normalizing constant. Then, as $n \to \infty$,</p>
</blockquote>

<p>…then what? What happens to these random variables in the pointwise limit? I remember asking myself this years ago when I first learned about discrete random variables, and, given the simplicity of the formulation, I was surprised when a search for an answer yielded nothing.</p>

<p>Anyway, the <a href="https://twitter.com/mr_roberts_z/status/1650471533293322240">answer</a> I tweeted was perhaps somewhat unsatisfying. This post is an attempt to do a little better.</p>

<p>Of course, the normalizing constant here is simply $\left(\sum_{k \in A} k \cdot p_{n-1,k} \right)^{-1} = \E[X_{n-1}]^{-1}$. So one way to look at this construction is to view the mass that $X_n$ places on $k$ as the corresponding (normalized) summand in the expectation of $X_{n-1}$. Observing that $p_{1,k} = \E[X_0]^{-1} \cdot k \cdot p_{0,k}$, we might suspect that this identity holds more generally provided that all moments exist,<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> and in fact an easy induction argument shows this is true:</p>

<blockquote>
  <p><b>Proposition:</b> For all $n \geq 1$, we have $p_{n,k} = \E[X_0^n]^{-1} \cdot k^n \cdot p_{0,k}$, provided that $\E[X_0^n] &lt; \infty$.</p>
</blockquote>

<p><i>Proof:</i> The base case $n=1$ is obvious. Now, assume that $p_{n-1,k} = \E[X_0^{n-1}]^{-1} \cdot k^{n-1} \cdot p_{0,k}$. Then, simply plugging this in, cancelling out normalizing constants and manipulating gives us</p>

\[\begin{align*} 
p_{n,k} &amp;= \left( \sum_{j \in A} j \cdot p_{n-1, j} \right)^{-1} \cdot k \cdot p_{n-1, k} \\
&amp;= \left( \sum_{j \in A} j \cdot \left[\E[X_0^{n-1}]^{-1} \cdot j^{n-1} \cdot p_{0,j} \right] \right)^{-1} \cdot k \cdot \E[X_0^{n-1}]^{-1} \cdot k^{n-1} \cdot p_{0,k} \\
&amp;= \left( \sum_{j \in A} j \cdot \left[ j^{n-1} \cdot p_{0,j} \right] \right)^{-1} \cdot k  \cdot k^{n-1} \cdot p_{0,k} \\
&amp;= \left( \sum_{j \in A} j^{n} \cdot p_{0,j} \right)^{-1} \cdot  k^{n} \cdot p_{0,k} \\
&amp;= \E[X_0^n]^{-1} \cdot k^n \cdot p_{0,k},
\end{align*}\]

<p>as desired. $\square$</p>

<p>We now return to the original question: what can we say about $X_n$ as $n \to \infty$? The answer to this, as it turns out, depends solely on whether $A$ is infinite or not. If $A$ is infinite, then for any fixed $k \in A$ we have</p>

\[\begin{align}
p_{n,k} &amp;= \frac{k^n \cdot p_{0,k}}{ \sum_{j \in A} j^n \cdot p_{0,j}} \nonumber \\
&amp;= p_{0,k} \cdot \left(\sum_{\substack{i \in A \\ i &lt; k}} \left(\frac{i}{k} \right)^n \cdot p_{0,i} + 1 +  \sum_{\substack{j \in A \\ j &gt; k}} \left(\frac{j}{k} \right)^n \cdot p_{0,j}\right)^{-1} \nonumber\\
&amp;&lt; p_{0,k} \cdot \left(\sum_{\substack{i \in A \\ i &lt; k}} \left(\frac{i}{k} \right)^n \cdot p_{0,i} + 1 +  \underbrace{\left(\frac{k+1}{k} \right)^n \cdot \sum_{\substack{j \in A \\ j &gt; k}} p_{0,j}}_{\to \, \infty}\right)^{-1} \label{eq:ineq}\\
&amp; \xrightarrow{n \to \infty} 0 \nonumber.
\end{align}\]

<p>That is, $\P(X_n = k) \xrightarrow{n \to \infty} 0$. Observe that that this works because</p>

\[\begin{equation*} \sum_{\substack{j \in A, \\ j &gt; k}} p_{0,j}&gt; 0, \end{equation*}\]

<p>which follows from our assumption that $A$ is infinite. Thus, for any fixed $k \in A$ we have $\P(X_n \leq k) \xrightarrow{n \to \infty} 0$, or equivalently $\P(X_n &gt; k) \xrightarrow{n \to \infty} 1$, and we see that $X_n$ does not converge to a random variable as $n \to \infty$.</p>

<p>If $A$ is finite, then the situation is quite different. In this case, let $k^* = \max A$. Then we find that</p>

\[\begin{align*}
p_{n,k^*} &amp;=  \left( \sum_{\substack{i \in A \\ i \neq k^*}} \left(\frac{i}{k^*} \right)^n \cdot \frac{p_{0,i}}{p_{0,k^*}} + 1 \right)^{-1}\\
&amp;= \left( \underbrace{\sum_{\substack{i &lt; k^*}} \left(\frac{i}{k^*} \right)^n \cdot \frac{p_{0,i}}{p_{0,k^*}}}_{\to \, 0} + 1 \right)^{-1}\\
&amp;\xrightarrow{n \to \infty} 1.
\end{align*}\]

<p>That is, $\P(X_n = k^* ) \xrightarrow{n \to \infty} 1$. We can also see this with an approach from the other direction. For any $k \in A$ with $k &lt; k^* $, we return to the situation of $\eqref{eq:ineq}$:</p>

\[\begin{equation*} p_{n,k} \leq \left(\sum_{\substack{i \in A \\ i &lt; k}} \left(\frac{i}{k} \right)^n \cdot \frac{p_{0,i}}{p_{0,k}} + 1 +  \left(\frac{k+1}{k} \right)^n \cdot \sum_{\substack{j \in A \\ j &gt; k}} \frac{p_{0,j}}{p_{0,k}}\right)^{-1} \xrightarrow{n \to \infty} 0. \end{equation*}\]

<p>While the second sum in the parentheses is finite this time, we can be sure it’s non-empty because it must include a term for $k^*$ (the inequality is not strict this time!).</p>

<p>We can summarize our answer to the silly tweet question in a neat statement:</p>

<blockquote><b>Theorem:</b> Let $X_0$ be supported on some non-empty $A \subseteq \N^{&gt;0}$ with $\P(X_0 = k) = p_{0,k}$ and $\E[X_0^n] &lt; \infty$ for all $n \in \N$. For each $n \geq 1$, recursively define $X_n$ on $\N^{&gt;0}$ by $\P(X_n = k) = c_n \cdot k \cdot p_{n-1,k}$, where $c_n$ is a normalizing constant. Then, as $n \to \infty$, the following holds:<br /><br />

1) If $|A| = \infty$, then $X_n$ does not converge.<br /> 
2) If $|A| &lt; \infty$, then $X_n$ converges to a point mass at $\max A$. </blockquote>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>This assumption was missing from the original tweet (and this post). Thanks <a href="https://yanbotang.github.io/">Yanbo</a>, for noticing! <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Robert Zimmerman</name><email>robert.zimmerman@alumni.utoronto.ca</email></author><category term="probability" /><summary type="html"><![CDATA[I recently tweeted something very silly (redundant information, I know). The [tweet in question](https://twitter.com/mr_roberts_z/status/1650471367299440641) asked the following: Let $X_0$ be supported on some nonempty $A \subseteq \mathbb{N}_{>0}$ with $\mathbb{P}(X_0 = k) = p_{0,k}$ and $\mathbb{E}[X_0] 0}$ by $\mathbb{P}(X_n = k) = c_n \cdot k \cdot p_{n-1,k}$, where $c_n$ is a normalizing constant. Then, as $n \to \infty$ ...then what?]]></summary></entry><entry><title type="html">Self-Independence by Ancillarity and Completeness</title><link href="https://rob-zimmerman.ca/posts/2022-self-independence-ancillarity" rel="alternate" type="text/html" title="Self-Independence by Ancillarity and Completeness" /><published>2022-07-14T00:00:00-07:00</published><updated>2022-07-14T00:00:00-07:00</updated><id>https://rob-zimmerman.ca/posts/independence-basu</id><content type="html" xml:base="https://rob-zimmerman.ca/posts/2022-self-independence-ancillarity"><![CDATA[<p>\(\newcommand{\N}{\mathbb{N}}\)
\(\newcommand{\E}{\mathbb{E}}\)
\(\newcommand{\R}{\mathbb{R}}\)
\(\newcommand{\bX}{\mathbf{X}}\)
\(\renewcommand{\P}{\mathbb{P}}\)
Back in 2020, I taught <a href="https://rob-zimmerman.github.io/teaching/STA261">STA261</a> for the first time. The first part of that course deals with statistics (i.e., functions of random samples, not the subject as a whole!) and I chose to provide a light introduction to completeness because of how elegant the <a href="https://en.wikipedia.org/wiki/Lehmann%E2%80%93Scheff%C3%A9_theorem">Lehmann-Scheffé theorem</a> and related results in point estimation are down the road, despite the unintuitive definition of completeness itself (I’ve kept up this choice in my subsequent offerings of the course). By the same token, I also introduced ancillarity in order to apply the extremely slick <a href="https://en.wikipedia.org/wiki/Basu%27s_theorem">Basu’s theorem</a> to several nice problems.</p>

<p>The first assessment in that course (in the form of an open-book online quiz — this was during lockdown, after all) was intended to test the students’ understanding of these concept; with lots of difficulty, I managed to come up with several original questions. One of them was meant to be on the trickier side:</p>

<blockquote>
  <p>Let $X_1,\ldots,X_n \stackrel{iid}{\sim} F_\theta$, where the $X_i$’s have finite first moments. Suppose that there exist some $j,k \in {1,\ldots,n}$ such that the statistic $S(\bX) = X_j$ is ancillary for $\theta$ and the statistic $T(\bX) = X_k$ is complete for the family $\{F_\theta: \theta \in \Theta\}$. Prove that all of the $X_i$’s must be constant with probability 1.</p>
</blockquote>

<p><i>Solution:</i> Let $h(T(\bX)) = T(\bX) - \E[X_j]$, which is free of $\theta$ because $X_j$ is ancillary for $\theta$. Then for any $\theta \in \Theta$, we have $\E[h(T(\bX))] = \E[X_k] - \E[X_j] = 0$, and by completeness it follows that $1 = \P_\theta( h(T(\bX)) = 0) = \P_\theta(X_k = \E[X_j])$ for all $\theta \in \Theta$. That is, $X_k$ is the constant $\E[X_j]$ with probability 1. Since the $X_i$’s are iid, they’re all equal to that same constant with probability 1. $\square$</p>

<p>The question originally had a number of red herrings thrown in, but fortunately I removed those before going live. In a subsequent discussion with pals <a href="https://yanbotang.github.io/">Yanbo</a> and <a href="https://mic-lalancette.github.io/">Michaël</a>, we noticed a connection to a basic result in probablity theory:</p>

<blockquote>
  <p><b>Theorem:</b> A random variable $Y$ is independent of itself if and only if $Y$ is almost surely constant.</p>
</blockquote>

<p>One direction of this is quite trivial. The other direction is substantially more difficult and is typically proved by appealing to <a href="https://en.wikipedia.org/wiki/Kolmogorov%27s_zero%E2%80%93one_law">Kolmogorov’s zero-one law</a> (of course one easily shows that \(\P(Y \in A) \in \{0,1\}\) for any event \(A\) — perhaps specializing to \(\P(Y = y) \in \{0,1\}\) for any \(y \in \R\) — but clearly this is not enough to immediately conclude that \(\P(Y = y) = 1\) for some \(y \in \R\), and some topological argument is required). However, if we’re willing to further assume that $Y$ has a finite first moment, then we can produce an elementary proof using the result from the STA261 quiz:</p>

<p>$(\Rightarrow)$: Suppose that $Y$ is independent of itself. Define the “parameter space” $\Theta = \{ \theta \}$, where $\theta \in \R$ is arbitrary. We can vacuously associate $\theta$ to the distribution of $Y$ and write $Y \sim F_\theta$ without any ambiguity, even though the distribution of $Y$ is free of $\theta$. By the last remark, the “statistic” $S(Y) = Y$ is ancillary for $\theta$. Now, let $h: \R \to \R$ be some function such that $\E[h(Y)] = 0$. Since $Y$ is independent of itself, so too is $h(Y)$ and we have $0 = \E[h(Y)] \cdot \E[h(Y)] = \E[h(Y)^2]$, and it follows<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> that $\P(h(Y) = 0) = 1$, and this holds vacuously for all $\theta \in \Theta$. Thus, the “statistic” $T(Y) = Y$ is complete (and obviously sufficient) for the family $\{ F_\theta: \theta \in \Theta \}$. By the result from the quiz with $Y$ in place of each $X_i$, we see that $Y$ must be constant. $\square$</p>

<p>Unfortunately, this will not win the Most Elegant Proof award for that direction of the theorem under the assumption of a finite first moment: since $Y$ is independent of itself, $\E[Y^2] = \E[Y] \cdot \E[Y] = \E[Y]^2$ so that $\text{Var}(Y) = \E[Y^2] - \E[Y]^2 = 0$, and the result follows.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>One could reasonably protest that the argument here is not completely elementary because the statement $\P(X \geq 0) = 1 \implies \left(\E[X] = 0 \iff \P(X = 0) = 1 \right)$ requires some basic measure theory to prove. But it’s still much simpler than Kolmogorov’s zero-one law! <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Robert Zimmerman</name><email>robert.zimmerman@alumni.utoronto.ca</email></author><category term="probability" /><summary type="html"><![CDATA[Back in 2020, I taught [STA261](https://rob-zimmerman.github.io/teaching/STA261) for the first time. The first part of that course deals with statistics (i.e., functions of random samples, not the subject as a whole!) and I chose to provide a light introduction to completeness because of how elegant the [Lehmann-Scheffé theorem](https://en.wikipedia.org/wiki/Lehmann%E2%80%93Scheff%C3%A9_theorem) and related results in point estimation are down the road...]]></summary></entry></feed>