diff --git a/book/src/SUMMARY.md b/book/src/SUMMARY.md
index aeccb68783..2ac4a2d8bb 100644
--- a/book/src/SUMMARY.md
+++ b/book/src/SUMMARY.md
@@ -12,7 +12,7 @@
   - [Multiplying polynomials](./fundamentals/zkbook_multiplying_polynomials.md)
   - [Fast Fourier transform](./fundamentals/zkbook_fft.md)
 
-# Cryptographic tools
+# Cryptographic Tools
 
 - [Commitments](./fundamentals/zkbook_commitment.md)
 - [Polynomial commitments](./plonk/polynomial_commitments.md)
@@ -27,55 +27,47 @@
     - [Half Gate](./fundamentals/zkbook_2pc/halfgate.md)
     - [Full Description](./fundamentals/zkbook_2pc/fulldesc.md)
     - [Fixed-Key-AES Hashes](./fundamentals/zkbook_2pc/fkaes.md)
-
   - [Oblivious Transfer](./fundamentals/zkbook_2pc/ot.md)
     - [Base OT](./fundamentals/zkbook_2pc/baseot.md)
     - [OT Extension](./fundamentals/zkbook_2pc/ote.md)
-
   - [Full Protocol](./fundamentals/zkbook_2pc/2pc.md)
-
-# Proof systems
-
-- [Overview](./fundamentals/proof_systems.md)
-- [zk-SNARKs](./fundamentals/zkbook_plonk.md)
-- [Custom constraints](./fundamentals/custom_constraints.md)
+- [Proof Systems](./fundamentals/proof_systems.md)
+  - [zk-SNARKs](./fundamentals/zkbook_plonk.md)
 
 # Background on PLONK
 
 - [Overview](./plonk/overview.md)
-- [Glossary](./plonk/glossary.md)
+  - [Glossary](./plonk/glossary.md)
 - [Domain](./plonk/domain.md)
-- [Lagrange basis in multiplicative subgroups](./plonk/lagrange.md)
-- [Non-interaction with fiat-shamir](./plonk/fiat_shamir.md)
+- [Lagrange Basis in Multiplicative Subgroups](./plonk/lagrange.md)
+- [Non-Interactivity via Fiat-Shamir](./plonk/fiat_shamir.md)
 - [Plookup](./plonk/plookup.md)
-- [Maller's optimization](./plonk/maller.md)
+- [Maller's Optimization](./plonk/maller.md)
+- [Zero-Column Approach to Zero-Knowledge](./plonk/zkpm.md)
 
 # Kimchi
 
 - [Overview](./kimchi/overview.md)
-  - [Arguments](./kimchi/arguments.md)
-    - [Custom gates](./kimchi/gates.md)
-    - [Permutation](./kimchi/permut.md)
-    - [Lookup](./kimchi/lookup.md)
+- [Arguments](./kimchi/arguments.md)
+- [Final Check](./kimchi/final_check.md)
+- [Maller's Optimization for Kimchi](./kimchi/maller_15.md)
+- [Lookup Tables](./kimchi/lookup.md)
+  - [Extended Lookup Tables](./kimchi/extended-lookup-tables.md)
+- [Custom Constraints](./kimchi/custom_constraints.md)
+- [Custom Gates](./kimchi/gates.md)
+  - [Foreign Field Addition](./kimchi/foreign_field_add.md)
+  - [Foreign Field Multiplication](./kimchi/foreign_field_mul.md)
+  - [Keccak](./kimchi/keccak.md)
+
 
 # Pickles & Inductive Proof Systems
 
 - [Overview](./fundamentals/zkbook_ips.md)
 - [Accumulation](./pickles/accumulation.md)
 - [Deferred Computation](./pickles/deferred.md)
-- [Passthough & Me-Only](./pickles/passthrough.md)
 
-# RFCs
 
-- [RFC 0: Alternative zero-knowledge](./plonk/zkpm.md)
-- [RFC 1: Final check](./plonk/final_check.md)
-- [RFC 2: Maller's optimization for kimchi](./plonk/maller_15.md)
-- [RFC 3: Plookup integration in kimchi](./rfcs/3-lookup.md)
-- [RFC 4: Extended lookup tables](./rfcs/extended-lookup-tables.md)
-- [RFC 5: Foreign Field Addition](./rfcs/foreign_field_add.md)
-- [RFC 6: Keccak](./rfcs/keccak.md)
-  
-# Specifications
+# Technical Specifications
 
 - [Poseidon hash](./specs/poseidon.md)
 - [Polynomial commitment](./specs/poly-commitment.md)
diff --git a/book/src/fundamentals/proof_systems.md b/book/src/fundamentals/proof_systems.md
index 84b5f8fe97..d281048974 100644
--- a/book/src/fundamentals/proof_systems.md
+++ b/book/src/fundamentals/proof_systems.md
@@ -1,31 +1,29 @@
-# Overview
+# Proof Systems Design Overview
 
 Many modern proof systems (and I think all that are in use) are constructed according to the  following recipe.
 
 1. You start out with a class of computations.
 
 2. You devise a way to *arithmetize* those computations. That is, to express your computation as a statement about polynomials.
-   
+
    More specifically, you describe what is often called an "algebraic interactive oracle proof" (AIOP) that encodes your computation. An AIOP is a protocol describing an interaction between a prover and a verifier, in which the prover sends the verifier some "polynomial oracles" (basically a black box function that given a point evaluates a polynomial at that point), the verifier sends the prover random challenges, and at the end, the verifier queries the prover's polynomials at points of its choosing and makes a decision as to whether it has been satisfied by the proof.
 
 3. An AIOP is an imagined interaction between parties. It is an abstract description of the protocol that will be "compiled" into a SNARK. There are several "non-realistic" aspects about it. One is that the prover sends the verifier black-box polynomials that the verifier can evaluate. These polynomials have degree comparable to the size of the computation being verified. If we implemented these "polynomial oracles" by having the prover really send the $O(n)$ size polynomials (say by sending all their coefficients), then we would not have a zk-SNARK at all, since the verifier would have to read this linearly sized polynomial so we would  lose succinctness, and the polynomials would not be black-box functions, so we may lose zero-knowledge.
-   
+
    Instead, when we concretely instantiate the AIOP, we have the prover send constant-sized, hiding *polynomial commitments*. Then, in the phase of the AIOP where the verifier queries the polynomials, the prover sends an *opening proof* for the polynomial commitments which the verifier can check, thus simulating the activity of evaluating the prover's polynomials on your own.
-   
+
    So this is the next step of making a SNARK: instantiating the AIOP with a polynomial commitment scheme of one's choosing. There are several choices here and these affect the properties of the SNARK you are constructing, as the SNARK will inherit efficiency and setup properties of the polynomial commitment scheme used.
 
-4. An AIOP describes an interactive protocol between the verifier and the prover. In reality, typically, we also want our proofs to be non-interactive. 
-   
+4. An AIOP describes an interactive protocol between the verifier and the prover. In reality, typically, we also want our proofs to be non-interactive.
+
    This is accomplished by what is called the [Fiat--Shamir transformation](). The basic idea is this: all that the verifier is doing is sampling random values to send to the prover. Instead, to generate a "random" value, the prover simulates the verifier by hashing its messages. The resulting hash is used as the "random" challenge.
-   
+
    At this point we have a fully non-interactive proof. Let's review our steps.
-   
+
    1. Start with a computation.
-   
+
    2. Translate the computation into a statement about polynomials and design a corresponding AIOP.
-   
-   3. Compile the AIOP into an interactive protocol by having the prover send hiding polynomial commitments instead of polynomial oracles.
-   
-   4. Get rid of the verifier-interaction by replacing it with a hash function. I.e., apply the Fiat--Shamir transform.
 
+   3. Compile the AIOP into an interactive protocol by having the prover send hiding polynomial commitments instead of polynomial oracles.
 
+   4. Get rid of the verifier-interaction by replacing it with a hash function. I.e., apply the Fiat--Shamir transform.
diff --git a/book/src/fundamentals/custom_constraints.md b/book/src/kimchi/custom_constraints.md
similarity index 98%
rename from book/src/fundamentals/custom_constraints.md
rename to book/src/kimchi/custom_constraints.md
index 038ec9a2f9..e2c7256d68 100644
--- a/book/src/fundamentals/custom_constraints.md
+++ b/book/src/kimchi/custom_constraints.md
@@ -20,7 +20,7 @@ Then, given such a circuit, PLONK lets you produce proofs for the statement
 
 ## Specifying a constraint
 
-Mathematically speaking, a constraint is a multivariate polynomial over the variables $v_{\mathsf{Curr},0}, \dots, v_{\mathsf{Curr}, W+A-1}, v_{\mathsf{Next}, 0}, \dots, v_{\mathsf{Next}, W+A-1}$. In other words, there is one variable corresponding to the value of each column in the "current row" and one variable correspond to the value of each column in the "next row".
+Mathematically speaking, a constraint is a multivariate polynomial over the variables $c_{\mathsf{Curr},i}, \dots, v_{\mathsf{Curr}, W+A-1}, v_{\mathsf{Next}, 0}, \dots, v_{\mathsf{Next}, W+A-1}$. In other words, there is one variable corresponding to the value of each column in the "current row" and one variable correspond to the value of each column in the "next row".
 
 In Rust, $v_{r, i}$ is written `E::cell(Column::Witness(i), r)`. So, for example, the variable $v_{\mathsf{Next}, 3}$ is written
 `E::cell(Column::Witness(3), CurrOrNext::Next)`.
diff --git a/book/src/rfcs/extended-lookup-tables.md b/book/src/kimchi/extended-lookup-tables.md
similarity index 99%
rename from book/src/rfcs/extended-lookup-tables.md
rename to book/src/kimchi/extended-lookup-tables.md
index ed8d29bf7c..a2205ddebd 100644
--- a/book/src/rfcs/extended-lookup-tables.md
+++ b/book/src/kimchi/extended-lookup-tables.md
@@ -1,6 +1,6 @@
-# RFC: Extended lookup tables
+# Extended lookup tables
 
-This RFC proposes an extension to our use of lookup tables using the PLOOKUP
+This (old) RFC proposes an extension to our use of lookup tables using the PLOOKUP
 multiset inclusion argument, so that values within lookup tables can be chosen
 after the constraint system for a circuit has been fixed.
 
diff --git a/book/src/plonk/final_check.md b/book/src/kimchi/final_check.md
similarity index 100%
rename from book/src/plonk/final_check.md
rename to book/src/kimchi/final_check.md
diff --git a/book/src/rfcs/foreign_field_add.md b/book/src/kimchi/foreign_field_add.md
similarity index 100%
rename from book/src/rfcs/foreign_field_add.md
rename to book/src/kimchi/foreign_field_add.md
diff --git a/book/src/kimchi/foreign_field_mul.md b/book/src/kimchi/foreign_field_mul.md
new file mode 100644
index 0000000000..4569be8055
--- /dev/null
+++ b/book/src/kimchi/foreign_field_mul.md
@@ -0,0 +1,1316 @@
+# Foreign Field Multiplication
+
+This document is an original RFC explaining how we constrain foreign field multiplication (i.e. non-native field multiplication) in the Kimchi proof system.
+For a more recent RFC, see [FFmul RFC](https://github.com/o1-labs/rfcs/blob/main/0006-ffmul-revised.md).
+
+**Changelog**
+
+| Author(s) | Date | Details |
+|-|-|-|
+| Joseph Spadavecchia and Anais Querol | October 2022 | Design of foreign field multiplication in Kimchi |
+
+## Overview
+
+This gate constrains
+
+$$
+a \cdot b = c \mod f
+$$
+
+where $a, b, c \in \mathbb{F_f}$, a foreign field with modulus $f$, using the native field $\mathbb{F_n}$ with prime modulus $n$.
+
+## Approach
+
+In foreign field multiplication the foreign field modulus $f$ could be bigger or smaller than the native field modulus $n$.  When the foreign field modulus is bigger, then we need to emulate foreign field multiplication by splitting the foreign field elements up into limbs that fit into the native field element size.  When the foreign modulus is smaller everything can be constrained either in the native field or split up into limbs.
+
+Since our projected use cases are when the foreign field modulus is bigger (more on this below) we optimize our design around this scenario.  For this case, not only must we split foreign field elements up into limbs that fit within the native field, but we must also operate in a space bigger than the foreign field.  This is because we check the multiplication of two foreign field elements by constraining its quotient and remainder.  That is, renaming $c$ to $r$, we must constrain
+
+$$
+a \cdot b = q \cdot f + r,
+$$
+
+where the maximum size of $q$ and $r$ is $f - 1$ and so we have
+
+$$
+\begin{aligned}
+a \cdot b &\le \underbrace{(f - 1)}_q \cdot f + \underbrace{(f - 1)}_r \\
+&\le f^2 - 1.
+\end{aligned}
+$$
+
+Thus, the maximum size of the multiplication is quadratic in the size of foreign field.
+
+**Naïve approach**
+
+Naïvely, this implies that we have elements of size $f^2 - 1$ that must split them up into limbs of size at most $n - 1$.  For example, if the foreign field modulus is $256$ and the native field modulus is $255$ bits, then we'd need $\log_2((2^{256})^2 - 1) \approx 512$ bits and, thus, require $512/255 \approx 3$ native limbs.  However, each limb cannot consume all $255$ bits of the native field element because we need space to perform arithmetic on the limbs themselves while constraining the foreign field multiplication.  Therefore, we need to choose a limb size that leaves space for performing these computations.
+
+Later in this document (see the section entitled "Choosing the limb configuration") we determine the optimal number of limbs that reduces the number of rows and gates required to constrain foreign field multiplication.  This results in $\ell = 88$ bits as our optimal limb size.  In the section about intermediate products we place some upperbounds on the number of bits required when constraining foreign field multiplication with limbs of size $\ell$ thereby proving that the computations can fit within the native field size.
+
+Observe that by combining the naïve approach above with a limb size of $88$ bits, we would require $512/88 \approx 6$ limbs for representing foreign field elements.  Each limb is stored in a witness cell (a native field element).  However, since each limb is necessarily smaller than the native field element size, it must be copied to the range-check gate to constrain its value.  Since Kimchi only supports 7 copyable witness cells per row, this means that only one foreign field element can be stored per row.  This means a single foreign field multiplication would consume at least 4 rows (just for the operands, quotient and remainder).  This is not ideal because we want to limit the number of rows for improved performance.
+
+**Chinese remainder theorem**
+
+Fortunately, we can do much better than this, however, by leveraging the chinese remainder theorem (CRT for short) as we will now show.  The idea is to reduce the number of bits required by constraining our multiplication modulo two coprime moduli: $2^t$ and $n$.  By constraining the multiplication with both moduli the CRT guarantees that the constraints hold with the bigger modulus $2^t \cdot n$.
+
+The advantage of this approach is that constraining with the native modulus $n$ is very fast, allowing us to speed up our non-native computations.  This practically reduces the costs to constraining with limbs over $2^t$, where $t$ is something much smaller than $512$.
+
+For this to work, we must select a value for $t$ that is big enough.  That is, we select $t$ such that $2^t \cdot n > f^2 - 1$.  As described later on in this document, by doing so we reduce the number of bits required for our use cases to $264$.  With $88$ bit limbs this means that each foreign field element only consumes $3$ witness elements, and that means the foreign field multiplication gate now only consumes $2$ rows.  The section entitled "Choosing $t$" describes this in more detail.
+
+**Overall approach**
+
+Bringing it all together, our approach is to verify that
+
+$$
+\begin{align}
+a \cdot b = q \cdot f + r
+\end{align}
+$$
+
+over $\mathbb{Z^+}$.  In order to do this efficiently we use the CRT, which means that the equation holds mod $M = 2^t \cdot n$.  For the equation to hold over the integers we must also check that each side of the equation is less than $2^t \cdot n$.
+
+The overall steps are therefore
+
+1. Apply CRT to equation (1)
+    * Check validity with binary modulus $\mod 2^t$
+    * Check validity with native modulus $\mod n$
+2. Check each side of equation (1) is less than $M$
+    * $a \cdot b < 2^t \cdot n$
+    * $q \cdot f + r < 2^t \cdot n$
+
+This then implies that
+
+$$
+a \cdot b = c \mod f.
+$$
+
+where $c = r$.  That is, $a$ multiplied with $b$ is equal to $c$ where $a,b,c \in \mathbb{F_f}$.  There is a lot more to each of these steps.  That is the subject of the rest of this document.
+
+**Other strategies**
+
+Within our overall approach, aside from the CRT, we also use a number of other strategies to reduce the number and degree of constraints.
+
+* Avoiding borrows and carries
+* Intermediate product elimination
+* Combining multiplications
+
+The rest of this document describes each part in detail.
+
+## Parameter selection
+
+This section describes important parameters that we require and how they are computed.
+
+* *Native field modulus* $n$
+* *Foreign field modulus* $f$
+* *Binary modulus* $2^t$
+* *CRT modulus* $2^t \cdot n$
+* *Limb size* in bits $\ell$
+
+#### Choosing $t$
+
+Under the hood, we constrain $a \cdot b = q \cdot f + r \mod 2^t \cdot n$.  Since this is a multiplication in the foreign field $f$, the maximum size of $q$ and $r$ are $f - 1$, so this means
+
+$$
+\begin{aligned}
+a \cdot b &\le (f - 1) \cdot f + (f - 1) \\
+&\le f^2 - 1.
+\end{aligned}
+$$
+
+Therefore, we need the modulus $2^t \cdot n$ such that
+
+$$
+2^t \cdot n > f^2 - 1,
+$$
+
+which is the same as saying, given $f$ and $n$, we must select $t$ such that
+
+$$
+\begin{aligned}
+2^t \cdot n &\ge f^2 \\
+t &\ge 2\log_2(f) - \log_2(n).
+\end{aligned}
+$$
+
+Thus, we have an lower bound on $t$.
+
+Instead of dynamically selecting $t$ for every $n$ and $f$ combination, we fix a $t$ that will work for the different selections of $n$ and $f$ relevant to our use cases.
+
+To guide us, from above we know that
+
+$$
+t_{min} = 2\log_2(f) - \log_2(n)
+$$
+
+and we know the field moduli for our immediate use cases.
+
+```
+vesta     = 2^254 + 45560315531506369815346746415080538113 (255 bits)
+pallas    = 2^254 + 45560315531419706090280762371685220353 (255 bits)
+secp256k1 = 2^256 - 2^32 - 2^9 - 2^8 - 2^7 - 2^6 - 2^4 - 1 (256 bits)
+```
+
+So we can create a table
+
+| $n$         | $f$             | $t_{min}$ |
+| ----------- | --------------- | --- |
+| `vesta`     | `secp256k1`     | 258 |
+| `pallas`    | `secp256k1`     | 258 |
+| `vesta`     | `pallas`        | 255 |
+| `pallas`    | `vesta`         | 255 |
+
+and know that to cover our use cases we need $t \ge 258$.
+
+Next, given our native modulus $n$ and $t$, we can compute the *maximum foreign field modulus supported*.  Actually, we compute the maximum supported bit length of the foreign field modulus $f=2^m$.
+
+$$
+\begin{aligned}
+2^t \cdot n &\ge f^2 \\
+&\ge (2^m)^2 = 2^{2m} \\
+t + \log_2(n) &> \log_2(2^{2m}) = 2m
+\end{aligned}
+$$
+
+So we get
+
+$$
+m_{max} = \frac{t + \log_2(n)}{2}.
+$$
+
+With $t=258, n=255$ we have
+
+$$
+\begin{aligned}
+m_{max} &= \frac{258 + 255}{2} = 256.5,
+\end{aligned}
+$$
+
+which is not enough space to handle anything larger than 256 bit moduli.  Instead, we will use $t=264$, giving $m_{max} = 259$ bits.
+
+The above formula is useful for checking the maximum number of bits supported of the foreign field modulus, but it is not useful for computing the maximum foreign field modulus itself (because $2^{m_{max}}$ is too coarse). For these checks, we can compute our maximum foreign field modulus more precisely with
+
+$$
+max_{mod} = \lfloor \sqrt{2^t \cdot n} \rfloor.
+$$
+
+The max prime foreign field modulus satisfying the above inequality for both Pallas and Vesta is `926336713898529563388567880069503262826888842373627227613104999999999999999607`.
+
+#### Choosing the limb configuration
+
+Choosing the right limb size and the right number of limbs is a careful balance between the number of constraints (i.e. the polynomial degree) and the witness length (i.e. the number of rows).  Because one limiting factor that we have in Kimchi is the 12-bit maximum for range check lookups, we could be tempted to introduce 12-bit limbs.  However, this would mean having many more limbs, which would consume more witness elements and require significantly more rows.  It would also increase the polynomial degree by increasing the number of constraints required for the *intermediate products* (more on this later).
+
+We need to find a balance between the number of limbs and the size of the limbs.  The limb configuration is dependent on the value of $t$ and our maximum foreign modulus (as described in the previous section).  The larger the maximum foreign modulus, the more witness rows we will require to constrain the computation.  In particular, each limb needs to be constrained by the range check gate and, thus, must be in a copyable (i.e. permuteable) witness cell.  We have at most 7 copyable cells per row and gates can operate on at most 2 rows, meaning that we have an upperbound of at most 14 limbs per gate (or 7 limbs per row).
+
+As stated above, we want the foreign field modulus to fit in as few rows as possible and we need to constrain operands $a, b$, the quotient $q$ and remainder $r$.  Each of these will require cells for each limb.  Thus, the number of cells required for these is
+
+$$
+cells = 4 \cdot limbs
+$$
+
+It is highly advantageous for performance to constrain foreign field multiplication with the minimal number of gates.  This not only helps limit the number of rows, but also to keep the gate selector polynomial small.  Because of all of this, we aim to constrain foreign field multiplication with a single gate (spanning at most $2$ rows).  As mentioned above, we have a maximum of 14 permuteable cells per gate, so we can compute the maximum number of limbs that fit within a single gate like this.
+
+$$
+\begin{aligned}
+limbs_{max} &= \lfloor cells/4  \rfloor \\
+      &= \lfloor 14/4 \rfloor \\
+      &= 3 \\
+\end{aligned}
+$$
+
+Thus, the maximum number of limbs possible in a single gate configuration is 3.
+
+Using $limbs_{max}=3$ and $t=264$ that covers our use cases (see the previous section), we are finally able to derive our limbs size
+
+$$
+\begin{aligned}
+\ell &= \frac{t}{limbs_{max}} \\
+&= 264/3 \\
+&= 88
+\end{aligned}
+$$
+
+Therefore, our limb configuration is:
+
+  * Limb size $\ell = 88$ bits
+  * Number of limbs is $3$
+
+## Avoiding borrows
+
+When we constrain $a \cdot b - q \cdot f = r \mod 2^t$ we want to be as efficient as possible.
+
+Observe that the expansion of $a \cdot b - q \cdot f$ into limbs would also have subtractions between limbs, requiring our constraints to account for borrowing.  Dealing with this would create undesired complexity and increase the degree of our constraints.
+
+In order to avoid the possibility of subtractions we instead use $a \cdot b + q \cdot f'$ where
+
+$$
+\begin{aligned}
+f' &= -f \mod 2^t \\
+   &= 2^t - f
+\end{aligned}
+$$
+
+The negated modulus $f'$ becomes part of our gate coefficients and is not constrained because it is publicly auditable.
+
+Using the substitution of the negated modulus, we now must constrain $a \cdot b + q \cdot f' = r \mod 2^t$.
+
+> Observe that $f' < 2^t$ since $f < 2^t$ and that $f' > f$ when $f < 2^{t - 1}$.
+
+## Intermediate products
+
+This section explains how we expand our constraints into limbs and then eliminate a number of extra terms.
+
+We must constrain $a \cdot b + q \cdot f' = r \mod 2^t$ on the limbs, rather than as a whole.  As described above, each foreign field element $x$ is split into three 88-bit limbs: $x_0, x_1, x_2$, where $x_0$ contains the least significant bits and $x_2$ contains the most significant bits and so on.
+
+Expanding the right-hand side into limbs we have
+
+$$
+\begin{aligned}
+&(a_0 + a_1 \cdot 2^{\ell} + a_2 \cdot 2^{2\ell}) \cdot (b_0 + b_1 \cdot 2^{\ell} + b_2 \cdot 2^{2\ell}) +  (q_0 + q_1 \cdot 2^{\ell} + q_2 \cdot 2^{2\ell}) \cdot (f'_0 + f'_1 \cdot 2^{\ell} + f'_2 \cdot 2^{2\ell}) \\
+&=\\
+&~~~~~ a_0 \cdot b_0 + a_0 \cdot b_1 \cdot 2^{\ell} + a_0 \cdot b_2 \cdot 2^{2\ell} \\
+&~~~~ + a_1 \cdot b_0 \cdot 2^{\ell} + a_1 \cdot b_1 \cdot 2^{2\ell} + a_1 \cdot b_2 \cdot 2^{3\ell} \\
+&~~~~ + a_2 \cdot b_0 \cdot 2^{2\ell} + a_2 \cdot b_1 \cdot 2^{3\ell} + a_2 \cdot b_2 \cdot 2^{4\ell} \\
+&+ \\
+&~~~~~ q_0 \cdot f'_0 + q_0 \cdot f'_1 \cdot 2^{\ell} + q_0 \cdot f'_2 \cdot 2^{2\ell} \\
+&~~~~ + q_1 \cdot f'_0 \cdot 2^{\ell} + q_1 \cdot f'_1 \cdot 2^{2\ell} + q_1 \cdot f'_2 \cdot 2^{3\ell} \\
+&~~~~ + q_2 \cdot f'_0 \cdot 2^{2\ell} + q_2 \cdot f'_1 \cdot 2^{3\ell} + q_2 \cdot f'_2 \cdot 2^{4\ell} \\
+&= \\
+&a_0 \cdot b_0 + q_0 \cdot f'_0 \\
+&+ 2^{\ell} \cdot (a_0 \cdot b_1 + a_1 \cdot b_0 + q_0 \cdot f'_1 + q_1 \cdot f'_0) \\
+&+ 2^{2\ell} \cdot (a_0 \cdot b_2 + a_2 \cdot b_0 + q_0 \cdot f'_2 + q_2 \cdot f'_0 + a_1 \cdot b_1 + q_1 \cdot f'_1) \\
+&+ 2^{3\ell} \cdot (a_1 \cdot b_2 + a_2 \cdot b_1 + q_1 \cdot f'_2 + q_2 \cdot f'_1) \\
+&+ 2^{4\ell} \cdot (a_2 \cdot b_2 + q_2 \cdot f'_2) \\
+\end{aligned}
+$$
+
+Since $t = 3\ell$, the terms scaled by $2^{3\ell}$ and $2^{4\ell}$ are a multiple of the binary modulus and, thus, congruent to zero $\mod 2^t$. They can be eliminated and we don't need to compute them.  So we are left with 3 *intermediate products* that we call $p_0, p_1, p_2$:
+
+| Term  | Scale       | Product                                                  |
+| ----- | ----------- | -------------------------------------------------------- |
+| $p_0$ | $1$         | $a_0b_0 + q_0f'_0$                                       |
+| $p_1$ | $2^{\ell}$  | $a_0b_1 + a_1b_0 + q_0f'_1 + q_1f'_0$                    |
+| $p_2$ | $2^{2\ell}$ | $a_0b_2 + a_2b_0 + q_0f'_2 + q_2f'_0 + a_1b_1 + q_1f'_1$ |
+
+So far, we have introduced these checked computations to our constraints
+
+> 1. Computation of $p_0, p_1, p_2$
+
+## Constraining $\mod 2^t$
+
+Let's call $p := ab + qf' \mod 2^t$. Remember that our goal is to constrain that $p - r = 0 \mod 2^t$ (recall that any more significant bits than the 264th are ignored in $\mod 2^t$). Decomposing that claim into limbs, that means
+
+$$
+\begin{align}
+2^{2\ell}(p_2 - r_2) + 2^{\ell}(p_1 - r_1) + p_0 - r_0 = 0 \mod 2^t.
+\end{align}
+$$
+
+We face two challenges
+
+*  Since $p_0, p_1, p_2$ are at least $2^{\ell}$ bits each, the right side of the equation above does not fit in $\mathbb{F}_n$
+*  The subtraction of the remainder's limbs $r_0$ and $r_1$ could require borrowing
+
+For the moment, let's not worry about the possibility of borrows and instead focus on the first problem.
+
+## Combining multiplications
+
+The first problem is that our native field is too small to constrain $2^{2\ell}(p_2 - r_2) + 2^{\ell}(p_1 - r_1) + p_0 - r_0 = 0 \mod 2^t$.  We could split this up by multiplying $a \cdot b$ and $q \cdot f'$ separately and create constraints that carefully track borrows and carries between limbs.  However, a more efficient approach is combined the whole computation together and accumulate all the carries and borrows in order to reduce their number.
+
+The trick is to assume a space large enough to hold the computation, view the outcome in binary and then split it up into chunks that fit in the native modulus.
+
+To this end, it helps to know how many bits these intermediate products require.  On the left side of the equation, $p_0$  is at most $2\ell + 1$ bits.  We can compute this by substituting the maximum possible binary values (all bits set to 1) into $p_0 = a_0b_0 + q_0f'_0$ like this
+
+$$
+\begin{aligned}
+\mathsf{maxbits}(p_0) &= \log_2(\underbrace{(2^{\ell} - 1)}_{a_{0}} \underbrace{(2^{\ell} - 1)}_{b_{0}} + \underbrace{(2^{\ell} - 1)}_{q_{0}} \underbrace{(2^{\ell} - 1)}_{f'_{0}}) \\
+&= \log_2(2(2^{2\ell} - 2^{\ell + 1} + 1)) \\
+&= \log_2(2^{2\ell + 1} - 2^{\ell + 2} + 2).
+\end{aligned}
+$$
+
+So $p_0$ fits in $2\ell + 1$ bits.  Similarly, $p_1$ needs at most $2\ell + 2$ bits and $p_2$ is at most $2\ell + 3$ bits.
+
+
+| Term     | $p_0$       | $p_1$       | $p_2$       |
+| -------- | ----------- | ----------- | ----------- |
+| **Bits** | $2\ell + 1$ | $2\ell + 2$ | $2\ell + 3$ |
+
+The diagram below shows the right hand side of the zero-sum equality from equation (2).  That is, the value $p - r$. Let's look at how the different bits of $p_0, p_1, p_2, r_0, r_1$ and $r_2$ impact it.
+
+```text
+0             L             2L            3L            4L
+|-------------|-------------|-------------|-------------|-------------|
+                            :
+|--------------p0-----------:-| 2L + 1
+                            :
+              |-------------:-p1------------| 2L + 2
+                    p10➚    :        p11➚
+                            |----------------p2-------------| 2L + 3
+                            :
+|-----r0------|             :
+                            :
+              |-----r1------|
+                            :
+                            |-----r2------|
+\__________________________/ \______________________________/
+             ≈ h0                           ≈ h1
+```
+
+Within our native field modulus we can fit up to $2\ell + \delta < \log_2(n)$ bits, for small values of $\delta$ (but sufficient for our case).  Thus, we can only constrain approximately half of $p - r$ at a time. In the diagram above the vertical line at 2L bisects $p - r$ into two $\approx2\ell$ bit values: $h_0$ and $h_1$ (the exact definition of these values follow).  Our goal is to constrain $h_0$ and $h_1$.
+
+## Computing the zero-sum halves: $h_0$ and $h_1$
+
+Now we can derive how to compute $h_0$ and $h_1$ from $p$ and $r$.
+
+The direct approach would be to bisect both $p_0$ and $p_1$ and then define $h_0$ as just the sum of the $2\ell$ lower bits of $p_0$ and $p_1$ minus $r_0$ and $r_1$.  Similarly $h_1$ would be just the sum of upper bits of $p_0, p_1$ and $p_2$ minus $r_2$.  However, each bisection requires constraints for the decomposition and range checks for the two halves.  Thus, we would like to avoid bisections as they are expensive.
+
+Ideally, if our $p$'s lined up on the $2\ell$ boundary, we would not need to bisect at all.  However, we are unlucky and it seems like we must bisect both $p_0$ and $p_1$.  Fortunately, we can at least avoid bisecting $p_0$ by allowing it to be summed into $h_0$ like this
+
+$$
+h_0 = p_0 + 2^{\ell}\cdot p_{10} - r_0 - 2^{\ell}\cdot r_1
+$$
+
+Note that $h_0$ is actually greater than $2\ell$ bits in length.  This may not only be because it contains $p_0$ whose length is $2\ell + 1$, but also because adding $p_{10}$ may cause an overflow.  The maximum length of $h_0$ is computed by substituting in the maximum possible binary value of $2^{\ell} - 1$ for the added terms and $0$ for the subtracted terms of the above equation.
+
+$$
+\begin{aligned}
+\mathsf{maxbits}(h_0) &= \log_2(\underbrace{(2^{\ell} - 1)(2^{\ell} - 1) + (2^{\ell} - 1)(2^{\ell} - 1)}_{p_0} + 2^{\ell} \cdot \underbrace{(2^{\ell} - 1)}_{p_{10}}) \\
+&= \log_2(2^{2\ell + 1} - 2^{\ell + 2} + 2 + 2^{2\ell} - 2^\ell) \\
+&= \log_2( 3\cdot 2^{2\ell} - 5 \cdot 2^\ell +2 ) \\
+\end{aligned}
+$$
+
+which is $2\ell + 2$ bits.
+
+N.b. This computation assumes correct sizes values for $r_0$ and $r_1$, which we assure by range checks on the limbs.
+
+Next, we compute $h_1$ as
+
+$$
+h_1 = p_{11} + p_2 - r_2
+$$
+
+The maximum size of $h_1$ is computed as
+
+$$
+\begin{aligned}
+\mathsf{maxbits}(h_1) &= \mathsf{maxbits}(p_{11} + p_2)
+\end{aligned}
+$$
+
+In order to obtain the maximum value of $p_{11}$, we define $p_{11} := \frac{p_1}{2^\ell}$. Since the maximum value of $p_1$ was $2^{2\ell+2}-2^{\ell+3}+4$, then the maximum value of $p_{11}$ is $2^{\ell+2}-8$. For $p_2$, the maximum value was $6\cdot 2^{2\ell} - 12 \cdot 2^\ell + 6$, and thus:
+
+$$
+\begin{aligned}
+\mathsf{maxbits}(h_1) &= log_2(\underbrace{2^{\ell+2}-8}_{p_{11}} + \underbrace{6\cdot 2^{2\ell} - 12 \cdot 2^\ell + 6}_{p_2}) \\
+&= \log_2(6\cdot 2^{2\ell} - 8 \cdot 2^\ell - 2) \\
+\end{aligned}
+$$
+
+which is $2\ell + 3$ bits.
+
+| Term     | $h_0$       | $h_1$       |
+| -------- | ----------- | ----------- |
+| **Bits** | $2\ell + 2$ | $2\ell + 3$ |
+
+Thus far we have the following constraints
+> 2. Composition of $p_{10}$ and $p_{11}$ result in $p_1$
+> 3. Range check $p_{11} \in [0, 2^{\ell + 2})$
+> 4. Range check $p_{10} \in [0, 2^{\ell})$
+
+For the next step we would like to constrain $h_0$ and $h_1$ to zero.  Unfortunately, we are not able to do this!
+
+* Firstly, as defined $h_0$ may not be zero because we have not bisected it precisely at $2\ell$ bits, but have allowed it to contain the full $2\ell + 2$ bits.  Recall that these two additional bits are because $p_0$ is at most $2\ell + 1$ bits long, but also because adding $p_{10}$ increases it to $2\ell + 2$.  These two additional bits are not part of the first $2\ell$ bits of $p - r$ and, thus, are not necessarily zero.  That is, they are added from the second $2\ell$ bits (i.e. $h_1$).
+
+* Secondly, whilst the highest $\ell + 3$ bits of $p - r$ would wrap to zero $\mod 2^t$, when placed into the smaller $2\ell + 3$ bit $h_1$ in the native field, this wrapping does not happen.  Thus, $h_1$'s $\ell + 3$ highest bits may be nonzero.
+
+We can deal with this non-zero issue by computing carry witness values.
+
+## Computing carry witnesses values $v_0$ and $v_1$
+
+Instead of constraining $h_0$ and $h_1$ to zero, there must be satisfying witness $v_0$ and $v_1$ such that the following constraints hold.
+> 5. There exists $v_0$ such that $h_0 = v_0 \cdot 2^{2\ell}$
+> 6. There exists $v_1$ such that $h_1 = v_1 \cdot 2^{\ell} - v_0$
+
+Here $v_0$ is the last two bits of $h_0$'s $2\ell + 2$ bits, i.e., the result of adding the highest bit of $p_0$ and any possible carry bit from the operation of $h_0$.  Similarly, $v_1$ corresponds to the highest $\ell + 3$ bits of $h_1$.  It looks like this
+
+```text
+0             L             2L            3L            4L
+|-------------|-------------|-------------|-------------|-------------|
+                            :
+|--------------h0-----------:--| 2L + 2
+                            : ↖v0
+                            :-------------h1-------------| 2L + 3
+                            :              \____________/
+                            :                  v1➚
+```
+
+
+Remember we only need to prove the first $3\ell$ bits of $p - r$ are zero, since everything is $\mod 2^t$ and  $t = 3\ell$.  It may not be clear how this approach proves the $3\ell$ bits are indeed zero because within $h_0$ and $h_1$ there are bits that are nonzero.  The key observation is that these bits are too high for $\mod 2^t$.
+
+By making the argument with $v_0$ and $v_1$ we are proving that $h_0$ is something where the $2\ell$ least significant bits are all zeros and that $h_1 + v_0$ is something where the $\ell$ are also zeros.  Any nonzero bits after $3\ell$ do not matter, since everything is $\mod 2^t$.
+
+All that remains is to range check $v_0$ and $v_1$
+> 7. Range check $v_0 \in [0, 2^2)$
+> 8. Range check $v_1 =\in [0, 2^{\ell + 3})$
+
+## Subtractions
+
+When unconstrained, computing $u_0 = p_0 + 2^{\ell} \cdot p_{10} - (r_0 + 2^{\ell} \cdot r_1)$ could require borrowing.  Fortunately, we have the constraint that the $2\ell$ least significant bits of $u_0$ are `0` (i.e. $u_0 = 2^{2\ell} \cdot v_0$), which means borrowing cannot happen.
+
+Borrowing is prevented because when the the $2\ell$ least significant bits of the result are `0` it is not possible for the minuend to be less than the subtrahend.  We can prove this by contradiction.
+
+Let
+
+* $x = p_0 + 2^{\ell} \cdot p_{10}$
+* $y = r_0 + 2^{\ell} \cdot r_1$
+* the $2\ell$ least significant bits of $x - y$ be `0`
+
+Suppose that borrowing occurs, that is, that $x < y$.  Recall that the length of $x$ is at most $2\ell + 2$ bits.  Therefore, since $x < y$ the top two bits of $x$ must be zero and so we have
+
+$$
+x - y = x_{2\ell} - y,
+$$
+
+where $x_{2\ell}$ denotes the $2\ell$ least significant bits of $x$.
+
+Recall also that the length of $y$ is $2\ell$ bits.  We know this because limbs of the result are each constrained to be in $[0, 2^{\ell})$.  So the result of this subtraction is $2\ell$ bits.   Since the $2\ell$ least significant bits of the subtraction are `0` this means that
+
+$$
+\begin{aligned}
+x - y  &  = 0 \\
+x &= y,
+\end{aligned}
+$$
+
+which is a contradiction with $x < y$.
+
+## Costs
+
+Range checks should be the dominant cost, let's see how many we have.
+
+Range check (3) requires two range checks for $p_{11} = p_{111} \cdot 2^\ell + p_{110}$
+ * a) $p_{110} \in [0, 2^\ell)$
+ * b) $p_{111} \in [0, 2^2)$
+
+Range check (8) requires two range checks and a decomposition check that is merged in (6).
+ * a) $v_{10} \in [0, 2^{\ell})$
+ * b) $v_{11} \in [0, 2^3)$
+
+The range checks on $p_0, p_1$ and $p_2$ follow from the range checks on $a,b$ and $q$.
+
+So we have 3.a, 3.b, 4, 7, 8.a, 8.b.
+
+| Range check | Gate type(s)                                     | Witness                   | Rows |
+| ----------- | ------------------------------------------------ | ------------------------- | ---- |
+| 7           | $(v_0 - 3)(v_0 - 2)(v_0 - 1)v_0$                 | $v_0$                     | < 1  |
+| 3.a         | $(p_{111} - 3)(p_{111} - 2)(p_{111} - 1)p_{111}$ | $p_{111}$                 | < 1  |
+| 8.b         | degree-8 constraint or plookup                   | $v_{11}$                  | 1    |
+| 3.b, 4, 8.a | `multi-range-check`                              | $p_{10}, p_{110}, v_{10}$ | 4    |
+
+So we have 1 multi-range-check, 1 single-range-check and 2 low-degree range checks. This consumes just over 5 rows.
+
+## Use CRT to constrain $a \cdot b - q \cdot f - r \equiv 0 \mod n$
+
+Until now we have constrained the equation $\mod 2^t$, but remember that our application of the CRT means that we must also constrain the equation $\mod n$.  We are leveraging the fact that if the identity holds for all moduli in $\mathcal{M} = \{n, 2^t\}$, then it holds for $\mathtt{lcm} (\mathcal{M}) = 2^t \cdot n = M$.
+
+Thus, we must check $a \cdot b - q \cdot f - r \equiv 0 \mod n$, which is over $\mathbb{F}_n$.
+
+This gives us equality $\mod 2^t \cdot n$ as long as the divisors are coprime.  That is, as long as $\mathsf{gcd}(2^t, n) = 1$.  Since the native modulus $n$ is prime, this is true.
+
+Thus, to perform this check is simple.  We compute
+
+$$
+\begin{aligned}
+a_n &= a \mod n \\
+b_n &= b \mod n \\
+q_n &= q \mod n \\
+r_n &= r \mod n \\
+f_n &= f \mod n
+\end{aligned}
+$$
+
+using our native field modulus with constraints like this
+
+$$
+\begin{aligned}
+a_n &= 2^{2\ell} \cdot a_2 + 2^{\ell} \cdot a_1 + a_0 \\
+b_n &= 2^{2\ell} \cdot b_2 + 2^{\ell} \cdot b_1 + b_0 \\
+q_n &= 2^{2\ell} \cdot q_2 + 2^{\ell} \cdot q_1 + q_0 \\
+r_n & = 2^{2\ell} \cdot r_2 + 2^{\ell} \cdot r_1 + r_0 \\
+f_n &= 2^{2\ell} \cdot f_2 + 2^{\ell} \cdot f_1 + f_0 \\
+\end{aligned}
+$$
+
+and then constrain
+
+$$
+a_n \cdot b_n - q_n \cdot f_n - r_n = 0 \mod n.
+$$
+
+Note that we do not use the negated foreign field modulus here.
+
+This requires a single constraint of the form
+
+> 9. $a_n \cdot b_n - q_n \cdot f_n = r_n$
+
+with all of the terms expanded into the limbs according the the above equations.  The values $a_n, b_n, q_n, f_n$ and $r_n$ do not need to be in the witness.
+
+## Range check both sides of $a \cdot b = q \cdot f + r$
+
+Until now we have constrained that equation $a \cdot b = q \cdot f + r$  holds modulo $2^t$ and modulo $n$, so by the CRT it holds modulo $M = 2^t \cdot n$.  Remember that, as outlined in the "Overview" section, we must prove our equation over the positive integers, rather than $\mod M$.  By doing so, we assure that our solution is not equal to some $q' \cdot M + r'$ where $q', r' \in F_{M}$, in which case $q$ or $r$ would be invalid.
+
+First, let's consider the right hand side $q \cdot f + r$.  We have
+
+$$
+q \cdot f + r < 2^t \cdot n
+$$
+
+Recall that we have parameterized $2^t \cdot n \ge f^2$, so if we can bound $q$ and $r$ such that
+
+$$
+q \cdot f + r < f^2
+$$
+
+then we have achieved our goal.  We know that $q$ must be less than $f$, so that is our first check, leaving us with
+
+$$
+\begin{aligned}
+(f - 1) \cdot f + r &< f^2 \\
+r &< f^2 - (f - 1) \cdot f = f
+\end{aligned}
+$$
+
+Therefore, to check $q \cdot f + r < 2^t \cdot n$, we need to check
+* $q < f$
+* $r < f$
+
+This should come at no surprise, since that is how we parameterized $2^t \cdot n$ earlier on.  Note that by checking $q < f$ we assure correctness, while checking $r < f$ assures our solution is unique (sometimes referred to as canonical).
+
+Next, we must perform the same checks for the left hand side (i.e., $a \cdot b < 2^t \cdot n$).  Since $a$ and $b$ must be less than the foreign field modulus $f$, this means checking
+* $a < f$
+* $b < f$
+
+So we have
+
+$$
+\begin{aligned}
+a \cdot b &\le (f - 1) \cdot (f - 1) = f^2 - 2f + 1 \\
+\end{aligned}
+$$
+
+Since $2^t \cdot n \ge f^2$ we have
+
+$$
+\begin{aligned}
+&f^2 - 2f + 1 < f^2 \le 2^t \cdot n \\
+&\implies
+a \cdot b < 2^t \cdot n
+\end{aligned}
+$$
+
+### Bound checks
+
+To perform the above range checks we use the *upper bound check* method described in the [Foreign Field Addition RFC](](https://github.com/o1-labs/proof-systems/blob/master/book/src/rfcs/ffadd.md#upper-bound-check)).
+
+The upper bound check is as follows.  We must constrain $0 \le q < f$ over the positive integers, so
+
+$$
+\begin{aligned}
+2^t \le q &+ 2^t < f + 2^t \\
+2^t - f \le q &+ 2^t - f < 2^t \\
+\end{aligned}
+$$
+
+Remember $f' = 2^t - f$ is our negated foreign field modulus.  Thus, we have
+
+$$
+\begin{aligned}
+f' \le q &+ f' < 2^t \\
+\end{aligned}
+$$
+
+So to check $q < t$ we just need to compute $q' = q + f'$ and check $f' \le q' < 2^t$
+
+Observe that
+
+$$
+0 \le q \implies  f' \le q'
+$$
+
+and that
+
+$$
+q' < 2^t \implies q < f
+$$
+
+So we only need to check
+
+- $0 \le q$
+- $q' < 2^t$
+
+The first check is always true since we are operating over the positive integers and $q \in \mathbb{Z^+}$.  Observe that the second check also constrains that $q < 2^t$, since $f \le 2^{259} < 2^t$ and thus
+
+$$
+\begin{aligned}
+q' &\le 2^t \\
+q + f' &\le 2^t \\
+q &\le 2^t - (2^t - f) = f\\
+q &< 2^t
+\end{aligned}
+$$
+
+Therefore, to constrain $q < f$ we only need constraints for
+
+- $q' = q + f'$
+- $q' < 2^t$
+
+and we don't require an additional multi-range-check for $q < 2^t$.
+
+### Cost of bound checks
+
+This section analyzes the structure and costs of bounds checks for foreign field addition and foreign field multiplication.
+
+#### Addition
+
+In our foreign field addition design the operands $a$ and $b$ do not need to be less than $f$.  The field overflow bit $\mathcal{o}$ for foreign field addition is at most 1.  That is, $a + b = \mathcal{o} \cdot f + r$, where $r$ is allowed to be greater than $f$.  Therefore,
+
+$$
+(f + a) + (f + b) = 1 \cdot f + (f + a + b)
+$$
+
+These can be chained along $k$ times as desired.  The final result
+
+$$
+r = (\underbrace{f + \cdots + f}_{k} + a_1 + b_1 + \cdots a_k + b_k)
+$$
+
+Since the bit length of $r$ increases logarithmically with the number of additions, in Kimchi we must only check that the final $r$ in the chain is less than $f$ to constrain the entire chain.
+
+> **Security note:** In order to defer the $r < f$ check to the end of any chain of additions, it is extremely important to consider the potential impact of wraparound in $\mathbb{F_n}$.  That is, we need to consider whether the addition of a large chain of elements greater than the foreign field modulus could wrap around.  If this could happen then the $r < f$ check could fail to detect an invalid witness.  Below we will show that this is not possible in Kimchi.
+>
+> Recall that our foreign field elements are comprised of 3 limbs of 88-bits each that are each represented as native field elements in our proof system.  In order to wrap around and circumvent the $r < f$ check, the highest limb would need to wrap around.  This means that an attacker would need to perform about $k \approx n/2^{\ell}$ additions of elements greater than then foreign field modulus.  Since Kimchi's native moduli (Pallas and Vesta) are 255-bits, the attacker would need to provide a witness for about $k \approx 2^{167}$ additions.  This length of witness is greater than Kimchi's maximum circuit (resp. witness) length.  Thus, it is not possible for the attacker to generate a false proof by causing wraparound with a large chain of additions.
+
+In summary, for foreign field addition in Kimchi it is sufficient to only bound check the last result $r'$ in a chain of additions (and subtractions)
+
+- Compute bound $r' = r + f'$ with addition gate (2 rows)
+- Range check $r' < 2^t$ (4 rows)
+
+#### Multiplication
+
+In foreign field multiplication, the situation is unfortunately different, and we must check that each of $a, b, q$ and $r$ are less than $f$.  We cannot adopt the strategy from foreign field addition where the operands are allowed to be greater than $f$ because the bit length of $r$ would increases linearly with the number of multiplications.  That is,
+
+$$
+(a_1 + f) \cdot (a_2 + f) = 1 \cdot f + \underbrace{f^2 + (a_1 + a_2 - 1) \cdot f + a_1 \cdot a_2}_{r}
+$$
+
+and after a chain of $k$ multiplication we have
+
+$$
+r = f^k + \ldots + a_1 \cdots a_k
+$$
+
+where $r > f^k$ quickly overflows our CRT modulus $2^t \cdot n$.  For example, assuming our maximum foreign modulus of $f = 2^{259}$ and either of Kimchi's native moduli (i.e. Pallas or Vesta), $f^k > 2^t \cdot n$ for $k > 2$.  That is, an overflow is possible for a chain of greater than 1 foreign field multiplication.  Thus, we must check $a, b, q$ and $r$ are less than $f$ for each multiplication.
+
+Fortunately, in many situations the input operands may already be checked either as inputs or as outputs of previous operations, so they may not be required for each multiplication operation.
+
+Thus, the $q'$ and $r'$ checks (or equivalently $q$ and $r$) are our main focus because they must be done for every multiplication.
+
+- Compute bound $q' = q + f'$ with addition gate (2 rows)
+- Compute bound $r' = r + f'$ with addition gate (2 rows)
+- Range check $q' < 2^t$ (4 rows)
+- Range check $r' < 2^t$ (4 rows)
+
+This costs 12 rows per multiplication.  In a subsequent section, we will reduce it to 8 rows.
+
+### 2-limb decomposition
+
+Due to the limited number of permutable cells per gate, we do not have enough cells for copy constraining $q'$ and $r'$ (or $q$ and $r$) to their respective range check gadgets.  To address this issue, we must decompose $q'$ into 2 limbs instead of 3, like so
+
+$$
+q' = q'_{01} + 2^{2\ell} \cdot q'_2
+$$
+
+and
+
+$$
+q'_{01} = q'_0 + 2^{\ell} \cdot q'_1
+$$
+
+Thus, $q'$ is decomposed into two limbs $q'_{01}$ (at most $2\ell$ bits) and $q'_2$ (at most $\ell$ bits).
+
+Note that $q'$ must be range checked by a `multi-range-check` gadget.  To do this the `multi-range-check` gadget must
+
+- Store a copy of the limbs $q'_0, q'_1$ and $q'_2$ in its witness
+- Range check that they are $\ell$ bit each
+- Constrain that $q'_{01} = q'_0 + 2^{\ell} \cdot q'_1$ (this is done with a special mode of the `multi-range-check` gadget)
+- Have copy constraints for $q'_{01}$ and $q'_2$ as outputs of the `ForeignFieldMul` gate and inputs to the `multi-range-check` gadget
+
+Note that since the foreign field multiplication gate computes $q'$ from $q$ which is already in the witness and $q'_{01}$ and $q'_2$ have copy constraints to a `multi-range-check` gadget that fully constrains their decomposition from $q'$, then the `ForeignFieldMul` gate does not need to store an additional copy of $q'_0$ and $q'_1$.
+
+### An optimization
+
+Since the $q < f$ and $r < f$ checks must be done for each multiplication it makes sense to integrate them into the foreign field multiplication gate. By doing this we can save 4 rows per multiplication.
+
+Doing this doesn't require adding a lot more witness data because the operands for the bound computations $q' = q + f'$ and $r' = r + f'$ are already present in the witness of the multiplication gate.  We only need to store the bounds $q'$ and $r'$ in permutable witness cells so that they may be copied to multi-range-check gates to check they are each less than $2^t$.
+
+To constrain $x + f' = x'$, the equation we use is
+
+$$
+x + 2^t = \mathcal{o} \cdot f + x',
+$$
+
+where $x$ is the original value, $\mathcal{o}=1$ is the field overflow bit and $x'$ is the remainder and our desired addition result (e.g. the bound).  Rearranging things we get
+
+$$
+x + 2^t - f = x',
+$$
+
+which is just
+
+$$
+x + f' = x',
+$$
+
+Recall from the section "Avoiding borrows" that $f'$ is often larger than $f$.  At first this seems like it could be a problem because in multiplication each operation must be less than $f$.  However, this is because the maximum size of the multiplication was quadratic in the size of $f$ (we use the CRT, which requires the bound that $a \cdot b < 2^t \cdot n$).  However, for addition the result is much smaller and we do not require the CRT nor the assumption that the operands are smaller than $f$.  Thus, we have plenty of space in $\ell$-bit limbs to perform our addition.
+
+So, the equation we need to constrain is
+
+$$
+x + f' = x'.
+$$
+
+We can expand the left hand side into the 2 limb format in order to obtain 2 intermediate sums
+
+$$
+\begin{aligned}
+s_{01} = x_{01} + f_{01}' \\
+s_2 = x_2 + f'_2 \\
+\end{aligned}
+$$
+
+where $x_{01}$ and $f'_{01}$ are defined like this
+
+$$
+\begin{aligned}
+x_{01} = x_0 + 2^{\ell} \cdot x_1 \\
+f'_{01} = f'_0 + 2^{\ell} \cdot f'_1 \\
+\end{aligned}
+$$
+
+and $x$ and $f'$ are defined like this
+
+$$
+\begin{aligned}
+x = x_{01} + 2^{2\ell} \cdot x_2 \\
+f' = f'_{01} + 2^{2\ell} \cdot f'_2 \\
+\end{aligned}
+$$
+
+Going back to our intermediate sums, the maximum bit length of sum $s_{01}$ is computed from the maximum bit lengths of $x_{01}$ and $f'_{01}$
+
+$$
+\underbrace{(2^{\ell} - 1) + 2^{\ell} \cdot (2^{\ell} - 1)}_{x_{01}} + \underbrace{(2^{\ell} - 1) + 2^{\ell} \cdot (2^{\ell} - 1)}_{f'_{01}} = 2^{2\ell+ 1} - 2,
+$$
+
+which means $s_{01}$ is at most $2\ell + 1$ bits long.
+
+Similarly, since $x_2$ and $f'_2$ are less than $2^{\ell}$, the max value of $s_2$ is
+
+$$
+(2^{\ell} - 1) + (2^{\ell} - 1) = 2^{\ell + 1} - 2,
+$$
+
+which means $s_2$ is at most $\ell + 1$ bits long.
+
+Thus, we must constrain
+
+$$
+s_{01} + 2^{2\ell} \cdot s_2 - x'_{01} - 2^{2\ell} \cdot x'_2 = 0 \mod 2^t.
+$$
+
+The accumulation of this into parts looks like this.
+
+```text
+0             L             2L            3L=t          4L
+|-------------|-------------|-------------|-------------|-------------|
+                            :
+|------------s01------------:-| 2L + 1
+                            : ↖w01
+                            |------s2-----:-| L + 1
+                            :               ↖w2
+                            :
+|------------x'01-----------|
+                            :
+                            |------x'2----|
+                            :
+\____________________________/
+             ≈ z01           \_____________/
+                                   ≈ z2
+```
+
+The two parts are computed with
+
+$$
+\begin{aligned}
+z_{01} &= s_{01} - x'_{01} \\
+z_2 &= s_2 - x'_2.
+\end{aligned}
+$$
+
+Therefore, there are two carry bits $w_{01}$ and $w_2$ such that
+
+$$
+\begin{aligned}
+z_{01} &= 2^{2\ell} \cdot w_{01} \\
+z_2 + w_{01} &= 2^{\ell} \cdot w_2
+\end{aligned}
+$$
+
+In this scheme $x'_{01}, x'_2, w_{01}$ and $w_2$ are witness data, whereas $s_{01}$ and $s_2$ are formed from a constrained computation of witness data $x_{01}, x_2$ and constraint system public parameter $f'$.  Note that due to carrying, witness $x'_{01}$ and $x'_2$ can be different than the values $s_{01}$ and $s_2$ computed from the limbs.
+
+Thus, each bound addition $x + f'$ requires the following witness data
+
+- $x_{01}, x_2$
+- $x'_{01}, x'_2$
+- $w_{01}, w_2$
+
+where $f'$ is baked into the gate coefficients.  The following constraints are needed
+
+- $2^{2\ell} \cdot w_{01} = s_{01} - x'_{01}$
+- $2^{\ell} \cdot w_2 = s_2 + w_{01} - x'_2$
+- $x'_{01} \in [0, 2^{2\ell})$
+- $x'_2 \in [0, 2^{\ell})$
+- $w_{01} \in [0, 2)$
+- $w_2 \in [0, 2)$
+
+Due to the limited number of copyable witness cells per gate, we are currently only performing this optimization for $q$.
+
+The witness data is
+
+- $q_0, q_1, q_2$
+- $q'_{01}, q'_2$
+- $q'_{carry01}, q'_{carry2}$
+
+The checks are
+
+1. $q_0 \in [0, 2^{\ell})$
+2. $q_1 \in [0, 2^{\ell})$
+3. $q'_0 = q_0 + f'_0$
+4. $q'_1 = q_1 + f'_1$
+5. $s_{01} = q'_0 + 2^{\ell} \cdot q'_1$
+6. $q'_{01} \in [0, 2^{2\ell})$
+7. $q'_{01} = q'_0 + 2^{\ell} \cdot q'_1$
+8. $q'_{carry01} \in [0, 2)$
+9. $2^{2\ell} \cdot q'_{carry01} = s_{01} - q'_{01}$
+10. $q_2 \in [0, 2^{\ell})$
+11. $s_2 = q_2 + f'_2$
+12. $q'_{carry2} \in [0, 2)$
+13. $2^{\ell} \cdot q'_{carry2} = s_2 + w_{01} - q'_2$
+
+
+Checks (1) - (5) assure that $s_{01}$ is at most $2\ell + 1$ bits.  Whereas checks (10) - (11) assure that $s_2$ is at most $\ell + 1$ bits.  Altogether they are comprise a single `multi-range-check` of $q_0, q_1$ and $q_2$.  However, as noted above, we do not have enough copyable cells to output $q_1, q_2$ and $q_3$ to the `multi-range-check` gadget.  Therefore, we adopt a strategy where the 2 limbs $q'_{01}$ and $q'_2$ are output to the `multi-range-check` gadget where the decomposition of $q'_0$ and $q'_2$ into $q'_{01} = p_0 + 2^{\ell} \cdot p_1$ is constrained and then $q'_0, q'_1$ and $q'_2$ are range checked.
+
+Although $q_1, q_2$ and $q_3$ are not range checked directly, this is safe because, as shown in the "Bound checks" section, range-checking that $q' \in [0, 2^t)$ also constrains that $q \in [0, 2^t)$.  Therefore, the updated checks are
+
+1. $q_0 \in [0, 2^{\ell})$ `multi-range-check`
+2. $q_1 \in [0, 2^{\ell})$ `multi-range-check`
+3. $q'_0 = q_0 + f'_0$ `ForeignFieldMul`
+4. $q'_1 = q_1 + f'_1$ `ForeignFieldMul`
+5. $s_{01} = q'_0 + 2^{\ell} \cdot q'_1$ `ForeignFieldMul`
+6. $q'_{01} = q'_0 + 2^{\ell} \cdot q'_1$  `multi-range-check`
+7. $q'_{carry01} \in [0, 2)$ `ForeignFieldMul`
+8. $2^{2\ell} \cdot q'_{carry01} = s_{01} - q'_{01}$ `ForeignFieldMul`
+9.  $q_2 \in [0, 2^{\ell})$  `multi-range-check`
+10. $s_2 = q_2 + f'_2$ `ForeignFieldMul`
+11. $q'_{carry2} \in [0, 2)$ `ForeignFieldMul`
+12. $2^{\ell} \cdot q'_{carry2} = s_2 + q'_{carry01} - q'_2$ `ForeignFieldMul`
+
+Note that we don't need to range-check $q'_{01}$ is at most $2\ell + 1$ bits because it is already implicitly constrained by the `multi-range-check` gadget constraining that $q'_0, q'_1$ and $q'_2$ are each at most $\ell$ bits and that $q'_{01} = q'_0 + 2^{\ell} \cdot q'_1$.  Furthermore, since constraining the decomposition is already part of the `multi-range-check` gadget, we do not need to do it here also.
+
+To simplify things further, we can combine some of these checks.  Recall our checked computations for the intermediate sums
+
+$$
+\begin{aligned}
+s_{01} &= q_{01} + f'_{01} \\
+s_2 &= q_2 + f'_2 \\
+\end{aligned}
+$$
+
+where $q_{01} = q_0 + 2^{\ell} \cdot q_1$ and $f'_{01} = f'_0 + 2^{\ell} \cdot f'_1$.  These do not need to be separate constraints, but are instead part of existing ones.
+
+Checks (10) and (11) can be combined into a single constraint $2^{\ell} \cdot q'_{carry2} = (q_2 + f'_2) + q'_{carry01} - q'_2$.  Similarly, checks (3) - (5) and (8) can be combined into $2^{2\ell} \cdot q'_{carry01} = q_{01} + f'_{01} - q'_{01}$ with $q_{01}$ and $f'_{01}$ further expanded.  The reduced constraints are
+
+1. $q_0 \in [0, 2^{\ell})$ `multi-range-check`
+2. $q_1 \in [0, 2^{\ell})$ `multi-range-check`
+3. $q'_{01} = q'_0 + 2^{\ell} \cdot q'_1$  `multi-range-check`
+4. $q'_{carry01} \in [0, 2)$ `ForeignFieldMul`
+5. $2^{2\ell} \cdot q'_{carry01} = s_{01} - q'_{01}$ `ForeignFieldMul`
+6. $q_2 \in [0, 2^{\ell})$  `multi-range-check`
+7. $q'_{carry2} \in [0, 2)$ `ForeignFieldMul`
+8. $2^{\ell} \cdot q'_{carry2} = s_2 + w_{01} - q'_2$ `ForeignFieldMul`
+
+Finally, there is one more optimization that we will exploit.  This optimization relies on the observation that for bound addition the second carry bit $q'_{carry2}$ is always zero.  This this may be obscure, so we will prove it by contradiction.  To simplify our work we rename some variables by letting $x_0 = q_{01}$ and $x_1 = q_2$.  Thus, $q'_{carry2}$ being non-zero corresponds to a carry in $x_1 + f'_1$.
+
+> **Proof:** To get a carry in the highest limbs $x_1 + f'_1$ during bound addition, we need
+>
+> $$
+> 2^{\ell} < x_1 + \phi_0 + f'_1 \le 2^{\ell} - 1 + \phi_0 + f'_1
+> $$
+>
+> where $2^{\ell} - 1$ is the maximum possible size of $x_1$ (before it overflows) and $\phi_0$ is the overflow bit from the addition of the least significant limbs $x_0$ and $f'_0$.  This means
+>
+> $$
+> 2^{\ell} - \phi_0 - f'_1 < x_1 < 2^{\ell}
+> $$
+>
+> We cannot allow $x$ to overflow the foreign field, so we also have
+>
+> $$
+> x_1 < (f - x_0)/2^{2\ell}
+> $$
+>
+> Thus,
+>
+> $$
+> 2^{\ell} - \phi_0  - f'_1 < (f - x_0)/2^{2\ell} = f/2^{2\ell} - x_0/2^{2\ell}
+> $$
+>
+> Since $x_0/2^{2\ell} = \phi_0$ we have
+>
+> $$
+> 2^{\ell} - \phi_0 - f'_1 < f/2^{2\ell} - \phi_0
+> $$
+>
+> so
+>
+> $$
+> 2^{\ell} - f'_1 < f/2^{2\ell}
+> $$
+>
+> Notice that $f/2^{2\ell} = f_1$.  Now we have
+>
+> $$
+> 2^{\ell} - f'_1 < f_1 \\
+> \Longleftrightarrow \\
+> f'_1 > 2^{\ell} - f_1
+> $$
+>
+> However, this is a contradiction with the definition of our negated foreign field modulus limb $f'_1 = 2^{\ell} - f_1$. $\blacksquare$
+
+We have proven that $q'_{carry2}$ is always zero, so that allows use to simplify our constraints.  We now have
+
+1. $q_0 \in [0, 2^{\ell})$ `multi-range-check`
+2. $q_1 \in [0, 2^{\ell})$ `multi-range-check`
+3. $q'_{01} = q'_0 + 2^{\ell} \cdot q'_1$  `multi-range-check`
+4. $q'_{carry01} \in [0, 2)$ `ForeignFieldMul`
+5. $2^{2\ell} \cdot q'_{carry01} = s_{01} - q'_{01}$ `ForeignFieldMul`
+6. $q_2 \in [0, 2^{\ell})$  `multi-range-check`
+7. $q'_2 = s_2 + w_{01}$ `ForeignFieldMul`
+
+In other words, we have eliminated constraint (7) and removed $q'_{carry2}$ from the witness.
+
+Since we already needed to range-check $q$ or $q'$, the total number of new constraints added is 4: 3 added to to `ForeignFieldMul` and 1 added to `multi-range-check` gadget for constraining the decomposition of $q'_{01}$.
+
+This saves 2 rows per multiplication.
+
+## Chaining multiplications
+
+Due to the limited number of cells accessible to gates, we are not able to chain multiplications into multiplications.  We can chain foreign field additions into foreign field multiplications, but currently do not support chaining multiplications into additions (though there is a way to do it).
+
+## Constraining the computation
+
+Now we collect all of the checks that are required to constrain foreign field multiplication
+
+### 1. Range constrain $a$
+
+If the $a$ operand has not been constrained to $[0, f)$ by any previous foreign field operations, then we constrain it like this
+- Compute bound $a' = a + f'$ with addition gate (2 rows)  `ForeignFieldAdd`
+- Range check $a' \in [0, 2^t)$ (4 rows)  `multi-range-check`
+
+### 2. Range constrain $b$
+
+If the $b$ operand has not been constrained to $[0, f)$ by any previous foreign field operations, then we constrain it like this
+- Compute bound $b' = b + f'$ with addition gate (2 rows)  `ForeignFieldAdd`
+- Range check $b' \in [0, 2^t)$ (4 rows)  `multi-range-check`
+
+### 3. Range constrain $q$
+
+The quotient $q$ is constrained to $[0, f)$ for each multiplication as part of the multiplication gate
+- Compute bound $q' = q + f'$ with  `ForeignFieldMul` constraints
+- Range check $q' \in [0, 2^t)$ (4 rows)  `multi-range-check`
+- Range check $q \ge 0$ `ForeignFieldMul` (implicit by storing `q` in witness)
+
+### 4. Range constrain $r$
+
+The remainder $r$ is constrained to $[0, f)$ for each multiplication using an external addition gate.
+- Compute bound $r' = r + f'$ with addition gate (2 rows)  `ForeignFieldAdd`
+- Range check $r' \in [0, 2^t)$ (4 rows)  `multi-range-check`
+
+### 5. Compute intermediate products
+
+Compute and constrain the intermediate products $p_0, p_1$ and $p_2$ as:
+
+- $p_0 = a_0 \cdot b_0 + q_0 \cdot f'_0$ `ForeignFieldMul`
+- $p_1 = a_0 \cdot b_1 + a_1 \cdot b_0 + q_0 \cdot f'_1 + q_1 \cdot f'_0$ `ForeignFieldMul`
+- $p_2 = a_0 \cdot b_2 + a_2 \cdot b_0 + a_1 \cdot b_1 + q_0 \cdot f'_2 + q_2 \cdot f'_0 + q_1 \cdot f'_1$ `ForeignFieldMul`
+
+where each of them is about $2\ell$-length elements.
+
+### 6. Native modulus checked computations
+
+Compute and constrain the native modulus values, which are used to check the constraints modulo $n$ in order to apply the CRT
+
+- $a_n = 2^{2\ell} \cdot a_2 + 2^{\ell} \cdot a_1 + a_0 \mod n$
+- $b_n = 2^{2\ell} \cdot b_2 + 2^{\ell} \cdot b_1 + b_0 \mod n$
+- $q_n = 2^{2\ell} \cdot q_2 + 2^{\ell} \cdot q_1 + q_0 \mod n$
+- $r_n = 2^{2\ell} \cdot r_2 + 2^{\ell} \cdot r_1 + r_0 \mod n$
+- $f_n = 2^{2\ell} \cdot f_2 + 2^{\ell} \cdot f_1 + f_0 \mod n$
+
+### 7. Decompose middle intermediate product
+
+Check that $p_1 = 2^{\ell} \cdot p_{11} + p_{10}$:
+
+- $p_1 = 2^\ell \cdot p_{11} + p_{10}$ `ForeignFieldMul`
+- Range check $p_{10} \in [0, 2^\ell)$ `multi-range-check`
+- Range check $p_{11} \in [0, 2^{\ell+2})$
+    - $p_{11} = p_{111} \cdot 2^\ell + p_{110}$  `ForeignFieldMul`
+    - Range check $p_{110} \in [0, 2^\ell)$ `multi-range-check`
+    - Range check $p_{111} \in [0, 2^2)$ with a degree-4 constraint `ForeignFieldMul`
+
+### 8. Zero sum for multiplication
+
+Now we have to constrain the zero sum
+
+$$
+(p_0 - r_0) + 2^{88}(p_1 - r_1) + 2^{176}(p_2 - r_2) = 0 \mod 2^t
+$$
+
+We constrain the first and the second halves as
+
+- $v_0 \cdot 2^{2\ell} = p_0 + 2^\ell \cdot p_{10} - r_0 - 2^\ell \cdot r_1$ `ForeignFieldMul`
+- $v_1 \cdot 2^{\ell} = (p_{111} \cdot 2^\ell + p_{110}) + p_2 - r_2 + v_0$ `ForeignFieldMul`
+
+And some more range checks
+
+- Check that $v_0 \in [0, 2^2)$ with a degree-4 constraint `ForeignFieldMul`
+- Check that $v_1 \in [0, 2^{\ell + 3})$
+    - Check $v_1 = v_{11} \cdot 2^{88} + v_{10}$ `ForeignFieldMul`
+    - Check $v_{11} \in [0, 2^3]$ `ForeignFieldMul`
+    - Check $v_{10} < 2^\ell$ with range constraint `multi-range-check`
+
+To check that $v_{11} \in [0, 2^3)$ (i.e. that $v_{11}$ is at most 3 bits long) we first range-check $v_{11} \in [0, 2^{12})$ with a 12-bit plookup.  This means there can be no higher bits set beyond the 12-bits of $v_{11}$.  Next, we scale $v_{11}$ by $2^9$ in order to move the highest $12 - 3 = 9$ bits beyond the $12$th bit. Finally, we perform a 12-bit plookup on the resulting value.  That is, we have
+
+- Check $v_{11} \in [0, 2^{12})$ with a 12-bit plookup (to prevent any overflow)
+- Check $\mathsf{scaled}_{v_{11}} = 2^9 \cdot v_{11}$
+- Check $\mathsf{scaled}_{v_{11}}$ is a 12-bit value with a 12-bit plookup
+
+Kimchi's plookup implementation is extremely flexible and supports optional scaling of the lookup target value as part of the lookup operation.  Thus, we do not require two witness elements, two lookup columns, nor the $\mathsf{scaled}_{v_{11}} = 2^9 \cdot v_{11}$ custom constraint.  Instead we can just store $v_{11}$ in the witness and define this column as a "joint lookup" comprised of one 12-bit plookup on the original cell value and another 12-bit plookup on the cell value scaled by $2^9$, thus, yielding a 3-bit check.  This eliminates one plookup column and reduces the total number of constraints.
+
+### 9. Native modulus constraint
+
+Using the checked native modulus computations we constrain that
+
+$$
+a_n \cdot b_n - q_n \cdot f_n - r_n = 0 \mod n.
+$$
+
+### 10. Compute intermediate sums
+
+Compute and constrain the intermediate sums $s_{01}$ and $s_2$ as:
+
+- $s_{01} = q_{01} + f'_{01}$
+- $s_2 = q_2 + f_2'$
+- $q_{01} = q_0 + 2^{\ell} \cdot q_1$
+- $f'_{01} = f'_0 + 2^{\ell} \cdot f'_1$
+
+### 11. Decompose the lower quotient bound
+
+Check that $q'_{01} = q'_0 + 2^{\ell} \cdot q'_1$.
+
+Done by (3) above with the  `multi-range-check` on $q'$
+- $q'_{01} = q'_0 + 2^{\ell} \cdot q'_1$
+- Range check $q'_0 \in [0, 2^\ell)$
+- Range check $q'_1 \in [0, 2^\ell)$
+
+### 12. Zero sum for quotient bound addition
+
+We constrain that
+
+$$
+s_{01} - q'_{01} + 2^{2\ell} \cdot (s_2 - q'_2) = 0 \mod 2^t
+$$
+
+by constraining the two halves
+
+* $2^{2\ell} \cdot q'_{carry01} = s_{01} - q'_{01}$
+* $2^{\ell} \cdot q'_{carry2} = s_2 + q'_{carry01} - q'_2$
+
+We also need a couple of small range checks
+
+- Check that $q'_{carry01}$ is boolean `ForeignFieldMul`
+- Check that $q'_2$ is boolean `ForeignFieldMul`
+
+# Layout
+
+Based on the constraints above, we need the following 12 values copied from the range check gates.
+
+```
+a0, a1, a2, b0, b1, b2, q0, q1, q2, r0, r1, r2
+```
+Since we need 12 copied values for the constraints, they must span 2 rows.
+
+The $q < f$ bound limbs $q'_0, q'_1$ and $q'_2$ must be in copyable cells so they can be range-checked.  Similarly, the limbs of the operands $a$, $b$ and the result $r$ must all be in copyable cells.  This leaves only 2 remaining copyable cells and, therefore, we cannot compute and output $r' = r + f'$.  It must be deferred to an external `ForeignFieldAdd` gate with the $r$ cells copied as an argument.
+
+NB: the $f$ and $f'$ values are publicly visible in the gate coefficients.
+
+|            | Curr                                     | Next             |
+| ---------- | ---------------------------------------- | ---------------- |
+| **Column** | `ForeignFieldMul`                        | `Zero`           |
+| 0          | $a_0$ (copy)                             | $r_0$ (copy)     |
+| 1          | $a_1$ (copy)                             | $r_1$ (copy)     |
+| 2          | $a_2$ (copy)                             | $r_2$ (copy)     |
+| 3          | $b_0$ (copy)                             | $q'_{01}$ (copy) |
+| 4          | $b_1$ (copy)                             | $q'_2$ (copy)    |
+| 5          | $b_2$ (copy)                             | $p_{10}$  (copy) |
+| 6          | $v_{10}$ (copy)                          | $p_{110}$ (copy) |
+| 7          | $v_{11}$ (scaled plookup)                |                  |
+| 8          | $v_0$                                    |                  |
+| 9          | $q_0$                                    |                  |
+| 10         | $q_1$                                    |                  |
+| 11         | $q_2$                                    |                  |
+| 12         | $q'_{carry01}$                           |                  |
+| 13         | $p_{111}$                                |                  |
+| 14         |                                          |                  |
+
+# Checked computations
+
+As described above foreign field multiplication has the following intermediate computations
+
+1. $p_0 = a_0b_0 + q_0f'_0$
+2. $p_1 = a_0b_1 + a_1b_0 + q_0f'_1 + q_1f'_0$
+3. $p_2 = a_0b_2 + a_2b_0 + a_1b_1 + q_0f'_2 + q_2f'_0 + q_1f'_1$.
+
+For the $q$ bound addition we must also compute
+
+1. $s_{01} = q_{01} + f'_{01}$
+2. $s_2 = q_2 + f_2'$
+3. $q_{01} = q_0 + 2^{\ell} \cdot q_1$
+4. $f'_{01} = f'_0 + 2^{\ell} \cdot f'_1$
+
+> Note the equivalence
+>
+> $$
+> \begin{aligned}
+> s_{01} &= q_{01} + f'_{01} \\
+> &= q_0 + 2^{\ell} \cdot q_1 + f'_0 + 2^{\ell} \cdot f'_1 \\
+> &= q_0 + f'_0 + 2^{\ell} \cdot (q'_1 + f'_1) \\
+> &= q'_0 + 2^{\ell} \cdot q'_1
+> \end{aligned}
+> $$
+>
+> where $q'_0 = q_0 + f'_0$ and $q'_1 = q_1 + f'_1$ can be done with checked computations.
+
+Next, for applying the CRT we compute
+
+1. $a_n = 2^{2\ell} \cdot a_2 + 2^{\ell} \cdot a_1 + a_0 \mod n$
+2. $b_n = 2^{2\ell} \cdot b_2 + 2^{\ell} \cdot b_1 + b_0 \mod n$
+3. $q_n = 2^{2\ell} \cdot q_2 + 2^{\ell} \cdot q_1 + q_0 \mod n$
+4. $r_n = 2^{2\ell} \cdot r_2 + 2^{\ell} \cdot r_1 + r_0 \mod n$
+5. $f_n = 2^{2\ell} \cdot f_2 + 2^{\ell} \cdot f_1 + f_0 \mod n$
+
+# Checks
+
+In total we require the following checks
+
+1. $p_{111} \in [0, 2^2)$
+2. $v_{10} \in [0, 2^{\ell})$ `multi-range-check`
+3. $p_{110} \in [0, 2^{\ell})$ `multi-range-check`
+4. $p_{10} \in [0, 2^{\ell})$ `multi-range-check`
+5. $p_{11} = 2^{\ell} \cdot p_{111} + p_{110}$
+6. $p_1 = 2^{\ell} \cdot p_{11} + p_{10}$
+7. $v_0 \in [0, 2^2)$
+8. $2^{2\ell} \cdot v_0 = p_0 + 2^{\ell} \cdot p_{10} - r_0 - 2^{\ell} \cdot r_1$
+9.  $v_{11} \in [0, 2^{3})$
+10. $v_1 = 2^{\ell} \cdot v_{11} + v_{10}$
+11. $2^{\ell} \cdot v_1 = v_0 + p_{11} + p_2 - r_2$
+12. $a_n \cdot b_n - q_n \cdot f_n = r_n$
+13. $q'_0 \in [0, 2^{\ell})$ `multi-range-check`
+14. $q'_1 \in [0, 2^{\ell})$ `multi-range-check`
+15. $q'_2 \in [0, 2^{\ell})$ `multi-range-check`
+16. $q'_{01} = q'_0 + 2^{\ell} \cdot q'_1$ `multi-range-check`
+17. $q'_{carry01} \in [0, 2)$
+18. $2^{2\ell} \cdot q'_{carry01} = s_{01} - q'_{01}$
+19. $q'_2 = s_2 + q'_{carry01}$
+
+# Constraints
+
+These checks can be condensed into the minimal number of constraints as follows.
+
+First, we have the range-check corresponding to (1) as a degree-4 constraint
+
+**C1:** $p_{111} \cdot (p_{111} - 1) \cdot (p_{111} - 2) \cdot (p_{111} - 3)$
+
+Checks (2) - (4) are all handled by a single `multi-range-check` gadget.
+
+**C2:** `multi-range-check` $v_{10}, p_{10}, p_{110}$
+
+Next (5) and (6) can be combined into
+
+**C3:** $2^{\ell} \cdot (2^{\ell} \cdot p_{111} + p_{110}) + p_{10} = p_1$
+
+Now we have the range-check corresponding to (7) as another degree-4 constraint
+
+**C4:** $v_0 \cdot (v_0 - 1) \cdot (v_0 - 2) \cdot (v_0 - 3)$
+
+Next we have check (8)
+
+**C5:** $2^{2\ell} \cdot v_0 = p_0 + 2^{\ell} \cdot p_{10} - r_0 - 2^{\ell} \cdot r_1$
+
+Up next, check (9) is a 3-bit range check
+
+**C6:** Plookup $v_{11}$ (12-bit plookup scaled by $2^9$)
+
+Now checks (10) and (11) can be combined into
+
+**C7:**  $2^{\ell} \cdot (v_{11} \cdot 2^{\ell} + v_{10}) = p_2 + p_{11} + v_0 - r_2$
+
+Next, for our use of the CRT, we must constrain that $a \cdot b = q \cdot f + r \mod n$.  Thus, check (12) is
+
+**C8:** $a_n \cdot b_n - q_n \cdot f_n = r_n$
+
+Next we must constrain the quotient bound addition.
+
+Checks (13) - (16) are all combined into `multi-range-check` gadget
+
+**C9:** `multi-range-check` $q'_0, q'_1, q'_2$ and $q'_{01} = q'_0 + 2^{\ell} \cdot q'_1$.
+
+Check (17) is a carry bit boolean check
+
+**C10:** $q'_{carry01} \cdot (q'_{carry01} - 1)$
+
+Next, check (18) is
+
+**C11:** $2^{2\ell} \cdot q'_{carry10} = s_{01} - q'_{01}$
+
+Finally, check (19) is
+
+**C12:** $q'_2 = s_2 + q'_{carry01}$
+
+The `Zero` gate has no constraints and is just used to hold values required by the `ForeignFieldMul` gate.
+
+# External checks
+
+The following checks must be done with other gates to assure the soundness of the foreign field multiplication
+
+- Range check input
+    - `multi-range-check` $a$
+    - `multi-range-check` $b$
+- Range check witness data
+    - `multi-range-check` $q'$ and check $q_{01}' = q'_0 + 2^{\ell} q'_1$
+    - `multi-range-check` $\ell$-bit limbs: $p_{10}, p_{110}, p_{111}$
+- Range check output
+    - `multi-range-check` either $r$ or $r'$
+- Compute and constrain $r'$ the bound on $r$
+    - `ForeignFieldAdd` $r + 2^t = 1 \cdot f + r'$
+
+Copy constraints must connect the above witness cells to their respective input cells within the corresponding external check gates witnesses.
diff --git a/book/src/rfcs/keccak.md b/book/src/kimchi/keccak.md
similarity index 88%
rename from book/src/rfcs/keccak.md
rename to book/src/kimchi/keccak.md
index 457c247b97..9c00ad6d7c 100644
--- a/book/src/rfcs/keccak.md
+++ b/book/src/kimchi/keccak.md
@@ -1,4 +1,4 @@
-# RFC: Keccak
+# Keccak Gate
 
 The Keccak gadget is comprised of 3 circuit gates (`Xor16`, `Rot64`, and `Zero`)
 
@@ -45,9 +45,9 @@ Kimchi framework that dictated those approaches.
 
 ## Rot64
 
-It is clear from the definition of the rotation gate that its constraints are complete 
+It is clear from the definition of the rotation gate that its constraints are complete
 (meaning that honest instances always satisfy the constraints). It is left to be proven
-that the proposal is sound. In this section, we will give a proof that as soon as we 
+that the proposal is sound. In this section, we will give a proof that as soon as we
 perform the range checks on the excess and shifted parts of the input, only one possible
 assignment satisfies the constraints. This means that there is no dishonest instance that
 can make the constraints pass. We will also give an example where one could find wrong
@@ -55,10 +55,10 @@ rotation witnesses that would satisfy the constraints if we did not check the ra
 
 ### Necessity of range checks
 
-First of all, we will illustrate the necessity of range-checks with a simple example. 
-For the sake of readability, we will use some toy field lengths. In particular, let us 
-assume that our words have 4 bits, meaning all of the elements between `0x0` and `0xF`. 
-Next, we will be using the native field $\mathbb{F}_{32}$. 
+First of all, we will illustrate the necessity of range-checks with a simple example.
+For the sake of readability, we will use some toy field lengths. In particular, let us
+assume that our words have 4 bits, meaning all of the elements between `0x0` and `0xF`.
+Next, we will be using the native field $\mathbb{F}_{32}$.
 
 As we will later see, this choice of field lengths is not enough to perform any 4-bit
 rotation, since the operations in the constraints would overflow the native field.
@@ -85,52 +85,50 @@ $$
 $$
 
 We can easily check that the proposed values of the shifted `0b1010=10` and the excess
-`0b1=1` values satisfy the above constraint because $26 = 1 \cdot 16 + 10$ and $11 = 1 + 10$. 
+`0b1=1` values satisfy the above constraint because $26 = 1 \cdot 16 + 10$ and $11 = 1 + 10$.
 Now, the question is: _can we find another value for excess and shifted, such that their addition results in an incorrect rotated word?_
 
 The answer to this question is yes, due to __diophantine equations__. We basically want to find $x,y$ such that $26 = x \cdot 16 + y (\text{ mod } 32)$. The solution to this equation is:
 
 $$
 \begin{align*}
-\forall k \in [0..31]: & \\
-x &= k \\
+\forall k \in [0 \ldots 31]: x &= k \ \land \\
 y &= 26 - 16 \cdot k
 \end{align*}
 $$
 
 We chose these word and field lengths to better understand the behaviour of the solution. Here, we can see two "classes" of evaluations.
 
-- If we choose an even $k$, then $y$ will have the following shape: 
+- If we choose an even $k$, then $y$ will have the following shape:
   - $$26 - 16 \cdot (2 \cdot n) \iff 26 - 32n \equiv_{32} 26 $$
   - Meaning, if $x = 2n$ then $y = 26$.
 
 - If on the other hand, we chose an odd $k$, then $y$ will have the following shape instead:
   - $$26 - 16 \cdot (2 \cdot n + 1) \iff 26 - 32n - 16 \equiv_{32} 26 - 16 = 10$$
-  - Meaning, if $x = 2n+1$ then $y = 10$. 
+  - Meaning, if $x = 2n+1$ then $y = 10$.
 
 Thus, possible solutions to the diophantine equation are:
 
 $$
 \begin{align*}
-x &= 0, 1, 2, 3, 4, 5... \\
-y &= 26, 10, 26, 10, 26, 10...
+x &= 0, 1, 2, 3, 4, 5 \ldots \\
+y &= 26, 10, 26, 10, 26, 10 \ldots
 \end{align*}
 $$
 
-Note that our valid witness is part of that set of solutions, meaning $x=1$ and $y=10$. Of course, we can also find another dishonest instantiation such as $x=0$ and $y=26$. Perhaps one could think that we do not need to worry about this case, because the resulting rotation word would be $0+26=26$, and if we later use that result as an input to a subsequent gate such as XOR, the value $26$ would not fit and at some point the constraint system would complain. Nonetheless, we still have other solutions to worry about, such as $(x=3, y=10)$ or $(x=5, y=10)$, since they would result in a rotated word that would fit in the word length of 4 bits, yet would be incorrect (not equal to $11$). 
+Note that our valid witness is part of that set of solutions, meaning $x=1$ and $y=10$. Of course, we can also find another dishonest instantiation such as $x=0$ and $y=26$. Perhaps one could think that we do not need to worry about this case, because the resulting rotation word would be $0+26=26$, and if we later use that result as an input to a subsequent gate such as XOR, the value $26$ would not fit and at some point the constraint system would complain. Nonetheless, we still have other solutions to worry about, such as $(x=3, y=10)$ or $(x=5, y=10)$, since they would result in a rotated word that would fit in the word length of 4 bits, yet would be incorrect (not equal to $11$).
 
-All of the above incorrect solutions differ in one thing: they have different bit lengths. This means that we need to range check the values for the excess and shifted witnesses to make sure they have the correct length. 
+All of the above incorrect solutions differ in one thing: they have different bit lengths. This means that we need to range check the values for the excess and shifted witnesses to make sure they have the correct length.
 
 ### Sufficiency of range checks
 
-In the following, we will give a proof that performing range checks for these values is not only necessary but also sufficient to prove that the rotation gate is sound. In other words, we will prove there are no two possible solutions of the decomposition constraint that have the correct bit lengths. Now, for the sake of robustness, we will consider 64-bit words and fields with at least twice the bit length of the words (as is our case). 
+In the following, we will give a proof that performing range checks for these values is not only necessary but also sufficient to prove that the rotation gate is sound. In other words, we will prove there are no two possible solutions of the decomposition constraint that have the correct bit lengths. Now, for the sake of robustness, we will consider 64-bit words and fields with at least twice the bit length of the words (as is our case).
 
 We will proceed by __contradiction__. Suppose there are two different solutions to the following diophantic equation:
 
 $$
 \begin{align*}
-\forall k \in \mathbb{F}_n: & \\
-x &= k \\
+\forall k \in \mathbb{F}_n: x &= k \ \land \\
 y &= w \cdot 2^r - 2^{64} \cdot k
 \end{align*}
 $$
@@ -139,7 +137,7 @@ where $k$ is a parameter to instantiate the solutions, $w$ is the word to be rot
 
 Then, that means that there are two different solutions, $(0\leq x=a<2^r, 0\leq y=b<2^{64})$ and $(0\leq x=a'<2^r, 0\leq y=b'<2^{64})$ with at least $a \neq a'$ or $b \neq b'$. We will show that this is impossible.
 
-If both are solutions to the same equation, then: 
+If both are solutions to the same equation, then:
 $$
 \begin{align*}
 w \cdot 2^r &= a \cdot 2^{64} + b \\
@@ -149,18 +147,22 @@ $$
 means that $a \cdot 2^{64} + b = a'\cdot 2^{64} + b'$. Moving terms to the left side, we have an equivalent equality: $2^{64}(a-a') + (b-b')=0 \mod{n}$. There are three cases to consider:
 
 - $a = a'$ and $b \neq b'$: then $(b - b') \equiv_n 0$ and this can only happen if $b' = b + kn$. But since $n > 2^{64}$, then $b'$ cannot be smaller than $2^{64}$ as it was assumed. CONTRADICTION.
-  
+
 - $b = b'$ and $a \neq a'$: then $2^{64}(a - a') \equiv_n 0$ and this can only happen if $a' = a + kn$. But since $n > 2^r$, then $a'$ cannot be smaller than $2^r$ as it was assumed. CONTRADICTION.
 
-- $a\neq a'$ and $b \neq b'$: then we have something like $2^{64} \alpha + \beta \equiv_n 0$. 
+- $a\neq a'$ and $b \neq b'$: then we have something like $2^{64} \alpha + \beta \equiv_n 0$.
   - This means $\beta \equiv_n -2^{64} \alpha = k \cdot n - 2^{64} \alpha$ for any $k$.
-  - According to the assumption, both $0\leq a<2^r$ and $0\leq a'<2^r$. This means, the difference $\alpha:=(a - a')$ lies anywhere in between the following interval: 
-  - $$1 - 2^r \leq \alpha \leq 2^r - 1$$
+  - According to the assumption, both $0\leq a<2^r$ and $0\leq a'<2^r$. This means, the difference $\alpha:=(a - a')$ lies anywhere in between the following interval:
+    $$1 - 2^r \leq \alpha \leq 2^r - 1$$
   - We plug in this interval to the above equation to obtain the following interval for $\beta$:
-  - $$k\cdot n - 2^{64}(1-2^r)\leq \beta \leq k\cdot n - 2^{64}(2^r - 1) $$
+    $$k\cdot n - 2^{64}(1-2^r)\leq \beta \leq k\cdot n - 2^{64}(2^r - 1) $$
   - We look at this interval from both sides of the inequality: $\beta \geq kn - 2^{64} + 2^{64+r}$ and $\beta \leq kn + 2^{64} - 2^{64+r}$ and we wonder if $kn - 2^{64} + 2^{64+r} \leq kn + 2^{64} - 2^{64+r}$ is at all possible. We rewrite as follows:
-  - $$2^{64+r} - 2^{64} \leq 2^{64} - 2^{64+r}$$
-  - $$2\cdot2^{64+r} \leq 2\cdot2^{64} $$
-  - $$2^{64+r} \leq 2^{64} $$
-  - But this can only happen if $r\leq 0$, which is impossible since we assume $0 < r < 64$.CONTRADICTION.
+    $$
+    \begin{align*}
+     2^{64+r} - 2^{64} &\leq 2^{64} - 2^{64+r}\\
+     2\cdot2^{64+r} &\leq 2\cdot2^{64} \\
+     2^{64+r} &\leq 2^{64}
+     \end{align*}
+     $$
+  - But this can only happen if $r\leq 0$, which is impossible since we assume $0 < r < 64$. CONTRADICTION.
 - EOP.
diff --git a/book/src/kimchi/lookup.md b/book/src/kimchi/lookup.md
index 9989ac5557..a261672716 100644
--- a/book/src/kimchi/lookup.md
+++ b/book/src/kimchi/lookup.md
@@ -1,3 +1,318 @@
-## Lookup
+# Plookup in Kimchi
 
-TO-DO
\ No newline at end of file
+In 2020, [plookup](https://eprint.iacr.org/2020/315.pdf) showed how to create lookup proofs. Proofs that some witness values are part of a [lookup table](https://en.wikipedia.org/wiki/Lookup_table). Two years later, an independent team published [plonkup](https://eprint.iacr.org/2022/086) showing how to integrate Plookup into Plonk.
+
+This document specifies how we integrate plookup in kimchi. It assumes that the reader understands the basics behind plookup.
+
+## Overview
+
+We integrate plookup in kimchi with the following differences:
+
+* we snake-ify the sorted table instead of wrapping it around (see later)
+* we allow fixed-ahead-of-time linear combinations of columns of the queries we make
+* we only use a single table (XOR) at the moment of this writing
+* we allow several lookups (or queries) to be performed within the same row
+* zero-knowledgeness is added in a specific way (see later)
+
+The following document explains the protocol in more detail
+
+### Recap on the grand product argument of plookup
+
+As per the Plookup paper, the prover will have to compute three vectors:
+
+* $f$, the (secret) **query vector**, containing the witness values that the prover wants to prove are part of the lookup table.
+* $t$, the (public) **lookup table**.
+* $s$, the (secret) concatenation of $f$ and $t$, sorted by $t$ (where elements are listed in the order they are listed in $t$).
+
+Essentially, plookup proves that all the elements in $f$ are indeed in the lookup table $t$ if and only if the following multisets are equal:
+
+* $\{(1+\beta)f, \text{diff}(t)\}$
+* $\text{diff}(\text{sorted}(f, t))$
+
+where $\text{diff}$ is a new set derived by applying a "randomized difference" between every successive pairs of a vector. For example:
+
+* $f = \{5, 4, 1, 5\}$
+* $t = \{1, 4, 5\}$
+* $\{\color{blue}{(1+\beta)f}, \color{green}{\text{diff}(t)}\} = \{\color{blue}{(1+\beta)5, (1+\beta)4, (1+\beta)1, (1+\beta)5}, \color{green}{1+\beta 4, 4+\beta 5}\}$
+* $\text{diff}(\text{sorted}(f, t)) = \{1+\beta 1, 1+\beta 4, 4+\beta 4, 4+\beta 5, 5+\beta 5, 5+\beta 5\}$
+
+> Note: This assumes that the lookup table is a single column. You will see in the next section how to address lookup tables with more than one column.
+
+The equality between the multisets can be proved with the permutation argument of plonk, which would look like enforcing constraints on the following accumulator:
+
+* init: $\mathsf{acc}_0 = 1$
+* final: $\mathsf{acc}_n = 1$
+* for every $0 < i \leq n$:
+    $$
+    \mathsf{acc}_i = \mathsf{acc}_{i-1} \cdot \frac{(\gamma + (1+\beta) f_{i-1}) \cdot (\gamma + t_{i-1} + \beta t_i)}{(\gamma + s_{i-1} + \beta s_{i})(\gamma + s_{n+i-1} + \beta s_{n+i})}
+    $$
+
+Note that the plookup paper uses a slightly different equation to make the proof work. It is possible that the proof would work with the above equation, but for simplicity let's just use the equation published in plookup:
+
+$$
+\mathsf{acc}_i = \mathsf{acc}_{i-1} \cdot \frac{(1+\beta) \cdot (\gamma + f_{i-1}) \cdot (\gamma(1 + \beta) + t_{i-1} + \beta t_i)}{(\gamma(1+\beta) + s_{i-1} + \beta s_{i})(\gamma(1+\beta) + s_{n+i-1} + \beta s_{n+i})}
+$$
+
+> Note: in plookup $s$ is longer than $n$ ($|s| = |f| + |t|$), and thus it needs to be split into multiple vectors to enforce the constraint at every $i \in [[0;n]]$. This leads to the two terms in the denominator as shown above, so that the degree of $\gamma (1 + \beta)$ is equal in the nominator and denominator.
+
+### Lookup tables
+
+Kimchi uses a single **lookup table** at the moment of this writing; the XOR table. The XOR table for values of 1 bit is the following:
+
+
+| l   | r   | o   |
+| --- | --- | --- |
+| 1   | 0   | 1   |
+| 0   | 1   | 1   |
+| 1   | 1   | 0   |
+| 0   | 0   | 0   |
+
+Whereas kimchi uses the XOR table for values of $4$ bits, which has $2^{8}$ entries.
+
+Note: the $(0, 0, 0)$ **entry** is at the very end on purpose (as it will be used as dummy entry for rows of the witness that don't care about lookups).
+
+### Querying the table
+
+The plookup paper handles a vector of lookups $f$ which we do not have. So the first step is to create such a table from the witness columns (or registers). To do this, we define the following objects:
+
+* a **query** tells us what registers, in what order, and scaled by how much, are part of a query
+* a **query selector** tells us which rows are using the query. It is pretty much the same as a [gate selector]().
+
+Let's go over the first item in this section.
+
+For example, the following **query** tells us that we want to check if $r_0 \oplus r_2 = 2\cdot r_1$
+
+|   l      |   r      |   o      |
+| :---:    | :---:    | :---:    |
+| 1, $r_0$ | 1, $r_2$ | 2, $r_1$ |
+
+The grand product argument for the lookup consraint will look like this at this point:
+
+$$
+\mathsf{acc}_i = \mathsf{acc}_{i-1} \cdot \frac{(1+\beta) \cdot {\color{green}(\gamma + w_0(g^i) + j \cdot w_2(g^i) + j^2 \cdot 2 \cdot w_1(g^i))} \cdot (\gamma(1 + \beta) + t_{i-1} + \beta t_i)}{(\gamma(1+\beta) + s_{i-1} + \beta s_{i})(\gamma(1+\beta) + s_{n+i-1} + \beta s_{n+i})}
+$$
+
+Not all rows need to perform queries into a lookup table. We will use a query selector in the next section to make the constraints work with this in mind.
+
+### Query selector
+
+The associated **query selector** tells us on which rows the query into the XOR lookup table occurs.
+
+|  row  | query selector |
+| :---: | :------------: |
+|   0   |       1        |
+|   1   |       0        |
+
+
+Both the (XOR) lookup table and the query are built-ins in kimchi. The query selector is derived from the circuit at setup time. Currently only the ChaCha gates make use of the lookups.
+
+With the selectors, the grand product argument for the lookup constraint has the following form:
+
+$$
+\mathsf{acc}_i = \mathsf{acc}_{i-1} \cdot \frac{(1+\beta) \cdot {\color{green}\mathsf{query}} \cdot (\gamma(1 + \beta) + t_{i-1} + \beta t_i)}{(\gamma(1+\beta) + s_{i-1} + \beta s_{i})}
+$$
+
+where $\color{green}{\mathsf{query}}$ is constructed so that a dummy query ($0 \oplus 0 = 0$) is used on rows that don't have a query.
+
+$$
+\begin{align}
+\mathsf{query} := &\ \mathsf{selector} \cdot (\gamma + w_0(g^i) + j \cdot w_2(g^i) + j^2 \cdot 2 \cdot w_1(g^i)) + \\
+&\ (1- \mathsf{selector}) \cdot (\gamma + 0 + j \cdot 0 + j^2 \cdot 0)
+\end{align}
+$$
+
+### Supporting multiple queries
+
+Since we would like to allow multiple table lookups per row, we define multiple **queries**, where each query is associated with a **lookup selector**.
+
+At the moment of this writing, the `ChaCha` gates all perform $4$ queries in a row. Thus, $4$ is trivially the largest number of queries that happen in a row.
+
+**Important**: to make constraints work, this means that each row must make $4$ queries. Potentially some or all of them are dummy queries.
+
+For example, the `ChaCha0`, `ChaCha1`, and `ChaCha2` gates will jointly apply the following 4 XOR queries on the current and following rows:
+
+|   l      |   r      |   o         | -   |   l      |   r      |   o        | -   |   l      |   r      |   o         | -   |   l      |   r         |   o         |
+| :---:    | :---:    | :----:      | --- | :---:    | :---:    | :----:     | --- | :---:    | :---:    | :----:      | --- | :---:    | :----:      | :----:      |
+| 1, $r_3$ | 1, $r_7$ | 1, $r_{11}$ | -   | 1, $r_4$ | 1, $r_8$ | 1, $r_{12}$ | -   | 1, $r_5$ | 1, $r_9$ | 1, $r_{13}$ | -   | 1, $r_6$ | 1, $r_{10}$ | 1, $r_{14}$ |
+
+which you can understand as checking for the current and following row that
+
+$$
+\begin{align}
+r_3 \oplus r_7 &= r_{11}\\
+r_4 \oplus r_8 &= r_{12}\\
+r_5 \oplus r_9 &= r_{13}\\
+r_6 \oplus r_{10} &= r_{14}
+\end{align}
+$$
+
+The `ChaChaFinal` also performs $4$ (somewhat similar) queries in the XOR lookup table. In total this is $8$ different queries that could be associated to $8$ selector polynomials.
+
+### Grouping queries by queries pattern
+
+Associating each query with a selector polynomial is not necessarily efficient. To summarize:
+
+* the `ChaCha0`, `ChaCha1`, and `ChaCha2` gates that in total make $4$ queries into the XOR table
+* the `ChaChaFinal` gate makes another $4$ different queries into the XOR table
+
+Using the previous section's method, we'd have to use $8$ different lookup selector polynomials for each of the different $8$ queries. Since there's only $2$ use-cases, we can simply group them by **queries patterns** to reduce the number of lookup selector polynomials to $2$.
+
+The grand product argument for the lookup constraint looks like this now:
+
+$$
+\mathsf{acc}_i = \mathsf{acc}_{i-1} \cdot \frac{{\color{green}(1+\beta)^4 \cdot \mathsf{query}} \cdot (\gamma(1 + \beta) + t_{i-1} + \beta t_i)}{(\gamma(1+\beta) + s_{i-1} + \beta s_{i})\times \ldots}
+$$
+
+where $\color{green}{\mathsf{query}}$ is constructed as:
+
+$$
+\begin{align}
+\mathsf{query} = &\ \mathsf{selector}_1 \cdot \mathsf{pattern}_1 + \\
+&\ \mathsf{selector}_2 \cdot \mathsf{pattern}_2 + \\
+&\ (1 - \mathsf{selector}_1 - \mathsf{selector}_2) \cdot (\gamma + 0 + j \cdot 0 + j^2 \cdot 0)^4
+\end{align}
+$$
+
+where, for example the first pattern for the `ChaCha0`, `ChaCha1`, and `ChaCha2` gates looks like this:
+
+$$
+\begin{align}
+\mathsf{pattern}_1 = &\ (\gamma + w_3(g^i) + j \cdot w_7(g^i) + j^2 \cdot w_{11}(g^i)) \cdot \\
+&\ (\gamma + w_4(g^i) + j \cdot w_8(g^i) + j^2 \cdot w_{12}(g^i)) \cdot \\
+&\ (\gamma + w_5(g^i) + j \cdot w_9(g^i) + j^2 \cdot w_{13}(g^i)) \cdot \\
+&\ (\gamma + w_6(g^i) + j \cdot w_{10}(g^i) + j^2 \cdot w_{14}(g^i)) \cdot \\
+\end{align}
+$$
+
+Note that there's now $4$ dummy queries, and they only appear when none of the lookup selectors are active.
+If a pattern uses less than $4$ queries, it has to be padded with dummy queries as well.
+
+Finally, note that the denominator of the grand product argument is incomplete in the formula above.
+Since the nominator has degree $5$ in $\gamma (1 + \beta)$, the denominator must match too.
+This is achieved by having a longer $s$, and referring to it $5$ times.
+The denominator thus becomes $\prod_{k=1}^{5} (\gamma (1+\beta) + s_{kn+i-1} + \beta s_{kn+i})$.
+
+## Back to the grand product argument
+
+There are two things that we haven't touched on:
+
+* The vector $t$ representing the **combined lookup table** (after its columns have been combined with a joint combiner $j$). The **non-combined loookup table** is fixed at setup time and derived based on the lookup tables used in the circuit (for now only one, the XOR lookup table, can be used in the circuit).
+* The vector $s$ representing the sorted multiset of both the queries and the lookup table. This is created by the prover and sent as commitment to the verifier.
+
+The first vector $t$ is quite straightforward to think about:
+
+* if it is smaller than the domain (of size $n$), then we can repeat the last entry enough times to make the table of size $n$.
+* if it is larger than the domain, then we can either increase the domain or split the vector in two (or more) vectors. This is most likely what we will have to do to support multiple lookup tables later.
+
+What about the second vector $s$?
+
+## The sorted vector $s$
+
+We said earlier that in original plonk the size of $s$ is equal to $|s| = |f|+|t|$, where $f$ encodes queries, and $t$ encodes the lookup table.
+With our multi-query approach, the second vector $s$ is of the size
+
+$$n \cdot |\#\text{queries}| + |\text{lookup\_table}|$$
+
+That is, it contains the $n$ elements of each **query vectors** (the actual values being looked up, after being combined with the joint combinator, that's $4$ per row), as well as the elements of our lookup table (after being combined as well).
+
+Because the vector $s$ is larger than the domain size $n$, it is split into several vectors of size $n$. Specifically, in the plonkup paper, the two halves of $s$, which are then interpolated as $h_1$ and $h_2$.
+The denominator in plonk in the vector form is
+$$
+\big(\gamma(1+\beta) + s_{i-1} + \beta s_{i}\big)\big(\gamma(1+\beta)+s_{n+i-1} + \beta s_{n+i}\big)
+$$
+which, when interpolated into $h_1$ and $h_2$, becomes
+$$
+\big(\gamma(1+\beta) + h_1(x) + \beta h_1(g \cdot x)\big)\big(\gamma(1+\beta) + h_2(x) + \beta h_2(g \cdot x)\big)
+$$
+
+Since one has to compute the difference of every contiguous pairs, the last element of the first half is the replicated as the first element of the second half ($s_{n-1} = s_{n}$).
+Hence, a separate constraint must be added to enforce that continuity on the interpolated polynomials $h_1$ and $h_2$:
+
+$$L_{n-1}(X)\cdot\big(h_1(X) - h_2(g \cdot X)\big) \equiv 0$$
+
+which is equivalent to checking that $h_1(g^{n-1}) = h_2(1)$.
+
+## The sorted vector $s$ in kimchi
+
+Since this vector is known only by the prover, and is evaluated as part of the protocol, zero-knowledge must be added to the corresponding polynomial (in case of plookup approach, to $h_1(X),h_2(X)$). To do this in kimchi, we use the same technique as with the other prover polynomials: we randomize the last evaluations (or rows, on the domain) of the polynomial.
+
+This means two things for the lookup grand product argument:
+
+1. We cannot use the wrap around trick to make sure that the list is split in two correctly (enforced by $L_{n-1}(h_1(x) - h_2(g \cdot x)) = 0$ which is equivalent to $h_1(g^{n-1}) = h_2(1)$ in the plookup paper)
+2. We have even less space to store an entire query vector. Which is actually super correct, as the witness also has some zero-knowledge rows at the end that should not be part of the queries anyway.
+
+The first problem can be solved in two ways:
+
+* **Zig-zag technique**. By reorganizing $s$ to alternate its values between the columns. For example, $h_1 = (s_0, s_2, s_4, \cdots)$ and $h_2 = (s_1, s_3, s_5, \cdots)$ so that you can simply write the denominator of the grand product argument as
+    $$(\gamma(1+\beta) + h_1(x) + \beta h_2(x))(\gamma(1+\beta)+ h_2(x) + \beta h_1(x \cdot g))$$
+    Whis approach is taken by the [plonkup](https://eprint.iacr.org/2022/086) paper.
+* **Snake technique**. By reorganizing $s$ as a snake. This is what is currently implemented in kimchi.
+
+The snake technique rearranges $s$ into the following shape:
+
+```
+                           __    _
+          s_0 |  s_{2n-1} |  |  | |
+          ... |       ... |  |  | |
+      s_{n-1} |       s_n |  |  | |
+               ‾‾‾‾‾‾‾‾‾‾‾    ‾‾   ‾
+              h1         h2  h3 ...
+```
+
+Assuming that for now we have only one bend and two polynomials $h_1(X),h_2(X)$, the denominator has the following form:
+
+$$\big(\gamma(1+\beta) + h_1(x) + \beta h_1(x \cdot g)\big)\big(\gamma(1+\beta)+ h_2(x \cdot g) + \beta h_2(x)\big)$$
+
+and the snake doing a U-turn is constrained via $s_{n-1} = s_n$, enforced by the following equation:
+
+$$L_{n-1} \cdot (h_1(x) - h_2(x)) = 0$$
+
+In practice, $s$ will have more sections than just two.
+Assume that we have $k$ sections in total, then the denominator generalizes to
+
+$$
+\prod_{i=1}^k \gamma(1+\beta) + h_i(x \cdot g^{\delta_{0,\ i \text{ mod } 2}}) + \beta h_i(x \cdot g^{\delta_{1,\ i \text{ mod } 2}})
+$$
+
+where $\delta_{i,j}$ is Kronecker delta, equal to $1$ when $i$ is even (for the first term) or odd (for the second one), and equal to $0$ otherwise.
+
+Similarly, the U-turn constraints now become
+
+$$
+\begin{align*}
+L_{n-1}(X) \cdot (h_2(X) - h_1(X)) &\equiv 0\\
+\color{green}L_{0}(X) \cdot (h_3(X) - h_2(X)) &\color{green}\equiv 0\\
+\color{green}L_{n-1}(X) \cdot (h_4(X) - h_3(X)) &\color{green}\equiv 0\\
+\ldots
+\end{align*}
+$$
+
+In our concrete case with $4$ simultaneous lookups the vector $s$ has to be split into $k= 5$ sections --- each denominator term in the accumulator accounts for $4$ queries ($f$) and $1$ table consistency check ($t$).
+
+## Unsorted $t$ in $s$
+
+Note that at setup time, $t$ cannot be sorted lexicographically as it is not combined yet. Since $s$ must be sorted by $t$ (in other words sorting of $s$ must follow the elements of $t$), there are two solutions:
+
+1. Both the prover and the verifier can sort the combined $t$ lexicographically, so that $s$ can be sorted lexicographically too using typical sorting algorithms
+2. The prover can directly sort $s$ by $t$, so that the verifier doesn't have to do any pre-sorting and can just rely on the commitment of the columns of $t$ (which the prover can evaluate in the protocol).
+
+We take the second approach.
+However, this must be done carefully since the combined $t$ entries can repeat. For some $i, l$ such that $i \neq l$, we might have
+
+$$
+t_0[i] + j \cdot t_1[i] + j^2 \cdot t_2[i] = t_0[l] + j \cdot t_1[l] + j^2 \cdot t_2[l]
+$$
+
+For example, if $f = \{1, 2, 2, 3\}$ and $t = \{2, 1, 2, 3\}$, then $\text{sorted}(f, t) = \{2, 2, 2, 1, 1, 2, 3, 3\}$ would be a way of correctly sorting the combined vector $s$. At the same time $\text{sorted}(f, t) = \{ 2, 2, 2, 2, 1, 1, 3, 3 \}$ is incorrect since it does not have a second block of $2$s, and thus such an $s$ is not sorted by $t$.
+
+
+## Recap
+
+So to recap, to create the sorted polynomials $h_i$, the prover:
+
+1. creates a large query vector which contains the concatenation of the $4$ per-row (combined with the joint combinator) queries (that might contain dummy queries) for all rows
+2. creates the (combined with the joint combinator) table vector
+3. sorts all of that into a big vector $s$
+4. divides that vector $s$ into as many $h_i$ vectors as a necessary following the snake method
+5. interpolate these $h_i$ vectors into $h_i$ polynomials
+6. commit to them, and evaluate them as part of the protocol.
diff --git a/book/src/plonk/maller_15.md b/book/src/kimchi/maller_15.md
similarity index 100%
rename from book/src/plonk/maller_15.md
rename to book/src/kimchi/maller_15.md
diff --git a/book/src/kimchi/permut.md b/book/src/kimchi/permut.md
index 3d62293eae..cc82360ef6 100644
--- a/book/src/kimchi/permut.md
+++ b/book/src/kimchi/permut.md
@@ -1,4 +1 @@
-## Permutation
-
-TO-DO
-
+# Permutation
diff --git a/book/src/kimchi/zkpm.md b/book/src/kimchi/zkpm.md
new file mode 100644
index 0000000000..947a334538
--- /dev/null
+++ b/book/src/kimchi/zkpm.md
@@ -0,0 +1 @@
+# Zero-Column Approach to Zero-Knowledge
diff --git a/book/src/pickles/passthrough.md b/book/src/pickles/passthrough.md
deleted file mode 100644
index cc182c4e51..0000000000
--- a/book/src/pickles/passthrough.md
+++ /dev/null
@@ -1 +0,0 @@
-# Passthrough and Me-Only
diff --git a/book/src/plonk/fiat_shamir.md b/book/src/plonk/fiat_shamir.md
index fd486f6aba..f1e2b264ef 100644
--- a/book/src/plonk/fiat_shamir.md
+++ b/book/src/plonk/fiat_shamir.md
@@ -1,4 +1,4 @@
-# non-interaction with fiat-shamir
+# Non-Interactivity via Fiat-Shamir
 
 So far we've talked about an interactive protocol between a prover and a verifier.
 The zero-knowledge proof was also in the honest verifier zero-knowedlge (HVZK) model, which is problematic.
@@ -15,7 +15,7 @@ This is important as our technique to transform an interactive protocol to a non
 The whole idea is to replace the verifier by a random oracle, which in practice is a hash function.
 Note that by doing this, we remove potential leaks that can happen when the verifier acts dishonestly.
 
-Initially the Fiat-Shamir transformation was only applied to sigma protocols, named after the greek letter $\Sigma$ due to its shape resembling the direction of messages (prover sends a commit to a verifier, verifier sends a challenge to a prover, prover sends the final proof to a verifier). 
+Initially the Fiat-Shamir transformation was only applied to sigma protocols, named after the greek letter $\Sigma$ due to its shape resembling the direction of messages (prover sends a commit to a verifier, verifier sends a challenge to a prover, prover sends the final proof to a verifier).
 A $Z$ would have made more sense but here we are.
 
 ## Generalization of Fiat-Shamir
@@ -27,6 +27,6 @@ This is simple: every verifier move can be replaced by a hash of the transcript
 
 While we use a hash function for that, a different construction called the [duplex construction](https://keccak.team/sponge_duplex.html) is particularly useful in such situations as they allow to continuously absorb the transcript and produce challenges, while automatically authenticating the fact that they produced a challenge.
 
-[Merlin](https://merlin.cool/) is a standardization of such a construction using the [Strobe protocol framework](https://strobe.sourceforge.io/) (a framework to make use of a duplex construction). 
-Note that the more recent [Xoodyak](https://keccak.team/xoodyak.html) (part of NIST's lightweight competition) could have been used for this as well. 
+[Merlin](https://merlin.cool/) is a standardization of such a construction using the [Strobe protocol framework](https://strobe.sourceforge.io/) (a framework to make use of a duplex construction).
+Note that the more recent [Xoodyak](https://keccak.team/xoodyak.html) (part of NIST's lightweight competition) could have been used for this as well.
 Note also that Mina uses none of these standards, instead it simply uses Poseidon (see section on poseidon).
diff --git a/book/src/plonk/glossary.md b/book/src/plonk/glossary.md
index 4880894fb4..6769ac5f83 100644
--- a/book/src/plonk/glossary.md
+++ b/book/src/plonk/glossary.md
@@ -1,9 +1,9 @@
 # Glossary
 
-* size = number of rows
-* columns = number of variables per rows
-* cell = a pair (row, column)
-* witness = the values assigned in all the cells
-* gate = polynomials that act on the variables in a row
-* selector vector = a vector of values 1 or 0 (not true for constant vector I think) that toggles gates and variables in a row
-* gadget = a series of contiguous rows with some specific gates set (via selector vectors)
+* Size: number of rows
+* Columns: number of variables per rows
+* Cell: a pair (row, column)
+* Witness: the values assigned in all the cells
+* Gate: polynomials that act on the variables in a row
+* Selector vector: a vector of values 1 or 0 (not true for constant vector I think) that toggles gates and variables in a row
+* Gadget: a series of contiguous rows with some specific gates set (via selector vectors)
diff --git a/book/src/plonk/maller.md b/book/src/plonk/maller.md
index b2d97c10c8..d5e0d26965 100644
--- a/book/src/plonk/maller.md
+++ b/book/src/plonk/maller.md
@@ -1,4 +1,4 @@
-# Maller optimization to reduce proof size
+# Maller's Optimization to Reduce Proof Size
 
 In the PLONK paper, they make use of an optimization from Mary Maller in order to reduce the proof size.
 
@@ -19,7 +19,7 @@ Let's see the protocol where _Prover_ wants to prove to _Verifier_ that
 
 $$\forall x \in \mathbb{F}, \; h_1(x)h_2(x) - h_3(x) = 0$$
 
-given commitments of $h_1, h_2, h_3$, 
+given commitments of $h_1, h_2, h_3$,
 
 ![maller 1](../img/maller_1.png)
 
@@ -37,7 +37,7 @@ Note right of Verifier: verifies that \n h1(s)h2(s) - h3(s) = 0
 ```
 -->
 
-A shorter proof exists. Essentially, if the verifier already has the opening `h1(s)`, they can reduce the problem to showing that 
+A shorter proof exists. Essentially, if the verifier already has the opening `h1(s)`, they can reduce the problem to showing that
 
 $$ \forall x \in \mathbb{F}, \; L(x) = h_1(s)h_2(x) - h_3(x) = 0$$
 
diff --git a/book/src/plonk/zkpm.md b/book/src/plonk/zkpm.md
index a4a1ba291c..84f46f3ec0 100644
--- a/book/src/plonk/zkpm.md
+++ b/book/src/plonk/zkpm.md
@@ -19,7 +19,9 @@ Let $f(X)$ be an interpolation polynomial of degree $n-1$ such that $f(h_i) = v_
 
 **Proof sketch.** Recall that the interpolation polynomial is
 
-$f(X) = \sum_{j = 1}^n \prod_{k \neq j} \frac{(X-h_k)}{(h_j-h_k)} v_j$
+$$
+f(X) = \sum_{j = 1}^n \prod_{k \neq j} \frac{(X-h_k)}{(h_j-h_k)} v_j
+$$
 
 With $V_{w+1}, \ldots, V_{w+k}$ as random variables, we have,
 $f(X) = a_{w+1} V_{w+1} + a_{w+2} V_{w+2} + \ldots + a_{w+k} V_{w+k} + a$
@@ -32,16 +34,22 @@ The idea here is to set the last $k$ evaluations to be uniformly random elements
 
 **Modified permutation polynomial.** Specifically, set $z(X)$ as follows.
 
-$z(X) = L_1(X) + \sum_{i = 1}^{\blue{n-k-2}} \left(L_{i+1} \prod_{j=1}^i \mathsf{frac}_{i,j} \right) + \blue{t_1 L_{n-k}(X) + \ldots + t_k L_{n}(X) }$
+$$
+z(X) = L_1(X) + \sum_{i = 1}^{\blue{n-k-2}} \left(L_{i+1} \prod_{j=1}^i \mathsf{frac}_{i,j} \right) + \blue{t_1 L_{n-k}(X) + \ldots + t_k L_{n}(X) }
+$$
 
 From Lemma 1, the above $z(X)$ has the desired zero knowledge property when $k$ evaluations are revealed. However, we need to modify the other parts of the protocol so that the last $k$ elements are not subject to the permutation evaluation, since they will no longer satisfy the permutation check. Specifically, we will need to modify the permutation polynomial to disregard those random elements, as follows.
 
-$ \begin{aligned}  & t(X) = \\
-  & (a(X)b(X)q_M(X) + a(X)q_L(X) + b(X)q_R(X) + c(X)q_O(X) + PI(X) + q_C(X)) \frac{1}{z_H(X)} \\
-  &+ ((a(X) + \beta X + \gamma)(b(X) + \beta k_1 X + \gamma)(c(X) + \beta k_2X + \gamma)z(X) \blue{(X-h_{n-k}) \ldots (X-h_{n-1})(X-h_n)} ) \frac{\alpha}{z_{H}(X)} \\
-  & - ((a(X) + \beta S_{\sigma1}(X) + \gamma)(b(X) + \beta S_{\sigma2}(X) + \gamma)(c(X) + \beta S_{\sigma3}(X) + \gamma)z(X\omega) \blue{(X-h_{n-k}) \ldots (X-h_{n-1})(X-h_n)}) \frac{\alpha}{z_{H}(X)} \\
-  & + (z(X)-1)L_1(X) \frac{\alpha^2}{z_H(X)} \\
-  & + \blue{(z(X)-1)L_{n-k}(X) \frac{\alpha^3}{z_H(X)} }  \end{aligned} $
+$$
+\begin{aligned}  & t(X) = \\
+  & \Big(a(X)b(X)q_M(X) + a(X)q_L(X) + b(X)q_R(X) + c(X)q_O(X) + PI(X) + q_C(X)\Big) \frac{1}{z_H(X)} \\
+  &+ \Big((a(X) + \beta X + \gamma)(b(X) + \beta k_1 X + \gamma)(c(X) + \beta k_2X + \gamma)z(X)\\
+  &\qquad\qquad\qquad\times{\blue{(X-h_{n-k}) \ldots (X-h_{n-1})(X-h_n)}} \Big) \frac{\alpha}{z_{H}(X)} \\
+  & - \Big((a(X) + \beta S_{\sigma1}(X) + \gamma)(b(X) + \beta S_{\sigma2}(X) + \gamma)(c(X) + \beta S_{\sigma3}(X) + \gamma)z(X\omega)\\
+  &\qquad\qquad\qquad\times{\blue{(X-h_{n-k}) \ldots (X-h_{n-1})(X-h_n)}}\Big) \frac{\alpha}{z_{H}(X)} \\
+  & + \Big(z(X)-1\Big)\cdot L_1(X) \frac{\alpha^2}{z_H(X)} \\
+  & + \blue{\Big(z(X)-1\Big)\cdot L_{n-k}(X) \frac{\alpha^3}{z_H(X)} }  \end{aligned}
+$$
 
 **Modified permutation checks.** To recall, the permutation check was originally as follows. For all $h \in H$,
 
@@ -50,7 +58,6 @@ $ \begin{aligned}  & t(X) = \\
   = Z(\omega h)[(a(h) + \beta S_{\sigma1}(h) + \gamma)(b(h) + \beta S_{\sigma2}(h) + \gamma)(c(h) + \beta S_{\sigma3}(h) + \gamma)]$
 
 
-
 The modified permuation checks that ensures that the check is performed only on all the values except the last $k$ elements in the witness polynomials are as follows.
 
 * For all $h \in H$, $L_1(h)(Z(h) - 1) = 0$
diff --git a/book/src/rfcs/3-lookup.md b/book/src/rfcs/3-lookup.md
deleted file mode 100644
index 6f256e6a6f..0000000000
--- a/book/src/rfcs/3-lookup.md
+++ /dev/null
@@ -1,291 +0,0 @@
-# RFC: Plookup in kimchi
-
-In 2020, [plookup](https://eprint.iacr.org/2020/315.pdf) showed how to create lookup proofs. Proofs that some witness values are part of a [lookup table](https://en.wikipedia.org/wiki/Lookup_table). Two years later, an independent team published [plonkup](https://eprint.iacr.org/2022/086) showing how to integrate Plookup into Plonk.
-
-This document specifies how we integrate plookup in kimchi. It assumes that the reader understands the basics behind plookup.
-
-## Overview
-
-We integrate plookup in kimchi with the following differences:
-
-* we snake-ify the sorted table instead of wrapping it around (see later)
-* we allow fixed-ahead-of-time linear combinations of columns of the queries we make
-* we only use a single table (XOR) at the moment of this writing
-* we allow several lookups (or queries) to be performed within the same row
-* zero-knowledgeness is added in a specific way (see later)
-
-The following document explains the protocol in more detail
-
-### Recap on the grand product argument of plookup
-
-As per the Plookup paper, the prover will have to compute three vectors:
-
-* $f$, the (secret) **query vector**, containing the witness values that the prover wants to prove are part of the lookup table.
-* $t$, the (public) **lookup table**.
-* $s$, the (secret) concatenation of $f$ and $t$, sorted by $t$ (where elements are listed in the order they are listed in $t$).
-
-Essentially, plookup proves that all the elements in $f$ are indeed in the lookup table $t$ if and only if the following multisets are equal:
-
-* $\{(1+\beta)f, \text{diff}(t)\}$
-* $\text{diff}(\text{sorted}(f, t))$
-
-where $\text{diff}$ is a new set derived by applying a "randomized difference" between every successive pairs of a vector. For example:
-
-* $f = \{5, 4, 1, 5\}$
-* $t = \{1, 4, 5\}$
-* $\{\color{blue}{(1+\beta)f}, \color{green}{\text{diff}(t)}\} = \{\color{blue}{(1+\beta)5, (1+\beta)4, (1+\beta)1, (1+\beta)5}, \color{green}{1+\beta 4, 4+\beta 5}\}$
-* $\text{diff}(\text{sorted}(f, t)) = \{1+\beta 1, 1+\beta 4, 4+\beta 4, 4+\beta 5, 5+\beta 5, 5+\beta 5\}$
-
-> Note: This assumes that the lookup table is a single column. You will see in the next section how to address lookup tables with more than one column.
-
-The equality between the multisets can be proved with the permutation argument of plonk, which would look like enforcing constraints on the following accumulator:
-
-* init: $acc_0 = 1$ 
-* final: $acc_n = 1$
-* for every $0 < i \leq n$:
-    $$
-    acc_i = acc_{i-1} \cdot \frac{(\gamma + (1+\beta) f_{i-1})(\gamma + t_{i-1} + \beta t_i)}{(\gamma + s_{i-1} + \beta s_{i})}
-    $$
- 
-Note that the plookup paper uses a slightly different equation to make the proof work. I believe the proof would work with the above equation, but for simplicity let's just use the equation published in plookup:
-
-$$
-acc_i = acc_{i-1} \cdot \frac{(1+\beta)(\gamma + f_{i-1})(\gamma(1 + \beta) + t_{i-1} + \beta t_i)}{(\gamma(1+\beta) + s_{i-1} + \beta s_{i})}
-$$
-
-> Note: in plookup $s$ is too large, and so needs to be split into multiple vectors to enforce the constraint at every $i \in [[0;n]]$. We ignore this for now.
-
-### Lookup tables
-
-Kimchi uses a single **lookup table** at the moment of this writing; the XOR table. The XOR table for values of 1 bit is the following:
-
-
-| l   | r   | o   |
-| --- | --- | --- |
-| 1   | 0   | 1   |
-| 0   | 1   | 1   |
-| 1   | 1   | 0   |
-| 0   | 0   | 0   |
-
-Whereas kimchi uses the XOR table for values of 4 bits, which has $2^{8}$ entries.
-
-Note: the (0, 0, 0) **entry** is at the very end on purpose (as it will be used as dummy entry for rows of the witness that don't care about lookups).
-
-### Querying the table
-
-The plookup paper handles a vector of lookups $f$ which we do not have. So the first step is to create such a table from the witness columns (or registers). To do this, we define the following objects:
-
-* a **query** tells us what registers, in what order, and scaled by how much, are part of a query
-* a **query selector** tells us which rows are using the query. It is pretty much the same as a [gate selector]().
-
-Let's go over the first item in this section.
-
-For example, the following **query** tells us that we want to check if $r_0 \oplus r_2 = 2r_1$
-
-|   l   |   r   |   o   |
-| :---: | :---: | :---: |
-| 1, r0 | 1, r2 | 2, r1 |
-
-The grand product argument for the lookup consraint will look like this at this point:
-
-$$
-acc_i = acc_{i-1} \cdot \frac{\color{green}{(1+\beta)(\gamma + w_0(g^i) + j \cdot w_2(g^i) + j^2 \cdot 2 \cdot w_1(g^i))}(\gamma(1 + \beta) + t_{i-1} + \beta t_i)}{(\gamma(1+\beta) + s_{i-1} + \beta s_{i})}
-$$
-
-Not all rows need to perform queries into a lookup table. We will use a query selector in the next section to make the constraints work with this in mind.
-
-### Query selector
-
-The associated **query selector** tells us on which rows the query into the XOR lookup table occurs.
-
-|  row  | query selector |
-| :---: | :------------: |
-|   0   |       1        |
-|   1   |       0        |
-
-
-Both the (XOR) lookup table and the query are built-ins in kimchi. The query selector is derived from the circuit at setup time. Currently only the ChaCha gates make use of the lookups.
-
-The grand product argument for the lookup constraint looks like this now:
-
-$$
-acc_i = acc_{i-1} \cdot \frac{\color{green}{(1+\beta) \cdot query} \cdot (\gamma(1 + \beta) + t_{i-1} + \beta t_i)}{(\gamma(1+\beta) + s_{i-1} + \beta s_{i})}
-$$
-
-where $\color{green}{query}$ is constructed so that a dummy query ($0 \oplus 0 = 0$) is used on rows that don't have a query.
-
-$$
-\begin{align}
-query = &\ selector \cdot (\gamma + w_0(g^i) + j \cdot w_2(g^i) + j^2 \cdot 2 \cdot w_1(g^i)) + \\
-&\ (1- selector) \cdot (\gamma + 0 + j \cdot 0 + j^2 \cdot 0)
-\end{align}
-$$
-
-### Queries, not query
-
-Since we allow multiple queries per row, we define multiple **queries**, where each query is associated with a **lookup selector**. 
-
-At the moment of this writing, the `ChaCha` gates all perform $4$ queries in a row. Thus, $4$ is trivially the largest number of queries that happen in a row.
-
-**Important**: to make constraints work, this means that each row must make 4 queries. (Potentially some or all of them are dummy queries.)
-
-For example, the `ChaCha0`, `ChaCha1`, and `ChaCha2` gates will apply the following 4 XOR queries on the current and following rows:
-
-|   l   |   r   |   o    | -   |   l   |   r   |   o    | -   |   l   |   r   |   o    | -   |   l   |   r    |   o    |
-| :---: | :---: | :----: | --- | :---: | :---: | :----: | --- | :---: | :---: | :----: | --- | :---: | :----: | :----: |
-| 1, r3 | 1, r7 | 1, r11 | -   | 1, r4 | 1, r8 | 1, r12 | -   | 1, r5 | 1, r9 | 1, r13 | -   | 1, r6 | 1, r10 | 1, r14 |
-
-which you can understand as checking for the current and following row that
-
-* $r_3 \oplus r7 = r_{11}$
-* $r_4 \oplus r8 = r_{12}$
-* $r_5 \oplus r9 = r_{13}$
-* $r_6 \oplus r10 = r_{14}$
-
-The `ChaChaFinal` also performs $4$ (somewhat similar) queries in the XOR lookup table. In total this is 8 different queries that could be associated to 8 selector polynomials.
-
-### Grouping queries by queries pattern
-
-Associating each query with a selector polynomial is not necessarily efficient. To summarize:
-
-* the `ChaCha0`, `ChaCha1`, and `ChaCha2` gates that make $4$ queries into the XOR table
-* the `ChaChaFinal` gate makes $4$ different queries into the XOR table
-
-Using the previous section's method, we'd have to use $8$ different lookup selector polynomials for each of the different $8$ queries. Since there's only $2$ use-cases, we can simply group them by **queries patterns** to reduce the number of lookup selector polynomials to $2$.
-
-The grand product argument for the lookup constraint looks like this now:
-
-$$
-acc_i = acc_{i-1} \cdot \frac{\color{green}{(1+\beta)^4 \cdot query} \cdot (\gamma(1 + \beta) + t_{i-1} + \beta t_i)}{(\gamma(1+\beta) + s_{i-1} + \beta s_{i})}
-$$
-
-where $\color{green}{query}$ is constructed as:
-
-$$
-\begin{align}
-query = &\ selector_1 \cdot pattern_1 + \\
-&\ selector_2 \cdot pattern_2 + \\
-&\ (1 - selector_1 - selector_2) \cdot (\gamma + 0 + j \cdot 0 + j^2 \cdot 0)^4
-\end{align}
-$$
-
-where, for example the first pattern for the `ChaCha0`, `ChaCha1`, and `ChaCha2` gates looks like this:
-
-$$
-\begin{align}
-pattern_1 = &\ (\gamma + w_3(g^i) + j \cdot w_7(g^i) + j^2 \cdot w_{11}(g^i)) \cdot \\
-&\ (\gamma + w_4(g^i) + j \cdot w_8(g^i) + j^2 \cdot w_{12}(g^i)) \cdot \\
-&\ (\gamma + w_5(g^i) + j \cdot w_9(g^i) + j^2 \cdot w_{13}(g^i)) \cdot \\
-&\ (\gamma + w_6(g^i) + j \cdot w_{10}(g^i) + j^2 \cdot w_{14}(g^i)) \cdot \\
-\end{align}
-$$
-
-Note:
-
-* there's now 4 dummy queries, and they only appear when none of the lookup selectors are active
-* if a pattern uses less than 4 queries, they'd have to pad their queries with dummy queries as well
-
-## Back to the grand product argument
-
-There are two things that we haven't touched on:
-
-* The vector $t$ representing the **combined lookup table** (after its columns have been combined with a joint combiner $j$). The **non-combined loookup table** is fixed at setup time and derived based on the lookup tables used in the circuit (for now only one, the XOR lookup table, can be used in the circuit).
-* The vector $s$ representing the sorted multiset of both the queries and the lookup table. This is created by the prover and sent as commitment to the verifier.
-
-The first vector $t$ is quite straightforward to think about: 
-
-* if it is smaller than the domain (of size $n$), then we can repeat the last entry enough times to make the table of size $n$.
-* if it is larger than the domain, then we can either increase the domain or split the vector in two (or more) vectors. This is most likely what we will have to do to support multiple lookup tables later.
-
-What about the second vector?
-
-## The sorted vector $s$
-
-The second vector $s$ is of size 
-
-$$n \cdot |\text{queries}| + |\text{lookup\_table}|$$
-
-That is, it contains the $n$ elements of each **query vectors** (the actual values being looked up, after being combined with the joint combinator, that's $4$ per row), as well as the elements of our lookup table (after being combined as well).
-
-Because the vector $s$ is larger than the domain size $n$, it is split into several vectors of size $n$. Specifically, in the plonkup paper, the two halves of $s$ (which are then interpolated as $h_1$ and $h_2$).
-
-$$
-acc_i = acc_{i-1} \cdot \frac{\color{green}{(1+\beta)^4 \cdot query} \cdot (\gamma(1 + \beta) + t_{i-1} + \beta t_i)}{(\gamma(1+\beta) + s_{i-1} + \beta s_{i})(\gamma(1+\beta)+s_{n+i-1} + \beta s_{n+i})}
-$$
-
-Since you must compute the difference of every contiguous pairs, the last element of the first half is the replicated as the first element of the second half ($s_{n-1} = s_{n}$), and a separate constraint enforces that continuity on the interpolated polynomials $h_1$ and $h_2$: 
-
-$$L_{n-1}(h_1(x) - h_2(g \cdot x)) = 0$$
-
-which is equivalent with checking that
-
-$$h_1(g^{n-1}) = h_2(1)$$
-
-## The sorted vector $s$ in kimchi
-
-Since this vector is known only by the prover, and is evaluated as part of the protocol, zero-knowledge must be added to the polynomial. To do this in kimchi, we use the same technique as with the other prover polynomials: we randomize the last evaluations (or rows, on the domain) of the polynomial. 
-
-This means two things for the lookup grand product argument:
-
-1. we cannot use the wrap around trick to make sure that the list is split in two correctly (enforced by $L_{n-1}(h_1(x) - h_2(g \cdot x)) = 0$ which is equivalent to $h_1(g^{n-1}) = h_2(1)$ in the plookup paper)
-2. we have even less space to store an entire query vector. Which is actually super correct, as the witness also has some zero-knowledge rows at the end that should not be part of the queries anyway.
-
-The first problem can be solved in two ways:
-
-* **Zig-zag technique**. By reorganizing $s$ to alternate its values between the columns. For example, $h_1 = (s_0, s_2, s_4, \cdots)$ and $h_2 = (s_1, s_3, s_5, \cdots)$ so that you can simply write the denominator of the grand product argument as 
-    $$(\gamma(1+\beta) + h_1(x) + \beta h_2(x))(\gamma(1+\beta)+ h_2(x) + \beta h_1(x \cdot g))$$
-    this is what the [plonkup](https://eprint.iacr.org/2022/086) paper does. 
-* **Snake technique**. by reorganizing $s$ as a snake. This is what is done in kimchi currently.
-
-The snake technique rearranges $s$ into the following shape:
-
-```
-    _   _
- | | | | |
- | | | | |
- |_| |_| |
-```
-
-so that the denominator becomes the following equation:
-
-$$(\gamma(1+\beta) + h_1(x) + \beta h_1(x \cdot g))(\gamma(1+\beta)+ h_2(x \cdot g) + \beta h_2(x))$$
-
-and the snake doing a U-turn is constrained via something like
-
-$$L_{n-1} \cdot (h_1(x) - h_2(x)) = 0$$
-
-If there's an $h_3$ (because the table is very large, for example), then you'd have something like:
-
-$$(\gamma(1+\beta) + h_1(x) + \beta h_1(x \cdot g))(\gamma(1+\beta)+ h_2(x \cdot g) + \beta h_2(x))\color{green}{(\gamma(1+\beta)+ h_3(x) + \beta h_3(x \cdot g))}$$
-
-with the added U-turn constraint:
-
-$$L_{0} \cdot (h_2(x) - h_3(x)) = 0$$
-
-## Unsorted $t$ in $s$
-
-Note that at setup time, $t$ cannot be sorted as it is not combined yet. Since $s$ needs to be sorted by $t$ (in other words, not sorted, but sorted following the elements of $t$), there are two solutions:
-
-1. both the prover and the verifier can sort the combined $t$, so that $s$ can be sorted via the typical sorting algorithms
-2. the prover can sort $s$ by $t$, so that the verifier doesn't have to do any sorting and can just rely on the commitment of the columns of $t$ (which the prover can evaluate in the protocol).
-
-We do the second one, but there is an edge-case: the combined $t$ entries can repeat.
-For some $i, l$ such that $i \neq l$, we might have
-
-$$
-t_0[i] + j t_1[i] + j^2 t_2[i] = t_0[l] + j t_1[l] + j^2 t_2[l]
-$$
-
-For example, if $f = \{1, 2, 2, 3\}$ and $t = \{2, 1, 2, 3\}$, then $\text{sorted}(f, t) = \{2, 2, 2, 1, 1, 2, 3, 3\}$ would be one way of sorting things out. But $\text{sorted}(f, t) = \{ 2, 2, 2, 2, 1, 1, 3, 3 \}$ would be incorrect.
-
-
-## Recap
-
-So to recap, to create the sorted polynomials $h_i$, the prover:
-
-1. creates a large query vector which contains the concatenation of the $4$ per-row (combined with the joint combinator) queries (that might contain dummy queries) for all rows
-2. creates the (combined with the joint combinator) table vector
-3. sorts all of that into a big vector $s$
-4. divides that vector $s$ into as many $h_i$ vectors as a necessary following the snake method
-5. interpolate these $h_i$ vectors into $h_i$ polynomials
-6. commit to them, and evaluate them as part of the protocol.