Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo FSE Table Description section #3810

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 70 additions & 50 deletions doc/zstd_compression_format.md
Original file line number Diff line number Diff line change
Expand Up @@ -1069,72 +1069,92 @@ from `0` to the last present one (included)
on a normalized scale of `1 << Accuracy_Log` .
Note that there must be two or more symbols with nonzero probability.

It's a bitstream which is read forward, in __little-endian__ fashion.
It's not necessary to know bitstream exact size,
it will be discovered and reported by the decoding process.
FSE tables are decoded from a bitstream which is read forward, in
__little-endian__ fashion. It's not necessary to know bitstream exact
size, it will be discovered and reported by the decoding process.

The bitstream starts by reporting on which scale it operates.
Let's `low4Bits` designate the lowest 4 bits of the first byte :
Let `low4Bits` designate the lowest 4 bits of the first byte :
`Accuracy_Log = low4bits + 5`.

Then follows each symbol value, from `0` to last present one.
The number of bits used by each field is variable.
It depends on :

- Remaining probabilities + 1 :
__example__ :
Presuming an `Accuracy_Log` of 8,
and presuming 100 probabilities points have already been distributed,
the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
Therefore, it may read up to `log2sup(157) == 8` bits, where `log2sup(N)`
is the smallest integer `T` that satisfies `(1 << T) > N`.
For each encoded symbol, minimum bit usage must be determined.
To do this, first compute
`Cumulative_Prob = Sum of Effective_Prob for all previously decoded symbols`
and `Max_Prob_Value = (1 << Accuracy_Log) - Cumulative_Prob + 1`.

Then, compute `Prob_Base_Size` as the position of the highest `1` bit in
`Max_Prob_Value` (that is, the largest integer `T` that satisfies
`(1 << T) <= Max_Prob_Value`), and then compute
`Large_Prob_Range = Max_Prob_Value - (1 << Prob_Base_Size) + 1` and
`Large_Prob_Start = (1 << Prob_Base_Size) - Large_Prob_Range`.

To decode a symbol probability, read `Prob_Base_Size` bits from the bitstream
as `Initial_Prob_Value`. If `Initial_Prob_Value >= Large_Prob_Start`, then
the decoder must read one additional bit from the bitstream. If that
bit is `1`, then compute
`Prob_Value = Initial_Prob_Value + Large_Prob_Range`
If the extra bit is `0`, or there is no extra bit, then
`Prob_Value = Initial_Prob_Value`.

If `Prob_Value` is `0`, then the probability is a special "less than 1" value.
The effects of the "less than 1" probability on the distribution table is
described in the [next section]. In this case, `Effective_Prob` for the
symbol is `1`.

If `Prob_Value` is non-zero, then the probability is computed as
`Probability = Prob_Value - 1`, and `Effective_Prob = Probability`.

If the `Probability` is zero, then it is succeeded by at least one repeat
count in the bitstream. Each repeat count is `2` bits. Each time a repeat
count of `3` is encountered, another repeat count is read until a repeat
count other than `3` is encountered. Once all repeat counts have been read,
compute `Symbol_Repeat_Count = Sum of all repeat counts for the symbol`.
The probability of the next `Symbol_Repeat_Count` symbols after the decoded
symbol is zero, and decoding of the probability values of those symbols is
skipped. The `Effective_Prob` of all skipped symbols is also `0`.

If, after decoding a probability, the sum of all decoded `Effective_Prob`
equals `1 << Accuracy_Log`, then decoding of symbols is completed and the
probability of all remaining symbols is `0`.

After decoding probabilities, further decoding resumes at the byte after
the last byte used or partially used by probability decoding. If the last
byte was partially used, then any unused bits are ignored.

- Value decoded : small values use 1 less bit :
__example__ :
Presuming values from 0 to 157 (inclusive) are possible,
255-157 = 98 values are remaining in an 8-bits field.
They are used this way :
first 98 values (hence from 0 to 97) use only 7 bits,
values from 98 to 157 use 8 bits.
This is achieved through this scheme :
If this process results in a non-zero probability for a value outside of the
valid range of values that the FSE table is defined for, then the data is
considered corrupted, even if the value is not used.

| Value read | Value decoded | Number of bits used |
| ---------- | ------------- | ------------------- |
| 0 - 97 | 0 - 97 | 7 |
| 98 - 127 | 98 - 127 | 8 |
| 128 - 225 | 0 - 97 | 7 |
| 226 - 255 | 128 - 157 | 8 |
In the case of offset codes, a decoder may reject a frame that encodes a
non-zero probability for an offset code that exceeds the decoder's
maximum allowed offset code. A compliant encoder should not emit non-zero
probabilities for offset codes that are greater than the highest offset
code value present in the data stream.

Symbols probabilities are read one by one, in order.
__Example__ :

Probability is obtained from Value decoded by following formula :
`Proba = value - 1`
Presuming an `Accuracy_Log` of 8,
and presuming 100 probabilities points have already been distributed,
`Max_Prob_Value` is `(1 << 8) - 100 + 1 == 157`.
Therefore, `Prob_Base_Size` is `rounddown(log2(157)) == 7` bits.

It means value `0` becomes negative probability `-1`.
`-1` is a special probability, which means "less than 1".
Its effect on distribution table is described in the [next section].
For the purpose of calculating total allocated probability points, it counts as one.
Presuming values from 0 to 157 (inclusive) are possible,
`Large_Prob_Range` is `157 - (1 << 7) + 1 == 30` and
`Large_Prob_Start` is `(1 << 7) - 30 == 98`.

[next section]:#from-normalized-distribution-to-decoding-tables
This results in the following mapping of input bits to probability:

When a symbol has a __probability__ of `zero`,
it is followed by a 2-bits repeat flag.
This repeat flag tells how many probabilities of zeroes follow the current one.
It provides a number ranging from 0 to 3.
If it is a 3, another 2-bits repeat flag follows, and so on.
| First 7 bits | Additional bit | Decoded value | Total bits used |
| ------------ | ------------- | -------------- | --------------- |
| 0 - 97 | Not used | 0 - 97 | 7 |
| 98 - 127 | 0 | 98 - 127 | 8 |
| 98 - 127 | 1 | 128 - 157 | 8 |

When last symbol reaches cumulated total of `1 << Accuracy_Log`,
decoding is complete.
If the last symbol makes cumulated total go above `1 << Accuracy_Log`,
distribution is considered corrupted.
If this process results in a non-zero probability for a value outside of the
valid range of values that the FSE table is defined for, even if that value is
not used, then the data is considered corrupted.

Then the decoder can tell how many bytes were used in this process,
and how many symbols are present.
The bitstream consumes a round number of bytes.
Any remaining bit within the last byte is just unused.
[next section]:#from-normalized-distribution-to-decoding-tables

#### From normalized distribution to decoding tables

Expand Down