-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Automatic detection of encoded data using CyberChef Magic
CyberChef v8 introduces the 'Magic' operation, designed to automatically detect how your data is encoded and which operations can be used to decode it. A number of methods are used to achieve this.
Many common data encoding schemes, such as Base64, Hexadecimal and Gzip, have predictable structures that can be detected using pattern matching techniques. Regular expressions have been written for all operations where this is the case. They are each run over the data and any matches are recorded. In some cases, multiple regular expressions are written for the same operation where different arguments can be applied, for example Base64 using a non-standard alphabet.
Example regular expression for Base64 data using the standard alphabet:
/^(?:[A-Z\d+/]{4})+(?:[A-Z\d+/]{2}==|[A-Z\d+/]{3}=)?$/i
Example regular expression for Base64 data using the y64 alphabet:
/^(?:[A-Z\d._]{4}){5,}(?:[A-Z\d._]{2}--|[A-Z\d._]{3}-)?$/i
For every pattern that matches, the corresponding operation is speculatively executed to determine what the output looks like. Various metrics are collected for each of these possible branches to determine whether they look like valid data or not. Each branch is also checked for further pattern matches, meaning that data under multiple levels of encoding can be unwrapped recursively. The maximum number of levels of recursion is controlled by the 'Depth' argument.
The methods used to detect how "valid" the data looks are as follows, ranging from simple to more complex techniques:
In many file formats, a magic number is included to allow trivial detection of the file type. If a magic byte sequence is found in a branch, it increases the likelihood that the correct decoding sequence has been found.
UTF-8 data has a well-defined structure which can be easily tested for. The presence of valid UTF-8 data may suggest that a valid decoding sequence has been found.
Shannon Entropy, in the context of information theory, is a measure of the rate at which information is produced by a source of data. It can be used, in a broad sense, to detect whether data is likely to be structured or unstructured. If a branch results in data with high entropy, it is possible that it has simply output unstructured, random garbage. Branches resulting in lower entropy data are ranked higher as they are more likely to contain repeating structures.
On average, the English language contains more "e"s than any other letter. In fact, given a long enough sample, the relative frequency of each character is very predictable, to the extent that we can consider any text not roughly matching these frequencies as unlikely to be English.
This set of frequencies can be expanded to include all possible bytes, incorporating punctuation, numbers, symbols and other formatting characters. To generate a set of accurate "truth data", the English language Wikipedia dump was downloaded, wiki syntax was stripped out, then the byte frequencies were calculated. The resulting values assume a character encoding of UTF-8.
For every branch created by the Magic operation, the byte frequencies for the output are calculated and then compared to this truth data using Pearson's chi-squared goodness of fit test. This process tells us how closely the branch's output matches the English language and therefore hopefully gives us an idea of how likely it is that we have found correctly decoded data, if that data includes a reasonably high proportion of English text.
Truth data was also generated for all other languages supported by Wikipedia. By default, only the top 38 languages are checked (based on the most popular languages used on the Internet, as listed on W3 Techs), however if 'Extensive language support' is selected, all 245 languages are supported.
The above methods have been optimised to run reasonably quickly over most types of input, however there are some methods which take considerably longer to run due to their high branching factor. These can be turned on by enabling the 'Intensive mode' argument.
For each branch, the data is converted into a number of different character encodings. If this conversion results in different data, a new branch is created and metrics are calculated as described above. This can help to detect the correct character encodings for mojibake (garbled data represented in the wrong encoding). Over 40 character encodings are currently supported.
Single byte XORs are carried out over every branch, creating 255 further branches to analyse. Bit rotates are also calculated, resulting in another 7 branches.
As well as being available as a standalone operation, CyberChef runs the 'Magic' operation automatically in a background thread whenever the Output is changed. If it manages to find an operation or set of operations that can help decode the data, the magic icon will be displayed in the Output pane. Hovering over this icon shows which operations are most likely to help and a snippet of what they will produce. Clicking the icon will append those operations to your recipe.