-
Notifications
You must be signed in to change notification settings - Fork 13
4.2 String Melody
I might as well reveal this terrible secret before we start this article. Tamgu's raison d'être is to manipulate strings. Everything else stems from this primary need. However, most programming languages seem to consider character strings as a peripheral problem. Everything is always heavy, not very intuitive or even encrypted (Perl????). In particular, the management of encodings is often a source of acute suffering for the majority of computer scientists.
If today, most people agree on Unicode, the fact remains that Unicode can take many disturbing forms. Most web pages are encoded in UTF8, while Windows prefers UTF16, Linux and Mac OS root for UTF32.
So the C++ type "std::wstring" is 16 bits on Windows and 32 bits on linux... For a standard type, this makes a lot of sense.
The result is quite frustrating: calculating the size of a string can be very complicated.
Let's take a simple example: "👌🏾".
Its size should be one. Unfortunately, in UTF8 its internal representation is actually 8 bytes:
[240,159,145,140,240,159,143,190]
In UTF16, we might hope to end up with a representation closer to 1. Unfortunately, this is not the case. Emojis have codes whose values exceed 65535, and therefore the representation of this character in UTF16 is composed of 4 numbers on 16 bits :
[55357, 56396, 55356, 57342]
So will UTF32 finally provide us with the solution? Unfortunately not, this emoji character is actually made up of two Unicode characters: "👌" and its color: "🏾":
[128076,127998]
This explains why it takes 8 UTF8 characters and 4 UTF16 characters to represent a single character on the screen. We say to ourselves, okay, all this is very complicated, but fortunately most programming languages take it into account. right ?
Let's take the following example in Python:
u="👌🏾"
l = len(u)
Unfortunately, "l" is 2... And on the screen, you only see one character... A potentially wonderful source of incomprehensible bugs. Python does not understand that "🏾" is a complement to "👌".
Tamgu aims to harmonize the eyes and the code. If only one character appears on the screen, only one character must be counted when traversing the string. Otherwise, we might spend more time wondering about the origin of the "🏾" than coding our algorithm.
Alas, emojis have invaded the world. It took three thousand years to move from hieroglyphics to the alphabet and it took less than ten years to return to a writing system that the Egyptians would have loved. So we can no longer shrug and let the programmer get away with weird string lengths.
First, Tamgu can detect automatically if a string is in UTF8. In other words, when it traverses through a string of bytes, it will immediately be able to detect the composite bytes of UTF8 from the others. In this way, it is possible to handle hybrid strings without the system grinding its teeth. The only difficulty in this case is to correctly identify the type of Latin encoding. Here, on the other hand, human intervention is mandatory, but Tamgu still provides the means to convert any Latin encoding into UTF8. Tamgu also knows how to manage UTF16 under Windows, because here too, some characters can be composite. Finally, he knows how to handle multiple emojis whatever the initial encoding.
Tamgu actually offers two types of string variables, which allows you to choose the underlying encoding. If you want to manipulate UTF8 or Latin strings, in other words if you want to manipulate byte strings, you can choose "string".
If, on the other hand, you want to manipulate Unicode strings, you can use the "ustring" type. It should be noted that "ustring" is encoded in UTF16 under Windows and in UTF32 elsewhere. Generally, this encoding is invisible to the programmer, but it has an impact on the speed with which the different characters of this string are accessed.
Note: To speed up processing, we use "check_large_char" implemented with AVX instructions to detect if direct access is possible (See: Accelerated processing of strings).
If you want to discover the underlying encoding of your platform, "bytes" is the right method for you. The "ord" method returns the Unicode codes of your character string.
Warning: in the case of Linux or Mac OS, these two methods return the same result.
It is the use of the "bytes" method that allowed us to extract the following underlying representation for: "👌🏾":
UTF8:[240,159,145,145,140,240,240,159,143,190] (everywhere)
UTF16:[55357, 56396, 55356, 57342] (Windows)
UTF32:[128076,127998] (Unix)
Tamgu has remained faithful to the majority of programming languages in this world by using"[]" to access the characters in a string. As you will see in the next descriptions, we have taken the liberty of innovating a little.
string s = "👌🏾: This is an example.";
You can of course use numerical indexes.
s[0] returns "👌🏾".
s[1] returns ":"
s[2] returns " "
s[3] returns "T".
You can use negative indexes:
s[-1] returns "."
s[-2] returns" "e"
s[-3] returns "l" etc...
You can also use multiple indexes:
s[0:2] returns "👌🏾:"
s[3:5] returns "This".
You can also omit one of the indexes:
s[:2] refers to "👌🏾:"
s[3:] returns "This is an example."
This is very similar to what Python offers.
Except that Tamgu goes a little further. You can also use strings as indexes:
s["T":] returns "This is an example."
s["T": "x"] returns "This is an ex"
And we can also combine alphabetical indexes with numerical indexes. But in this case, the numerical index indicates the number of characters after the substring.
s["T":10] returns "This is an"
Finally, we can use negative alphabetical indexes. In this case, Tamgu search the substring backward, starting at the end.
s["i":10] returns "is is an ex"
s[-"i":10] returns "is an examp"
Tamgu also offers regular expressions. These come in two flavors: treg and preg. Behind these barbaric names are in fact two types of expressions:
Tamgu regular expressions (r"...")
Posix regular expressions (p"...")
You can use these expressions directly as indexes.
I would not present here the pregs which are based on the traditional regular expressions of the Unix world. They are implemented with the "std::regex" library in C++.
tregs are regular expressions specific to Tamgu, compiled in the form of finite state automata. Their formalism is very simple and is based on the following guidelines:
%a: means an alphabetical character
%c: means a lower-case alphabetical character
%C: a capital alphabetical character
%d: means a number (a digit)
%e: means an emoji
%H: means a Korean character (Hangul)
%p: means a punctuation
%r: means a carriage return
%s: means a space
%x: a hexadecimal digit
?: any character
%?: the character'?'' itself
%%: the'%' character itself
\xFFFF: Unicode code in hexadecimal
{...}: character disjunction (no space in between since these will be interpreted as actual characters)
[...]: character sequence in a disjunction
+,* : Kleine's operators
~: the negation
string s = "Yooo:10 Wesdenesday, 20 Saturday.";
s[r"%d+"] return "10"
s[-r"%d+"] return "20"
s[r'%a+day'] return "Wesdenesday"
s[-r'%a+day'] return "Saturday"
s[p'\w+day'] return "Wesdenesday"
s[-p'\w+day'] return "Saturday"
Tamgu, unlike Python, considers strings to be "modifiable". It goes further than that. Not only is it possible to modify a character in a string, but this string can also be modified via multiple indexes.
string s = "👌🏾: This is an example.";
s["T": "x"] = "000";
//Now "s" contains: 👌🏾: 000 example.
//You can also use regular expressions
s = "👌🏾: This is an example.";
s[r"%C%a+"] = "Ceci"; //can match "This"
//Now "s" contains: 👌🏾: Ceci is an example.
We will end with the "in" operator, which is particularly useful for detecting the presence of a substring in a string.
In the same way as the operator"[...]", "in" accepts without hesitation all forms of querying.
if ("is" in s) ... //Is the chain "is" found in s?
if (r"%d+" in s) ... //Are there any numbers in the chain?
if (p"\w+day" in s) ... //Do we have a week name in s?
This operator can also use a string vector: svector as recipient of the "in"
svector vs;
string s = "👌🏾: This is an example.";
vs = "is" in s; //we look for the position of all the "is" in s
//vs contains ["is","is"], not very informative but possible
string s = "Yooo:10 Wesdenesday, 20 Saturday.";
vs = r"%d+" in s; //we look for ALL the numbers in s
// vs contains ["10", "20"]
vs = p"\w+day" in s; //we search ALL week names in s
// vs contains ["Wesdenesday", "Saturday"]
If we replace "svector" with "ivector" (vector of integers), we will get all the positions of the substrings in s.
ivector iv;
string s = "👌🏾: This is an example.";
iv = "is" in s; //we look for the position of all "is" in s
//iv contains[5,8]
s = "Yooo:10 Wesdenesday, 20 Saturday.";
iv = r"%d+" in s; //we look for ALL the numbers in s
//iv contains[5,7,21,23]
iv = p"\w+day" in s; //we search ALL week names in s
//iv contains[8,19,24,32]
In the last two cases, if a regular expression is used, iv contains the beginning and end of each substrings.