Expand description

Code for efficiently counting the number of chars in a UTF-8 encoded string.

Broadly, UTF-8 encodes chars as a “leading” byte which begins the char, followed by some number (possibly 0) of continuation bytes.

The leading byte can have a number of bit-patterns (with the specific pattern indicating how many continuation bytes follow), but the continuation bytes are always in the format 0b10XX_XXXX (where the Xs can take any value). That is, the most significant bit is set, and the second most significant bit is unset.

To count the number of characters, we can just count the number of bytes in the string which are not continuation bytes, which can be done many bytes at a time fairly easily.

Note: Because the term “leading byte” can sometimes be ambiguous (for example, it could also refer to the first byte of a slice), we’ll often use the term “non-continuation byte” to refer to these bytes in the code.