Expand description
A character type.
The char
type represents a single character. More specifically, since
‘character’ isn’t a well-defined concept in Unicode, char
is a ‘Unicode
scalar value’.
This documentation describes a number of methods and trait implementations on the
char
type. For technical reasons, there is additional, separate
documentation in the std::char
module as well.
Validity
A char
is a ‘Unicode scalar value’, which is any ‘Unicode code point’
other than a surrogate code point. This has a fixed numerical definition:
code points are in the range 0 to 0x10FFFF, inclusive.
Surrogate code points, used by UTF-16, are in the range 0xD800 to 0xDFFF.
No char
may be constructed, whether as a literal or at runtime, that is not a
Unicode scalar value:
// Undefined behaviour
let _ = unsafe { char::from_u32_unchecked(0x110000) };
RunUSVs are also the exact set of values that may be encoded in UTF-8. Because
char
values are USVs and str
values are valid UTF-8, it is safe to store
any char
in a str
or read any character from a str
as a char
.
The gap in valid char
values is understood by the compiler, so in the
below example the two ranges are understood to cover the whole range of
possible char
values and there is no error for a non-exhaustive match.
let c: char = 'a';
match c {
'\0' ..= '\u{D7FF}' => false,
'\u{E000}' ..= '\u{10FFFF}' => true,
};
RunAll USVs are valid char
values, but not all of them represent a real
character. Many USVs are not currently assigned to a character, but may be
in the future (“reserved”); some will never be a character
(“noncharacters”); and some may be given different meanings by different
users (“private use”).
Representation
char
is always four bytes in size. This is a different representation than
a given character would have as part of a String
. For example:
let v = vec!['h', 'e', 'l', 'l', 'o'];
// five elements times four bytes for each element
assert_eq!(20, v.len() * std::mem::size_of::<char>());
let s = String::from("hello");
// five elements times one byte per element
assert_eq!(5, s.len() * std::mem::size_of::<u8>());
RunAs always, remember that a human intuition for ‘character’ might not map to Unicode’s definitions. For example, despite looking similar, the ‘é’ character is one Unicode code point while ‘é’ is two Unicode code points:
let mut chars = "é".chars();
// U+00e9: 'latin small letter e with acute'
assert_eq!(Some('\u{00e9}'), chars.next());
assert_eq!(None, chars.next());
let mut chars = "é".chars();
// U+0065: 'latin small letter e'
assert_eq!(Some('\u{0065}'), chars.next());
// U+0301: 'combining acute accent'
assert_eq!(Some('\u{0301}'), chars.next());
assert_eq!(None, chars.next());
RunThis means that the contents of the first string above will fit into a
char
while the contents of the second string will not. Trying to create
a char
literal with the contents of the second string gives an error:
error: character literal may only contain one codepoint: 'é'
let c = 'é';
^^^
Another implication of the 4-byte fixed size of a char
is that
per-char
processing can end up using a lot more memory:
let s = String::from("love: ❤️");
let v: Vec<char> = s.chars().collect();
assert_eq!(12, std::mem::size_of_val(&s[..]));
assert_eq!(32, std::mem::size_of_val(&v[..]));
Run