15 Facts About UTF-8

1.

UTF-8 is a variable-width character encoding used for electronic communication.

FactSnippet No. 1,567,753
2.

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte code units.

FactSnippet No. 1,567,754
3.

UTF-8 was designed as a superior alternative to UTF-1, a proposed variable-width encoding with partial ASCII compatibility which lacked some features including self-synchronization and fully ASCII-compatible handling of characters such as slashes.

FactSnippet No. 1,567,755
4.

Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as NUL, slash, or quotes.

FactSnippet No. 1,567,756
5.

Invalid UTF-8 has been used to bypass security validations in high-profile products including Microsoft's IIS web server and Apache's Tomcat servlet container.

FactSnippet No. 1,567,757
6.

Nevertheless, there was and still is software that always inserts a BOM when writing UTF-8, and refuses to correctly interpret UTF-8 unless the first character is a BOM.

FactSnippet No. 1,567,758
7.

UTF-8 is the recommendation from the WHATWG for HTML and DOM specifications, and the Internet Mail Consortium recommends that all e-mail programs be able to display and create mail using UTF-8.

FactSnippet No. 1,567,759
8.

UTF-8 has been the most common encoding for the World Wide Web since 2008.

FactSnippet No. 1,567,760
9.

Local text files UTF-8 usage is lower, and many legacy single-byte encodings remain in use.

FactSnippet No. 1,567,761
10.

The primary cause is editors that do not display or write UTF-8 unless the first character in a file is a byte order mark, making it impossible for other software to use UTF-8 without being rewritten to ignore the byte order mark on input and add it on output.

FactSnippet No. 1,567,762
11.

UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25 to 29,1993.

FactSnippet No. 1,567,763
12.

CESU-8 encoding can result from converting UTF-16 data with supplementary characters to UTF-8, using conversion methods that assume UCS-2 data, meaning they are unaware of four-byte UTF-16 supplementary characters.

FactSnippet No. 1,567,764
13.

All known Modified UTF-8 implementations treat the surrogate pairs as in CESU-8.

FactSnippet No. 1,567,765
14.

In normal usage, the language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter.

FactSnippet No. 1,567,766
15.

Extensions have been created to allow any byte sequence that is assumed to be UTF-8 to be losslessly transformed to UTF-16 or UTF-32, by translating the 128 possible error bytes to reserved code points, and transforming those code points back to error bytes to output UTF-8.

FactSnippet No. 1,567,767