15 Facts About UTF-8

UTF-8 is a variable-width character encoding used for electronic communication.

FactSnippet No. 1,567,753

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte code units.

FactSnippet No. 1,567,754

UTF-8 was designed as a superior alternative to UTF-1, a proposed variable-width encoding with partial ASCII compatibility which lacked some features including self-synchronization and fully ASCII-compatible handling of characters such as slashes.

FactSnippet No. 1,567,755

Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as NUL, slash, or quotes.

FactSnippet No. 1,567,756

Invalid UTF-8 has been used to bypass security validations in high-profile products including Microsoft's IIS web server and Apache's Tomcat servlet container.

FactSnippet No. 1,567,757

Related searches

Unicode Internet WHATWG HTML World Wide Web San Diego

Nevertheless, there was and still is software that always inserts a BOM when writing UTF-8, and refuses to correctly interpret UTF-8 unless the first character is a BOM.

FactSnippet No. 1,567,758

UTF-8 is the recommendation from the WHATWG for HTML and DOM specifications, and the Internet Mail Consortium recommends that all e-mail programs be able to display and create mail using UTF-8.

FactSnippet No. 1,567,759

UTF-8 has been the most common encoding for the World Wide Web since 2008.

FactSnippet No. 1,567,760

Local text files UTF-8 usage is lower, and many legacy single-byte encodings remain in use.

FactSnippet No. 1,567,761

10.

The primary cause is editors that do not display or write UTF-8 unless the first character in a file is a byte order mark, making it impossible for other software to use UTF-8 without being rewritten to ignore the byte order mark on input and add it on output.

FactSnippet No. 1,567,762

11.

UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25 to 29,1993.

FactSnippet No. 1,567,763

12.

CESU-8 encoding can result from converting UTF-16 data with supplementary characters to UTF-8, using conversion methods that assume UCS-2 data, meaning they are unaware of four-byte UTF-16 supplementary characters.

FactSnippet No. 1,567,764

13.

All known Modified UTF-8 implementations treat the surrogate pairs as in CESU-8.

FactSnippet No. 1,567,765

14.

In normal usage, the language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter.

FactSnippet No. 1,567,766

15.

Extensions have been created to allow any byte sequence that is assumed to be UTF-8 to be losslessly transformed to UTF-16 or UTF-32, by translating the 128 possible error bytes to reserved code points, and transforming those code points back to error bytes to output UTF-8.

FactSnippet No. 1,567,767