UTF-8 is a variable-width character encoding used for electronic communication.
FactSnippet No. 1,567,753 |
UTF-8 is a variable-width character encoding used for electronic communication.
FactSnippet No. 1,567,753 |
UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte code units.
FactSnippet No. 1,567,754 |
UTF-8 was designed as a superior alternative to UTF-1, a proposed variable-width encoding with partial ASCII compatibility which lacked some features including self-synchronization and fully ASCII-compatible handling of characters such as slashes.
FactSnippet No. 1,567,755 |
Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as NUL, slash, or quotes.
FactSnippet No. 1,567,756 |
Invalid UTF-8 has been used to bypass security validations in high-profile products including Microsoft's IIS web server and Apache's Tomcat servlet container.
FactSnippet No. 1,567,757 |
Nevertheless, there was and still is software that always inserts a BOM when writing UTF-8, and refuses to correctly interpret UTF-8 unless the first character is a BOM.
FactSnippet No. 1,567,758 |
UTF-8 has been the most common encoding for the World Wide Web since 2008.
FactSnippet No. 1,567,760 |
Local text files UTF-8 usage is lower, and many legacy single-byte encodings remain in use.
FactSnippet No. 1,567,761 |
The primary cause is editors that do not display or write UTF-8 unless the first character in a file is a byte order mark, making it impossible for other software to use UTF-8 without being rewritten to ignore the byte order mark on input and add it on output.
FactSnippet No. 1,567,762 |
UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25 to 29,1993.
FactSnippet No. 1,567,763 |
CESU-8 encoding can result from converting UTF-16 data with supplementary characters to UTF-8, using conversion methods that assume UCS-2 data, meaning they are unaware of four-byte UTF-16 supplementary characters.
FactSnippet No. 1,567,764 |
All known Modified UTF-8 implementations treat the surrogate pairs as in CESU-8.
FactSnippet No. 1,567,765 |
In normal usage, the language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter.
FactSnippet No. 1,567,766 |
Extensions have been created to allow any byte sequence that is assumed to be UTF-8 to be losslessly transformed to UTF-16 or UTF-32, by translating the 128 possible error bytes to reserved code points, and transforming those code points back to error bytes to output UTF-8.
FactSnippet No. 1,567,767 |