|Online Tools » Computers » Unicode character codes|| |
Unicode is an international standard that describes how computers can encode most of the characters that are used around the world. Each character has been given a unique number, called a code point, that is often written as U+ followed by the number with at least 4 digits in hexadecimal notation. The characters A, ñ and ♥ has the code points U+0041, U+00F1 and U+2665 respectively, which correspond to 65, 241 and 9829 in decimal notation.
HTML is the most widely used language for creating web pages. Some characters have a special meaning in HTML so they cannot be written as-is if we want them to be visible on the page. Other characters are just hard to type on some keyboards. Fortunately, there are alternative ways of writing characters in HTML that let you use the Unicode number instead.
The number is simply written in decimal surrounded by &# and ; (semicolon). If an x is written in front of the number it means that the number is written in hexadecimal. A few characters, but far from all, have in addition a short name that can be used in a similar way. For example, the character < has code point U+003C (decimal: 60) and the name lt (short for less than) which allow us to write it in any of the following ways: < < <
UTF-8 is a popular encoding for Unicode. As all other UTF encodings it can be used to encode any character in the Unicode standard. The characters are encoded as a series of 1-4 bytes. Most characters used in the western world only requires one or two bytes. To encode the digits 0-9 and letters in the English alphabet only one byte is required.
Characters in the range U+0000 – U+007F are encoded the same way as with ASCII, which is the encoding that most character encodings are derived from. This means that there is no difference between UTF-8 and ASCII for texts that only use these characters.
Characters that require more than one byte start with a byte were the most significant bits are a series of ones, as many as the number of bytes used to encode the character, followed by a zero. For all the other bytes the two most significant bits are 10. The remaining bits are used to represent the Unicode number.
|0 - 7F||0xxxxxxx|
|80 - 7FF||110xxxxx 10xxxxxx|
|800 - FFFF||1110xxxx 10xxxxxx 10xxxxxx|
|10000 - 10FFFF||11110xxx 10xxxxxx 10xxxxxx 10xxxxxx|
UTF-16 uses one or two 16-bit values to encode one character. Characters in the range U+0000 – U+FFFF use one 16-bit value where all of the bits are used to represent the Unicode number. This includes most of the characters that are used in all modern languages.
Characters in the range U+10000 – U+10FFFF use two 16-bit values with the initial bit pattern 110110 and 110111. The remaining 20 bits are used to represent the Unicode number. The reason this works is because the Unicode numbers that would have used these bit patterns do not represent any valid Unicode character.
|0 - FFFF||xxxxxxxxxxxxxxxx|
|10000 - 10FFFF||110110xxxxxxxxxx 110111xxxxxxxxxx|
UTF-32 uses a fixed size of 32 bits to store each character. This often makes it easier to work with text in computer programs compared to UTF-8 and UTF-16 that use variable sizes for each character, but a disadvantage is that it requires more memory. UTF-32 is therefore most often used internally by programs but seldom for long term storage on for example a hard drive. Only 21 bits are needed to represent all Unicode characters so the 11 most significant bits are always zeroes.
|0 - 10FFFF||00000000000xxxxxxxxxxxxxxxxxxxxx|