Source code editor What Is Ajax
↑
MySQL supports 70+ collations for 30+ character sets. This section indicates which character sets MySQL supports. There is one subsection for each group of related character sets. For each character set, the allowable collations are listed.
You can always list the available character sets and their default collations with the SHOW CHARACTER SET
statement:
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+
| Charset | Description | Default collation |
+----------+-----------------------------+---------------------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci |
| dec8 | DEC West European | dec8_swedish_ci |
| cp850 | DOS West European | cp850_general_ci |
| hp8 | HP West European | hp8_english_ci |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci |
| latin1 | cp1252 West European | latin1_swedish_ci |
| latin2 | ISO 8859-2 Central European | latin2_general_ci |
| swe7 | 7bit Swedish | swe7_swedish_ci |
| ascii | US ASCII | ascii_general_ci |
| ujis | EUC-JP Japanese | ujis_japanese_ci |
| sjis | Shift-JIS Japanese | sjis_japanese_ci |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci |
| tis620 | TIS620 Thai | tis620_thai_ci |
| euckr | EUC-KR Korean | euckr_korean_ci |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci |
| greek | ISO 8859-7 Greek | greek_general_ci |
| cp1250 | Windows Central European | cp1250_general_ci |
| gbk | GBK Simplified Chinese | gbk_chinese_ci |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci |
| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci |
| utf8 | UTF-8 Unicode | utf8_general_ci |
| ucs2 | UCS-2 Unicode | ucs2_general_ci |
| cp866 | DOS Russian | cp866_general_ci |
| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci |
| macce | Mac Central European | macce_general_ci |
| macroman | Mac West European | macroman_general_ci |
| cp852 | DOS Central European | cp852_general_ci |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci |
| cp1251 | Windows Cyrillic | cp1251_general_ci |
| cp1256 | Windows Arabic | cp1256_general_ci |
| cp1257 | Windows Baltic | cp1257_general_ci |
| binary | Binary pseudo charset | binary |
| geostd8 | GEOSTD8 Georgian | geostd8_general_ci |
| cp932 | SJIS for Windows Japanese | cp932_japanese_ci |
| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci |
+----------+-----------------------------+---------------------+
In cases where a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing the wrong collation, it can be helpful to perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.
MySQL has two Unicode character sets. You can store text in about 650 languages using these character sets.
ucs2
(UCS-2 Unicode) collations:
ucs2_bin
ucs2_czech_ci
ucs2_danish_ci
ucs2_esperanto_ci
ucs2_estonian_ci
ucs2_general_ci
(default)
ucs2_hungarian_ci
ucs2_icelandic_ci
ucs2_latvian_ci
ucs2_lithuanian_ci
ucs2_persian_ci
ucs2_polish_ci
ucs2_roman_ci
ucs2_romanian_ci
ucs2_slovak_ci
ucs2_slovenian_ci
ucs2_spanish2_ci
ucs2_spanish_ci
ucs2_swedish_ci
ucs2_turkish_ci
ucs2_unicode_ci
utf8
(UTF-8 Unicode) collations:
utf8_bin
utf8_czech_ci
utf8_danish_ci
utf8_esperanto_ci
utf8_estonian_ci
utf8_general_ci
(default)
utf8_hungarian_ci
utf8_icelandic_ci
utf8_latvian_ci
utf8_lithuanian_ci
utf8_persian_ci
utf8_polish_ci
utf8_roman_ci
utf8_romanian_ci
utf8_slovak_ci
utf8_slovenian_ci
utf8_spanish2_ci
utf8_spanish_ci
utf8_swedish_ci
utf8_turkish_ci
utf8_unicode_ci
The MySQL implementation of UCS-2 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of UCS-2 values. Other database systems might use little-ending byte order or a BOM, in which case conversion of UCS-2 values will need to be performed when transferring data between those systems and MySQL.
Note that in the ucs2_roman_ci
and utf8_roman_ci
collations, I
and J
compare as equals, and U
and V
compare as equals.
The ucs2_esperanto_ci
and utf8_esperanto_ci
collations were added in MySQL 5.0.13. The ucs2_hungarian_ci
and utf8_hungarian_ci
collations were added in MySQL 5.0.19.
MySQL implements the utf8_unicode_ci
collation according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. The following discussion uses utf8_unicode_ci
, but it is also true for ucs2_unicode_ci
.
Currently, the utf8_unicode_ci
collation has only partial support for the Unicode Collation Algorithm. Some characters are not supported yet. Also, combining marks are not fully supported. This affects primarily Vietnamese, Yoruba, and some smaller languages such as Navajo.
The most significant feature in utf8_unicode_ci
is that it supports expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages ‘Я
’ is equal to ‘ss
’.
utf8_general_ci
is a legacy collation that does not support expansions. It can make only one-to-one comparisons between characters. This means that comparisons for the utf8_general_ci
collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci
.
For example, the following equalities hold in both utf8_general_ci
and utf8_unicode_ci
:
Д = A Ц = O Ь = U
A difference between the collations is that this is true for utf8_general_ci
:
Я = s
Whereas this is true for utf8_unicode_ci
:
Я = ss
MySQL implements language-specific collations for the utf8
character set only if the ordering with utf8_unicode_ci
does not work well for a language. For example, utf8_unicode_ci
works fine for German and French, so there is no need to create special utf8
collations for these two languages.
utf8_general_ci
also is satisfactory for both German and French, except that ‘Я
’ is equal to ‘s
’, and not to ‘ss
’. If this is acceptable for your application, then you should use utf8_general_ci
because it is faster. Otherwise, use utf8_unicode_ci
because it is more accurate.
utf8_swedish_ci
, like other utf8
language-specific collations, is derived from utf8_unicode_ci
with additional language rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:
Ь = Y < Ц
The utf8_spanish_ci
and utf8_spanish2_ci
collations correspond to modern Spanish and traditional Spanish, respectively. In both collations, ‘с
’ (n-tilde) is a separate letter between ‘n
’ and ‘o
’. In addition, for traditional Spanish, ‘ch
’ is a separate letter between ‘c
’ and ‘d
’, and ‘ll
’ is a separate letter between ‘l
’ and ‘m
’
Western European character sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.
ascii
(US ASCII) collations:
ascii_bin
ascii_general_ci
(default)
cp850
(DOS West European) collations:
cp850_bin
cp850_general_ci
(default)
dec8
(DEC Western European) collations:
dec8_bin
dec8_swedish_ci
(default)
hp8
(HP Western European) collations:
hp8_bin
hp8_english_ci
(default)
latin1
(cp1252 West European) collations:
latin1_bin
latin1_danish_ci
latin1_general_ci
latin1_general_cs
latin1_german1_ci
latin1_german2_ci
latin1_spanish_ci
latin1_swedish_ci
(default)
latin1
is the default character set. MySQL's latin1
is the same as the Windows cp1252
character set. This means it is the same as the official ISO 8859-1
or IANA (Internet Assigned Numbers Authority) latin1
, except that IANA latin1
treats the code points between 0x80
and 0x9f
as “undefined,” whereas cp1252
, and therefore MySQL's latin1
, assign characters for those positions. For example, 0x80
is the Euro sign. For the “undefined” entries in cp1252
, MySQL translates 0x81
to Unicode 0x0081
, 0x8d
to 0x008d
, 0x8f
to 0x008f
, 0x90
to 0x0090
, and 0x9d
to 0x009d
.
The latin1_swedish_ci
collation is the default that probably is used by the majority of MySQL customers. Although it is frequently said that it is based on the Swedish/Finnish collation rules, there are Swedes and Finns who disagree with this statement.
The latin1_german1_ci
and latin1_german2_ci
collations are based on the DIN-1 and DIN-2 standards, where DIN stands for Deutsches Institut fьr Normung (the German equivalent of ANSI). DIN-1 is called the “dictionary collation” and DIN-2 is called the “phone book collation.”
latin1_german1_ci
(dictionary) rules:
Д = A Ц = O Ь = U Я = s
latin1_german2_ci
(phone-book) rules:
Д = AE Ц = OE Ь = UE Я = ss
In the latin1_spanish_ci
collation, ‘с
’ (n-tilde) is a separate letter between ‘n
’ and ‘o
’.
macroman
(Mac West European) collations:
macroman_bin
macroman_general_ci
(default)
swe7
(7bit Swedish) collations:
swe7_bin
swe7_swedish_ci
(default)
MySQL provides some support for character sets used in the Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, Poland, and Serbia (Latin).
cp1250
(Windows Central European) collations:
cp1250_bin
cp1250_croatian_ci
cp1250_czech_cs
cp1250_general_ci
(default)
cp852
(DOS Central European) collations:
cp852_bin
cp852_general_ci
(default)
keybcs2
(DOS Kamenicky Czech-Slovak) collations:
keybcs2_bin
keybcs2_general_ci
(default)
latin2
(ISO 8859-2 Central European) collations:
latin2_bin
latin2_croatian_ci
latin2_czech_cs
latin2_general_ci
(default)
latin2_hungarian_ci
macce
(Mac Central European) collations:
macce_bin
macce_general_ci
(default)
South European and Middle Eastern character sets supported by MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and Turkish.
armscii8
(ARMSCII-8 Armenian) collations:
armscii8_bin
armscii8_general_ci
(default)
cp1256
(Windows Arabic) collations:
cp1256_bin
cp1256_general_ci
(default)
geostd8
(GEOSTD8 Georgian) collations:
geostd8_bin
geostd8_general_ci
(default)
greek
(ISO 8859-7 Greek) collations:
greek_bin
greek_general_ci
(default)
hebrew
(ISO 8859-8 Hebrew) collations:
hebrew_bin
hebrew_general_ci
(default)
latin5
(ISO 8859-9 Turkish) collations:
latin5_bin
latin5_turkish_ci
(default)
The Baltic character sets cover Estonian, Latvian, and Lithuanian languages.
cp1257
(Windows Baltic) collations:
cp1257_bin
cp1257_general_ci
(default)
cp1257_lithuanian_ci
latin7
(ISO 8859-13 Baltic) collations:
latin7_bin
latin7_estonian_cs
latin7_general_ci
(default)
latin7_general_cs
The Cyrillic character sets and collations are for use with Belarusian, Bulgarian, Russian, Ukrainian, and Serbian (Cyrillic) languages.
cp1251
(Windows Cyrillic) collations:
cp1251_bin
cp1251_bulgarian_ci
cp1251_general_ci
(default)
cp1251_general_cs
cp1251_ukrainian_ci
cp866
(DOS Russian) collations:
cp866_bin
cp866_general_ci
(default)
koi8r
(KOI8-R Relcom Russian) collations:
koi8r_bin
koi8r_general_ci
(default)
koi8u
(KOI8-U Ukrainian) collations:
koi8u_bin
koi8u_general_ci
(default)
The Asian character sets that we support include Chinese, Japanese, Korean, and Thai. These can be complicated. For example, the Chinese sets must allow for thousands of different characters. See Section 10.10.7.1, “The cp932
Character Set”, for additional information about the cp932
and sjis
character sets.
For answers to some common questions and problems relating support for Asian character sets in MySQL, see Section A.11, “MySQL 5.0 FAQ — MySQL Chinese, Japanese, and Korean Character Sets”.
big5
(Big5 Traditional Chinese) collations:
big5_bin
big5_chinese_ci
(default)
cp932
(SJIS for Windows Japanese) collations:
cp932_bin
cp932_japanese_ci
(default)
eucjpms
(UJIS for Windows Japanese) collations:
eucjpms_bin
eucjpms_japanese_ci
(default)
euckr
(EUC-KR Korean) collations:
euckr_bin
euckr_korean_ci
(default)
gb2312
(GB2312 Simplified Chinese) collations:
gb2312_bin
gb2312_chinese_ci
(default)
gbk
(GBK Simplified Chinese) collations:
gbk_bin
gbk_chinese_ci
(default)
sjis
(Shift-JIS Japanese) collations:
sjis_bin
sjis_japanese_ci
(default)
tis620
(TIS620 Thai) collations:
tis620_bin
tis620_thai_ci
(default)
ujis
(EUC-JP Japanese) collations:
ujis_bin
ujis_japanese_ci
(default)
Why is cp932
needed?
In MySQL, the sjis
character set corresponds to the Shift_JIS
character set defined by IANA, which supports JIS X0201 and JIS X0208 characters. (See http://www.iana.org/assignments/character-sets.)
However, the meaning of “SHIFT JIS” as a descriptive term has become very vague and it often includes the extensions to Shift_JIS
that are defined by various vendors.
For example, “SHIFT JIS” used in Japanese Windows environments is a Microsoft extension of Shift_JIS
and its exact name is Microsoft Windows Codepage : 932
or cp932
. In addition to the characters supported by Shift_JIS
, cp932
supports extension characters such as NEC special characters, NEC selected — IBM extended characters, and IBM extended characters.
Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:
MySQL automatically converts character sets.
Character sets are converted via Unicode (ucs2
).
The sjis
character set does not support the conversion of these extension characters.
There are several conversion rules from so-called “SHIFT JIS” to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).
The MySQL cp932
character set is designed to solve these problems. It is available as of MySQL 5.0.3.
Because MySQL supports character set conversion, it is important to separate IANA Shift_JIS
and cp932
into two different character sets because they provide different conversion rules.
How does cp932
differ from sjis
?
The cp932
character set differs from sjis
in the following ways:
cp932
supports NEC special characters, NEC selected — IBM extended characters, and IBM selected characters.
Some cp932
characters have two different code points, both of which convert to the same Unicode code point. When converting from Unicode back to cp932
, one of the code points must be selected. For this “round trip conversion,” the rule recommended by Microsoft is used. (See http://support.microsoft.com/kb/170559/EN-US/.)
The conversion rule works like this:
If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208.
If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters.
If the character is in both IBM selected characters and NEC selected — IBM extended characters, use the code point of IBM extended characters.
The table shown at http://www.microsoft.com/globaldev/reference/dbcs/932.htm provides information about the Unicode values of cp932
characters. For cp932
table entries with characters under which a four-digit number appears, the number represents the corresponding Unicode (ucs2
) encoding. For table entries with an underlined two-digit value appears, there is a range of cp932
character values that begin with those two digits. Clicking such a table entry takes you to a page that displays the Unicode value for each of the cp932
characters that begin with those digits.
The following links are of special interest. They correspond to the encodings for the following sets of characters:
NEC special characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_87.htm
NEC selected — IBM extended characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_ED.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_EE.htm
IBM selected characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_FA.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_FB.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_FC.htm
Starting from version 5.0.3, cp932
supports conversion of user-defined characters in combination with eucjpms
, and solves the problems with sjis
/ujis
conversion. For details, please refer to http://www.opengroup.or.jp/jvc/cde/sjis-euc-e.html.
For some characters, conversion to and from ucs2
is different for sjis
and cp932
. The following tables illustrate these differences.
Conversion to ucs2
:
sjis /cp932 Value | sjis -> ucs2 Conversion | cp932 -> ucs2 Conversion |
5C | 005C | 005C |
7E | 007E | 007E |
815C | 2015 | 2015 |
815F | 005C | FF3C |
8160 | 301C | FF5E |
8161 | 2016 | 2225 |
817C | 2212 | FF0D |
8191 | 00A2 | FFE0 |
8192 | 00A3 | FFE1 |
81CA | 00AC | FFE2 |
Conversion from ucs2
:
ucs2 value | ucs2 -> sjis Conversion | ucs2 -> cp932 Conversion |
005C | 815F | 5C |
007E | 7E | 7E |
00A2 | 8191 | 3F |
00A3 | 8192 | 3F |
00AC | 81CA | 3F |
2015 | 815C | 815C |
2016 | 8161 | 3F |
2212 | 817C | 3F |
2225 | 3F | 8161 |
301C | 8160 | 3F |
FF0D | 3F | 817C |
FF3C | 3F | 815F |
FF5E | 3F | 8160 |
FFE0 | 3F | 8191 |
FFE1 | 3F | 8192 |
FFE2 | 3F | 81CA |
Users of any Japanese character sets should be aware that using --character-set-client-handshake
(or --skip-character-set-client-handshake
) has an important effect. See Section 5.2.2, “Command Options”.