I Don't Understand Collation? (Mysql, Rdbms, Character Sets)

I don't understand Collation? (Mysql, RDBMS, Character sets)

The main point of a database collation is determining how data is sorted and compared.

Case sensitivity of string comparisons

SELECT "New York" = "NEW YORK";` 

will return true for a case insensitive collation; false for a case sensitive one.

Which collation does which can be told by the _ci and _cs suffix in the collation's name. _bin collations do binary comparisons (strings must be 100% identical).

Comparison of umlauts/accented characters

the collation also determines whether accented characters are treated as their latin base counterparts in string comparisons.

SELECT "Düsseldorf" =  "Dusseldorf";
SELECT "Èclair" = "Eclair";

will return true in the former case; false in the latter. You will need to read each collation's description to find out which is which.

String sorting

The collation influences the way strings are sorted.

For example,

  • Umlauts Ä Ö Ü are at the end of the alphabet in the finnish/swedish alphabet latin1_swedish_ci

  • they are treated as A O U in German DIN-1 sorting (latin_german1_ci)

  • and as AE OE UE in German DIN-2 sorting (latin_german2_ci). ("phone book" sorting)

  • In latin1_spanish_ci, "ñ" (n-tilde) is a separate letter between "n" and "o".

These rules will result in different sort orders when non-latin characters are used.

Using collations at runtime

You have to choose a collation for your table and columns, but if you don't mind the performance hit, you can force database operations into a certain collation at runtime using the COLLATE keyword.

This will sort table by the name column using German DIN-2 sorting rules:

SELECT name
FROM table
ORDER BY name COLLATE latin1_german2_ci;

Using COLLATE at runtime will have performance implications, as each column has to be converted during the query. So think twice before applying this do large data sets.

MySQL Reference:

  • Character Sets and Collations That MySQL Supports
  • Examples of the Effect of Collation
  • Collation issues

Character set and collation in database

$ mysql ... --default-character-set=utf8 < tables.sql
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

MySQL needs to know what encoding the data you're sending it is in. To do this, you need to set the connection encoding. When connecting to MySQL from a client, you usually run a SET NAMES query or use an equivalent call on your API of choice to do so. On the command line, the --default-character-set option does this. It needs to be set to whatever encoding your file is in.

How to change the CHARACTER SET (and COLLATION) throughout a database?

change database collation:

ALTER DATABASE <database_name> CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

change table collation:

ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

change column collation:

ALTER TABLE <table_name> MODIFY <column_name> VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;


What do the parts of utf8mb4_0900_ai_ci mean?

3 bytes -- utf8
4 bytes -- utf8mb4 (new)
v4.0 --   _unicode_
v5.20 -- _unicode_520_
v9.0 -- _0900_ (new)
_bin      -- just compare the bits; don't consider case folding, accents, etc
_ci -- explicitly case insensitive (A=a) and implicitly accent insensitive (a=á)
_ai_ci -- explicitly case insensitive and accent insensitive
_as (etc) -- accent-sensitive (etc)
_bin         -- simple, fast
_general_ci -- fails to compare multiletters; eg ss=ß, somewhat fast
... -- slower
_0900_ -- (8.0) much faster because of a rewrite

More info:

  • What are the differences between utf8_general_ci and utf8_unicode_ci?
  • What's the difference between utf8_general_ci and utf8_unicode_ci?
  • How to change collation of database, table, column?
  • What's the difference between utf8_general_ci and utf8_unicode_ci?

Mysql why use utf8_slovenian_ci?

Is it because Slovenia needs this encoding to assimilate certain letters in their alphabet?

Probably not because UTF-8 already covers all imaginable characters. The collation probably contains sorting and comparison rules specific to Slovenia. (See here for an overview what a collation does and how sorting rules can vary.)

If I"m using english I can omit this can't I?

you should be able to use one of the _general collations if you don't need any country specific rules.

Difference between database, table, column collation

MySQL's character sets and collations can be interpreted as a top-down list of prioritized items. The topmost is least priority and the bottommost is most priority.

Order of precedence with topmost being least precedence:

  • Server collation
  • Connection-specific collation
  • Database collation
  • Table collation
  • Column collation
  • Query collation (using CAST or CONVERT)

The server collation is set by the server, which is set either inside of my.cnf or when the server was built from source code. By default, this will usually be latin1 or utf8, depending on your platform.

The connection-specific collation is set by the client using a query like SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';. Most clients don't set a connection-specific collation, so the server will use its own default as explained above.

The database collation is set during database creation, or manually by updating it later. If you don't specify one, it will use the next higher-level collation, which would either be the connection-specific or the server collation.

The table collation is the same as the database collation, except if left blank, it will use the database as its default, then connection-specific, and then finally the server's collation.

The column collation uses the table's collation as its default, and if there is no collation set, it will then follow up the chain to find a collation to use, stopping at server if all of the others weren't set.

The query collation is specified in the query by using CAST or CONVERT, but otherwise will use the next available collation in the chain. There's no way to set this unless you use a function.

Please also refer to the manual page Character Set Support.



Related Topics



Leave a reply



Submit