I don't understand Collation? (Mysql, RDBMS, Character sets)
The main point of a database collation is determining how data is sorted and compared.
Case sensitivity of string comparisons
SELECT "New York" = "NEW YORK";`
will return true for a case insensitive collation; false for a case sensitive one.
Which collation does which can be told by the _ci
and _cs
suffix in the collation's name. _bin
collations do binary comparisons (strings must be 100% identical).
Comparison of umlauts/accented characters
the collation also determines whether accented characters are treated as their latin base counterparts in string comparisons.
SELECT "Düsseldorf" = "Dusseldorf";
SELECT "Èclair" = "Eclair";
will return true in the former case; false in the latter. You will need to read each collation's description to find out which is which.
String sorting
The collation influences the way strings are sorted.
For example,
Umlauts
Ä Ö Ü
are at the end of the alphabet in the finnish/swedish alphabetlatin1_swedish_ci
they are treated as
A O U
in German DIN-1 sorting (latin_german1_ci
)and as
AE OE UE
in German DIN-2 sorting (latin_german2_ci
). ("phone book" sorting)In
latin1_spanish_ci
, "ñ" (n-tilde) is a separate letter between "n" and "o".
These rules will result in different sort orders when non-latin characters are used.
Using collations at runtime
You have to choose a collation for your table and columns, but if you don't mind the performance hit, you can force database operations into a certain collation at runtime using the COLLATE
keyword.
This will sort table
by the name
column using German DIN-2 sorting rules:
SELECT name
FROM table
ORDER BY name COLLATE latin1_german2_ci;
Using COLLATE
at runtime will have performance implications, as each column has to be converted during the query. So think twice before applying this do large data sets.
MySQL Reference:
- Character Sets and Collations That MySQL Supports
- Examples of the Effect of Collation
- Collation issues
Character set and collation in database
$ mysql ... --default-character-set=utf8 < tables.sql
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MySQL needs to know what encoding the data you're sending it is in. To do this, you need to set the connection encoding. When connecting to MySQL from a client, you usually run a SET NAMES
query or use an equivalent call on your API of choice to do so. On the command line, the --default-character-set
option does this. It needs to be set to whatever encoding your file is in.
How to change the CHARACTER SET (and COLLATION) throughout a database?
change database collation:
ALTER DATABASE <database_name> CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
change table collation:
ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
change column collation:
ALTER TABLE <table_name> MODIFY <column_name> VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
What do the parts of utf8mb4_0900_ai_ci
mean?
3 bytes -- utf8
4 bytes -- utf8mb4 (new)
v4.0 -- _unicode_
v5.20 -- _unicode_520_
v9.0 -- _0900_ (new)
_bin -- just compare the bits; don't consider case folding, accents, etc
_ci -- explicitly case insensitive (A=a) and implicitly accent insensitive (a=á)
_ai_ci -- explicitly case insensitive and accent insensitive
_as (etc) -- accent-sensitive (etc)
_bin -- simple, fast
_general_ci -- fails to compare multiletters; eg ss=ß, somewhat fast
... -- slower
_0900_ -- (8.0) much faster because of a rewrite
More info:
- What are the differences between utf8_general_ci and utf8_unicode_ci?
- What's the difference between utf8_general_ci and utf8_unicode_ci?
- How to change collation of database, table, column?
- What's the difference between utf8_general_ci and utf8_unicode_ci?
Mysql why use utf8_slovenian_ci?
Is it because Slovenia needs this encoding to assimilate certain letters in their alphabet?
Probably not because UTF-8 already covers all imaginable characters. The collation probably contains sorting and comparison rules specific to Slovenia. (See here for an overview what a collation does and how sorting rules can vary.)
If I"m using english I can omit this can't I?
you should be able to use one of the _general
collations if you don't need any country specific rules.
Difference between database, table, column collation
MySQL's character sets and collations can be interpreted as a top-down list of prioritized items. The topmost is least priority and the bottommost is most priority.
Order of precedence with topmost being least precedence:
- Server collation
- Connection-specific collation
- Database collation
- Table collation
- Column collation
- Query collation (using
CAST
orCONVERT
)
The server collation is set by the server, which is set either inside of my.cnf
or when the server was built from source code. By default, this will usually be latin1
or utf8
, depending on your platform.
The connection-specific collation is set by the client using a query like SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';
. Most clients don't set a connection-specific collation, so the server will use its own default as explained above.
The database collation is set during database creation, or manually by updating it later. If you don't specify one, it will use the next higher-level collation, which would either be the connection-specific or the server collation.
The table collation is the same as the database collation, except if left blank, it will use the database as its default, then connection-specific, and then finally the server's collation.
The column collation uses the table's collation as its default, and if there is no collation set, it will then follow up the chain to find a collation to use, stopping at server if all of the others weren't set.
The query collation is specified in the query by using CAST
or CONVERT
, but otherwise will use the next available collation in the chain. There's no way to set this unless you use a function.
Please also refer to the manual page Character Set Support.
Related Topics
MySQL Select Query String Matching
How to Grant Read Access for a User to a Database in SQL Server
Finding Simultaneous Events in a Database Between Times
How to Pull a List of Id's from a SQL Table as a Comma-Separated Values String
Conditional Where Clause with Case Statement in Oracle
Select Random Row for Each Group
SQL Server , Split a Time Duration Row by 24 Hour Period
How to Use Regular Expression in SQL Server
SQL - Stored Procedure with Select Statement Using in (@Variable_Commadelimitedlistofids)
SQL to Transpose Row Pairs to Columns in Ms Access Database
How to Use Elasticsearch to Get Join Functionality as in SQL
Having Transaction in All Queries
Sane/Fast Method to Pass Variable Parameter Lists to SQLserver2008 Stored Procedure
Error Detection from Powershell Invoke-Sqlcmd Not Always Working