Better Unicode support for MySQL (including emoji)

When it comes to character support I think the only thing that should ever be used is Unicode. That's right, I said it. However, when it comes to support in MySQL, things get a little bit murky.

I never had too much of an issue using plain ol' utf8_general_ci, however when trying to add language support for Gothic (tested because it's rare), I ran into a serious issue:

Incorrect string value: '\xF0\x90\x8C\xB0\xF0\x90...'
Source: brainstuck.com
Source: brainstuck.com

Son of a …

This issue is caused by the fact UTF-8 in MySQL isn't fully supported by utf8 the character set, it only supports a maximum of 3-byte characters. If you want something more realistic you're going to need to have at least MySQL 5.5.3 and you're going to have to use utf8mb4 not regular utf8. Yes, seriously.

Make sure you read through all this before trying anything, because there are edge issues, especially with indices (indexes) which you may need to consider.

Also back up your data first.

Updating Database

I'll be working under the assumption that you want your entire database to be utf8mb4, but if you don't then you'll have to adjust a bit, but seriously reconsider joining the 21st century if you're not using unicode. I'm also assuming you want case insensitive text, and if you don't, replace utf8mb4_unicode_ci with utf8mb4_bin — most people want case insensitive text in most cases.

Update the default character set and collation for your database:

ALTER DATABASE `mydatabase` CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Updating Tables

First we need to change the default character set, this way when you add new columns in the future, or whatever, you don't need to worry about adding all of the character set specification:

ALTER TABLE `mytable1` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Updating Table Columns

Now, you can convert one column at a time, and this may be what you wish to do if you require different character sets for your CHAR, VARCHAR, and TEXT columns, here's how you do that:

ALTER TABLE `mytable1` CHANGE `mycolumn` `mycolumn` VARCHAR(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT  '';

Now obviously you're going to want to make sure that you're converting to the same column type and length, etc, the above is for example only and if you copy/paste it, you may screw up your column schema. Essentially you're just using ALTER TABLE CHANGE on the column in order to change the character set to utf8mb4 and collation to utf8mb4_unicode_ci.

If on the other hand you just want to change the entire the entire table at once, you can do:

ALTER TABLE `mytable1` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Updating Table Indices

When changing the character type, you may run into this on InnoDB:

ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes

or this on MyISAM:

ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes

Ah, CRAP!

Oh snippy snap, snap, there are solutions:

If you're using InnoDB on MySQL 5.6.3 or higher you can enable innodb_large_prefix in your MySQL config file (more information in manual here), but if you aren't you can take a few steps to work it out the old way:

  1. Make note of the conflicting index, it's going to likely be one which is something like VARCHAR(255) or an index across multiple columns which includes VARCHAR. Make note of the index name, type, and which column(s) it crosses.
  2. In my own scenario, I had a lot of columns which included some sort of VARCHAR(254) and ID which was binary(20).  Now it seems like 254+20 = 274, and hey that's less than 767 (or 1000) so what's the deal?Well, not so fast there, Professor.MySQL doesn't count literal bytes in VARCHAR when it comes to Unicode, rather potential Unicode bytes are themselves counted as a byte (wait, what?).So if the column is 254 and it's utf8 that means the actual potential length is literally (254 * 3) bytes, and with utf8mb4 it's (254 * 4). So really the length of the key you're trying to create is ((254 * 4) + 20).InnoDB only allows a maximum of 255 bytes for the column in an index with utf8 and 191 bytes for utf8mb4.So if you need the entire column indexed, you aren't going to want to change the character set for that column(s), and instead I recommend changing all others one by one (as seen in the Table Columns section) rather than trying to convert the entire table. However if you do not need the entire column to be index, and in certain cases I did not.Drop the index:
    ALTER TABLE `mytable1` DROP INDEX `theindex`;

    Then recreate it with the offending column(s) limited to 191:

    ALTER TABLE `mytable1` ADD INDEX `theindex` (`mycolumn` (191));

    or if across multiple columns (assuming mycolumnb is not utf8 for example):

    ALTER TABLE `mytable1` ADD INDEX `theindex` (`mycolumn` (191),`mycolumnb`);

As long as the indices are the same, and in the same column order, you should receive the same benefits for the indices without worrying about redoing your queries.

Additional Notes and Considerations

If a column is not being used for search and case insensitivity isn't an issue, instead of using CHAR or VARCHAR, I suggest using BINARY and VARBINARY. Not only is comparison vastly faster, but also there's less to worry about as far as character set issues go, i.e. they don't matter. Further also VARBINARY is literal length so the UTF-8 limitations described in the index section of this post do not apply, so you can get the full width for your index.

Additionally instead of using TEXT, use BLOB, for the same reasons, but also realise the same limitations apply, such as no fulltext searching.

In summation, if you don't need case sensitivity and you don't need fulltext search, consider BINARY, VARBINARY, and BLOB over CHAR, VARCHAR, and TEXT, it'll be a lot easier to deal with when it comes to Unicode.

You can learn more about this on my MySQL performance, using case insensitive columns post.

Database Connections

Depending on your programming language, you may need to specify when connecting which chartype to use (you can also, in most cases, specify this on configuration, see that section at the bottom), this usually can be done by sending this query  right after connection:

SET NAMES utf8mb4;
Configuration

You can edit your my.cnf (or my.ini on Windows) and make these changes to the appropriate sections of the configuration file (applicable to MySQL 5.6, older versions may need adjusted configuration):

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

MySQL performance, using case insensitive columns

Fast_train_(4712207733)When it comes to matching in queries, if you don't need to do case insensitive comparison (and not search) you should definitely consider using binary columns. They're much faster. Let's just jump right into benchmarks, then I'll talk more about why this is good and scenarios where it can be done (using MySQL 5.6.17 x64 Linux).

In this example, I use utf8mb4, which is what you probably should use if you wish to have better Unicode support (learn more here). Keep in mind though that the character set can be essentially anything, and these results are always in favor of comparing against case sensitive (or binary) columns.

Case Sensitive
SELECT BENCHMARK(500000000, _utf8mb4'foo' < _utf8mb4'zot' COLLATE utf8mb4_bin);

Result

1 row in set (5.88 sec)
Case Insensitive
SELECT BENCHMARK(500000000, _utf8mb4'foo' < _utf8mb4'zot' COLLATE utf8mb4_unicode_ci);

Result

1 row in set (16.27 sec)

As you can see, though, case sensitive was nearly three (2.77) times faster than case insensitive.

My suggestion to you is that unless you need the column to be case insensitive for comparison or search, you should certainly consider using the case insensitive character set, for example utf8_unicode_ci would be utf8_bin and utf8mb4_unicode_ci would be utf8mb4_bin.

For CHAR and VARCHAR other character sets also have case insensitive versions, and if not you can alter the column to be VARCHAR BINARY:

ALTER TABLE `mytable1` CHANGE `mycolumn` `mycolumn` VARCHAR(254) BINARY NOT NULL;

However, I don't think that's ideal, instead use VARBINARY, and in fact here's a little chart for which to change to:

CHAR BINARY
VARCHAR VARBINARY
TEXT BLOB
MEDIUMTEXT MEDIUMBLOB
LONGTEXT LONGBLOB

If you're not using fulltext search on MyISAM you might just want to consider switching your TEXT columns to BLOB anyway to avoid character set issues with Unicode.

Properly escaping MySQL queries in PHP

I'm on various boards and such and from time to time people run into issues where they're trying to insert something into MySQL via a raw query and they inevitably run into that pesky apostrophe and the query dies.

Then almost always someone comes along to tell them that they need to use addslashes().

This is wrong.

Ideally you really want to use prepared statements (mysqli and PDO extensions), but let's assume for now you're throwing caution to the wind and you're going to do it the old fashioned way.

If you're using the mysql extention, you should use mysql_real_escape_string() around all of your variables which are not cast as integers. But actually, you shouldn't be using this function because mysql_* is deprecated, way deprecated. Instead you should be using…

mysqli which is faster, better, sexier, everything you want in a wom… extension. In this case we have the more logical name mysqli_escape_string() or you can use the back-to-goofiness-again method in the mysqli class $mysqli::real_escape_string() and it works the same way.

One issue is that with both of the above functions you have to actually be connected to the server to use them, that's because it escapes based on your connection chartype and some other stuff.

However assuming you're not too worried about potential unicode issues (I've yet to have any, supporting Serbian and Hungarian) you can always make your own function to escape based on what MySQL requires:

function escape($string) {
    return str_replace(array('\\', "\0", "\n", "\r", "'", '"', "\x1a"),
                       array('\\\\', '\\0', '\\n', '\\r', "\\'", '\\"', '\\Z'), $string);
}

But there's always a potential danger in doing things yourself and I actually don't have proof the above is faster than the connection required escape functions, so just use prepared statements ideally.