Setting up UTF-8


This article is intended for people that already know what UTF-8 is and why it's a good choice for web applications but aren't entirely sure how to set everything up in Apache, MySQL, PHP and their markup (HTML, XHTML etc) to get things rolling. If you're not familiar with character sets and unicode the following article by Joel Spolsky is a good place to start. You may also enjoy reading the Character Sets, Encoding and UTF-8 articles on Wikipedia.

Apache


Web browsers need to know what character set to use for the documents you send it. Your server needs to include charset=utf-8 in the Content-Type HTTP header so clients will handle UTF-8 encoded documents correctly. There are a variety of ways you can do this and a simple approach would be to set UTF-8 as the default character set. Here are a few ways you can do this -- add the method of your choice to httpd.conf or your root htaccess file:

  1. // make utf-8 the default charset for everything
  2. AddDefaultCharset UTF-8
  3.  
  4. // or specifically for php, html, xml, javascript etc..
  5. AddCharset UTF-8 .php
  6. AddCharset UTF-8 .html
  7. AddCharset UTF-8 .xml
  8. AddCharset UTF-8 .js


Figure 1

MySQL


Setting the default and server character set and collation to UTF-8 in your MySQL config may not be enough if your clients (PHP etc.) request to use a different one as MySQL will oblige. So long as your clients request the character set you want to be used this is fine.

  1. [client]
  2. default-character-set = utf8
  3.  
  4. [mysql]
  5. default-character-set = utf8
  6.  
  7. [mysqld]
  8. default-character-set = utf8
  9. character-set-server = utf8


On the other hand, if you want to ensure MySQL always stores and retrieves data in the UTF-8 charset regardless of what a client requests add the following directive to the [mysqld] section of your MySQL config file.

  1. skip-character-set-client-handshake


Use the following queries to see the different server variables related to character sets. You may also want to read the following document on the MySQL web site:

  1. SHOW VARIABLES LIKE 'character_set%';
  2. SHOW VARIABLES LIKE 'collation%';


PHP


If your scripts will contain multi-byte characters be sure to encode them in UTF-8. Any decent text editor can do this for you (I like notepad++). In your PHP config, set the default_charset and mbstring.internal_encoding to UTF-8.

  1. default_charset = "UTF-8"
  2. mbstring.internal_encoding = UTF-8


Now that your web application will be supporting UTF-8 you'll have to adjust the way you validate user supplied input data. What you need to do is check for well-formed UTF-8. I wrote the following PHP function based off this w3c regular expression. One of the modifications I made was only accepting up to U+FFFF because higher planes are not currently supported by MySQL (I believe they are silently rejected?).

The reason I break the string up into smaller chunks is to keep the memory usage down because if too much memory is requested (more than the stack size) the Apache process will be terminated.

  1. function valid_utf8( $string ) {
  2.  
  3.     $pattern = '%^(?:'.
  4.         '[\x09\x0A\x0D\x20-\x7E]|'.             # ascii
  5.         '[\xC2-\xDF][\x80-\xBF]|'.              # non-overlong 2-byte
  6.         '\xE0[\xA0-\xBF][\x80-\xBF]|'.          # excluding overlongs
  7.         '[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|'.   # straight 3-byte
  8.         '\xED[\x80-\x9F][\x80-\xBF]'.           # excluding surrogates
  9.     ')*$%D';
  10.  
  11.     $chunks = ceil( mb_strlen($string, 'UTF-8') / 2000 );
  12.     for ( $i=1; $i < $chunks+1; $i++ ) {
  13.  
  14.         if ( preg_match($pattern,
  15.             mb_substr($string, $i * 2000 - 2000, 2000, 'UTF-8')
  16.         ) !== 1 ) return false;
  17.  
  18.     } unset( $string, $pattern, $chunks, $i ); return true;
  19. }


Markup


When it comes to your markup there are primarily two things that need to be done. The first is encoding documents in UTF-8 and the second is including a meta tag in the document <head> that specifies the UTF-8 character set is being used.

  1. // for html, xhtml etc
  2. <meta http-equiv="content-type" content="text/html;charset=utf-8" />


  1. // for xml
  2. <?xml version="1.0" encoding="utf-8"?>


JavaScript


JavaScript versions 1.3 and newer support unicode and use it internally so there is usually nothing special that needs to be done on that front. You should encode your JavaScript files in UTF-8 though and be sure encoding=utf-8 is included in the Content-Type header when serving these files. If you want to be really thorough you could also add charset="utf-8" to your script tags although the HTTP header is really all you need.

For more information, check out the following article on the Mozilla Developer Center:

  1. <script type="text/javascript" charset="utf-8" src="/path/to/script"></script>