UTF8 character set isn't used when connecting to MySQL



addons.mozilla.org Graveyard
Public Pages
8 years ago
2 years ago


(Reporter: clouserw, Assigned: davedash)




(1 attachment, 1 obsolete attachment)



8 years ago
Bug 503502 found that we aren't setting the correct character set when we connect to MySQL.  By default MySQL is latin1 (that's probably what we're doing) and we should be running "SET NAMES 'utf8'" when our connection fires up.  An example:

mysql> show variables like 'character_set_%';
| Variable_name            | Value                      |
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 

mysql> select tags from text_search_summary where match(tags) against('海');
| tags     |
| á,v,海 | 
1 row in set (0.00 sec)

mysql> SET NAMES 'UTF8';
Query OK, 0 rows affected (0.00 sec)

mysql> select tags from text_search_summary where match(tags) against('海');
Empty set (0.00 sec)

It fails the second time because our ft_min_word_len is 2 but without the right encoding 海 is determined to be 3 characters long.

Before this bug gets fixed we need some serious testing and consideration of changing this both on the current data and future data.  Completely untested, but Sergey's last comment on http://bugs.mysql.com/bug.php?id=28581 could help migrate existing data.


8 years ago
Duplicate of this bug: 522176
Blocks: 519531
mysql> SELECT count(localized_string) FROM translations WHERE char_length(localized_string) <> length(localized_string)
    -> ;
| count(localized_string) |
|                   36208 | 
1 row in set (3.31 sec)

SELECT tag_text FROM tags WHERE char_length(tag_text) <> length(tag_text)
can get us the affected tags.
Assignee: nobody → dd
Created attachment 406345 [details] [diff] [review]
utf8 awesome

This patch does the following: 
* Forces all cake connections to use utf8 charset
* Migration script (utf8.sql, will be renamed on commit to have a commit number)
Attachment #406345 - Flags: review?(clouserw)
Created attachment 406381 [details] [diff] [review]
utf8 awesome!

Covering the rest of the tables.  Anything missed can be done at a later time.  We should make sure the data is backed up, and we log when we run this script, just to cover our bases.

Whether this lands in 5.2, depends on QA.  QA will have to do fairly thorough coverage of the site to make sure we've covered every text string that is coming from the DB to check for weird entities.
Attachment #406345 - Attachment is obsolete: true
Attachment #406381 - Flags: review?(clouserw)
Attachment #406345 - Flags: review?(clouserw)


8 years ago
Attachment #406381 - Flags: review?(clouserw) → review+
QA will take this patch for 5.2, and we'll run our Selenium testcases (search.html, search2.html, searchapi.html), plus our Litmus testsuite and ad-hoc.
Last Resolved: 8 years ago
Resolution: --- → FIXED
Summary: UTF character set isn't used when connecting to MySQL → UTF8 character set isn't used when connecting to MySQL
Verified FIXED; we ran the above, and didn't notice anything amiss.


8 years ago
Blocks: 525827
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.