Closed Bug 702291 Opened 14 years ago Closed 14 years ago

Collation error on Basket

Categories

(Data & BI Services Team :: DB: MySQL, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wenzel, Assigned: nmaul)

References

Details

Basket's database seems to contain some latin1-encoded fields, which should never happen[tm]. Here's a traceback we've been getting: ----- OperationalError at /subscriptions/subscribe/ (1267, "Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='") /data/www/django/basket.mozilla.com/basket/apps/subscriptions/handlers.py in create def delete(self, request): return rc.NOT_IMPLEMENTED def create(self, request): try: email = request.POST.get('email', '') >>> s = Subscriber.objects.get(email=email) ... except Subscriber.DoesNotExist: try: s = Subscriber(email=email) s.full_clean() s.save() except ValidationError, e:
Jake, did you make a new database when we launched the recent version of basket? This is old code, so it's pointing to a database issue. Can you check that the database collation is utf8?
(In reply to James Long (:jlongster) from comment #1) > Can you check that the database collation is utf8? .. and field collation too. Somewhat strangely, both entire MySQL databases as well as the individual fields inside them have collations associated with them.
db: tm-c01-master01 mysql> use basket_mozilla_com; Database changed mysql> show variables like "collation_database"; +--------------------+-------------------+ | Variable_name | Value | +--------------------+-------------------+ | collation_database | latin1_swedish_ci | +--------------------+-------------------+ 1 row in set (0.00 sec) @mpressman: would we be able to change/alter the collation to utf8 easily?
fwiw, the email on these errors contains Unicode, such as: teshikata2000\uff20yahoo.co.jp
Getting more errors. We need this looked into ASAP.
Priority: -- → P1
I'm afraid to touch this myself. A database has a default charset for new tables. This is what Jason posted in comment 3, I think. A table has a default charset (for new rows? or rows that don't specify? not sure). This can be seen in "show create table <table>". A column has a charset, which is the same as the table default unless specified otherwise. This is also shown in "show create table <table>", except if it's the same as the default then it's not specifically listed. I'm not at all certain of what to do here: http://dev.mysql.com/doc/refman/5.1/en/alter-table.html There are choices here, and it's not obvious to me how to proceed. Do we need to actually *convert* the data in the table to a different character set? I think probably not, but I'm not certain. Do we just need to tell MySQL that the existing data is already utf8, and not actually alter the data, just the column/table definitions? Maybe, but I don't know what "teshikata2000\uff20yahoo.co.jp" will end up looking like... Maybe it would still be "teshikata2000\uff20yahoo.co.jp", but in Unicode (that is, not actually translate the code in the middle). If anyone has more/better info on this, please share. :)
Another note / clarification: If you've been feeding the database UTF-8 data from the start (regardless of what MySQL thought it was), it should be possible to do something like this: ALTER TABLE t1 CHANGE c1 c1 BLOB; ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8; That is, convert the affected column to an intermediary binary type, then to utf8. If the affected column is TEXT, then use BLOB as the intermediary. If it's VARCHAR, then use VARBINARY. If it's CHAR, then use BINARY. Once all the affected columns are changed, it would be a good idea to change the table default as well, so any new columns get made properly: ALTER TABLE tbl_name DEFAULT CHARACTER SET utf8; Once all the tables are done, it might be good to do the same for the DB as a whole, so new tables have the proper default: ALTER DATABASE db_name DEFAULT CHARACTER SET utf8;
According to this django bug: https://code.djangoproject.com/ticket/11950 "... Django always uses utf-8 to talk to the database." I would assume that all data in the database is UTF-8 compatible. If we've ever entered non-latin-1 data, it should be utf-8 encoded. All other data in latin-1 is represented the same way in UTF-8, right? So you're suggestion in comment 7 seems like the way to go. Without access to the database it's hard to verify it though. Can you do a quick select on the Subscriber table and see what the data looks like? I don't understand how there could be any UTF-8 data in there currently if the encoding is latin-1. Has this been a problem from the start or can you think of something that happened when we updated the code (did we recreate the database)? Note that the error is happening when trying to *insert* new data, so we don't need to focus on recovering any current data that's not encoded correctly.
Here's a snippet for converting mysql tables to utf-8: http://michaelangela.wordpress.com/2008/05/07/snippet-to-update-django-tables-to-utf8/ Seems like the way to go.
It's mildly out of scope here, but if there's *any* way we can set up our database server configs so that by default, we never, ever, create a database that's not UTF-8, that would be very much appreciated. It's the umpteenth time we had to deal with this issue in production databases :( It is almost always caused by the person setting up the DB not picking any charset (I can't blame them), so MySQL shrugs and uses latin1-swedish. Broken-by-default is frustrating and stupid. Matt Pressman, do you know if MySQL has such a default setting?
Seems that you can start up MySQL with such an option (the sanest place for config!): http://dev.mysql.com/doc/refman/5.1/en/charset-applications.html
I filed bug 703286 to get the default settings fixed, so I stop scope-creeping this bug.
See Also: → 703286
Assigning this to IT because we need to run the commands in comment 9 (or something like them) to convert basket's database to UTF-8.
Assignee: jlong → server-ops-database
Component: Webdev → Server Operations: Database
QA Contact: webdev → cshields
I'm good with this... very nice to see an example specifically mentioning Django apps, so it feels like someone already did the homework. I'll do this in just a minute.
Assignee: server-ops-database → nmaul
This is completed. [root@mradm02 basket]# python26 manage.py shell Python 2.6.5 (r265:79063, Feb 28 2011, 21:55:45) Type "copyright", "credits" or "license" for more information. IPython 0.10 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object'. ?object also works, ?? prints more. In [1]: from django.db import connection In [2]: cursor = connection.cursor() In [3]: cursor.execute('SHOW TABLES') Out[3]: 21L In [4]: results=[] In [5]: for row in cursor.fetchall(): results.append(row) ...: In [6]: for row in results: cursor.execute('ALTER TABLE %s CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;' % (row[0])) ...: Out[6]: 0L Out[6]: 0L Out[6]: 0L Out[6]: 48L Out[6]: 3L Out[6]: 0L Out[6]: 48L Out[6]: 25L Out[6]: 16L Out[6]: 31L Out[6]: 1L Out[6]: 2L Out[6]: 1890201L Out[6]: 36L Out[6]: 0L Out[6]: 0L Out[6]: 0L Out[6]: 1L Out[6]: 1530320L Out[6]: 1219243L Out[6]: 1L In [7]: quit() Do you really want to exit ([y]/n)? y I this converts all the columns to utf8, then also changes the default for each of the tables to utf8. Afterwards I also did this: mysql> alter database basket_mozilla_com default character set utf8; Query OK, 1 row affected (0.00 sec) mysql> show variables like "c%_database"; +------------------------+-----------------+ | Variable_name | Value | +------------------------+-----------------+ | character_set_database | utf8 | | collation_database | utf8_general_ci | +------------------------+-----------------+ 2 rows in set (0.00 sec)
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.