Closed Bug 554210 Opened 10 years ago Closed 10 years ago

[k] UnicodeDecodeError when search results contain non-ascii characters

Categories

(support.mozilla.org :: Search, defect, P1, blocker)

defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: paulc, Assigned: paulc)

References

()

Details

Attachments

(8 files, 2 obsolete files)

Mostly this applies to search results in other languages. The problem appears to be related to MySQL's storage of data as latin1 encoded, which Django/Python has trouble displaying.
So far considered solutions:

1. encoding/decoding data coming from the database and leaving the schema intact

2. converting the affected columns to utf8 (tiki_pages.pageName, .data, and tiki_comments.title, .data)

3. adding two columns for each of those tables holding duplicate data encoded in utf8.
Attached file unicode duplicates
I'm documenting this partly so we know what has been attempted, and partly because someone may have an idea that I've missed.

Until now it looks like option 1. doesn't offer a good solution, since MySQL's understanding of latin1 is different from python's understanding.
At first I found a page saying MySQL's latin1 is actually cp1252:
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html
But cp1252, while different from latin1, doesn't seem to work for non-roman characters, e.g. Russian.


For option 2, MySQL treats utf8 characters which should technically be different as equal (á and a, for example). This causes duplicate keys in the wiki articles and due to how the internal links work in the wiki syntax, we can't drop the unique key on pageName. We could consider rename the duplicates for wiki pages, since there are just a few.
I've attached a SQL file which illustrates the utf8 problem, and should be easy to understand through the comments. Just run it on any database. On my machine it produced consistent duplicate results.

Option 3 needs further investigation, to verify that the Tiki configuration works with the schema changes. We would need to update the Tiki code to save this data in two columns.
Attached file list of KB duplicates (obsolete) —
(In reply to comment #2)
> Option 3 needs further investigation, to verify that the Tiki configuration
> works with the schema changes. We would need to update the Tiki code to save
> this data in two columns.

Here are some results on this:
* the modifications necessary to the tiki code are not too complicated, there are 4 files updating the pageName or data fields in tiki_pages.
* fortunately, conversion in php is not necessary
* unfortunately, saving data in two different encodings doesn't work in the same query, so this would increase the number of queries on tiki-editpage by at least 3. This code piece summarizes why the increase:
---
    $this->query("SET NAMES utf8");
    $result = $this->query("UPDATE tiki_pages SET dataUTF=? WHERE pageNameUTF=?", array($data, $name));
    $this->query("SET NAMES latin1");
---
This increase will be even more visible on page renames (roughly double the number of queries).

What about another approach: 
* leave the tiki code completely as is
* have a separate script run under cron, only looking at wiki pages changed or created after its previous run, and syncing the UTF columns to their latin1 counterparts
* we would schedule it to finish before a new search indexing operation starts (or even trigger the reindexing operation?)

FTR, creating the two columns and converting the data for all wiki pages using pure SQL didn't take long (<1m).


Thoughts?
Duplicate of this bug: 557446
Assignee: paulc → nobody
Component: General → Search
QA Contact: general → kb-software
Assignee: nobody → paulc
Update on this: so we've agreed that the duplicates will not be a problem and we can go ahead and convert the encoding to utf8. I spent time today looking at ways to convert without potentially losing data and haven't found any.
See e.g.
http://www.mysqlperformanceblog.com/2009/03/17/converting-character-sets/
http://www.haidongji.com/2008/11/11/convert-character-set-to-utf8-in-mysql/

So far I've yet to try a way that works without any MySQL warnings (data loss). Hope to do better than this latter one :-)
http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t.html
For tiki_pages, the conversion process went pretty well. Some observations:

On a database dump from March 17, MySQL showed:
* 5 warnings for `pageName`
* 1 warning for `data` field
* 269 warnings for `description` field. However, this field may be entirely unused (was used by old sphinx XML indexer). Need to investigate.

With the exception of `description`, I double checked each of the values with warnings and the data looks properly converted. In each case there were trailing invalid characters in the latin1 encoding itself (tested with command line mysql).

Testing the newly converted utf8 table with the Tiki codebase, however, I see malformed data. It looks like we can't mix encodings in PHP without running "SET NAMES charset" before querying a table.

One possible solution is making the entire Tiki database use utf8. Since most of the data is either wiki pages or forum posts, the rest should be less hassle.
Attached file summary of warnings
I've finally converted the entire database to utf8. Here is a summary of the warnings along with some details.

Although I have not done a thorough investigation of each warning, I haven't seen any case where the characters were valid in the initial latin1 encoding, i.e. all the errors I saw came from attempting to store utf8 data in latin1 encoding in the first place.

I have not included the warnings received for tiki_pages.description or tiki_comments.description, since I think we are not using these columns any longer. We used to use them for old search, but need to double check.

Note: most of the warnings in tables related to tiki_pages come from storing the same page titles in other tables (e.g. tiki_objects, tiki_links, tiki_history).

We also have a somewhat large number of warnings in tiki_comments, but very few are actual contents. Most of them are related to anonymous usernames. (in_reply_to, message_id, userName columns).

IMHO, the results are really good.
Attached patch SQL to convert entire DB (obsolete) — Splinter Review
This is the giant SQL that can be run on the database to convert all the tables to UTF8. Last time I checked, it took 1h+ to run on my machine.

Assumptions:
* there are no duplicates
* the script runs uninterrupted from start to finish (stopping and restarting it will likely throw errors)

It does the following:
* starts with all tables with data in them, drops any text/varchar/char keys, converts everything to blob and then to utf8
* for empty tables, there's no data to worry about so it just sets the table/columns encoding to utf8
* drops the 4 unused search tables
* (optional) last line changes the database encoding (this is commented out)

James: on that note, the script can be split up into converting (1) tables with data and (2) empty tables, if you wish.


For duplicates, there are 11 wiki_pages colliding in my local db copy (from march 17), and these cause problems with unique/primary keys in the following tables:
* tiki_history
* tiki_links
* tiki_pages
* tiki_stats

Renaming them through tiki-rename_page.php would take care of all the collisions *except* those in tiki_stats (which we could probably erase).
Attachment #438611 - Flags: review?(james)
I should add that I didn't write all of that SQL. This perl script helped *a lot*:
http://www.pablowe.net/convert_charset
And this is the much much shorter code change necessary for Tiki to run on utf8.
Attachment #438617 - Flags: review?(james)
Priority: -- → P1
Severity: normal → blocker
Here is a better list of duplicates, has page_id, locale, and title pairs.
Attachment #434121 - Attachment is obsolete: true
Attached file delete duplicates
This helped me delete duplicates. I ran it on Monday's dump, before the conversion SQL and then the conversion ran uninterrupted.
The internet was offline, so I spent some time finding and commenting out all the writes to tiki_stats. Looked through the code for "tiki_stats", "stats", and "statslib". I also asked Cheng, and we don't use this table for metrics.

I don't think we should drop the table, but I truncated it and see no problems.
Attachment #439282 - Flags: review?(james)
Comment on attachment 438611 [details] [diff] [review]
SQL to convert entire DB

This is extremely long, but it worked as intended, as far as I can tell. (I've been looking around in French, Russian and Japanese.) So far it looks like this has worked! Fantastic that you eventually figured it out. 

(I did have to delete the duplicates to make it work, so we need to get those names changed or pages delete ASAP, including on stage.)

r+. When you file a bug for IT to run this, be sure to specify support-stage and support-stage-new. And maybe we should get the DBs resynced from prod before this is run, so we're sure we're testing up-to-date data.
Attachment #438611 - Flags: review?(james) → review+
Comment on attachment 438617 [details] [diff] [review]
make tiki db connection use utf8

WFM.
Attachment #438617 - Flags: review?(james) → review+
(In reply to comment #15)
> When you file a bug for IT to run this, be sure to specify support-stage
> and support-stage-new. And maybe we should get the DBs resynced from prod
> before this is run, so we're sure we're testing up-to-date data.
Now that the databases are synced per bug 559717, we should rename the wiki pages. I should probably file the IT bug for running this SQL after we do that.
Forgot to add this to the Sphinx conf, as well, which results in some interesting search results.
Attachment #439623 - Flags: review?(paulc)
Comment on attachment 439623 [details] [diff] [review]
add SET NAMES to sphinx.conf

WFM. I'll patch up Kitsune
Attachment #439623 - Flags: review?(paulc) → review+
r66067 in trunk for the sphinx conf.
Depends on: 559997
r66180 on trunk, landed attachment 438617 [details] [diff] [review] -- use utf8 on db connections.
Blocks: 560759
Same as attachment 438611 [details] [diff] [review], except it also deals with the remaining duplicates.

Assumes that all the approved articles in attachment 439135 [details] have been renamed.
Attachment #438611 - Attachment is obsolete: true
Attachment #440403 - Flags: feedback?(james)
Last week I tried to rename or remove the articles mentioned here, but it didn't work, only threw an error. The important locales had changed the names already. But reading through this bug I'm wondering, is it still needed to remove those remaining pairs?
(In reply to comment #24)
> is it still needed to
> remove those remaining pairs?

Yes. We can not have duplicates in the database. Page names have a UNIQUE index, and so must be unique. The definition of "unique" changed a little.

If they can't be removed via the web site, we'll need to drop or rename the pages directly in the database. I'd prefer to minimize the number of times we have to do that, as I'm afraid of leaving orphaned data.
This is done from the dev side. Bug 560759 is for the SUMO team to rename duplicates on production, and we'll file an IT bug for the relevant scripts to run on production when push time comes around.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Attachment #439282 - Flags: review?(james)
Attachment #440403 - Flags: feedback?(james)
You need to log in before you can comment on or make changes to this bug.