Closed Bug 554210 Opened 15 years ago Closed 15 years ago

[k] UnicodeDecodeError when search results contain non-ascii characters

Tracking

(Not tracked)

Status:

VERIFIED FIXED

Milestone:

2.0

People

(Reporter: paulc, Assigned: paulc)

References

(
URL
)

Details

Attachments

(8 files, 2 obsolete files)

unicode duplicates 15 years ago Paul Craciunoiu [:paulc] 785 bytes, text/plain		Details
list of KB duplicates 15 years ago Paul Craciunoiu [:paulc] 985 bytes, text/plain		Details
summary of warnings 15 years ago Paul Craciunoiu [:paulc] 62.65 KB, text/plain		Details
SQL to convert entire DB 15 years ago Paul Craciunoiu [:paulc] 103.83 KB, patch	jsocol : review+	Details \| Diff \| Splinter Review
make tiki db connection use utf8 15 years ago Paul Craciunoiu [:paulc] 500 bytes, patch	jsocol : review+	Details \| Diff \| Splinter Review
better list of duplicates 15 years ago Paul Craciunoiu [:paulc] 3.43 KB, text/plain		Details
delete duplicates 15 years ago Paul Craciunoiu [:paulc] 2.04 KB, text/plain		Details
stop writing to tiki_stats 15 years ago Paul Craciunoiu [:paulc] 5.21 KB, patch		Details \| Diff \| Splinter Review
add SET NAMES to sphinx.conf 15 years ago James Socol [:jsocol, :james] 741 bytes, patch	paulc : review+	Details \| Diff \| Splinter Review
SQL to convert entire db + erase remaining duplicates first 15 years ago Paul Craciunoiu [:paulc] 104.17 KB, patch		Details \| Diff \| Splinter Review

Paul Craciunoiu [:paulc]

Assignee

Description

•

15 years ago

Mostly this applies to search results in other languages. The problem appears to be related to MySQL's storage of data as latin1 encoded, which Django/Python has trouble displaying.

Paul Craciunoiu [:paulc]

Assignee

Comment 1

•

15 years ago

So far considered solutions: 1. encoding/decoding data coming from the database and leaving the schema intact 2. converting the affected columns to utf8 (tiki_pages.pageName, .data, and tiki_comments.title, .data) 3. adding two columns for each of those tables holding duplicate data encoded in utf8.

Paul Craciunoiu [:paulc]

Assignee

Comment 2

•

15 years ago

Attached file unicode duplicates — Details

I'm documenting this partly so we know what has been attempted, and partly because someone may have an idea that I've missed. Until now it looks like option 1. doesn't offer a good solution, since MySQL's understanding of latin1 is different from python's understanding. At first I found a page saying MySQL's latin1 is actually cp1252: http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html But cp1252, while different from latin1, doesn't seem to work for non-roman characters, e.g. Russian. For option 2, MySQL treats utf8 characters which should technically be different as equal (á and a, for example). This causes duplicate keys in the wiki articles and due to how the internal links work in the wiki syntax, we can't drop the unique key on pageName. We could consider rename the duplicates for wiki pages, since there are just a few. I've attached a SQL file which illustrates the utf8 problem, and should be easy to understand through the comments. Just run it on any database. On my machine it produced consistent duplicate results. Option 3 needs further investigation, to verify that the Tiki configuration works with the schema changes. We would need to update the Tiki code to save this data in two columns.

Paul Craciunoiu [:paulc]

Assignee

Comment 3

•

15 years ago

Attached file list of KB duplicates (obsolete) — Details

Paul Craciunoiu [:paulc]

Assignee

Comment 4

•

15 years ago

(In reply to comment #2) > Option 3 needs further investigation, to verify that the Tiki configuration > works with the schema changes. We would need to update the Tiki code to save > this data in two columns. Here are some results on this: * the modifications necessary to the tiki code are not too complicated, there are 4 files updating the pageName or data fields in tiki_pages. * fortunately, conversion in php is not necessary * unfortunately, saving data in two different encodings doesn't work in the same query, so this would increase the number of queries on tiki-editpage by at least 3. This code piece summarizes why the increase: --- $this->query("SET NAMES utf8"); $result = $this->query("UPDATE tiki_pages SET dataUTF=? WHERE pageNameUTF=?", array($data, $name)); $this->query("SET NAMES latin1"); --- This increase will be even more visible on page renames (roughly double the number of queries). What about another approach: * leave the tiki code completely as is * have a separate script run under cron, only looking at wiki pages changed or created after its previous run, and syncing the UTF columns to their latin1 counterparts * we would schedule it to finish before a new search indexing operation starts (or even trigger the reindexing operation?) FTR, creating the two columns and converting the data for all wiki pages using pure SQL didn't take long (<1m). Thoughts?

Stephen Donner [:stephend] Not actively reading bugmail

Updated

•

15 years ago

Assignee: paulc → nobody

Component: General → Search

QA Contact: general → kb-software

Stephen Donner [:stephend] Not actively reading bugmail

Updated

•

15 years ago

Assignee: nobody → paulc

James Socol [:jsocol, :james]

Updated

•

15 years ago

Blocks: 554206

Paul Craciunoiu [:paulc]

Assignee

Comment 6

•

15 years ago

Update on this: so we've agreed that the duplicates will not be a problem and we can go ahead and convert the encoding to utf8. I spent time today looking at ways to convert without potentially losing data and haven't found any. See e.g. http://www.mysqlperformanceblog.com/2009/03/17/converting-character-sets/ http://www.haidongji.com/2008/11/11/convert-character-set-to-utf8-in-mysql/ So far I've yet to try a way that works without any MySQL warnings (data loss). Hope to do better than this latter one :-) http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t.html

Paul Craciunoiu [:paulc]

Assignee

Comment 7

•

15 years ago

For tiki_pages, the conversion process went pretty well. Some observations: On a database dump from March 17, MySQL showed: * 5 warnings for `pageName` * 1 warning for `data` field * 269 warnings for `description` field. However, this field may be entirely unused (was used by old sphinx XML indexer). Need to investigate. With the exception of `description`, I double checked each of the values with warnings and the data looks properly converted. In each case there were trailing invalid characters in the latin1 encoding itself (tested with command line mysql). Testing the newly converted utf8 table with the Tiki codebase, however, I see malformed data. It looks like we can't mix encodings in PHP without running "SET NAMES charset" before querying a table. One possible solution is making the entire Tiki database use utf8. Since most of the data is either wiki pages or forum posts, the rest should be less hassle.

Paul Craciunoiu [:paulc]

Assignee

Comment 8

•

15 years ago

Attached file summary of warnings — Details

I've finally converted the entire database to utf8. Here is a summary of the warnings along with some details. Although I have not done a thorough investigation of each warning, I haven't seen any case where the characters were valid in the initial latin1 encoding, i.e. all the errors I saw came from attempting to store utf8 data in latin1 encoding in the first place. I have not included the warnings received for tiki_pages.description or tiki_comments.description, since I think we are not using these columns any longer. We used to use them for old search, but need to double check. Note: most of the warnings in tables related to tiki_pages come from storing the same page titles in other tables (e.g. tiki_objects, tiki_links, tiki_history). We also have a somewhat large number of warnings in tiki_comments, but very few are actual contents. Most of them are related to anonymous usernames. (in_reply_to, message_id, userName columns). IMHO, the results are really good.

Paul Craciunoiu [:paulc]

Assignee

Comment 9

•

15 years ago

Attached patch SQL to convert entire DB (obsolete) — Details — Splinter Review

This is the giant SQL that can be run on the database to convert all the tables to UTF8. Last time I checked, it took 1h+ to run on my machine. Assumptions: * there are no duplicates * the script runs uninterrupted from start to finish (stopping and restarting it will likely throw errors) It does the following: * starts with all tables with data in them, drops any text/varchar/char keys, converts everything to blob and then to utf8 * for empty tables, there's no data to worry about so it just sets the table/columns encoding to utf8 * drops the 4 unused search tables * (optional) last line changes the database encoding (this is commented out) James: on that note, the script can be split up into converting (1) tables with data and (2) empty tables, if you wish. For duplicates, there are 11 wiki_pages colliding in my local db copy (from march 17), and these cause problems with unique/primary keys in the following tables: * tiki_history * tiki_links * tiki_pages * tiki_stats Renaming them through tiki-rename_page.php would take care of all the collisions *except* those in tiki_stats (which we could probably erase).

Attachment #438611 - Flags: review?(james)

Paul Craciunoiu [:paulc]

Assignee

Comment 10

•

15 years ago

I should add that I didn't write all of that SQL. This perl script helped *a lot*: http://www.pablowe.net/convert_charset

Paul Craciunoiu [:paulc]

Assignee

Comment 11

•

15 years ago

Attached patch make tiki db connection use utf8 — Details — Splinter Review

And this is the much much shorter code change necessary for Tiki to run on utf8.

Attachment #438617 - Flags: review?(james)

James Socol [:jsocol, :james]

Updated

•

15 years ago

Priority: -- → P1

James Socol [:jsocol, :james]

Updated

•

15 years ago

Severity: normal → blocker

Paul Craciunoiu [:paulc]

Assignee

Comment 12

•

15 years ago

Attached file better list of duplicates — Details

Here is a better list of duplicates, has page_id, locale, and title pairs.

Attachment #434121 - Attachment is obsolete: true

Paul Craciunoiu [:paulc]

Assignee

Comment 13

•

15 years ago

Attached file delete duplicates — Details

This helped me delete duplicates. I ran it on Monday's dump, before the conversion SQL and then the conversion ran uninterrupted.

Paul Craciunoiu [:paulc]

Assignee

Comment 14

•

15 years ago

Attached patch stop writing to tiki_stats — Details — Splinter Review

The internet was offline, so I spent some time finding and commenting out all the writes to tiki_stats. Looked through the code for "tiki_stats", "stats", and "statslib". I also asked Cheng, and we don't use this table for metrics. I don't think we should drop the table, but I truncated it and see no problems.

Attachment #439282 - Flags: review?(james)

James Socol [:jsocol, :james]

Comment 15

•

15 years ago

Comment on attachment 438611 [details] [diff] [review] SQL to convert entire DB This is extremely long, but it worked as intended, as far as I can tell. (I've been looking around in French, Russian and Japanese.) So far it looks like this has worked! Fantastic that you eventually figured it out. (I did have to delete the duplicates to make it work, so we need to get those names changed or pages delete ASAP, including on stage.) r+. When you file a bug for IT to run this, be sure to specify support-stage and support-stage-new. And maybe we should get the DBs resynced from prod before this is run, so we're sure we're testing up-to-date data.

Attachment #438611 - Flags: review?(james) → review+

James Socol [:jsocol, :james]

Comment 16

•

15 years ago

Comment on attachment 438617 [details] [diff] [review] make tiki db connection use utf8 WFM.

Attachment #438617 - Flags: review?(james) → review+

Paul Craciunoiu [:paulc]

Assignee

Comment 17

•

15 years ago

(In reply to comment #15) > When you file a bug for IT to run this, be sure to specify support-stage > and support-stage-new. And maybe we should get the DBs resynced from prod > before this is run, so we're sure we're testing up-to-date data. Now that the databases are synced per bug 559717, we should rename the wiki pages. I should probably file the IT bug for running this SQL after we do that.

James Socol [:jsocol, :james]

Comment 18

•

15 years ago

Attached patch add SET NAMES to sphinx.conf — Details — Splinter Review

Forgot to add this to the Sphinx conf, as well, which results in some interesting search results.

Attachment #439623 - Flags: review?(paulc)

Paul Craciunoiu [:paulc]

Assignee

Comment 19

•

15 years ago

Comment on attachment 439623 [details] [diff] [review] add SET NAMES to sphinx.conf WFM. I'll patch up Kitsune

Attachment #439623 - Flags: review?(paulc) → review+

James Socol [:jsocol, :james]

Comment 20

•

15 years ago

r66067 in trunk for the sphinx conf.

Paul Craciunoiu [:paulc]

Assignee

Updated

•

15 years ago

Depends on: 559997

Stephen Donner [:stephend] Not actively reading bugmail

Comment 21

•

15 years ago

Is the traceback from http://support-stage-new.mozilla.com/en-US/search?q=netscape the same as this bug?

Paul Craciunoiu [:paulc]

Assignee

Comment 22

•

15 years ago

r66180 on trunk, landed attachment 438617 [details] [diff] [review] -- use utf8 on db connections.

Paul Craciunoiu [:paulc]

Assignee

Updated

•

15 years ago

Blocks: 560759

Paul Craciunoiu [:paulc]

Assignee

Comment 23

•

15 years ago

Attached patch SQL to convert entire db + erase remaining duplicates first — Details — Splinter Review

Same as attachment 438611 [details] [diff] [review], except it also deals with the remaining duplicates. Assumes that all the approved articles in attachment 439135 [details] have been renamed.

Attachment #438611 - Attachment is obsolete: true

Attachment #440403 - Flags: feedback?(james)

Kadir Topal [:atopal]

Comment 24

•

15 years ago

Last week I tried to rename or remove the articles mentioned here, but it didn't work, only threw an error. The important locales had changed the names already. But reading through this bug I'm wondering, is it still needed to remove those remaining pairs?

James Socol [:jsocol, :james]

Comment 25

•

15 years ago

(In reply to comment #24) > is it still needed to > remove those remaining pairs? Yes. We can not have duplicates in the database. Page names have a UNIQUE index, and so must be unique. The definition of "unique" changed a little. If they can't be removed via the web site, we'll need to drop or rename the pages directly in the database. I'd prefer to minimize the number of times we have to do that, as I'm afraid of leaving orphaned data.

Paul Craciunoiu [:paulc]

Assignee

Comment 26

•

15 years ago

This is done from the dev side. Bug 560759 is for the SUMO team to rename duplicates on production, and we'll file an IT bug for the relevant scripts to run on production when push time comes around.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Stephen Donner [:stephend] Not actively reading bugmail

Comment 27

•

15 years ago

Verified FIXED on http://support-stage-new.mozilla.com/zh-CN/search?a=1&q=键盘快捷键 and others.

URL: http://support-stage-new.mozilla.com/...

Status: RESOLVED → VERIFIED

James Socol [:jsocol, :james]

Updated

•

15 years ago

Attachment #439282 - Flags: review?(james)

James Socol [:jsocol, :james]

Updated

•

15 years ago

Attachment #440403 - Flags: feedback?(james)

You need to log in before you can comment on or make changes to this bug.