Closed Bug 485386 Opened 16 years ago Closed 16 years ago

newsearch XML vs DB encoding causes issues

Categories

(support.mozilla.org :: General, defect, P1)

defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: paulc, Assigned: ecooper)

References

()

Details

(Whiteboard: sumo_only)

Attachments

(2 files)

As per bug 483811 comment 6, the mysql DB articles are encoded in Latin-1, while the XML encoding that newsearch currently uses is UTF-8. This creates problems, as you can see in the menu reference article results: "Preferencesâ�?�¦" should actually be "Preferences…" (the three dots are *one* character that doesn't exist in UTF-8 apparently). I'm sure this character is not the only one. We should either do a text processor to handle these differences, or change the encoding of one of them (most likely the XML side). Currently, the newsearch indexer simply forces the DB encoding to UTF-8 before querying the DB (SET NAMES 'utf-8'), so the articles are already containing bad characters right out of the DB queries. I don't have much experience with this, so I'd really like someone's help to figure out the best approach.
My understanding is that the XML spec uses UTF-8 as default encoding. So I think it may be a good idea to preserve that. Are the DB articles really encoded in Latin-1? I thought TikiWiki uses UTF-8 by default (my experience it that it does). Moreover, we have articles with content in double-byte languages. How does that work if they are in Latin-1? I think we should standardize to use UTF-8 and just remove characters like that triple dot that doesn't exist in UTF-8. How did that get entered then if TikiWiki uses UTF-8?
Maybe TikiWiki does use UTF-8 (that is, maybe our DB is set to that collation). But then we still need an answer to how those characters are being displayed and stored in the database. I'll look into it, but this issue is acting really weird, at least on my local copy :)
Seems like the first step is to figure out which side isn't using UTF-8. As Nelson says, that's the standard and should definitely be preserved. About the "..." character, are there other examples of characters that mess up the summaries? If not, we should just remove the "..." characters and call it a day.
I should be able to post a fix for this sometime tonight.
Currently I'm having issues with mb characters getting rended when using mb_substr or mb_strcut for the description. So, this is making the last character of the description become corrupted. I'm going to take another look tomorrow morning.
Assignee: nobody → smirkingsisyphus
Target Milestone: 1.0 → 1.0.2
This patch should solve the search encoding problem, broken entities, and malformed entities. Please QA extra hard, but as far as I can tell, everything is valid and working. After doing a bit of research, I decided to drop the more complicated data into CDATA blocks. From what I read, the sphinx XML parser is a custom solution, but it seems to handle CDATA blocks properly. This solves the broken entities and all the other issues with the mb entities. Forcing UTF-8 via mb_convert_encoding ensures the data is UTF-8. The fixes in this patch relegate the changes made in bug 483811. Remember, after patching, you need to delete lastindexingtime.txt and lastindexingtime.txt-f and reindex.
Attachment #370115 - Flags: review?(paul.craciunoiu)
Attachment #370115 - Flags: review?(laura)
Comment on attachment 370115 [details] [diff] [review] Force UTF-8 and maintain valid entites and XML Great! Thanks Eric. I like the approach. Seems to produce valid XML - or at least valid enough for the expat parser to work. The encoding turns out fine, tested on the "Menu Reference" and a few other pages in the "bookmark" search result. Now I can go in and do the other bug fixes :)
Attachment #370115 - Flags: review?(paul.craciunoiu) → review+
Blocks: 483811
Blocks: 405028
Laura, can you please review Eric's patch? It's kind of blocking some other newsearch bugs for me.
Severity: normal → critical
Attachment #370115 - Flags: review?(laura) → review+
r24356/r24357 To avoid confusion, have any IT bugs for reindexing staging been opened in the past (similar to bug 460213)?
Priority: -- → P1
Sorry for stealing your thunder, Eric, but I filed bug 488374, just so we could potentially have more time to test.
Depends on: 488374
Eric, I think this can be marked fixed, since you landed the patch.
(In reply to comment #12) > Eric, I think this can be marked fixed, since you landed the patch. Except that the case in bug 483811 is still happening on https://support-stage.mozilla.org/tiki-newsearch.php?locale=en&q=bookmark&where=d&sa=&filter_lang=1&l=en&en_too=1&lastmodif=0&type=0&author=, probably because reindexing hasn't been re-run on staging (bug 488374). I think we should leave this bug open until that's been fixed.
While there is a problem reindexing the forums, all encoding issues for articles (and up to the point where the forum indexing failed) should be resolved. Feel free to begin poking around for initial testing.
For whatever reason, html entities weren't converted before reaching the indexer for the forums like they are for the wiki page indexer. I have no idea why such an inconsistency exists. This will fix the problem. Using the data from <https://bug488374.bugzilla.mozilla.org/attachment.cgi?id=373881> in a forum post as a test. Paul, if you're feeling up to it, feel free to review this as well.
Attachment #373973 - Flags: review?(laura)
Attachment #373973 - Flags: review?(laura) → review+
r24715/r24716
Staging has been reindexed. To verify, just check to make sure locales like de, es, ja, etc., are all clean of broken entities.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Whiteboard: sumo_only
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: