Closed
Bug 485386
Opened 16 years ago
Closed 16 years ago
newsearch XML vs DB encoding causes issues
Categories
(support.mozilla.org :: General, defect, P1)
support.mozilla.org
General
Tracking
(Not tracked)
VERIFIED
FIXED
1.0.2
People
(Reporter: paulc, Assigned: ecooper)
References
()
Details
(Whiteboard: sumo_only)
Attachments
(2 files)
|
9.27 KB,
patch
|
paulc
:
review+
laura
:
review+
|
Details | Diff | Splinter Review |
|
401 bytes,
patch
|
laura
:
review+
|
Details | Diff | Splinter Review |
As per bug 483811 comment 6, the mysql DB articles are encoded in Latin-1, while the XML encoding that newsearch currently uses is UTF-8. This creates problems, as you can see in the menu reference article results:
"Preferencesâ�?�¦" should actually be "Preferences…" (the three dots are *one* character that doesn't exist in UTF-8 apparently). I'm sure this character is not the only one.
We should either do a text processor to handle these differences, or change the encoding of one of them (most likely the XML side).
Currently, the newsearch indexer simply forces the DB encoding to UTF-8 before querying the DB (SET NAMES 'utf-8'), so the articles are already containing bad characters right out of the DB queries.
I don't have much experience with this, so I'd really like someone's help to figure out the best approach.
Comment 1•16 years ago
|
||
My understanding is that the XML spec uses UTF-8 as default encoding. So I think it may be a good idea to preserve that.
Are the DB articles really encoded in Latin-1? I thought TikiWiki uses UTF-8 by default (my experience it that it does). Moreover, we have articles with content in double-byte languages. How does that work if they are in Latin-1?
I think we should standardize to use UTF-8 and just remove characters like that triple dot that doesn't exist in UTF-8. How did that get entered then if TikiWiki uses UTF-8?
| Reporter | ||
Comment 2•16 years ago
|
||
Maybe TikiWiki does use UTF-8 (that is, maybe our DB is set to that collation). But then we still need an answer to how those characters are being displayed and stored in the database. I'll look into it, but this issue is acting really weird, at least on my local copy :)
Comment 3•16 years ago
|
||
Seems like the first step is to figure out which side isn't using UTF-8. As Nelson says, that's the standard and should definitely be preserved.
About the "..." character, are there other examples of characters that mess up the summaries? If not, we should just remove the "..." characters and call it a day.
| Assignee | ||
Comment 4•16 years ago
|
||
I should be able to post a fix for this sometime tonight.
| Assignee | ||
Comment 5•16 years ago
|
||
Currently I'm having issues with mb characters getting rended when using mb_substr or mb_strcut for the description. So, this is making the last character of the description become corrupted.
I'm going to take another look tomorrow morning.
Assignee: nobody → smirkingsisyphus
Updated•16 years ago
|
Target Milestone: 1.0 → 1.0.2
| Assignee | ||
Comment 6•16 years ago
|
||
This patch should solve the search encoding problem, broken entities, and malformed entities. Please QA extra hard, but as far as I can tell, everything is valid and working.
After doing a bit of research, I decided to drop the more complicated data into CDATA blocks. From what I read, the sphinx XML parser is a custom solution, but it seems to handle CDATA blocks properly. This solves the broken entities and all the other issues with the mb entities.
Forcing UTF-8 via mb_convert_encoding ensures the data is UTF-8.
The fixes in this patch relegate the changes made in bug 483811.
Remember, after patching, you need to delete lastindexingtime.txt and lastindexingtime.txt-f and reindex.
Attachment #370115 -
Flags: review?(paul.craciunoiu)
| Assignee | ||
Updated•16 years ago
|
Attachment #370115 -
Flags: review?(laura)
| Reporter | ||
Comment 7•16 years ago
|
||
Comment on attachment 370115 [details] [diff] [review]
Force UTF-8 and maintain valid entites and XML
Great! Thanks Eric. I like the approach. Seems to produce valid XML - or at least valid enough for the expat parser to work. The encoding turns out fine, tested on the "Menu Reference" and a few other pages in the "bookmark" search result.
Now I can go in and do the other bug fixes :)
Attachment #370115 -
Flags: review?(paul.craciunoiu) → review+
| Reporter | ||
Comment 8•16 years ago
|
||
Laura, can you please review Eric's patch? It's kind of blocking some other newsearch bugs for me.
Severity: normal → critical
Updated•16 years ago
|
Attachment #370115 -
Flags: review?(laura) → review+
| Assignee | ||
Comment 9•16 years ago
|
||
r24356/r24357
To avoid confusion, have any IT bugs for reindexing staging been opened in the past (similar to bug 460213)?
Updated•16 years ago
|
Priority: -- → P1
Sorry for stealing your thunder, Eric, but I filed bug 488374, just so we could potentially have more time to test.
| Reporter | ||
Comment 12•16 years ago
|
||
Eric, I think this can be marked fixed, since you landed the patch.
Updated•16 years ago
|
(In reply to comment #12)
> Eric, I think this can be marked fixed, since you landed the patch.
Except that the case in bug 483811 is still happening on https://support-stage.mozilla.org/tiki-newsearch.php?locale=en&q=bookmark&where=d&sa=&filter_lang=1&l=en&en_too=1&lastmodif=0&type=0&author=, probably because reindexing hasn't been re-run on staging (bug 488374).
I think we should leave this bug open until that's been fixed.
| Assignee | ||
Comment 14•16 years ago
|
||
While there is a problem reindexing the forums, all encoding issues for articles (and up to the point where the forum indexing failed) should be resolved.
Feel free to begin poking around for initial testing.
| Assignee | ||
Comment 15•16 years ago
|
||
For whatever reason, html entities weren't converted before reaching the indexer for the forums like they are for the wiki page indexer.
I have no idea why such an inconsistency exists. This will fix the problem. Using the data from <https://bug488374.bugzilla.mozilla.org/attachment.cgi?id=373881> in a forum post as a test.
Paul, if you're feeling up to it, feel free to review this as well.
Attachment #373973 -
Flags: review?(laura)
Updated•16 years ago
|
Attachment #373973 -
Flags: review?(laura) → review+
| Assignee | ||
Comment 16•16 years ago
|
||
r24715/r24716
| Assignee | ||
Comment 17•16 years ago
|
||
Staging has been reindexed.
To verify, just check to make sure locales like de, es, ja, etc., are all clean of broken entities.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
[1] DE: https://support-stage.mozilla.org/tiki-newsearch.php?locale=de&q=bookmark&where=d&sa=&filter_lang=1&l=en&en_too=1&lastmodif=0&type=0&author
[2] JA: https://support-stage.mozilla.org/tiki-newsearch.php?locale=ja&q=bookmark&where=d&sa=&filter_lang=1&l=en&en_too=1&lastmodif=0&type=0&author
[3] ES: https://support-stage.mozilla.org/tiki-newsearch.php?locale=es&q=bookmark&where=d&sa=&filter_lang=1&l=en&en_too=1&lastmodif=0&type=0&author
[4] EN: https://support-stage.mozilla.org/tiki-newsearch.php?locale=en&q=bookmark&where=d&sa=&filter_lang=1&l=en&en_too=1&lastmodif=0&type=0&author
I'll keep testing, too, but marking this as verified fixed.
Status: RESOLVED → VERIFIED
Updated•16 years ago
|
Whiteboard: sumo_only
You need to log in
before you can comment on or make changes to this bug.
Description
•