Closed Bug 485386 Opened 16 years ago Closed 16 years ago

newsearch XML vs DB encoding causes issues

Tracking

(Not tracked)

Status:

VERIFIED FIXED

Milestone:

1.0.2

People

(Reporter: paulc, Assigned: ecooper)

References

(
URL
)

Details

(Whiteboard: sumo_only)

Attachments

(2 files)

Force UTF-8 and maintain valid entites and XML 16 years ago E. Cooper [:ecooper] 9.27 KB, patch	paulc : review+ laura : review+	Details \| Diff \| Splinter Review
Make extra sure HTML entities are entities during forum indexing 16 years ago E. Cooper [:ecooper] 401 bytes, patch	laura : review+	Details \| Diff \| Splinter Review

Paul Craciunoiu [:paulc]

Reporter

Description

•

16 years ago

As per bug 483811 comment 6, the mysql DB articles are encoded in Latin-1, while the XML encoding that newsearch currently uses is UTF-8. This creates problems, as you can see in the menu reference article results: "Preferencesâ�?�¦" should actually be "Preferences…" (the three dots are *one* character that doesn't exist in UTF-8 apparently). I'm sure this character is not the only one. We should either do a text processor to handle these differences, or change the encoding of one of them (most likely the XML side). Currently, the newsearch indexer simply forces the DB encoding to UTF-8 before querying the DB (SET NAMES 'utf-8'), so the articles are already containing bad characters right out of the DB queries. I don't have much experience with this, so I'd really like someone's help to figure out the best approach.

Nelson Ko [:nkoth]

Comment 1

•

16 years ago

My understanding is that the XML spec uses UTF-8 as default encoding. So I think it may be a good idea to preserve that. Are the DB articles really encoded in Latin-1? I thought TikiWiki uses UTF-8 by default (my experience it that it does). Moreover, we have articles with content in double-byte languages. How does that work if they are in Latin-1? I think we should standardize to use UTF-8 and just remove characters like that triple dot that doesn't exist in UTF-8. How did that get entered then if TikiWiki uses UTF-8?

Paul Craciunoiu [:paulc]

Reporter

Comment 2

•

16 years ago

Maybe TikiWiki does use UTF-8 (that is, maybe our DB is set to that collation). But then we still need an answer to how those characters are being displayed and stored in the database. I'll look into it, but this issue is acting really weird, at least on my local copy :)

David Tenser [:djst]

Comment 3

•

16 years ago

Seems like the first step is to figure out which side isn't using UTF-8. As Nelson says, that's the standard and should definitely be preserved. About the "..." character, are there other examples of characters that mess up the summaries? If not, we should just remove the "..." characters and call it a day.

E. Cooper [:ecooper]

Assignee

Comment 4

•

16 years ago

I should be able to post a fix for this sometime tonight.

E. Cooper [:ecooper]

Assignee

Comment 5

•

16 years ago

Currently I'm having issues with mb characters getting rended when using mb_substr or mb_strcut for the description. So, this is making the last character of the description become corrupted. I'm going to take another look tomorrow morning.

Assignee: nobody → smirkingsisyphus

Laura Thomson :laura

Updated

•

16 years ago

Target Milestone: 1.0 → 1.0.2

E. Cooper [:ecooper]

Assignee

Comment 6

•

16 years ago

Attached patch Force UTF-8 and maintain valid entites and XML — Details — Splinter Review

This patch should solve the search encoding problem, broken entities, and malformed entities. Please QA extra hard, but as far as I can tell, everything is valid and working. After doing a bit of research, I decided to drop the more complicated data into CDATA blocks. From what I read, the sphinx XML parser is a custom solution, but it seems to handle CDATA blocks properly. This solves the broken entities and all the other issues with the mb entities. Forcing UTF-8 via mb_convert_encoding ensures the data is UTF-8. The fixes in this patch relegate the changes made in bug 483811. Remember, after patching, you need to delete lastindexingtime.txt and lastindexingtime.txt-f and reindex.

Attachment #370115 - Flags: review?(paul.craciunoiu)

E. Cooper [:ecooper]

Assignee

Updated

•

16 years ago

Attachment #370115 - Flags: review?(laura)

Paul Craciunoiu [:paulc]

Reporter

Comment 7

•

16 years ago

Comment on attachment 370115 [details] [diff] [review] Force UTF-8 and maintain valid entites and XML Great! Thanks Eric. I like the approach. Seems to produce valid XML - or at least valid enough for the expat parser to work. The encoding turns out fine, tested on the "Menu Reference" and a few other pages in the "bookmark" search result. Now I can go in and do the other bug fixes :)

Attachment #370115 - Flags: review?(paul.craciunoiu) → review+

Paul Craciunoiu [:paulc]

Reporter

Updated

•

16 years ago

Blocks: 483811

Chris Ilias [:cilias]

Updated

•

16 years ago

Blocks: 405028

Paul Craciunoiu [:paulc]

Reporter

Comment 8

•

16 years ago

Laura, can you please review Eric's patch? It's kind of blocking some other newsearch bugs for me.

Severity: normal → critical

Laura Thomson :laura

Updated

•

16 years ago

Attachment #370115 - Flags: review?(laura) → review+

E. Cooper [:ecooper]

Assignee

Comment 9

•

16 years ago

r24356/r24357 To avoid confusion, have any IT bugs for reindexing staging been opened in the past (similar to bug 460213)?

Laura Thomson :laura

Updated

•

16 years ago

Priority: -- → P1

Stephen Donner [:stephend] Not actively reading bugmail

Comment 10

•

16 years ago

Sorry for stealing your thunder, Eric, but I filed bug 488374, just so we could potentially have more time to test.

E. Cooper [:ecooper]

Assignee

Updated

•

16 years ago

Depends on: 488374

Paul Craciunoiu [:paulc]

Reporter

Comment 12

•

16 years ago

Eric, I think this can be marked fixed, since you landed the patch.

Stephen Donner [:stephend] Not actively reading bugmail

Updated

•

16 years ago

URL: http://support-stage.mozilla.org/tiki... → https://support-stage.mozilla.org/tik...

Stephen Donner [:stephend] Not actively reading bugmail

Comment 13

•

16 years ago

(In reply to comment #12) > Eric, I think this can be marked fixed, since you landed the patch. Except that the case in bug 483811 is still happening on https://support-stage.mozilla.org/tiki-newsearch.php?locale=en&q=bookmark&where=d&sa=&filter_lang=1&l=en&en_too=1&lastmodif=0&type=0&author=, probably because reindexing hasn't been re-run on staging (bug 488374). I think we should leave this bug open until that's been fixed.

E. Cooper [:ecooper]

Assignee

Comment 14

•

16 years ago

While there is a problem reindexing the forums, all encoding issues for articles (and up to the point where the forum indexing failed) should be resolved. Feel free to begin poking around for initial testing.

E. Cooper [:ecooper]

Assignee

Comment 15

•

16 years ago

Attached patch Make extra sure HTML entities are entities during forum indexing — Details — Splinter Review

For whatever reason, html entities weren't converted before reaching the indexer for the forums like they are for the wiki page indexer. I have no idea why such an inconsistency exists. This will fix the problem. Using the data from <https://bug488374.bugzilla.mozilla.org/attachment.cgi?id=373881> in a forum post as a test. Paul, if you're feeling up to it, feel free to review this as well.

Attachment #373973 - Flags: review?(laura)

Laura Thomson :laura

Updated

•

16 years ago

Attachment #373973 - Flags: review?(laura) → review+

E. Cooper [:ecooper]

Assignee

Comment 16

•

16 years ago

r24715/r24716

E. Cooper [:ecooper]

Assignee

Comment 17

•

16 years ago

Staging has been reindexed. To verify, just check to make sure locales like de, es, ja, etc., are all clean of broken entities.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Stephen Donner [:stephend] Not actively reading bugmail

Comment 18

•

16 years ago

[1] DE: https://support-stage.mozilla.org/tiki-newsearch.php?locale=de&q=bookmark&where=d&sa=&filter_lang=1&l=en&en_too=1&lastmodif=0&type=0&author [2] JA: https://support-stage.mozilla.org/tiki-newsearch.php?locale=ja&q=bookmark&where=d&sa=&filter_lang=1&l=en&en_too=1&lastmodif=0&type=0&author [3] ES: https://support-stage.mozilla.org/tiki-newsearch.php?locale=es&q=bookmark&where=d&sa=&filter_lang=1&l=en&en_too=1&lastmodif=0&type=0&author [4] EN: https://support-stage.mozilla.org/tiki-newsearch.php?locale=en&q=bookmark&where=d&sa=&filter_lang=1&l=en&en_too=1&lastmodif=0&type=0&author I'll keep testing, too, but marking this as verified fixed.

Status: RESOLVED → VERIFIED

David Tenser [:djst]

Updated

•

16 years ago

Whiteboard: sumo_only

You need to log in before you can comment on or make changes to this bug.