Closed Bug 483811 Opened 11 years ago Closed 11 years ago

Need to unescape in KB articles' summaries in search results

Categories

(support.mozilla.org :: General, defect, P1, major)

x86
All

Tracking

(Not tracked)

VERIFIED DUPLICATE of bug 485386

People

(Reporter: stephend, Assigned: paulc)

References

()

Details

Attachments

(1 file)

Steps to Reproduce:

1. Load http://support-stage.mozilla.org/tiki-newsearch.php?where=d&locale=en-US&q=bookmark&sa=
2. Look at the third entry, "How to change the Latest Headlines feed

The quot;Latest Headlines quot; item is a Live Bookmark that is included by default in Firefox. Instead of changing the feed, you can delete it, and add another live bookmark. Right-clickHold dow..."

Expected Results:

We should escape unescape quot;Latest Headlines quot; so that it appears as "Latest Headlines"
Eric, this is the most important "new search is broken" bug to fix at this point. Giving it your highest prio for 1.1.
Assignee: nobody → smirkingsisyphus
Priority: -- → P1
Target Milestone: 1.0 → 1.1
It's not that they need decoded, it's because they're malformed entities...or that seems to be the case. As far as I can tell, utf8entities() defined at /scripts/indexer.php:9 replaces all '&' to ' '. This makes all encoded html entities invalid. 

Nelson can you clear up line 11 from the indexer? I get the rest (removing the first 31 ascii characters in the table), but why "&", ">", "<"? Were they causing problems with the indexer? Am I missing something simple?
The indexing process needs valid XML, which is why entities are being killed. Currently investigating a working solution.
So, we can 'fix' the indexer or add something into smarty to salvage malformed html entities.

I spent sometime over the weekend playing with the indexer, and that's my preferred approach. I'm just not sure I'll have time. Input?
Target Milestone: 1.1 → 1.0
Eric: I can help out with this too if you can't get to it by tomorrow and want help :)
Blocks: 405028
Assignee: smirkingsisyphus → paul.craciunoiu
This fix comes in two flavors (or steps)!

Part 1. Fixed future indexing of encoding for '&<>' in XML.

Part 2. For this to be verifiable on staging, lines
173 of indexer.php and
151 of indexer-forums.php
(the lastModif conditional)
must be commented out, so that old (invalid) descriptions are replaced by new ones.

As long as production hasn't come in contact with indexing yet, there will be no need to do that there. Note that this only fixes the three HTML characters "&<>".

r23784 / r23785


However, while testing this locally, I noticed that certain characters don't encode well in utf-8. For example, on
http://support-stage.mozilla.org/tiki-newsearch.php?where=d&locale=en-US&q=menu+reference&sa=
There are a few odd characters after the word "Preferences". This turns out to go away if the lines for the MySQL character set are commented out in the indexer. However, then the XML encoding needs to be changed...

So we either account for those cases (who knows how many?), or ideally change the encoding to a superset of both Latin-1 and UTF-8, though I'm not sure there is one.

My experience in this is limited, so I'd like some help. I've played around lots though, but didn't find any solution. Ideas?
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reopening, since the bug isn't fixed. When looking at the URL you included above, I see a lot of weird things:

- The first result is linkified
- Still lots of weird characters in the results, as you point out
- Also filed bug 485326 for the "Table Of Contents" text appearing in the summaries

Please also ask for review. I know we're on a tight schedule, but seeing how easy it is to create massive regressions, I'd feel much more comfortable if someone reviews the patches before we check them in.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #7)
> Reopening, since the bug isn't fixed. When looking at the URL you included
> above, I see a lot of weird things:
> 
> - The first result is linkified
Could you define linkified?
> - Still lots of weird characters in the results, as you point out
Definitely. But this is not about escaping, as much as doing proper summaries and proper encoding. It's not in the scope of this bug.
> - Also filed bug 485326 for the "Table Of Contents" text appearing in the
> summaries
And neither is this.
> 
> Please also ask for review. I know we're on a tight schedule, but seeing how
> easy it is to create massive regressions, I'd feel much more comfortable if
> someone reviews the patches before we check them in.
Okay, sorry about that. It'll attach the patch when I wake up, so that at least it will be here. Maybe Eric could look over it at least to double check.

As a side note, I think there's a lot more to be done for newsearch, even more than it seems.
(In reply to comment #8)
> (In reply to comment #7)
> > Reopening, since the bug isn't fixed. When looking at the URL you included
> > above, I see a lot of weird things:
> > 
> > - The first result is linkified
> Could you define linkified?

Interesting, it's no longer linkified. I meant that the whole summary was part of the article link.

> > - Still lots of weird characters in the results, as you point out
> Definitely. But this is not about escaping, as much as doing proper summaries
> and proper encoding. It's not in the scope of this bug.

Is there a bug filed for that?

> As a side note, I think there's a lot more to be done for newsearch, even more
> than it seems.

Yeah; sadly I think you're right.
(In reply to comment #9)
> Interesting, it's no longer linkified. I meant that the whole summary was part
> of the article link.
It still is linkified for me, all of them are. I thought this was part of the design... :) Should it be removed?
> 
> > Definitely. But this is not about escaping, as much as doing proper summaries
> > and proper encoding. It's not in the scope of this bug.
> 
> Is there a bug filed for that?
No. I'll file one right now.
I've already committed this, but it's very likely that the fix won't be pushed. So there's still time for a review. In any case, it'll serve as a reference for reverting if need be.
Attachment #369547 - Flags: review?(nelson)
(In reply to comment #10)
> (In reply to comment #9)
> > Interesting, it's no longer linkified. I meant that the whole summary was part
> > of the article link.
> It still is linkified for me, all of them are. I thought this was part of the
> design... :) Should it be removed?

Yes, it should, but when I reported the problem, the summary was also blue (as in the link color). And it was only for the first search result (don't ask me why..). Anyway, not a problem anymore.
(In reply to comment #12)
So is this bug fixed? Or what should it focus on now?
Comment on attachment 369547 [details] [diff] [review]
Patch for XML encoding of "&<>"

This is fine, could have been done a bit less verbosely with str_replace but meh.
Attachment #369547 - Flags: review?(nelson) → review+
(In reply to comment #14)
> (From update of attachment 369547 [details] [diff] [review])
> This is fine, could have been done a bit less verbosely with str_replace but
> meh.
True, I just reused the same function that was there...
Target Milestone: 1.0 → 1.1
Depends on: 485386
Target Milestone: 1.1 → 1.0.2
Should this be closed now?
I think that bug 485386 is taking care of this.
I'm marking this as a dupe of bug 485386, which is root of this bug and relegates the patch in this bug.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 485386
OK, verified duplicate
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.