Closed Bug 657153 Opened 9 years ago Closed 9 years ago

Yahoo search is broken for non-ascii search requests

Categories

(Tech Evangelism Graveyard :: English US, defect, major)

defect
Not set
major

Tracking

(firefox6+ unaffected, firefox7+ unaffected, firefox8+ unaffected)

RESOLVED FIXED
Tracking Status
firefox6 + unaffected
firefox7 + unaffected
firefox8 + unaffected

People

(Reporter: unghost, Unassigned)

References

Details

(Whiteboard: [Yahoo broken by our removing Accept-Charset header])

Attachments

(3 files)

STR:
1) Open http://search.yahoo.com/
2) Type search request in Russian (for example "виски")

Expected results:
List of search for "виски"

Actual results:
All non-ascii symbols look as ????

Last good build:
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2011-05-08-03-mozilla-central/

First broken build:
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2011-05-09-03-mozilla-central/

Regression range:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=a8f07cad55e2&tochange=9e31df64bfd7

I suspect that Bug 572652 is culprit.
FWIW, faking user agent as Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1) helps.
Duplicate of this bug: 657156
Email sent to Yahoo! asking them to investigate. I assume this would be a Firefox 6 change if it stays on track.
Note to Yahoo!:
Accept-Charset already had a constant value and all Firefox builds had the same support for encodings, so paying attention to Accept-Charset was already useless before the header was removed.
Whiteboard: Dao nominated without comment
(In reply to comment #5)
> Note to Yahoo!:
> Accept-Charset already had a constant value and all Firefox builds had the
> same support for encodings, so paying attention to Accept-Charset was
> already useless before the header was removed.
It's untrue. "intl.charset.default" is localizable and some localized builds of Firefox actually had changed the value. For example, Japanese Firefox 4 sends "Accept-Charset: Shift_JIS,utf-8;q=0.7,*;q=0.7".
If it was just a constant, it would not be a problem from http-fingerprint's perspective in the first place.
Kev, any response here?
Whiteboard: Dao nominated without comment → [Yahoo broken by our removing Accept-Charset header]
Looks like Yahoo! Web search (http://search.yahoo.com/) is fixed for non-ascii search requests.
Non-ascii search in Yahoo! Image Search (http://images.search.yahoo.com) and Yahoo! Video Search (http://video.search.yahoo.com/) still broken though.
I'm going to back out bug 572652 as the risk vs reward seems off. Backing out bug 572652 shouldn't have any adverse effects on those that already changed server side to ignore / not use the header. 

Marking as unaffected for 6 due to the backout in bug 572652. I have not backed it out on Aurora or central, so this will crop up again and affects those versions
bc, is there any way for the spider to help in testing this if we turned it back on in aurora and beta
chofmann: is there a header I could look for in the response that would indicate the presence of the problem?
ok, so this might be tricky.  maybe some ideas in how to test over in https://bugzilla.mozilla.org/show_bug.cgi?id=572652

One idea would be to to sent Accept-Charset in one request, then not in the next request, and then look at the diffs to try and detect broken content.

Then we can do more evangelizing to the list of sites that are sending different responses.
chofmann: I tried your idea using XMLHttpRequest to send a default GET to a site followed by a GET with Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 as described in bug 572652. Looking at the content it is very difficult to filter out the differences due to changing session/tracking crap but it appears the charset in the response headers may be the ticket. Spidering the yahoo home page 1 level deep quickly gives:

charset differs: http://images.search.yahoo.com/images : charset1 = ISO-8859-1, charset2 = UTF-8
charset differs: http://video.search.yahoo.com/video : charset1 = ISO-8859-1, charset2 = UTF-8

which matches what Alexander said in comment 8. The scan is still running so I don't have a full answer. If this is sufficient we could do yahoo deeper or more of the top sites or piggy back this on the userhook script used in crash testing or Tomcat can use his vms to collect data. Let me know what you think.
we have contacted yahoo and they are working on fixes, so I think scanning yahoo deeper isn't as valuable as doing a broader scan across sites using a good top site list and/or crash urls.
Ok. I've started a scan with the alexa top-1m list 1 level deep on 1 vm in the colo and will let it run for a while. I doubt that we want it to scan all 1m sites, but I'll let you know before I kill it off.
Attached file charset log
This is the result of comparing 55000+ pages on the top 389 top sites using a Nightly build on Windows XP

1. using XHR with the default headers to GET the page and response headers and parsing out the charset= from the Content-Type header to obtain charset1
2. using XHR with the header Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 to GET the page and response headers and parsing out the charset= from the Content-Type header to obtain charset2.
3. output a charset differs line if charset1 != charset2.

I believe most of these results are false positives. I don't think this approach is sufficient since most of the pages appear to display properly with both Namoroka and Nightly. Looking that the results in the browser, it appears in most cases we respect the charset specified in <meta http-equiv="Content-Type" content="text/html; charset=BLAH">

A notable exception pointed out in this bug is http://video.search.yahoo.com/ which Nightly treats as ISO-8859-1 even though the http-equiv specifies UTF-8, Nightly still treats it as ISO-8859-1.

Perhaps a different approach would be more informative:

1. get charset1 as above.
2. get charset3 from http-equiv="Content-Type"
3. output a charset differs line if charset3 exists and charset1 != charset3
4. if charset3 does not exist then get charset2 as above and output a charset differs line if charset1 != charset2
(In reply to Bob Clary [:bc:] from comment #16)

I screwed up the alternative approach.

> 1. get charset4 from document.characterSet
> 2. get charset3 from http-equiv="Content-Type"
> 3. output a charset differs line if charset3 exists and charset4 != charset3
> 4. if charset3 does not exist then get charset1 and charset2 as above and output a charset differs line if charset1 != charset2
Non-ascii search in Yahoo! Image Search (http://images.search.yahoo.com) is fixed. Yahoo! Video Search still broken for non-ascii search.
Looks like non-ascii search in Yahoo! Video Search is fixed (tested on Mozilla/5.0 (X11; Linux i686; rv:10.0a1) Gecko/20111101 Firefox/10.0a1 ID:20111101031108)
Marking this bug as fixed.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Tech Evangelism → Tech Evangelism Graveyard
You need to log in before you can comment on or make changes to this bug.