Closed Bug 171813 Opened 22 years ago Closed 22 years ago

Universal auto detector doesn't work well on sohu news page

Categories

(Core :: Internationalization, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

VERIFIED FIXED

People

(Reporter: ji, Assigned: shanjian)

References

()

Details

(Keywords: intl, topembed+)

Attachments

(2 files)

Build: 09/30 trunk and branch build
OS: Simplified Chinese XP

On Sohu news page (http://news.sohu.com/15/95/news203459515.shtml), there is a
frame for promotion which doesn't contain charset meta tag (please refer the URL
in URL field). With universal auto detector turned on, this promotion frame is
displayed as garbled while it can be displayed correctly with Chinese/Simplified
Chinese auto detector turned on.
It seems not related with frame:
by load page: http://www.sol.sohu.com/promotion/sol.htm separately, auto-detect
Universal detects it as windows 1252 while auto-detect Chinese and SimpChinese
detect it as gb2312.
QA Contact: ruixu → ylong
yes, it's not related to the frame, you can click on the URL in the URL field to
see that. The page happens to be in a subframe of the news page.
Removed the subframe from summary to clarify.
Keywords: intl
Summary: Universal auto detector doesn't work well on a subframe of sohu news page → Universal auto detector doesn't work well on sohu news page
I think it's related with fix of bug 162894 - 09-08 trunk on Mac 10.1.5
auto-detect universal detects that page as gb18030.
Added topembed, since Sohu is one of the most popular sites in China.
Keywords: topembed
I see the same problem with IE6
Checked 09/11 branch build which doesn't contain the fix for bug 162894, the
promotion page can be displayed correctly.
Bob, which system are you using?
IE 6 on my Simplified Chinese XP with auto selection turned on can display that
page correctly.
US W2K with region set to US.
Attached file Saved problematic page
The existing code is too conservative in declaring charset. Now we have the
competition of latin1, and our latin1 prober is not that conservative, multibyte
prober should have more confidence. 

Roy, could you r=?
chris, could you sr?
Status: NEW → ASSIGNED
roy is on vacation this week.  maybe ftang can r=?
What body of data do we use to tune these ratios?
For each charset, I calculate 2 ratios based on a large collection of text. One
is called "Ideal distribution ratio", the other is "random distribution ratio".
If the text does not belong to the charset, the ratio should be close to RDR. A
typical chinese text should have ratio close to IDR. Chinese have the closest
IDR and RDR, they are 3.79 to 0.157. As you see, the difference is very big. For
calculation, I currently use 50% of IDR, which is 1.90 for chinese. That is too
conservation, especially for small amount of text. If I use 25%, the ratio is
0.95, that is six times bigger than RDR, I think that is good enough. 
I feel scared about the patch, it looks pure guessing about number. Is there a
way we can generate those number by some fomular ? or data ? how can we sure
changing those number wont fix some page bug break others ?
Some parameters should be adjusted over the time, that is within my original
expectation. And those numbers are not base on pure guessing. IDR and RDR are
calculated from collected data. The guessing happens in choose a ratio to use
base on these 2 values. With japanese and korean, 25% of IDR have been used for
quite long. Using chinese as example, even 25% of IDR is 6 times bigger than
RDR. That's say the sampled text has the characteristic 6 times stronger than a
random text. Isn't that a very strong indication?
A second thought, it might be a good idea to eliminate this guess and put
everything into the formula. 
As I was looking for a new formula to calculate confidence, I found out that my
original approach make more sense to me. With the development of each language
and each language's own characteristic, some manual tuning is necessary. Since
the characteristic we calculated here is very strong, it make sense to me to
keep the exisiting simple logic.  I would like to stick with my patch. 
ok, should we run your new patch throught some test suite and make sure the new
value work better before we land it into the tree?
I think we should land the fix into branch as well since it's a regression for
Universal auto detector for Simplified Chiense detection.
Comment on attachment 101281 [details] [diff] [review]
patch, adjust some detecting parameters.

rs=ftang for the number shanjian tweak.
Attachment #101281 - Flags: review+
let's land this into trunk and ask qa run full set of autodetect testing (yea ll
the test cases) before request this land into branch.
johnny, could you sr?
dan, could you sr?
Keywords: topembedtopembed+
Comment on attachment 101281 [details] [diff] [review]
patch, adjust some detecting parameters.

sr=jst
Attachment #101281 - Flags: superreview+
Blocks: 157673
fix checked in. 
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Verified page http://www.sol.sohu.com/promotion/sol.htm display fine on 11-27
trunk build / Win2k under charset gb18030 with auto-dtect Universal.
Status: RESOLVED → VERIFIED
No longer blocks: 157673
Depends on: 180372
Flags: in-testsuite+
Bug 545658 is about the test case for this bug failing with the HTML5 parser. The HTML5 parser currently, by design, runs the chardet over the first 512 bytes of the content. It appears that's not enough in this case.

The sohu.com site now sends a charset meta, so it's not a concrete problem anymore. Previously, in my cursory testing, I had seen reasonable real-world results with running the chardet over only the first 512 bytes.

Should I adjust the HTML5 parser to keep feeding the chardet (and possibly renavigate to the page in midparse) or should I change the test case?
512 bytes seems way too small for HTML charset detection. I would expect that in most typical HTML pages that is not even going to reach the <body>
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: