Closed
Bug 171813
Opened 22 years ago
Closed 22 years ago
Universal auto detector doesn't work well on sohu news page
Categories
(Core :: Internationalization, defect)
Tracking
()
VERIFIED
FIXED
People
(Reporter: ji, Assigned: shanjian)
References
()
Details
(Keywords: intl, topembed+)
Attachments
(2 files)
2.51 KB,
text/html
|
Details | |
3.66 KB,
patch
|
ftang
:
review+
jst
:
superreview+
|
Details | Diff | Splinter Review |
Build: 09/30 trunk and branch build OS: Simplified Chinese XP On Sohu news page (http://news.sohu.com/15/95/news203459515.shtml), there is a frame for promotion which doesn't contain charset meta tag (please refer the URL in URL field). With universal auto detector turned on, this promotion frame is displayed as garbled while it can be displayed correctly with Chinese/Simplified Chinese auto detector turned on.
Comment 1•22 years ago
|
||
It seems not related with frame: by load page: http://www.sol.sohu.com/promotion/sol.htm separately, auto-detect Universal detects it as windows 1252 while auto-detect Chinese and SimpChinese detect it as gb2312.
QA Contact: ruixu → ylong
yes, it's not related to the frame, you can click on the URL in the URL field to see that. The page happens to be in a subframe of the news page.
Removed the subframe from summary to clarify.
Keywords: intl
Summary: Universal auto detector doesn't work well on a subframe of sohu news page → Universal auto detector doesn't work well on sohu news page
Comment 4•22 years ago
|
||
I think it's related with fix of bug 162894 - 09-08 trunk on Mac 10.1.5 auto-detect universal detects that page as gb18030.
Added topembed, since Sohu is one of the most popular sites in China.
Keywords: topembed
Checked 09/11 branch build which doesn't contain the fix for bug 162894, the promotion page can be displayed correctly.
Bob, which system are you using? IE 6 on my Simplified Chinese XP with auto selection turned on can display that page correctly.
Comment 10•22 years ago
|
||
Assignee | ||
Comment 11•22 years ago
|
||
Assignee | ||
Comment 12•22 years ago
|
||
The existing code is too conservative in declaring charset. Now we have the competition of latin1, and our latin1 prober is not that conservative, multibyte prober should have more confidence. Roy, could you r=? chris, could you sr?
Status: NEW → ASSIGNED
Comment 13•22 years ago
|
||
roy is on vacation this week. maybe ftang can r=?
Comment 14•22 years ago
|
||
What body of data do we use to tune these ratios?
Assignee | ||
Comment 15•22 years ago
|
||
For each charset, I calculate 2 ratios based on a large collection of text. One is called "Ideal distribution ratio", the other is "random distribution ratio". If the text does not belong to the charset, the ratio should be close to RDR. A typical chinese text should have ratio close to IDR. Chinese have the closest IDR and RDR, they are 3.79 to 0.157. As you see, the difference is very big. For calculation, I currently use 50% of IDR, which is 1.90 for chinese. That is too conservation, especially for small amount of text. If I use 25%, the ratio is 0.95, that is six times bigger than RDR, I think that is good enough.
Comment 16•22 years ago
|
||
I feel scared about the patch, it looks pure guessing about number. Is there a way we can generate those number by some fomular ? or data ? how can we sure changing those number wont fix some page bug break others ?
Assignee | ||
Comment 17•22 years ago
|
||
Some parameters should be adjusted over the time, that is within my original expectation. And those numbers are not base on pure guessing. IDR and RDR are calculated from collected data. The guessing happens in choose a ratio to use base on these 2 values. With japanese and korean, 25% of IDR have been used for quite long. Using chinese as example, even 25% of IDR is 6 times bigger than RDR. That's say the sampled text has the characteristic 6 times stronger than a random text. Isn't that a very strong indication?
Assignee | ||
Comment 18•22 years ago
|
||
A second thought, it might be a good idea to eliminate this guess and put everything into the formula.
Assignee | ||
Comment 19•22 years ago
|
||
As I was looking for a new formula to calculate confidence, I found out that my original approach make more sense to me. With the development of each language and each language's own characteristic, some manual tuning is necessary. Since the characteristic we calculated here is very strong, it make sense to me to keep the exisiting simple logic. I would like to stick with my patch.
Comment 20•22 years ago
|
||
ok, should we run your new patch throught some test suite and make sure the new value work better before we land it into the tree?
Reporter | ||
Comment 21•22 years ago
|
||
I think we should land the fix into branch as well since it's a regression for Universal auto detector for Simplified Chiense detection.
Comment 22•22 years ago
|
||
Comment on attachment 101281 [details] [diff] [review] patch, adjust some detecting parameters. rs=ftang for the number shanjian tweak.
Attachment #101281 -
Flags: review+
Comment 23•22 years ago
|
||
let's land this into trunk and ask qa run full set of autodetect testing (yea ll the test cases) before request this land into branch.
Assignee | ||
Comment 24•22 years ago
|
||
johnny, could you sr?
Assignee | ||
Comment 25•22 years ago
|
||
dan, could you sr?
Updated•22 years ago
|
Comment 26•22 years ago
|
||
Comment on attachment 101281 [details] [diff] [review] patch, adjust some detecting parameters. sr=jst
Attachment #101281 -
Flags: superreview+
Assignee | ||
Comment 27•22 years ago
|
||
fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Comment 28•22 years ago
|
||
Verified page http://www.sol.sohu.com/promotion/sol.htm display fine on 11-27 trunk build / Win2k under charset gb18030 with auto-dtect Universal.
Status: RESOLVED → VERIFIED
Updated•16 years ago
|
Flags: in-testsuite+
Comment 29•14 years ago
|
||
Bug 545658 is about the test case for this bug failing with the HTML5 parser. The HTML5 parser currently, by design, runs the chardet over the first 512 bytes of the content. It appears that's not enough in this case. The sohu.com site now sends a charset meta, so it's not a concrete problem anymore. Previously, in my cursory testing, I had seen reasonable real-world results with running the chardet over only the first 512 bytes. Should I adjust the HTML5 parser to keep feeding the chardet (and possibly renavigate to the page in midparse) or should I change the test case?
Comment 30•14 years ago
|
||
512 bytes seems way too small for HTML charset detection. I would expect that in most typical HTML pages that is not even going to reach the <body>
You need to log in
before you can comment on or make changes to this bug.
Description
•