Closed Bug 171813 Opened 23 years ago Closed 23 years ago

Universal auto detector doesn't work well on sohu news page

Tracking

()

Status:

VERIFIED FIXED

People

(Reporter: ji, Assigned: shanjian)

References

(
URL
)

Details

(Keywords: intl, topembed+)

Attachments

(2 files)

Saved problematic page 23 years ago Yuying Long 2.51 KB, text/html		Details
patch, adjust some detecting parameters. 23 years ago Shanjian Li 3.66 KB, patch	ftang : review+ jst : superreview+	Details \| Diff \| Splinter Review

Reporter

Description

•

23 years ago

Build: 09/30 trunk and branch build OS: Simplified Chinese XP On Sohu news page (http://news.sohu.com/15/95/news203459515.shtml), there is a frame for promotion which doesn't contain charset meta tag (please refer the URL in URL field). With universal auto detector turned on, this promotion frame is displayed as garbled while it can be displayed correctly with Chinese/Simplified Chinese auto detector turned on.

Yuying Long

Comment 1

•

23 years ago

It seems not related with frame: by load page: http://www.sol.sohu.com/promotion/sol.htm separately, auto-detect Universal detects it as windows 1252 while auto-detect Chinese and SimpChinese detect it as gb2312.

QA Contact: ruixu → ylong

Reporter

Comment 2

•

23 years ago

yes, it's not related to the frame, you can click on the URL in the URL field to see that. The page happens to be in a subframe of the news page.

Reporter

Comment 3

•

23 years ago

Removed the subframe from summary to clarify.

Keywords: intl

Summary: Universal auto detector doesn't work well on a subframe of sohu news page → Universal auto detector doesn't work well on sohu news page

Yuying Long

Comment 4

•

23 years ago

I think it's related with fix of bug 162894 - 09-08 trunk on Mac 10.1.5 auto-detect universal detects that page as gb18030.

Reporter

Comment 5

•

23 years ago

Added topembed, since Sohu is one of the most popular sites in China.

Keywords: topembed

bobj

Comment 6

•

23 years ago

I see the same problem with IE6

Reporter

Comment 7

•

23 years ago

Checked 09/11 branch build which doesn't contain the fix for bug 162894, the promotion page can be displayed correctly.

Reporter

Comment 8

•

23 years ago

Bob, which system are you using? IE 6 on my Simplified Chinese XP with auto selection turned on can display that page correctly.

bobj

Comment 9

•

23 years ago

US W2K with region set to US.

Yuying Long

Comment 10

•

23 years ago

Attached file Saved problematic page — Details

Shanjian Li

Assignee

Comment 11

•

23 years ago

Attached patch patch, adjust some detecting parameters. — Details — Splinter Review

Shanjian Li

Assignee

Comment 12

•

23 years ago

The existing code is too conservative in declaring charset. Now we have the competition of latin1, and our latin1 prober is not that conservative, multibyte prober should have more confidence. Roy, could you r=? chris, could you sr?

Status: NEW → ASSIGNED

bobj

Comment 13

•

23 years ago

roy is on vacation this week. maybe ftang can r=?

bobj

Comment 14

•

23 years ago

What body of data do we use to tune these ratios?

Shanjian Li

Assignee

Comment 15

•

23 years ago

For each charset, I calculate 2 ratios based on a large collection of text. One is called "Ideal distribution ratio", the other is "random distribution ratio". If the text does not belong to the charset, the ratio should be close to RDR. A typical chinese text should have ratio close to IDR. Chinese have the closest IDR and RDR, they are 3.79 to 0.157. As you see, the difference is very big. For calculation, I currently use 50% of IDR, which is 1.90 for chinese. That is too conservation, especially for small amount of text. If I use 25%, the ratio is 0.95, that is six times bigger than RDR, I think that is good enough.

Frank Tang

Comment 16

•

23 years ago

I feel scared about the patch, it looks pure guessing about number. Is there a way we can generate those number by some fomular ? or data ? how can we sure changing those number wont fix some page bug break others ?

Shanjian Li

Assignee

Comment 17

•

23 years ago

Some parameters should be adjusted over the time, that is within my original expectation. And those numbers are not base on pure guessing. IDR and RDR are calculated from collected data. The guessing happens in choose a ratio to use base on these 2 values. With japanese and korean, 25% of IDR have been used for quite long. Using chinese as example, even 25% of IDR is 6 times bigger than RDR. That's say the sampled text has the characteristic 6 times stronger than a random text. Isn't that a very strong indication?

Shanjian Li

Assignee

Comment 18

•

23 years ago

A second thought, it might be a good idea to eliminate this guess and put everything into the formula.

Shanjian Li

Assignee

Comment 19

•

23 years ago

As I was looking for a new formula to calculate confidence, I found out that my original approach make more sense to me. With the development of each language and each language's own characteristic, some manual tuning is necessary. Since the characteristic we calculated here is very strong, it make sense to me to keep the exisiting simple logic. I would like to stick with my patch.

Frank Tang

Comment 20

•

23 years ago

ok, should we run your new patch throught some test suite and make sure the new value work better before we land it into the tree?

Reporter

Comment 21

•

23 years ago

I think we should land the fix into branch as well since it's a regression for Universal auto detector for Simplified Chiense detection.

Frank Tang

Comment 22

•

23 years ago

Comment on attachment 101281 [details] [diff] [review] patch, adjust some detecting parameters. rs=ftang for the number shanjian tweak.

Attachment #101281 - Flags: review+

Frank Tang

Comment 23

•

23 years ago

let's land this into trunk and ask qa run full set of autodetect testing (yea ll the test cases) before request this land into branch.

Shanjian Li

Assignee

Comment 24

•

23 years ago

johnny, could you sr?

Shanjian Li

Assignee

Comment 25

•

23 years ago

dan, could you sr?

Judson Valeski

Updated

•

23 years ago

Keywords: topembed → topembed+

Johnny Stenback (:jst)

Comment 26

•

23 years ago

Comment on attachment 101281 [details] [diff] [review] patch, adjust some detecting parameters. sr=jst

Attachment #101281 - Flags: superreview+

Yuying Long

Updated

•

23 years ago

Blocks: 157673

Shanjian Li

Assignee

Comment 27

•

23 years ago

fix checked in.

Status: ASSIGNED → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Yuying Long

Comment 28

•

23 years ago

Verified page http://www.sol.sohu.com/promotion/sol.htm display fine on 11-27 trunk build / Win2k under charset gb18030 with auto-dtect Universal.

Status: RESOLVED → VERIFIED

Frank Tang

Updated

•

22 years ago

No longer blocks: 157673

Frank Tang

Updated

•

22 years ago

Depends on: 180372

Simon Montagu :smontagu

Updated

•

17 years ago

Flags: in-testsuite+

Henri Sivonen (:hsivonen)

Comment 29

•

15 years ago

Bug 545658 is about the test case for this bug failing with the HTML5 parser. The HTML5 parser currently, by design, runs the chardet over the first 512 bytes of the content. It appears that's not enough in this case. The sohu.com site now sends a charset meta, so it's not a concrete problem anymore. Previously, in my cursory testing, I had seen reasonable real-world results with running the chardet over only the first 512 bytes. Should I adjust the HTML5 parser to keep feeding the chardet (and possibly renavigate to the page in midparse) or should I change the test case?

Simon Montagu :smontagu

Comment 30

•

15 years ago

512 bytes seems way too small for HTML charset detection. I would expect that in most typical HTML pages that is not even going to reach the <body>

You need to log in before you can comment on or make changes to this bug.