Closed Bug 90581 Opened 23 years ago Closed 23 years ago

universal charset detector does not work in mail/news

Categories

(MailNews Core :: Internationalization, defect)

Other
Other
defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED
mozilla0.9.4

People

(Reporter: ezh, Assigned: shanjian)

References

()

Details

(Keywords: intl, Whiteboard: need a=)

Attachments

(9 files, 1 obsolete file)

1. Set character coding to Autodetect->Russian.
2. Open this news group. 
3. Open list thrue messages

Some are displayed as 

????? ? ??? -?? ?????????. ? ???? ????????? ?????? ??????? ? MSIE ???????? ?
http:// ?????????. ??? ??? ?????????? ???????? ?????????????? ?????????. ???

Now change manually the codepage to KOI8-R. Now it looks as it should.
Could you check if this is a generic Cyrillic auto detection problem?
Could you try this?
* Copy the KOI8-R text to HTML composer and save as KOI8-R
* Remove META charset tag.
* Open the HTML file in browser with Cyrillic auto detection ON.
Keywords: intl
CCed to Marina.
* Copy the KOI8-R text to HTML composer and save as KOI8-R
* Remove META charset tag.
* Open the HTML file in browser with Cyrillic auto detection ON.

after all those steps with Russian auto detection ON encoding points cyrillic-koi8-r
Eugene, what build are you using?
First I want say I've the same result for the test as Marina had.

I tried 20010711 build under win98 and now see the same problem with 2001071308
under Linux (since my windows computer died I use at first time Linux RH 6.2 on
my emergency P133 :) ).
Hmmm, maybe it helps.

Scrolling up and down the messages (I must say the perf of mozilla on P133 is
very slow (espessially mail/news)) there poped-up a window saying "Unknown Error
804b0001". 

Maybe it was some server error but after this the message loaded (but anyway
with this bug).
Netscape PR1 has auto detect for "All" which I think supports cyrillic detection 
too, cc to shanjian.
Eugene, could you try Netscape 6.1 PR1 and use the "All" detector and see if you 
can still reproduce the problem?
Status: NEW → ASSIGNED
Can Marina do this, please? I have 6.1 PR1 installed, but it starts every time
with an error... :(

Sorry, but I do not have time now for instaling 6.1 PR1. :(
i looked into Nscp6.0 rtm ( 2001-11-08 build) and with autodetect "All" russian
sites are detected with no problem
Using any of charset detector to detecting mail/news, the results may not be 
satisfactory. I believe we are still feeding charset detector line by line and 
ask for result line by line. Some big change need to happen before we could fix 
this kind of problem. There is a bug filed against this problem. It is 
assigned to somebody outside I18n group. 
compare to 6rtm this is a regressin. Same newsgroup in 6.0 has no problem with
detecting russian encoding for newsarticles with Autodetect" all" is on. I don't
have to reload a message and manually correct the encoding to koi8-r as i have
to do with 6.1 (2001-07-16 branch)
Marina, which detection did you use, "Russian" or "All"?

The libmime problem is bug 12481.
Target Milestone: --- → Future
i used "All" in both cases: 6.0 and 6.1. With Autodetect set to "All" in 6.0 i
have no problem with this newsgroup ( regsoft.com) to detect koi8-r.. though
doesn't work with today's branch build
Maina, could you create a local file (.txt or .html) with text of the news
message? That way, we can see if the problem is mail/news specific.
Attached file another try..
I used today's branch build on Windows and both "Russian" and "All" detected the
attachments as KOI8-R. So it's mail/news specific.
Depends on: 12481
Yep, I also think so. While browsing I had no problem with detecting codepage.
Marina, could you copy the news message to local and attach it to this bug, thanks?
QA Contact: ji → marina
Libmime is getting data per line. So I divided the HTML file into separated
files each contains only one line. Those files are detected by browser
correctly. I used "Russian" detector for the test.
So I think the amount of the input data to the detector not really causing the
problem.
I noticed when I view the Russian message, very first line of the first message
is shown correctly but the following lines and other messages are displayed as
question marks. In the debugger, the first line, charset "koi8-r" is returned by
the detector but an empty string is returned as a charset for the following lines.
I did another test by doubling each line in the mail. The first line was
detected as "koi8-r" but the second line which is exactly the same string as the
first line was not detected and got an empty string for a charset.
So I suspect some kind of internal state is messed up for nsIStringCharsetDetector. 
Shanjian, please check if anything wrong in nsIStringCharsetDetector.
Assignee: nhotta → shanjian
Status: ASSIGNED → NEW
Summary: Cyrillic is not autodetected → Cyrillic is not autodetected by nsIStringCharsetDetector
Some time between now and 6.01, mail/news reuse the old detector instead 
creating new one. (That's perfect reasonable for performance reason.) However, 
XPCom String detector does not work this way yet. Fix is simple, but I need to 
check all detector to make sure we fix all similar problem. 
Status: NEW → ASSIGNED
Attached patch Proposed patchSplinter Review
I checked psm detector, and it does not have this problem. So we are done with 
mozilla tree. But I need to check universal detector. 
frank, could you review my fix?
Whiteboard: need r/sr, since 7/19
So other detector's in mozilla do not need the change?
Could it land to the NN6.1 trunk, since it's a big regression for cyrillic
languages?
cc to jenm
I think the real soultion is to obsoleted nsIStringCharsetDetector and force 
every code to use nsICharsetDetector isntead. There are no way we can detect a 
good result with nsIStringCharsetDetector since the data we provide to it is too 
small 
I agree it's more efficient to feed more data to detector (see bug 12481).
But in the data (of attachment 07 [details]/17/01 17:45), the first line with 12
characters are detected correctly by the Russian detector.
I heard that some other detection module which is used for search server work
with small amount of data (e.g. user's search query string). I think auto
detection with small amount of data is not impossible.

*** Bug 58236 has been marked as a duplicate of this bug. ***
naoki, can you review this one? I talked with frank about similar change to 
universal detector, and he gave me r= for that. So I don't think he will have 
any objection for this bug. Since he is on vacation, can you give me r= instead?
r=nhotta
Why mDone was moved from private to protected?
Keywords: nsBranch
In the beginning of my patch, you see "mDone" is initialized. So this variable 
has to be declared as "protected" in order for it to be accessed there.
chris, can you sr this one? 
sr=waterson
fix checked in. 
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
i still see this problem in the newsgroup. I can reproduce it with the same
newsarticle with 2001-08-22 build : going to the above server with the
Autodetect set to All ( russian) the encoding of the article is detected as
Western, i can correct the display manually. The view in browser has no problem:
with Autodetect set to All ( or Russian) the encoding is detected as Koi8-r in
the browser window. Eugene, do you see this happening? Reopen
Status: RESOLVED → REOPENED
Keywords: nsBranchnsbranch
Resolution: FIXED → ---
Attached patch proposed patch (obsolete) — Splinter Review
There are 2 problems in charset detector code. 1, mail assume string charset 
detector use the same name as its counterpart in browser. 2, When resetting, 
mAvailable should not be cleared. 

This patch is for universal charset detector, same patch will be checked in to 
comercial tree to fix detector "All". Universal detector works far better than 
all, because 3rd party code always report something wrong. 
Status: REOPENED → ASSIGNED
Nominate this one for branch. 
Roy, can you review my patch?
Target Milestone: Future → mozilla0.9.4
adding nsbranch+
Keywords: nsbranchnsbranch+
/r=yokoyama
shanjian: thanks for correcting 
NS_STRCDETECTOR_CONTRACTID_BASE "universal_string_charset_detector"

Attachment #47400 - Flags: review+
chris, could you sr this one? thanks.
Whiteboard: need r/sr, since 7/19 → need sr/a
Comment on attachment 47400 [details] [diff] [review]
proposed patch

Um, does anyone ever actually _read_ the value of mAvailable?
that's right. mAvailable is use to show if 3rd language module has been 
initiated correctly or not. Since 3rd party detector is removed in mozilla tree, 
this flag should be removed as well. 
(I copied the patch from commercial tree without care examination. Sorry.)
New patch will come soon. 
Whiteboard: need sr/a → need r/sr/a
Comment on attachment 48623 [details] [diff] [review]
update my patch (remove unused mAvailable).

sr=waterson
Attachment #48623 - Flags: superreview+
Comment on attachment 48623 [details] [diff] [review]
update my patch (remove unused mAvailable).

looks good
/r=yokoyama
Attachment #48623 - Flags: review+
Comment on attachment 47400 [details] [diff] [review]
proposed patch

marking as obsolete
Attachment #47400 - Attachment is obsolete: true
Whiteboard: need r/sr/a → need a=
fix checked in to trunk. 
update summary
Summary: Cyrillic is not autodetected by nsIStringCharsetDetector → universal charset detector does not work in mail/news
fix checked in to branch. 
Status: ASSIGNED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
i am using 2001-09-13-03-0.9.4. build. i still see the problem in mail/news but
if i perform the following steps with browser it works fine:
* Copy the KOI8-R text to HTML composer and save as KOI8-R
* Remove META charset tag.
* Open the HTML file in browser with Cyrillic auto detection ON
and it points to Cyrillic-koi8-r, in mail when there is no mime it is still
pointing to Western... i am reopening. Shanjian, any suggestions?

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This bug is about universal charset detector. For Cyrillic detector, please open 
another bug. 
Status: REOPENED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
I have problems to use Universal detector to detect a shift jis attachment. I'll
attach the mails to the report.
this is not a cyrillic problem, i had problem with detecting japanese as well
shirley/marina, 
Using the original testcase, I did verify that the problem is at least resolved 
in my local tree. The problem you are experiencing might be a different one. So 
please file new one against those problems. By the way, for email problems, if 
you can cc me a copy, it will be much easier for me to reproduce the problem. 
thanks.
Shanjian, i would open a new bug but wouldn't the problem be the same:
Autodetect set to all is not working in Mail/news?
I guess the new problem is in the attachment auto-detect.
not only, i have cyrillic message that have no attach but are still not detected
as cyrillic
marina, using your original testcase, which is a newsgroup in russian. It 
didn't work for me before my patch, and now it works well. That's why I believe 
the problem I want to address has been fixed. If that is not the behavior you 
observed, we should reopen this bug. If you observe the problem using another 
testcase, which is mail attachment, it is very likely to be a different problem.
(BTW, I could not get any news in russian using the url provided. So I just 
subcribe a newsgroup called "fido7.www.station.ru" which can be found in almost 
any newsserver. ) 

Does that sound reasonable to you? 
If you still feel confused, send me a mail testcase which does not work for you, 
and let me find out if this is a new problem or now. I can file a new bug or 
reopen this one after I see what's happening. 

Shanjian, i think that the only confusion ( after clearing cache) was that the
checkmark after setting to Universal is still pointing to Western eventhough the
display of the message is correct. Same would be true for Autodetect All ( or
Russian or Japanese): the display is correct after reselcting but the checkmark
doesn't show the right encoding.. I am verifying this as fixed because you are
right , the origina; problem is gone.
Status: RESOLVED → VERIFIED
Filed bug 99630 for the problem in sjis attachment auto detect.
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: