101995 - Need a generic way to guess at charset (layout depends on cache/etc)

Reporter

Description

•

23 years ago

Right now in 

http://lxr.mozilla.org/seamonkey/source/content/html/document/src/nsHTMLDocument.cpp#425
We use a whole bunch of methods to guess at the character set:
  - default/fallback of ISO-8850-1
  - http header
  - user default
  - requested by the document
  - user forced
  - cache
  - bookmarks (this also brings in rdf)
  - parent frame
  - autodetect

The problem is that some of these (http header, cache, bookmarks) bring in
modules that are totally irrelevant, and that embeddors may not want.. we need a
mechanism so that a component may register as a "charset guesser" or something..

My initial plan is to have a category that each component will register
themselves in, and in nsHTMLDocument, we'll go through each member of the
category until we find a match.

Alec Flett

Reporter

Comment 1

•

23 years ago

reassigning to me, I just wanted to get the default owner.. putting the original
author of some of this, jbetak, in as well.

Assignee: yokoyama → alecf

Priority: -- → P2

Target Milestone: --- → mozilla0.9.5

Alec Flett

Reporter

Comment 2

•

23 years ago

Adding this to the dependency tracker, because if we fix this, we can eliminate
the following dependencies in content:
- necko
- nkcache
- bookmarks

and possibly more (we won't know until we try :))

*** This bug has been marked as a duplicate of 100107 ***

Status: NEW → RESOLVED

Closed: 23 years ago

Resolution: --- → DUPLICATE

jbetak@netscape.com (away - not reading bugmail)

Comment 3

•

23 years ago

cc'ing ftang and shanjian

Alec Flett

Reporter

Comment 4

•

23 years ago

argh, reopening.. I did NOT mean to mark this a dupe

Status: RESOLVED → REOPENED

Depends on: 100107

Resolution: DUPLICATE → ---

Alec Flett

Reporter

Comment 5

•

23 years ago

There are a few issues that I've thought about since posting this:
- How often is this routine called, i.e. how performant should this be? I'm
thinking that this affects document loading and that we should probably
enumerate the cateogry just once (the first time), and then cache the results.
- How will we handle ordering? I'm thinking I'll just sort the keys in the
category, and when components register as members of this category, they'll just
prefix themselves appropriately. (i.e. maybe "100 bookmarks" for bookmarks, "200
cache" for cache, etc, so we have room for expansion)
- The "charset guesser" interface will have to have some way to provide context
to the guesser - for now I'll probably just provide enough context such that we
can handle the two most important cases (bookmarks and cache) and then figure
out how to do harder ones (like http headers, which will need to know what
channel it came in on, etc)

Alec Flett

Reporter

Updated

•

23 years ago

No longer depends on: 100107

Alec Flett

Reporter

Updated

•

23 years ago

Blocks: 100107

Status: REOPENED → ASSIGNED

Shanjian Li

Comment 6

•

23 years ago

I have an alternative approach for your consideration. Instead of 
using out-of-band channel to transfer charset info, we can simply use http data 
stream to perform such task. The idea is to add a internal tag before real html 
data, just like the invention of meta tag. It is something like:
<MOZ-META Charset="Shift-JIS" CharsetSource="HTTP-Header">
Such tags are soly internal and must appear before real html data, and they 
should be optional to parser. 
When charset is found in http header, network module can just send such a string 
before any real data. Parser will be notified when it is parsing the data. 
Similar stories for cache, bookmark. (We still need to leave user default, user 
forced, meta charset, parent frame, and autodetect there.)  Those tags will be 
stored in cache, and that will give parser more information about how the 
charset is achieved. Besides charset information, we can probably put more 
things there, or even extend it to be a general out-of-band information exchange 
method. That might help break other inter-module dependency.

Alec Flett

Reporter

Comment 7

•

23 years ago

An interesting idea.. but personally I think that places an unnecessary burden
on the HTTP protocol to insert HTML-specific code into the stream. For example,
what if I retrieve a text file over HTTP? We can't insert the tag into that
stream because we won't be creating an html document. That means that the HTTP
protocol has to have special knowledge of the "text/html" type so that it knows
that it is appropriate to insert this tag. 

And besides, that doesn't even solve the problem of getting the information from
bookmarks or the cache.

Personally, I think that we need to create such a mechanism anyway for
bookmarks/cache, we might as well make it accomadate http.

However, HTTP is a lower priority, because there are other dependencies on
necko/http as it is.

Andreas Becker

Updated

•

23 years ago

Keywords: intl

QA Contact: andreasb → ylong

Shanjian Li

Comment 8

•

23 years ago

"Content-Type" is available in http header. That's not a problem. Cache is the 
easiest to handle. We need to just include those internal tags. Bookmark is 
something I have less knowledge. I hope there is an available mechanism for it 
to send such hints to network and let network module do the work.

Alec Flett

Reporter

Comment 9

•

23 years ago

I don't understand your objection to the category idea.. you've provided another
suggestion, which is fine, but I don't see why it's better than using categories.

The HTML-specific idea is ok, but it doesn't address all cases, and (as I said
earlier) it makes such modules have unnecessary dependencies on html, not to
mention adding an extra burden on the parser to be aware of data that might be
in the stream before the actual file... Also, what would we do in the case of a
text file which is encoded in some particular character set? I also object to it
from an architectural perspective because instead of clearly seperating the
stream and its encoding, we are inserting the encoding into the stream. I think
an out-of-band solution is actually more architecturally sound.

Also, cache and bookmarks can't get involved in the stream:
- cache isn't always involved - sometimes the data is coming over the net, but
the code (as it stands today) will look to the cache in case the newer version
of the file does not have an associated encoding.
- there's no way for bookmarks to get involved in the stream without adding
extra hooks into necko, and it seems wierd to rearchitect necko to add this
functionality.
Again, what is the objection to the category method? I'm the one who will be
implementing this, so its not extra work for you :)

Shanjian Li

Comment 10

•

23 years ago

Sorry for the misunderstanding. I have no objection to your category idea. I 
just want to mention an alternative approach for your consideration. It is all 
up to you to analysis the pro and con of each approach and use the one that you 
believe is better.

Alec Flett

Reporter

Updated

•

23 years ago

Priority: P2 → P1

Target Milestone: mozilla0.9.5 → mozilla0.9.6

Alec Flett

Reporter

Updated

•

23 years ago

Priority: P1 → P2

Alec Flett

Reporter

Updated

•

23 years ago

Target Milestone: mozilla0.9.6 → mozilla0.9.7

Alec Flett

Reporter

Updated

•

23 years ago

Summary: Need a generic way to guess at charset → Need a generic way to guess at charset (layout depends on bookmarks)

Alec Flett

Reporter

Comment 11

•

23 years ago

*sigh* pushing out to 0.9.8, I just have too many little bugs that are more
important.

Target Milestone: mozilla0.9.7 → mozilla0.9.8

Roy Yokoyama

Comment 12

•

23 years ago

*** Bug 111336 has been marked as a duplicate of this bug. ***

Alec Flett

Reporter

Comment 13

•

23 years ago

bad news: gotta move these bugs to 0.9.9, I have too much on my plate right now.

Target Milestone: mozilla0.9.8 → mozilla0.9.9

Alec Flett

Reporter

Updated

•

23 years ago

Priority: P2 → P3

Alec Flett

Reporter

Updated

•

23 years ago

Target Milestone: mozilla0.9.9 → mozilla1.1

Judson Valeski

Updated

•

23 years ago

Keywords: topembed

Cathleen

Comment 14

•

22 years ago

topembed-, embed
keep working this. take for later releases
-triage team (chofmann, cathleen, marcia)

Keywords: topembed → embed, topembed-

Alec Flett

Reporter

Updated

•

22 years ago

Priority: P3 → P2

Alec Flett

Reporter

Comment 15

•

22 years ago

woah, 1.1 came up quick. Throwing these bugs over to 1.2.. little or no work has
been done and there is no way that these bugs can be fixed in 1.1.

Target Milestone: mozilla1.1alpha → mozilla1.2alpha

Alec Flett

Reporter

Updated

•

22 years ago

Target Milestone: mozilla1.2alpha → mozilla1.2beta

Judson Valeski

Comment 16

•

22 years ago

many embeddors don't use our bookmarks impl.

Keywords: topembed- → topembed+

Alec Flett

Reporter

Comment 17

•

22 years ago

ok, here's the investigation I've done:

Functions called to guess the charset, and data they need (in parentheses)

TryUserForcedCharset (muCV, dcInfo)
  muCV <- docshell <- aContainer
  dcInfo <- docshell <- aContainer
TryHintCharset (muCV)
  muCV <- docshell <- aContainer
TryParentCharset (dcInfo)
  dcInfo <- docshell <- aContainer
TryChannelCharset (channel)
  aChannel
TryBookmarkCharset (urlspec)
  urlSpec <- url <- aChannel
  ** url is needed later for NS_NewHTMLContentSink **
TryCacheCharset (cacheDescriptor)
  cacheDescriptor <- cacheToken <- cachingChan <- httpChannel <- aChannel
  ** cacheDescriptor is later used to set the charset **
TryDefaultCharset (muCV)
  muCV <- docshell <- aContainer
UseWeakDocType (none)

later, if it is a post,
GetPrevDocCharset (muCV)
  muCV <- docshell <- aContainer

all data eventually comes from aContainer, aChannel:
docshell used by 5 consumers (2x by one) - almost all 
muCV used by 4 consumers, can also come from parent, can be used later to set
the previous doc's charset
dcInfo used by 2 consumers
url/urlspec used by 1 consumer
httpChannel used by 1 consumer

so here's the thought:

a new interface
nsICharsetDetector : nsISupports
{
    // @returns true if the handler will handle it later. If so, then the
    // resulting closure object should be passed to postCharset() with the
    // detected charset
    boolean detectCharset(in nsIDocShell aDocShell, in nsIChannel aChannel, out
nsISupports aClosure);

    void postCharset(in nsAString aCharset, nsISupports *aClosure);
};

so in the case of the cache and the bookmarks method, postCharset() will go back
and set the charset of the cache entry or the bookmark entry. The URL might also
come out of that.

Each charset detector will implement the above interface via a service object of
some kind. Initially, I'll just have a static, ordered list of contractIDs to
try but eventually I'd like to get to the point where we're using categories.
the trick there is to make sure we order the categories correctly.

Alec Flett

Reporter

Comment 18

•

22 years ago

Attached patch prep work, new interface (obsolete) — Details — Splinter Review

ok, this patch:
- cleans up nsHTMLDocument to move all the charset guessing code into a
function GuessCharset()
- adds the new interface
- makes the bookmark service implement the new interface, which prompted me to
make some string changes to the bookmark service

Reviews?

Alec Flett

Reporter

Comment 19

•

22 years ago

Attached patch prep work, new interface (obsolete) — Details — Splinter Review

oops, lets try that again... here's the actual patch.

Attachment #103815 - Attachment is obsolete: true

Alec Flett

Reporter

Comment 20

•

22 years ago

These bugs missed the 1.2 train. Moving these out to 1.3alpha

Target Milestone: mozilla1.2beta → mozilla1.3alpha

Shanjian Li

Comment 21

•

22 years ago

alec, 
I went through the code. Most of the code is just reorganization. Your new
interface is not used anywhere. I understood that this is a preparation work
only. I didn't have much to say about the code, I didn't spot any problem there.
I have some suggestion towards naming and your interface design. 

1, charset detector is a name we heavily used for "real" charset detection. Ie,
given a binary stream, we detecting the encoding of this stream. So you might
want to use a different word for this purpose. I suggest CharsetResolver in this
context. 
2, I would suggest you to use "AString" to replace "ACString" in your intereface
design. We are using UTF-16 for charset representation in most of the places, so
there is no need to do this conversion. This might be critical when non-ascii
character get into character set name, though very unlikely. 

3, In detectCharset function (as I mentioned in 1, I suggest to rename it as
resolveCharset or queryCharset), I would suggest you to add a additional
aurgument so that charset source can be returned. When I was fine tuning charset
resolution,  I found sometimes it is necessary to discriminate charset even they
came from the same source. For example, charset from parent was later
categorized as kCharsetFromParentForced and kCharsetFromParentFrame. This may
happen on other category as well. Using bookmark as another example, a user
manually set a page to certain character set (forced charset) and a charset
being detected by computer might need to be treated different if user has such
request. When such information need to be recorded by bookmark, your interface
should be able to transfer this information back to caller.

4, I didn't quite understand how "postDetection" is supposed to be used. Our
charset auto detection will set notification to webshell/parser if it detected a
new charset in late stage, Is this function aimed for that purpose as well?

Alec Flett

Reporter

Comment 22

•

22 years ago

shanjian,
Thanks for your review...

from what I understand, charset NAMES are always ASCII, at least the way we use
them in mozilla. In fact, there are many places where we do raw ASCII/UCS2
conversion of charset names so I doubt that non-ASCII charset names would even
work in our product as it stands today. But that is the subject beyond the scope
of this bug. 

(In fact one of my other footprint tasks is to convert the charset conversion
APIs over to using ASCII (or at least UTF8) charset names (using nsACString&) -
I'll need to talk with you a little about that later, but in the meantime I
think this is the right thing to do.)

and so, addressing the rest of your review:
1) I like Resolver, I'll switch to that.
2) (covered above) - I'd like to leave it as it is and assume ASCII
3) I'll add the "source" output, as you describe. 
4) oh yeah I didn't explain that well. postDetection is going to be used by some
of the other resolvers - it basically lets the resolver clean up after the
detection. So after the initial detection happens, certain resolvers will want
to be notified of the charset that they were just asked about. For example, the
cache might fail to detect a charset, but it is interested in the charset of the
URL so that it has it for next time. basically, it is needed to maintain the
current behavior when we switch to this new mechanism.

Alec Flett

Reporter

Updated

•

22 years ago

Target Milestone: mozilla1.3alpha → mozilla1.4alpha

Alec Flett

Reporter

Comment 23

•

22 years ago

mass moving lower risk 1.4alpha stuff to 1.4beta

Target Milestone: mozilla1.4alpha → mozilla1.4beta

Juan José Mata

Comment 24

•

21 years ago

Missed the 1.4b train, will this be targetted for final 1.4 or 1.5 alpha? 
Please re-set the milestone accordingly.

Target Milestone: mozilla1.4beta → ---

Alec Flett

Reporter

Comment 25

•

21 years ago

Attached patch new patch - does minimal work to remove dependencies — Details — Splinter Review

ok, I've finally gotten to this. This is the minimal work necessary. What this
does now is add a new interface (nsICharsetResolver) as described above, and
does the absolute minimum to make nsHTMLDocument::TryBookmarks() call through
this interface rather than nsIBookmarks service. this remove dependencies on
appcomps but of course introduces a common dependency on chardet. it also adds
a dependency on htmlparser (for nsIParser.h, for kCharsetFromBookmarks) 

This will get us out of our current predicament, and lays the groundwork for
making this way more generic.

Someone care to review?

Attachment #103816 - Attachment is obsolete: true

Alec Flett

Reporter

Comment 26

•

21 years ago

I should add that obviously the kCharsetFromBookmarks dependency is temporary
for now. I haven't decided if these constants belong as a part of the
nsICharsetResolver interface (Which would suck, because that would mean
nsICharsetResolver would have implicit knowledge of all possible resolvers)

for now we'll just continue to depend on the C++ #defines.

Alec Flett

Reporter

Comment 27

•

21 years ago

oh, and that kQuote/kSpace thing is because nsIParser.h defines two global
variables that conflict in name (they are 'const PRUnichar' in nsIParser.h and
'const char[]' in nsBookmarksService.cpp)

Alec Flett

Reporter

Comment 28

•

21 years ago

Comment on attachment 123328 [details] [diff] [review]
new patch - does minimal work to remove dependencies

requesting r / sr - brendan/shanjian please see my previous comments (starting
with comment 25) - they explain some of the oddities of this patch.

Attachment #123328 - Flags: superreview?(brendan)

Attachment #123328 - Flags: review?(shanjian)

Brendan Eich [:brendan]

Comment 29

•

21 years ago

Comment on attachment 123328 [details] [diff] [review]
new patch - does minimal work to remove dependencies

>+++ intl/chardet/public/Makefile.in	14 May 2003 21:36:24 -0000
>@@ -31,6 +31,7 @@ MODULE		= chardet
> XPIDLSRCS       = \
>                 nsIDocumentCharsetInfo.idl \
>                 nsIDocCharset.idl \
>+				nsICharsetResolver.idl \

Wrong tabstops -- use 8-space equivalent tabs, no spaces.

>+    /**
>+     * requestCharset
>+     * called to resolve the charset of an existing docshell.
>+     * If the charset cannot be resolved, but the implementation is
>+     * still curious what the final charset turned out to be, it can
>+     * set wantCharset to true.
>+     * If so, notifyResovedCharset will be called with the resulting
>+     * closure
>+     * 
>+     * @param docShell    the docshell the document is being loaded in
>+     * @param channel     the channel the document is coming in from
>+     * @param charsetSource a unique number which can be stored by the
>+     *                    caller to remember which resolver actually
>+     *                    resolved the charset.
>+     * @param wantCharset gets set to true if notifyResolvedCharset should be
>+     *                    called with the given closure object

Is the wantCharset machinery really necessary?	Seems unused, and when in
doubt, leave it out applies.

Where's the mozilla/browser patch?

/be

Alec Flett

Reporter

Comment 30

•

21 years ago

Yes, its needed for the cache.... I wouldn't have written it if it wasn't going
to be used. :)

Its actually the mechanism that the cache uses to store the charset of the page
so a page doesn't need to be re-detected when loaded from the cache.

Note that this fix only fixes up bookmarks - the goal is to remove all these
dependencies (Such as the cache) from this charset detection code - this one
hunk of code is responsible for a number of ugly dependencies.

Alec Flett

Reporter

Comment 31

•

21 years ago

oops, brendan please see my previous comments - I loaded another page before the
bugmail went out.

Alec Flett

Reporter

Comment 32

•

21 years ago

Comment on attachment 123328 [details] [diff] [review]
new patch - does minimal work to remove dependencies

I talked with Jan on IRC and he says that nsIBookmarksService is forked from
mozilla to phoenix.

The difference will be that the QI will fail, and TryBookmarksCharset() will
return early.

Brendan Eich [:brendan]

Comment 33

•

21 years ago

Comment on attachment 123328 [details] [diff] [review]
new patch - does minimal work to remove dependencies

sr=brendan@mozilla.org with the makefile tab fix.

I didn't realize there was more coming, including cache-related changes.  In
this bug, or a new one?

/be

Attachment #123328 - Flags: superreview?(brendan) → superreview+

Alec Flett

Reporter

Comment 34

•

21 years ago

well, I was originally planning on doing the global architecture changes for
this bug - there are bunch of unnecessary dependencies (seem comment 0 and
comment 2) - that's why this patch is just a quick hack to get rid of the
bookmarks dependency...but it lays the groundwork for future work

short answer: yes, in this bug.

Brendan Eich [:brendan]

Comment 35

•

21 years ago

*** Bug 201821 has been marked as a duplicate of this bug. ***

Alec Flett

Reporter

Comment 36

•

21 years ago

Comment on attachment 123328 [details] [diff] [review]
new patch - does minimal work to remove dependencies

trying a new reviewer.. jshin can you give me a review here? there is no
functionality loss here, just a architecture change to reduce dependencies.

Attachment #123328 - Flags: review?(shanjian) → review?(jshin)

Jungshik Shin

Comment 37

•

21 years ago

Comment on attachment 123328 [details] [diff] [review]
new patch - does minimal work to remove dependencies

r=jshin. sorry for the delay. Because shanjian did a thorough review regarding
the interface and this is the minimal patch, I don't have any to add. A _brief_
test with the patch didn't show a sign of the regression, either.

Attachment #123328 - Flags: review?(jshin) → review+

Alec Flett

Reporter

Updated

•

21 years ago

Summary: Need a generic way to guess at charset (layout depends on bookmarks) → Need a generic way to guess at charset (layout depends on cache/etc)

John Keiser (jkeiser)

Comment 38

•

21 years ago

Alec, this looks done, are there any barriers to checking it in?

Alec Flett

Reporter

Comment 39

•

21 years ago

oh, sorry. the bookmarks stuff already landed. cache and http come next.

Boris Zbarsky [:bzbarsky]

Comment 40

•

19 years ago

So I'm looking at the api this bug introduced, and I don't see how callees could
possibly set charsetSource in general.  That number needs to fit into our
hierarchy of nsIParser charset sources, and that's a private interface.  So an
embeddor who wants to set charsets from bookmarks cannot use it.

Also, given that HTMLdocument has to instantiate the resolver to call it, it at
least knows what contract it's instantiating and therefore can decide for itself
what the charset source should be.

Other than that, this patch leaves us depending on the fact that bookmarks are
RDF, which may not be true in an embedding app....

Boris Zbarsky [:bzbarsky]

Comment 41

•

19 years ago

ccing vlad too, since bookmarks are involved.

Alec Flett

Reporter

Comment 42

•

19 years ago

so my original plan (which I never followed through on, sorry) was to make a
category of charset resolvers, which would be iterated in some order, asking
each one to resolve the charset. What you see here was the first step.

obviously right now we're just instantiating the interface

the sourceCharset was definitely something I hadn't quite worked out. I'm not
even sure its used anywhere, I think it might just be recorded for posterity.
Other options include just keeping a pointer to the charset resolver rather than
some magic number which indicates the source.

Boris Zbarsky [:bzbarsky]

Comment 43

•

19 years ago

The charset source is heavily used in HTML document, parser, etc.  That's how we
ensure that higher-priority sources (eg user override, HTTP header, etc) don't
get clobbered by lower-priority sources (eg bookmarks).  There is no way to use
a pointer to the resolver for this purpose.

If we had a category, we would need to somehow order the things within the
category and somehow interleave them with the internal charset sources we have.
 In any case, we wouldn't want to have the charset source in the API itself.

Boris Zbarsky [:bzbarsky]

Comment 44

•

19 years ago

Attached patch Possible change for discussion (obsolete) — Details — Splinter Review

The goal here is to allow embeddors to make use of this.  This means no
nsIParser dependency, no RDF dependency....

Boris Zbarsky [:bzbarsky]

Updated

•

19 years ago

Blocks: 284660

Alec Flett

Reporter

Comment 45

•

19 years ago

Comment on attachment 176175 [details] [diff] [review]
Possible change for discussion

I like it.. simple and to the point :)

Boris Zbarsky [:bzbarsky]

Comment 46

•

19 years ago

Attached patch Same but compiling [checked in] — Details — Splinter Review

Attachment #176175 - Attachment is obsolete: true

Attachment #176181 - Flags: superreview?(alecf)

Attachment #176181 - Flags: review?(benjamin)

Benjamin Smedberg

Updated

•

19 years ago

Attachment #176181 - Flags: review?(benjamin) → review+

Alec Flett

Reporter

Comment 47

•

19 years ago

Comment on attachment 176181 [details] [diff] [review]
Same but compiling [checked in]

sr=alecf

Attachment #176181 - Flags: superreview?(alecf) → superreview+

Boris Zbarsky [:bzbarsky]

Comment 48

•

19 years ago

Comment on attachment 176181 [details] [diff] [review]
Same but compiling [checked in]

Checked this in.

Attachment #176181 - Attachment description: Same but compiling → Same but compiling [checked in]

Phil Ringnalda (:philor)

Updated

•

15 years ago

QA Contact: amyy → i18n

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 49

•

13 years ago

Bug 700490 is proposing backing out the patches that landed here.

Assignee: alecf → smontagu

Keywords: embed, intl, topembed+

OS: Windows 2000 → All

Priority: P2 → --

Hardware: x86 → All

Boris Zbarsky [:bzbarsky]

Updated

•

11 years ago

Attachment #176181 - Flags: checkin+

Henri Sivonen (:hsivonen)

Comment 50

•

9 years ago

Per bug 700490 and events since then.

Status: ASSIGNED → RESOLVED

Closed: 23 years ago → 9 years ago

Resolution: --- → WONTFIX

prep work, new interface 22 years ago Alec Flett 370 bytes, patch		Details \| Diff \| Splinter Review
prep work, new interface 22 years ago Alec Flett 28.38 KB, patch		Details \| Diff \| Splinter Review
new patch - does minimal work to remove dependencies 21 years ago Alec Flett 20.63 KB, patch	jshin1987 : review+ brendan : superreview+	Details \| Diff \| Splinter Review
Possible change for discussion 19 years ago Boris Zbarsky [:bzbarsky] 14.49 KB, patch		Details \| Diff \| Splinter Review
Same but compiling [checked in] 19 years ago Boris Zbarsky [:bzbarsky] 15.24 KB, patch	benjamin : review+ alecf : superreview+ bzbarsky : checkin+	Details \| Diff \| Splinter Review