Closed Bug 235090 Opened 20 years ago Closed 20 years ago

CSS stylesheets labeled as "UTF-16" don't load

Categories

(Core :: CSS Parsing and Computation, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla1.7final

People

(Reporter: dbaron, Assigned: jshin1987)

References

()

Details

(Keywords: intl, verified1.7)

Attachments

(2 files, 4 obsolete files)

http://www.w3.org/Style/Examples/010/utf-16-correct.html should be green.

The CSS parser doesn't support "UTF-16"/"utf-16" encoding.  This is mainly the
result of bug 68738.  This needs to be fixed in the uconv code rather than the
HTML parser, which is not its only consumer.
(And while we're here it might make sense to add the name ucs2 to
charsetalias.properties if that's not deprecated.)
(In reply to comment #0)
> http://www.w3.org/Style/Examples/010/utf-16-correct.html should be green.
> 
> The CSS parser doesn't support "UTF-16"/"utf-16" encoding.  This is mainly the
> result of bug 68738. 

  Did you mean the result of the patch for bug 68738 or the result of bug 68738
having been fixed only partially?  

The encoding of the stylesheet at
http://www.w3.org/Style/Examples/010/utf-16-correct.css is correctly detected as
UTF-16LE when it's loaded into a browser window by itself. Nonetheless, Mozilla
fails the test so that there's something to fix. I guess our cssloader doesn't
recognize it correctly. 


> add the name ucs2 to charsetalias.properties

  Originally, I guess the file was meant to include aliases registered with the
IANA. We made an exception last year with 'iso8859_1' (or something like that)
because we found some broken sites using JSP/servlet to emit that name. Do you
have sites using 'ucs2'? 
 
> This needs to be fixed in the uconv code rather than the
> HTML parser, which is not its only consumer.

  I guess you meant we need to add a BOM-detection method to uconv and make it
available to our CSS loader and (JS) script loader as well as our html/xml
parser (bug 68738 comment #20 and a few comments before that). Or, do you think
we need a separate UTF-16 decoder that looks for a BOM to determine the
endianness? The latter wouldn't help Mozilla pass the test given here although
it could be useful for other cases.
 
Keywords: intl
@charset "UTF-16"; should be valid, but it causes the CSS loader to request (via
a converter in netwerk/base/src/) a unicode decoder for it.  I think we should
have a separate decoder for "UTF-16" that looks for a BOM -- then we could
discard the hack in the HTML parser that drops any explicit "UTF-16".  And I
think it would fix this case -- the @charset rule causes an explicit "UTF-16"
decoder to be requested.
(In reply to comment #3)
> @charset "UTF-16"; should be valid, but it causes the CSS loader to request (via
> a converter in netwerk/base/src/) a unicode decoder for it.  
> I think we should have a separate decoder for "UTF-16" that looks for a BOM

To recognize '@charset "UTF-16"', we need to know that the file is in UTF-16LE.
In your example, the http header comes with just 'text/css'. 'UTF-16' decoder
with a built-in BOM detector is useful only when we know in advance that it's
UTF-16 but don't know whether it's LE or BE.  If we know that the file is in
UTF-16LE, we just have to  call our UTF-16LE decoder directly.


> -- then we could discard the hack in the HTML parser that drops any 
> explicit "UTF-16".  

You meant 

http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&root=/cvsroot&subdir=mozilla/htmlparser/src&command=DIFF_FRAMESET&file=nsParser.cpp&rev2=3.271&rev1=3.270

(that was bug 98214). However, that's valid because by the time we reach there,
we already know it's one of byte-oriented encodings so that 'UTF-16' in meta tag
can't be right. 


So... We look at the BOM in the CSSLoader.  We use this to "parse" the @charset
rule.  If this says UTF-16, we just ask uconv for a decoder for that.

I _could_ hack the CSSLoader to just go with what the BOM said if the @charset
was UTF-16.  But the point is that this should not be needed... I think what
David is suggesting is the following:

1)  Have a decoder registered for "UTF-16"
2)  Have this decoder look at the first char, and if it's a BOM instantiate the
    appropriate UTF-16[BL]E decoder.  Then proxy all the rest of the decoder
    stuff to it.  If there is no BOM, assume UTF-16BE or something.

This shouldn't be very much code, really...  I could put this logic in the
CSSLoader, like I said, but the point is that this sort of thing should really
be opaque to uconv consumers.  Consider the case of data coming in with a
"Content-Type: text/css; charset=UTF-16" -- in that case the CSSLoader really
shouldn't even look at the data; it should just ask uconv to deal and uconv
should deal....
Attached patch patchSplinter Review
I completely forgot that bz had added BOM detection to nsCSSLoader.cpp in bug
218915. With this patch, 'LE' or 'BE' is appended to 'UTF-16' or 'UTF-32' if
the endian is known from bz's code.
Assignee: dbaron → jshin
Status: NEW → ASSIGNED
Sorry I hadn't read the bug mail from bz  when I uploaded my patch. I've just
read it. I'm not sure whether adding UTF-16 and UTF-32 decoders that looks for
BOM although it seems to be cleaner that way.
Attached patch alternative (uconv patch) (obsolete) — Splinter Review
This works for the case given in the URL field of this bug, but it breaks some
other cases at http://jshin.dyndns.org/i18n/utftest/index.html
Without the patch, all cases at the page worked, but with the patch, UTF-16LE
(with BOM) labelled as UTF-16 (the 9th case from the top) doesn't get rendered
correctly. Actually, this should be all right even with this patch ....
Moreover, UTF-16BE (with BOM) tagged as UTF-16 in the http c-t header works
fine.... There's something funny going on here. The asymmetry between BE and LE
is something I have to look into. (I don't claim to understand
|UTF16ConvertToUnicode| in nsUCS2BEtoUnicode.cpp) 
 
What this patch really breaks is a UTF-16LE page without BOM but tagged as
UTF-16 in the http header (the 4th case from the bottom). To my surprise, a
UTF-16BE page without BOM but tagged as UTF-16 in the http header (the 10th
from the bottom) is rendered correctly. 

In some cases, we have to 'override http header' and rely on nsParser for
'UTF-16' (and 'UTF-32'), which defeats the purpose of adding 'UTF-16' decoder
in a sense.
Comment on attachment 141962 [details] [diff] [review]
alternative (uconv patch)


>@@ -1041,14 +1041,19 @@ static const nsModuleComponentInfo compo

>+  },
>+  { 
>+    DECODER_NAME_BASE "UTF-16" , NS_UTF16TOUNICODE_CID, 
>+    NS_UNICODEDECODER_CONTRACTID_BASE "UTF-16",
>+    nsUTF16BEToUnicodeConstructor ,

Please, ignore the patch. The typo in the above (UTF16BEToUnicode  instead of
UTF16ToUnicode) gave me an illusion that it worked (at least partially).
Attachment #141962 - Attachment is obsolete: true
Attached patch uconv patch (obsolete) — Splinter Review
With this, documents with BOM and tagged as UTF-16 get rendered correctly.
However, documents without BOM and tagged as UTF-16 (the endian of which used
to be detected correctly) put Mozilla to sorta infinite loop.  For them,
nsUTF16ToUnicode::Convert() fails, but that failure doens't seem to be handled
properly by callers.

It appears that we need some heuristics in parser and elsewhere even if we add
'UTF-16' decoder.
Oh, fun.  In order:

(In reply to comment #7)
> I'm not sure whether adding UTF-16 and UTF-32 decoders that looks for
> BOM although it seems to be cleaner that way.

It's absolutely necessary to handle a charset of "UTF-16" in an HTTP header, no?
 In fact, how does the parser manage to do it right now?

(In reply to comment #8)
> In some cases, we have to 'override http header' and rely on nsParser for
> 'UTF-16' (and 'UTF-32'), which defeats the purpose of adding 'UTF-16' decoder
> in a sense.

Um... What are those cases?  We should not be doing that short of an explicit
user override via the view menu, no?


(In reply to comment #10)
> However, documents without BOM and tagged as UTF-16 (the endian of which used
> to be detected correctly) put Mozilla to sorta infinite loop.  For them,
> nsUTF16ToUnicode::Convert() fails, but that failure doens't seem to be handled
> properly by callers.

Fix the callers?  ;)

> It appears that we need some heuristics in parser and elsewhere even if we add
> 'UTF-16' decoder.

Why can't we move those heuristics into the 'UTF-16' decoder?  Something like:

1)  If BOM, use that
2)  If no BOM, check whether first or second byte is non-null and go with that
3)  If both are null or both are non-null, assume UTF-16BE (network endianness)
    or something.  Really, if we hit this case, the document is all screwed up
    anyway.

The goal here is to have as little knowledge about this stuff outside uconv as
humanly possible.
(In reply to comment #11)
> (In reply to comment #7)
> > I'm not sure whether adding UTF-16 and UTF-32 decoders that looks for
> > BOM although it seems to be cleaner that way.
> 
> It's absolutely necessary to handle a charset of "UTF-16" in an HTTP header, no?
>  In fact, how does the parser manage to do it right now?

  Apparently, 'UTF-16' coming from http header is ignored because we don't have
a decoder for 'UTF-16'. So, we end up relying on the parser. See bug 68738 and
the reference to XML there. Before bug 68738, we used to alias UTF-16 to
UTF-16BE so that documents tagged as UTF-16 (even if they're UTF-16LE) were
always interpreted as UTF-16BE. 


> (In reply to comment #8)
> > In some cases, we have to 'override http header' and rely on nsParser for
> > 'UTF-16' (and 'UTF-32'), which defeats the purpose of adding 'UTF-16' decoder
> > in a sense.
> 
> Um... What are those cases?  We should not be doing that short of an explicit
> user override via the view menu, no?

By overriding, I meant not the full scale override but just determining the
endian (big or little) as my first patch did for '@charset.....' Those cases are
what I wrote about below : documents without BOM but tagged as UTF-16. Of
course, this 'guessing' based on the first several bytes doesn't work for plain
text.

> (In reply to comment #10)
> > However, documents without BOM and tagged as UTF-16 (the endian of which used
> > to be detected correctly) put Mozilla to sorta infinite loop.  For them,
> > nsUTF16ToUnicode::Convert() fails, but that failure doens't seem to be handled
> > properly by callers.
> 
> Fix the callers?  ;)

  I didn't even bother to check what the callers are. Anyway, even if the
callers are fixed, the document in question still wouldn't be rendered. 
 
> > It appears that we need some heuristics in parser and elsewhere even if we add
> > 'UTF-16' decoder.
> 
> Why can't we move those heuristics into the 'UTF-16' decoder?  Something like:
> 
> 1)  If BOM, use that
> 2)  If no BOM, check whether first or second byte is non-null and go with that
> 3)  If both are null or both are non-null, assume UTF-16BE (network endianness)
>     or something.  Really, if we hit this case, the document is all screwed up
>     anyway.

3) doesn't work because the majority of such a case (both byte non-null) is
likely to be UTF-16LE no matter BigEndian is the network byte order. If we ever
do this, it'd be better to go with the native endian.  2) doesn't work well,
either. Think about a plain text document beginning with characters like U+hh00
(where 'hh' is not '00'). Besides, I don't think #2 and #3 are what encoding
converters are supposed to do. Encoding converters are not just for html, css,
xml but also for plain text. On top of that, BOM is a normative part of the
definition of UTF-16 and UTF-32 so that we can/should use it to detect the
endian. Other document-format-specific heuristics should be done elsewhere. 

> The goal here is to have as little knowledge about this stuff outside uconv as
> humanly possible.

 However, some stuffs rightly belong elsewhere.    
> If we ever do this, it'd be better to go with the native endian.

From the standard compliance point of view, I guess (I have to check, though)
that the network byte order is the 'right' thing.  
(In reply to comment #12)
> Apparently, 'UTF-16' coming from http header is ignored because we don't
> have a decoder for 'UTF-16'.

That's a bug, no?

> So, we end up relying on the parser.

Which breaks when we're not parsing HTML/XML, as here....  I assume the
scriptloader currently has the same exact issues as the CSSLoader here if
someone bothers to test.

Which is why we just want to fix this inside uconv instead of making all
consumers write code to maybe handle this.

> Before bug 68738, we used to alias UTF-16 to UTF-16BE

Yes, I know.  That was totally wacky.  ;)

> By overriding, I meant not the full scale override but just determining the
> endian ...[snipped]... Of course, this 'guessing' based on the first several
> bytes doesn't work for plain text.

Sure it does.  It works wonderfully.  Just look for null bytes or not... that
will work for ascii-compatible plaintext.  For other plaintext, there is no way
to solve the problem anyway.  The point is that since both CSS and HTML have
ascii-compatible syntax, it would work for them.  Same for JS, more or less.

>   I didn't even bother to check what the callers are. Anyway, even if the
> callers are fixed, the document in question still wouldn't be rendered. 

Unless we detect the endianness in uconv.

> > 3)  If both are null or both are non-null

> 3) doesn't work because the majority of such a case (both byte non-null) is
> likely to be UTF-16LE no matter BigEndian is the network byte order.

Fine.  Like I said, I have no strong preference for what to do here.  Native
endian is ok.

> 2) doesn't work well, either. Think about a plain text document beginning with
> characters like U+hh00 (where 'hh' is not '00').

This is already broken in today's world, no?

> Besides, I don't think #2 and #3 are what encoding converters are supposed to
> do.

Encoding converters are supposed to take an encoding and convert.  In this case
the encoding being given is "UTF-16".  I think the fact that there is in fact no
such encoding is a uconv-internal issue that consumers are largely ignorant
about and should be able to be ignorant about.

> Encoding converters are not just for html, css, xml but also for plain text

And your suggestion is?  Note that plaintext is currently broken anyway and we
see no way to fix it.  Why not fix the cases that we can?  Frankly, I suspect
that BOM-less stuff marked UTF-16 will be reasonably rare... and most of it will
be HTML, which we _will_ detect correctly.

> On top of that, BOM is a normative part of the definition of UTF-16 and UTF-32
> so that we can/should use it to detect the endian. Other
> document-format-specific heuristics should be done elsewhere.

I can deal with just having the BOM stuff in uconv and some hacks in the HTML
parser, I suppose, if this is very important.  But the BOM stuff _really_ needs
to be in uconv.  And we need to fix the current bug whereby the HTML parser
ignores the HTTP header charset....
 
>  However, some stuffs rightly belong elsewhere.    

I'm not sure I see the reasoning there, as I said.  The point of uconv is to be
useful, no?  Right now it's creating a _lot_ of extra work for all consumers
like this....

There is one more angle to all this, btw.  As things stand, the "UTF-16"
document will get flagged as BE or LE by the parser and that's what it's
resulting charset will be.  With these changes, its charset will be "UTF-16", so
we probably need an encoder too, not just a decoder (the encoder can just
convert to native endianness, I guess).
(In reply to comment #14)
> (In reply to comment #12)
> > Apparently, 'UTF-16' coming from http header is ignored because we don't
> > have a decoder for 'UTF-16'.
> 
> That's a bug, no?

 Yes, in a certain sense, but we can make that up down the road (in case of
html/xml, in effect, we do that...) . Instead of 'ignoring', we can put it in an
'intermediate' state that needs to be turned to UTF-16LE or UTF-16BE ... 

> > So, we end up relying on the parser.
> 
> Which breaks when we're not parsing HTML/XML, as here....  I assume the
> scriptloader currently has the same exact issues as the CSSLoader here if
> someone bothers to test.

I guess so. There are at least three consumers that can be relieved of their own
'endian' detection, which is a strong point of the proposal here that 'UTF-16'
be added to the list of encodings supported in uconv as you wrote. However, the
common endian detection can be factorized into a single function accessible from
all three consumers. See bug 68738 comment #20. (btw, I guess we can make the
endian detection always on for  UTF-16 and UTF-32 even though the universal
detector can be on and off at users' will)

 
> > this 'guessing' based on the first several bytes doesn't work for plain text.

> Sure it does.  It works wonderfully.  Just look for null bytes or not... that
> will work for ascii-compatible plaintext.  For other plaintext, there is no way
> to solve the problem anyway.  The point is that since both CSS and HTML have
> ascii-compatible syntax, it would work for them.  Same for JS, more or less.

By 'ascii-compatible', you meant that the first few 'characters' are within the
ASCII portion of the Unicode, didn't you? That usage deviates from a common use
of the term (UTF-8 and most legacy encodings are ascii-compatible, but UTF-16 is
not). However, that's not important because I understand what you meant :-)
Anyway, I'd not say 'wonderfully' because there are cases for which this doesn't
work as we discussed below. BTW, by 'plain text', I meant 'text/plain' (not
text/*) ;-)

 
> >   I didn't even bother to check what the callers are. Anyway, even if the
> > callers are fixed, the document in question still wouldn't be rendered. 
> 
> Unless we detect the endianness in uconv.

 Or, callers (or callers of callers) can fix that up ;-)
 
> > > 3)  If both are null or both are non-null
> 
> > 3) doesn't work because the majority of such a case (both byte non-null) is
> > likely to be UTF-16LE no matter BigEndian is the network byte order.
> 
> Fine.  Like I said, I have no strong preference for what to do here.  Native
> endian is ok.

 If we do that, we might have to return something different from NS_OK to let
callers to distinguish this case from others (say, NS_OK_BOM_DETECTION_UNSURE)
in case they want to.
 
> > 2) ... Think about a plain text document beginning with
> > characters like U+hh00 (where 'hh' is not '00').
> 
> This is already broken in today's world, no?
.....
> And your suggestion is?  Note that plaintext is currently broken anyway a

 Yes, it is. See the above reference to my comment in bug 68738.
  
> > On top of that, BOM is a normative part of the definition of UTF-16 
> > and UTF-32 so that we can/should use it to detect the endian. Other
> > document-format-specific heuristics should be done elsewhere.
> 
> I can deal with just having the BOM stuff in uconv and some hacks in the HTML
> parser, I suppose, if this is very important.  

  I don't know. It doesn't sound right that we apply format-dependent heuristics
in uconv, but I can't say it's very important, either. 

> But the BOM stuff _really_ needs to be in uconv.  And we need to fix the 
> current bug whereby the HTML parser ignores the HTTP header charset....

 Yes, that appears to be a reasonable way to divvy things up. 

> >  However, some stuffs rightly belong elsewhere.    
> 
> I'm not sure I see the reasoning there, as I said.  The point of uconv is to be
> useful, no?  Right now it's creating a _lot_ of extra work for all consumers
> like this....

  That's why I suggested factoring them out in bug 68738 and here. I admit that
it seems to be an unnecessary 'division of labors' based on a pedantic
distinction between what constitutes UTF-16/32 (BOM) and what doesn't. On the
other hand, it can be argued that callers can have a finer control that way than
otherwise and can satisfy format-specific requirements (if any)

> we probably need an encoder too, not just a decoder (the encoder can just
> convert to native endianness, I guess).

Actually, there is an encoder (I didn't know there is), but I think we can
remove it. Not all decoders have to be accompanied by the corresponding
encoders. Unless we want to let users export/save documents in UTF-16, we don't
need an encoder (utf-16.notForOutgoing is added in my patch.)  Wait.... I'm
wondering what we do with 'Save Page As..' It seems like we get documents from
the network again even though we have them afresh in the cache (because we can't
reverse the decoding). ('view-source' does the same....) In that case, we don't
need an encoder. 

  Anyway, my position is not carved on a stone. Let's just get some other
opinions. It's unfortunate that smontagu is on the road. He may drop by a net
cafe and pitch in. 

 BTW, I wasn't fond of UTF-16* as a character encoding for 'text/*' and now I'm
now even less so. 
> See bug 68738 comment #20.

Note that doing that involves giving all consumers special knowledge that UTF-16
and UTF-32 are the encodings that have an endianness... (and hence need this
function called) what if we add support for another encoding with an endianness?
 Do we have to go and fix all consumers?  That's why I'd prefer to just put it
all in uconv....

> By 'ascii-compatible', you meant ...[snipped]

Yep.  I meant "could be reencoded as ASCII without loss of data".

> we might have to return something different from NS_OK to let callers to
> distinguish this case

Sure.

> It doesn't sound right that we apply format-dependent heuristics in uconv

That would be very wrong, indeed.  I'm not suggesting that.  That is, we should
not look for '@', '<', or anything like that.  Just examine the pattern of null
bytes in the first four bytes of the data, I would say. That's enough to get
this right in 99% of cases, probably.  The rest are cases with a useless
encoding listed, without a BOM, _and_ with ambiguous data.  People doing that
deserve whatever they get.  ;)

> On the other hand, it can be argued that callers can have a finer control that
> way than otherwise and can satisfy format-specific requirements (if any)

They can already do that.  The HTML parser does....  We're just adding a way for
them _not_ to have to do that if they don't want to.  What the HTML parser is
currently doing would still be possible if someone really wishes.

> Wait.... I'm wondering what we do with 'Save Page As..'

Depends on how you save.  If it's saving as "original HTML" we refetch from the
cache.  But "web page, complete" serializes the DOM, into the document encoding.
 That's where problems would come in.  Same issue in composer and plaintext
copy/paste and anything else that actually uses the various DOM serializers we have.

I found the UTF-16 encoder broken so that I fixed it. It uses the native endian
with BOM.  I added bz's heuritics #2 (but it inspects only the first two bytes.
Checking two more bytes doesn't seem to buy us much once we decide to take that
road). As for #3, we can add it, but do we really have to? 

I got rid of obsolete codes. BTW, I'd love to rename nsUCS2BEtoUnicode.* and
nsUnicodeToUCS2BE.* to nsUTF16toUnicode.* and nsUnicodeToUTF16.* if somebody
can 'operate' on the cvs repository to preserve the revision history.
Attachment #141964 - Attachment is obsolete: true
(In reply to comment #16)
> > See bug 68738 comment #20.
> 
> Note that doing that involves giving all consumers special 
> knowledge that UTF-16 and UTF-32 are the encodings that have an 
> endianness... 

 Yes.

> what if we add support for another encoding with an endianness?

Gee, what's the chance of somebody conjuring up yet another endian-dependent
UTF? UTF-64, UTF-128? That might well be called GTF-64 or GTF-128 for 'Galatic
code Transformation Format' :-)


> > we might have to return something different from NS_OK to let callers to
> > distinguish this case
> 
> Sure.

I added a new return code NS_OK_UDEC_NOBOMFOUND. Not to hide/shadow return codes
like NS_OK_UDEC_MOREINPUT, it's only returned when we'd NS_OK otherwise.
 
> > It doesn't sound right that we apply format-dependent heuristics in uconv
> 
> That would be very wrong, indeed.  I'm not suggesting that. 

OK. I'm sold.


> What the HTML parser is
> currently doing would still be possible if someone really wishes.

  I think we have to keep that code (and your code in nsCSSLoader.cpp) even with
this patch. 


> > Wait.... I'm wondering what we do with 'Save Page As..'
> 
> Depends on how you save. .... "web page, complete" serializes the 
> DOM, into the document encoding.
>  That's where problems would come in.  

  Thanks for the explanation.
> As for #3, we can add it, but do we really have to? 

Well, what does your code do right now?  Just return an error and completely
fail to decode?  That may be fine too...

> I think we have to keep that code (and your code in nsCSSLoader.cpp) even with
> this patch. 

The CSSLoader code probably needs to stay (since it's solving a different
problem).  The parser code I don't know enough about to comment.
(In reply to comment #19)
> > As for #3, we can add it, but do we really have to? 
> 
> Well, what does your code do right now?  Just return an error and completely
> fail to decode?  That may be fine too...

  Yes, it does, but callers don't like it :-)

> > I think we have to keep that code (and your code in nsCSSLoader.cpp) even wi
 
> The CSSLoader code probably needs to stay (since it's solving a different
> problem).  The parser code I don't know enough about to comment.

The same is true of the parser code. The 'BOM' detection is one of 7 or 8
priortized encoding 'detection' methods. It's useful when we just get
'text/html'  from http.  You can try my test cases mentioned in comment #8. 
basically the same as attachment 142043 [details] [diff] [review]. In addition to this patch, I'll 'cvs
remove' nsUCS2LEToUnicode.* and nsUnicodeToUCS2LE.*.
Attachment #142043 - Attachment is obsolete: true
Comment on attachment 142235 [details] [diff] [review]
uconv patch  with a bit more code reduction

asking for r/sr (smontagu is on the road..) 
bz, if you can review, feel free to do so.
Attachment #142235 - Flags: superreview?(jst)
Attachment #142235 - Flags: review?(dbaron)
Comment on attachment 142235 [details] [diff] [review]
uconv patch  with a bit more code reduction

bz has been following this bug and apparently  can review now.
Attachment #142235 - Flags: superreview?(jst) → superreview?(bzbarsky)
I can, but it'll take me a bit (maybe up to a week).
I won't get to this for at least another week; more likely until three weeks
from now at best.
Comment on attachment 142235 [details] [diff] [review]
uconv patch  with a bit more code reduction

This bug has become an i18n bug (the patch adds UTF-16 en/decoder). I'm
changing r/sr request accordingly hoping smontagu can catch up his bug mail on
the road.
Attachment #142235 - Flags: superreview?(bzbarsky)
Attachment #142235 - Flags: superreview?(blizzard)
Attachment #142235 - Flags: review?(smontagu)
Attachment #142235 - Flags: review?(dbaron)
Comment on attachment 142235 [details] [diff] [review]
uconv patch  with a bit more code reduction

Could you document some of the random pointer math you're using?
Attachment #142235 - Flags: superreview?(blizzard) → superreview-
Comment on attachment 142235 [details] [diff] [review]
uconv patch  with a bit more code reduction


I hardly added any new pointer-trickery. Did you mean this? I wish I understood
this so that I could add comments. It's ftang's code and it somehow works. 
Simon, can you decipher it?

>+static nsresult
>+UTF16ConvertToUnicode(PRUint8& aState, PRUint8& aData, const char * aSrc,
>+                      PRInt32 * aSrcLength, PRUnichar * aDest,
>+                      PRInt32 * aDestLength)

>-  if((1 == mState) && (src < srcEnd))
>+  if((1 == aState) && (src < srcEnd))
>   {
>     if(dest >= destEnd)
>       goto error;
>     if(src>=srcEnd)
>       goto done;
>     char tmpbuf[2];
>     PRUnichar * up = (PRUnichar*) &tmpbuf[0];
>-    tmpbuf[0]= mData;
>+    tmpbuf[0]= aData;
>     tmpbuf[1]= *src++;
>     *dest++ = *up;
>   }
>   
>   copybytes = (destEnd-dest)*2;
>   if(copybytes > (0xFFFE & (srcEnd-src)))
>       copybytes = 0xFFFE & (srcEnd-src);
>   memcpy(dest,src,copybytes);
>   src +=copybytes;
>   dest +=(copybytes/2);
Comment on attachment 142235 [details] [diff] [review]
uconv patch  with a bit more code reduction

I made a patch with more verbose comment, but it's not on a machine not
reachable at the moment. I'll upload it later.
Attachment #142235 - Flags: review?(smontagu)
Attachment #142235 - Attachment is obsolete: true
This is safe enough to get in for 1.7final.
Target Milestone: --- → mozilla1.7final
Comment on attachment 144438 [details] [diff] [review]
uconv patch (with more detailed comments)

asking for r/sr
Attachment #144438 - Flags: superreview?(blizzard)
Attachment #144438 - Flags: review?(smontagu)
nominating as a blocker to 1.7 now that it's decided that 1.7 is gonna be a long
lived branch. 
smontagu and blizzard, can you review? 
Flags: blocking1.7?
Comment on attachment 144438 [details] [diff] [review]
uconv patch (with more detailed comments)

>Index: intl/uconv/ucvlatin/Makefile.in
> 		nsUCS2BEToUnicode.cpp \

I assume you left it as this filename to preserve the CVS history or something?

>-		nsUCS2LEToUnicode.cpp \

Does that need CVS removing?

>Index: intl/uconv/ucvlatin/nsUCS2BEToUnicode.cpp

>+UTF16ConvertToUnicode(PRUint8& aState, PRUint8& aData, const char * aSrc,

>+    // Eleminate BOM (0xFEFF). Note that different endian case is taken

"Eliminate"

>+SwapBytes(PRUnichar *aDest, PRInt32 aLen)

>+     char tmp = *(p + 1);
>+     *(p + 1) = *p;
>+     *p = tmp;

Could you use something like
http://lxr.mozilla.org/seamonkey/source/xpcom/io/nsIStreamBufferAccess.idl#117
here?  Would that be faster/slower/same-ish?

>+nsUTF16BEToUnicode::Convert(const char * aSrc, PRInt32 * aSrcLength,

>+    // Process nsUTF16DiffEndianToUnicode. The 'same endian' case with the
>+    // leading BOM will be taken care of by |UTF16ConvertToUnicode|.

The first sentence of that comment is rather meaningless... Just say "remove
the BOM if we're little-endian" or something?

>       if(0xFFFE == *((PRUnichar*)aSrc)) {

Hrm... I guess we're assuming that we're passed a source length of at least
two, so we're not looking at bogus data here?  I guess we made the same
assumption inUTF16ConvertToUnicode().  Add asserts to that effect, and maybe
file a followup bug on eliminating that assumption?  

>+        // eleminate BOM (on LE machines, BE BOM is 0xFFFE)

"eliminate"

>+nsUTF16LEToUnicode::Convert(const char * aSrc, PRInt32 * aSrcLength,

Same comments.

>+nsUTF16ToUnicode::Convert(const char * aSrc, PRInt32 * aSrcLength,

Again, this assumes aSrcLength is at least two... assert and deal in the
followup?

>+    return (rv == NS_OK) ? (mFoundBOM ? rv : NS_OK_UDEC_NOBOMFOUND) : rv; 

  return (rv == NS_OK && !mFoundBOM) ? NS_OK_UDEC_NOBOMFOUND : rv;

maybe?

>Index: intl/uconv/ucvlatin/nsUCS2BEToUnicode.h

>+class nsUTF16ToUnicodeBase : public nsBasicDecoderSupport

> protected:
>   NS_IMETHOD GetMaxLength(const char * aSrc, PRInt32 aSrcLength, 
>       PRInt32 * aDestLength);

Should that really be protected?  I'd think it would be public, as would
Reset()....

>+class nsUTF16ToUnicode : public nsUTF16ToUnicodeBase

>+  nsUTF16ToUnicode() { Reset(); }

Do you need that?  I'd think the Reset() call in the constructor of the
superclass would call our Reset(), since it's a virtual method...

>Index: intl/uconv/ucvlatin/nsUnicodeToUCS2BE.cpp
> NS_IMETHODIMP nsUnicodeToUTF16BE::CopyData(char* aDest, const PRUnichar* >+#ifdef IS_BIG_ENDIAN
>     //UnicodeToUTF16SameEndian
>     ::memcpy(aDest, (void*) aSrc, aLen * 2);

Weird indent (outdent by 2?)

>+    SwapBytes(aDest, aSrc, aLen);

Again.

> NS_IMETHODIMP nsUnicodeToUTF16LE::CopyData(char* aDest, const PRUnichar* 

And in this function too.

>Index: intl/uconv/ucvlatin/nsUnicodeToUCS2BE.h

>+// XXX In theory, we have to check the endianness at run-time because some
>+// modern RISC processors can be run at both LE and BE. 

I suspect we have all sorts of code that would fail to deal with that....

One more thought.  The pattern

#ifdef IS_BIG_ENDIAN
#else
#endif

should probably become

#ifdef IS_BIG_ENDIAN
#elif defined LITTLE_ENDIAN
#else
  #error "Unknown endianness"
#endif

or something and similarly for starting with |#ifdef IS_LITTLE_ENDIAN|.

sr=bzbarsky with those nits.  I'd be happy to have that count as r+sr, but this
is all intl code, so I'm not sure that will fly...  If you need to convert this
sr= to an r=, feel free to do so.

And thanks for doing this!
Attachment #144438 - Flags: superreview?(blizzard) → superreview+
Comment on attachment 144438 [details] [diff] [review]
uconv patch (with more detailed comments)

Thanks for sr.

I'm just turning bz's sr to r+sr and asking for a1.7
Attachment #144438 - Flags: review?(smontagu)
Attachment #144438 - Flags: review+
Attachment #144438 - Flags: approval1.7?
Comment on attachment 144438 [details] [diff] [review]
uconv patch (with more detailed comments)

r=smontagu as well. Sorry for the delay: I still have a huge backlog of
bugmail.

>-  if(copybytes > (0xFFFE & (srcEnd-src)))
>-      copybytes = 0xFFFE & (srcEnd-src);
>+  // if |srcEnd-src| is odd, we copy one fewer bytes.
>+  if(copybytes > (~1 & (srcEnd - src)))
>+      copybytes = ~1 & (srcEnd - src);

Wow. Using 0xFFFE in the context of UTF-16, BOMs, etc. to mean something
totally different deserves some kind of prize of obfuscation. Kudos to you for
working out what the hell was going on.

I think this section could be made a lot clearer and simpler, but that can wait
for a later iteration.
got into the trunk (1.8alpha). marking as fixed
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Comment on attachment 144438 [details] [diff] [review]
uconv patch (with more detailed comments)

a=asa (on behalf of drivers) for checkin to 1.7
Attachment #144438 - Flags: approval1.7? → approval1.7+
patch checked into 1.7 branch
Keywords: fixed1.7
Flags: blocking1.7?
verified on 1.7
Status: RESOLVED → VERIFIED
Keywords: fixed1.7verified1.7
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: