Closed Bug 235090 Opened 22 years ago Closed 22 years ago

CSS stylesheets labeled as "UTF-16" don't load

Categories

(Core :: CSS Parsing and Computation, defect)

Product:

Component:

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla1.7final

People

(Reporter: dbaron, Assigned: jshin1987)

References

(
URL
)

Details

(Keywords: intl, verified1.7)

Attachments

(2 files, 4 obsolete files)

patch 22 years ago Jungshik Shin 4.76 KB, patch		Details \| Diff \| Splinter Review
alternative (uconv patch) 22 years ago Jungshik Shin 26.74 KB, patch		Details \| Diff \| Splinter Review
uconv patch 22 years ago Jungshik Shin 23.68 KB, patch		Details \| Diff \| Splinter Review
uconv patch (with utf-16 en/decoder) 22 years ago Jungshik Shin 34.13 KB, patch		Details \| Diff \| Splinter Review
uconv patch with a bit more code reduction 22 years ago Jungshik Shin 34.97 KB, patch	blizzard : superreview-	Details \| Diff \| Splinter Review
uconv patch (with more detailed comments) 22 years ago Jungshik Shin 36.87 KB, patch	jshin1987 : review+ bzbarsky : superreview+ asa : approval1.7+	Details \| Diff \| Splinter Review

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Reporter

Description

•

22 years ago

http://www.w3.org/Style/Examples/010/utf-16-correct.html should be green. The CSS parser doesn't support "UTF-16"/"utf-16" encoding. This is mainly the result of bug 68738. This needs to be fixed in the uconv code rather than the HTML parser, which is not its only consumer.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Reporter

Comment 1

•

22 years ago

(And while we're here it might make sense to add the name ucs2 to charsetalias.properties if that's not deprecated.)

Assignee

Comment 2

•

22 years ago

(In reply to comment #0) > http://www.w3.org/Style/Examples/010/utf-16-correct.html should be green. > > The CSS parser doesn't support "UTF-16"/"utf-16" encoding. This is mainly the > result of bug 68738. Did you mean the result of the patch for bug 68738 or the result of bug 68738 having been fixed only partially? The encoding of the stylesheet at http://www.w3.org/Style/Examples/010/utf-16-correct.css is correctly detected as UTF-16LE when it's loaded into a browser window by itself. Nonetheless, Mozilla fails the test so that there's something to fix. I guess our cssloader doesn't recognize it correctly. > add the name ucs2 to charsetalias.properties Originally, I guess the file was meant to include aliases registered with the IANA. We made an exception last year with 'iso8859_1' (or something like that) because we found some broken sites using JSP/servlet to emit that name. Do you have sites using 'ucs2'? > This needs to be fixed in the uconv code rather than the > HTML parser, which is not its only consumer. I guess you meant we need to add a BOM-detection method to uconv and make it available to our CSS loader and (JS) script loader as well as our html/xml parser (bug 68738 comment #20 and a few comments before that). Or, do you think we need a separate UTF-16 decoder that looks for a BOM to determine the endianness? The latter wouldn't help Mozilla pass the test given here although it could be useful for other cases.

Keywords: intl

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Reporter

Comment 3

•

22 years ago

@charset "UTF-16"; should be valid, but it causes the CSS loader to request (via a converter in netwerk/base/src/) a unicode decoder for it. I think we should have a separate decoder for "UTF-16" that looks for a BOM -- then we could discard the hack in the HTML parser that drops any explicit "UTF-16". And I think it would fix this case -- the @charset rule causes an explicit "UTF-16" decoder to be requested.

Assignee

Comment 4

•

22 years ago

(In reply to comment #3) > @charset "UTF-16"; should be valid, but it causes the CSS loader to request (via > a converter in netwerk/base/src/) a unicode decoder for it. > I think we should have a separate decoder for "UTF-16" that looks for a BOM To recognize '@charset "UTF-16"', we need to know that the file is in UTF-16LE. In your example, the http header comes with just 'text/css'. 'UTF-16' decoder with a built-in BOM detector is useful only when we know in advance that it's UTF-16 but don't know whether it's LE or BE. If we know that the file is in UTF-16LE, we just have to call our UTF-16LE decoder directly. > -- then we could discard the hack in the HTML parser that drops any > explicit "UTF-16". You meant http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&root=/cvsroot&subdir=mozilla/htmlparser/src&command=DIFF_FRAMESET&file=nsParser.cpp&rev2=3.271&rev1=3.270 (that was bug 98214). However, that's valid because by the time we reach there, we already know it's one of byte-oriented encodings so that 'UTF-16' in meta tag can't be right.

Boris Zbarsky [:bzbarsky]

Comment 5

•

22 years ago

So... We look at the BOM in the CSSLoader. We use this to "parse" the @charset rule. If this says UTF-16, we just ask uconv for a decoder for that. I _could_ hack the CSSLoader to just go with what the BOM said if the @charset was UTF-16. But the point is that this should not be needed... I think what David is suggesting is the following: 1) Have a decoder registered for "UTF-16" 2) Have this decoder look at the first char, and if it's a BOM instantiate the appropriate UTF-16[BL]E decoder. Then proxy all the rest of the decoder stuff to it. If there is no BOM, assume UTF-16BE or something. This shouldn't be very much code, really... I could put this logic in the CSSLoader, like I said, but the point is that this sort of thing should really be opaque to uconv consumers. Consider the case of data coming in with a "Content-Type: text/css; charset=UTF-16" -- in that case the CSSLoader really shouldn't even look at the data; it should just ask uconv to deal and uconv should deal....

Assignee

Comment 6

•

22 years ago

Attached patch patch — Details — Splinter Review

I completely forgot that bz had added BOM detection to nsCSSLoader.cpp in bug 218915. With this patch, 'LE' or 'BE' is appended to 'UTF-16' or 'UTF-32' if the endian is known from bz's code.

Assignee: dbaron → jshin

Status: NEW → ASSIGNED

Assignee

Comment 7

•

22 years ago

Sorry I hadn't read the bug mail from bz when I uploaded my patch. I've just read it. I'm not sure whether adding UTF-16 and UTF-32 decoders that looks for BOM although it seems to be cleaner that way.

Assignee

Comment 8

•

22 years ago

Attached patch alternative (uconv patch) (obsolete) — Details — Splinter Review

This works for the case given in the URL field of this bug, but it breaks some other cases at http://jshin.dyndns.org/i18n/utftest/index.html Without the patch, all cases at the page worked, but with the patch, UTF-16LE (with BOM) labelled as UTF-16 (the 9th case from the top) doesn't get rendered correctly. Actually, this should be all right even with this patch .... Moreover, UTF-16BE (with BOM) tagged as UTF-16 in the http c-t header works fine.... There's something funny going on here. The asymmetry between BE and LE is something I have to look into. (I don't claim to understand |UTF16ConvertToUnicode| in nsUCS2BEtoUnicode.cpp) What this patch really breaks is a UTF-16LE page without BOM but tagged as UTF-16 in the http header (the 4th case from the bottom). To my surprise, a UTF-16BE page without BOM but tagged as UTF-16 in the http header (the 10th from the bottom) is rendered correctly. In some cases, we have to 'override http header' and rely on nsParser for 'UTF-16' (and 'UTF-32'), which defeats the purpose of adding 'UTF-16' decoder in a sense.

Assignee

Comment 9

•

22 years ago

Comment on attachment 141962 [details] [diff] [review] alternative (uconv patch) >@@ -1041,14 +1041,19 @@ static const nsModuleComponentInfo compo >+ }, >+ { >+ DECODER_NAME_BASE "UTF-16" , NS_UTF16TOUNICODE_CID, >+ NS_UNICODEDECODER_CONTRACTID_BASE "UTF-16", >+ nsUTF16BEToUnicodeConstructor , Please, ignore the patch. The typo in the above (UTF16BEToUnicode instead of UTF16ToUnicode) gave me an illusion that it worked (at least partially).

Attachment #141962 - Attachment is obsolete: true

Assignee

Comment 10

•

22 years ago

Attached patch uconv patch (obsolete) — Details — Splinter Review

With this, documents with BOM and tagged as UTF-16 get rendered correctly. However, documents without BOM and tagged as UTF-16 (the endian of which used to be detected correctly) put Mozilla to sorta infinite loop. For them, nsUTF16ToUnicode::Convert() fails, but that failure doens't seem to be handled properly by callers. It appears that we need some heuristics in parser and elsewhere even if we add 'UTF-16' decoder.

Boris Zbarsky [:bzbarsky]

Comment 11

•

22 years ago

Oh, fun. In order: (In reply to comment #7) > I'm not sure whether adding UTF-16 and UTF-32 decoders that looks for > BOM although it seems to be cleaner that way. It's absolutely necessary to handle a charset of "UTF-16" in an HTTP header, no? In fact, how does the parser manage to do it right now? (In reply to comment #8) > In some cases, we have to 'override http header' and rely on nsParser for > 'UTF-16' (and 'UTF-32'), which defeats the purpose of adding 'UTF-16' decoder > in a sense. Um... What are those cases? We should not be doing that short of an explicit user override via the view menu, no? (In reply to comment #10) > However, documents without BOM and tagged as UTF-16 (the endian of which used > to be detected correctly) put Mozilla to sorta infinite loop. For them, > nsUTF16ToUnicode::Convert() fails, but that failure doens't seem to be handled > properly by callers. Fix the callers? ;) > It appears that we need some heuristics in parser and elsewhere even if we add > 'UTF-16' decoder. Why can't we move those heuristics into the 'UTF-16' decoder? Something like: 1) If BOM, use that 2) If no BOM, check whether first or second byte is non-null and go with that 3) If both are null or both are non-null, assume UTF-16BE (network endianness) or something. Really, if we hit this case, the document is all screwed up anyway. The goal here is to have as little knowledge about this stuff outside uconv as humanly possible.

Assignee

Comment 12

•

22 years ago

(In reply to comment #11) > (In reply to comment #7) > > I'm not sure whether adding UTF-16 and UTF-32 decoders that looks for > > BOM although it seems to be cleaner that way. > > It's absolutely necessary to handle a charset of "UTF-16" in an HTTP header, no? > In fact, how does the parser manage to do it right now? Apparently, 'UTF-16' coming from http header is ignored because we don't have a decoder for 'UTF-16'. So, we end up relying on the parser. See bug 68738 and the reference to XML there. Before bug 68738, we used to alias UTF-16 to UTF-16BE so that documents tagged as UTF-16 (even if they're UTF-16LE) were always interpreted as UTF-16BE. > (In reply to comment #8) > > In some cases, we have to 'override http header' and rely on nsParser for > > 'UTF-16' (and 'UTF-32'), which defeats the purpose of adding 'UTF-16' decoder > > in a sense. > > Um... What are those cases? We should not be doing that short of an explicit > user override via the view menu, no? By overriding, I meant not the full scale override but just determining the endian (big or little) as my first patch did for '@charset.....' Those cases are what I wrote about below : documents without BOM but tagged as UTF-16. Of course, this 'guessing' based on the first several bytes doesn't work for plain text. > (In reply to comment #10) > > However, documents without BOM and tagged as UTF-16 (the endian of which used > > to be detected correctly) put Mozilla to sorta infinite loop. For them, > > nsUTF16ToUnicode::Convert() fails, but that failure doens't seem to be handled > > properly by callers. > > Fix the callers? ;) I didn't even bother to check what the callers are. Anyway, even if the callers are fixed, the document in question still wouldn't be rendered. > > It appears that we need some heuristics in parser and elsewhere even if we add > > 'UTF-16' decoder. > > Why can't we move those heuristics into the 'UTF-16' decoder? Something like: > > 1) If BOM, use that > 2) If no BOM, check whether first or second byte is non-null and go with that > 3) If both are null or both are non-null, assume UTF-16BE (network endianness) > or something. Really, if we hit this case, the document is all screwed up > anyway. 3) doesn't work because the majority of such a case (both byte non-null) is likely to be UTF-16LE no matter BigEndian is the network byte order. If we ever do this, it'd be better to go with the native endian. 2) doesn't work well, either. Think about a plain text document beginning with characters like U+hh00 (where 'hh' is not '00'). Besides, I don't think #2 and #3 are what encoding converters are supposed to do. Encoding converters are not just for html, css, xml but also for plain text. On top of that, BOM is a normative part of the definition of UTF-16 and UTF-32 so that we can/should use it to detect the endian. Other document-format-specific heuristics should be done elsewhere. > The goal here is to have as little knowledge about this stuff outside uconv as > humanly possible. However, some stuffs rightly belong elsewhere.

Assignee

Comment 13

•

22 years ago

> If we ever do this, it'd be better to go with the native endian. From the standard compliance point of view, I guess (I have to check, though) that the network byte order is the 'right' thing.

Boris Zbarsky [:bzbarsky]

Comment 14

•

22 years ago

(In reply to comment #12) > Apparently, 'UTF-16' coming from http header is ignored because we don't > have a decoder for 'UTF-16'. That's a bug, no? > So, we end up relying on the parser. Which breaks when we're not parsing HTML/XML, as here.... I assume the scriptloader currently has the same exact issues as the CSSLoader here if someone bothers to test. Which is why we just want to fix this inside uconv instead of making all consumers write code to maybe handle this. > Before bug 68738, we used to alias UTF-16 to UTF-16BE Yes, I know. That was totally wacky. ;) > By overriding, I meant not the full scale override but just determining the > endian ...[snipped]... Of course, this 'guessing' based on the first several > bytes doesn't work for plain text. Sure it does. It works wonderfully. Just look for null bytes or not... that will work for ascii-compatible plaintext. For other plaintext, there is no way to solve the problem anyway. The point is that since both CSS and HTML have ascii-compatible syntax, it would work for them. Same for JS, more or less. > I didn't even bother to check what the callers are. Anyway, even if the > callers are fixed, the document in question still wouldn't be rendered. Unless we detect the endianness in uconv. > > 3) If both are null or both are non-null > 3) doesn't work because the majority of such a case (both byte non-null) is > likely to be UTF-16LE no matter BigEndian is the network byte order. Fine. Like I said, I have no strong preference for what to do here. Native endian is ok. > 2) doesn't work well, either. Think about a plain text document beginning with > characters like U+hh00 (where 'hh' is not '00'). This is already broken in today's world, no? > Besides, I don't think #2 and #3 are what encoding converters are supposed to > do. Encoding converters are supposed to take an encoding and convert. In this case the encoding being given is "UTF-16". I think the fact that there is in fact no such encoding is a uconv-internal issue that consumers are largely ignorant about and should be able to be ignorant about. > Encoding converters are not just for html, css, xml but also for plain text And your suggestion is? Note that plaintext is currently broken anyway and we see no way to fix it. Why not fix the cases that we can? Frankly, I suspect that BOM-less stuff marked UTF-16 will be reasonably rare... and most of it will be HTML, which we _will_ detect correctly. > On top of that, BOM is a normative part of the definition of UTF-16 and UTF-32 > so that we can/should use it to detect the endian. Other > document-format-specific heuristics should be done elsewhere. I can deal with just having the BOM stuff in uconv and some hacks in the HTML parser, I suppose, if this is very important. But the BOM stuff _really_ needs to be in uconv. And we need to fix the current bug whereby the HTML parser ignores the HTTP header charset.... > However, some stuffs rightly belong elsewhere. I'm not sure I see the reasoning there, as I said. The point of uconv is to be useful, no? Right now it's creating a _lot_ of extra work for all consumers like this.... There is one more angle to all this, btw. As things stand, the "UTF-16" document will get flagged as BE or LE by the parser and that's what it's resulting charset will be. With these changes, its charset will be "UTF-16", so we probably need an encoder too, not just a decoder (the encoder can just convert to native endianness, I guess).

Assignee

Comment 15

•

22 years ago

(In reply to comment #14) > (In reply to comment #12) > > Apparently, 'UTF-16' coming from http header is ignored because we don't > > have a decoder for 'UTF-16'. > > That's a bug, no? Yes, in a certain sense, but we can make that up down the road (in case of html/xml, in effect, we do that...) . Instead of 'ignoring', we can put it in an 'intermediate' state that needs to be turned to UTF-16LE or UTF-16BE ... > > So, we end up relying on the parser. > > Which breaks when we're not parsing HTML/XML, as here.... I assume the > scriptloader currently has the same exact issues as the CSSLoader here if > someone bothers to test. I guess so. There are at least three consumers that can be relieved of their own 'endian' detection, which is a strong point of the proposal here that 'UTF-16' be added to the list of encodings supported in uconv as you wrote. However, the common endian detection can be factorized into a single function accessible from all three consumers. See bug 68738 comment #20. (btw, I guess we can make the endian detection always on for UTF-16 and UTF-32 even though the universal detector can be on and off at users' will) > > this 'guessing' based on the first several bytes doesn't work for plain text. > Sure it does. It works wonderfully. Just look for null bytes or not... that > will work for ascii-compatible plaintext. For other plaintext, there is no way > to solve the problem anyway. The point is that since both CSS and HTML have > ascii-compatible syntax, it would work for them. Same for JS, more or less. By 'ascii-compatible', you meant that the first few 'characters' are within the ASCII portion of the Unicode, didn't you? That usage deviates from a common use of the term (UTF-8 and most legacy encodings are ascii-compatible, but UTF-16 is not). However, that's not important because I understand what you meant :-) Anyway, I'd not say 'wonderfully' because there are cases for which this doesn't work as we discussed below. BTW, by 'plain text', I meant 'text/plain' (not text/*) ;-) > > I didn't even bother to check what the callers are. Anyway, even if the > > callers are fixed, the document in question still wouldn't be rendered. > > Unless we detect the endianness in uconv. Or, callers (or callers of callers) can fix that up ;-) > > > 3) If both are null or both are non-null > > > 3) doesn't work because the majority of such a case (both byte non-null) is > > likely to be UTF-16LE no matter BigEndian is the network byte order. > > Fine. Like I said, I have no strong preference for what to do here. Native > endian is ok. If we do that, we might have to return something different from NS_OK to let callers to distinguish this case from others (say, NS_OK_BOM_DETECTION_UNSURE) in case they want to. > > 2) ... Think about a plain text document beginning with > > characters like U+hh00 (where 'hh' is not '00'). > > This is already broken in today's world, no? ..... > And your suggestion is? Note that plaintext is currently broken anyway a Yes, it is. See the above reference to my comment in bug 68738. > > On top of that, BOM is a normative part of the definition of UTF-16 > > and UTF-32 so that we can/should use it to detect the endian. Other > > document-format-specific heuristics should be done elsewhere. > > I can deal with just having the BOM stuff in uconv and some hacks in the HTML > parser, I suppose, if this is very important. I don't know. It doesn't sound right that we apply format-dependent heuristics in uconv, but I can't say it's very important, either. > But the BOM stuff _really_ needs to be in uconv. And we need to fix the > current bug whereby the HTML parser ignores the HTTP header charset.... Yes, that appears to be a reasonable way to divvy things up. > > However, some stuffs rightly belong elsewhere. > > I'm not sure I see the reasoning there, as I said. The point of uconv is to be > useful, no? Right now it's creating a _lot_ of extra work for all consumers > like this.... That's why I suggested factoring them out in bug 68738 and here. I admit that it seems to be an unnecessary 'division of labors' based on a pedantic distinction between what constitutes UTF-16/32 (BOM) and what doesn't. On the other hand, it can be argued that callers can have a finer control that way than otherwise and can satisfy format-specific requirements (if any) > we probably need an encoder too, not just a decoder (the encoder can just > convert to native endianness, I guess). Actually, there is an encoder (I didn't know there is), but I think we can remove it. Not all decoders have to be accompanied by the corresponding encoders. Unless we want to let users export/save documents in UTF-16, we don't need an encoder (utf-16.notForOutgoing is added in my patch.) Wait.... I'm wondering what we do with 'Save Page As..' It seems like we get documents from the network again even though we have them afresh in the cache (because we can't reverse the decoding). ('view-source' does the same....) In that case, we don't need an encoder. Anyway, my position is not carved on a stone. Let's just get some other opinions. It's unfortunate that smontagu is on the road. He may drop by a net cafe and pitch in. BTW, I wasn't fond of UTF-16* as a character encoding for 'text/*' and now I'm now even less so.

Boris Zbarsky [:bzbarsky]

Comment 16

•

22 years ago

> See bug 68738 comment #20. Note that doing that involves giving all consumers special knowledge that UTF-16 and UTF-32 are the encodings that have an endianness... (and hence need this function called) what if we add support for another encoding with an endianness? Do we have to go and fix all consumers? That's why I'd prefer to just put it all in uconv.... > By 'ascii-compatible', you meant ...[snipped] Yep. I meant "could be reencoded as ASCII without loss of data". > we might have to return something different from NS_OK to let callers to > distinguish this case Sure. > It doesn't sound right that we apply format-dependent heuristics in uconv That would be very wrong, indeed. I'm not suggesting that. That is, we should not look for '@', '<', or anything like that. Just examine the pattern of null bytes in the first four bytes of the data, I would say. That's enough to get this right in 99% of cases, probably. The rest are cases with a useless encoding listed, without a BOM, _and_ with ambiguous data. People doing that deserve whatever they get. ;) > On the other hand, it can be argued that callers can have a finer control that > way than otherwise and can satisfy format-specific requirements (if any) They can already do that. The HTML parser does.... We're just adding a way for them _not_ to have to do that if they don't want to. What the HTML parser is currently doing would still be possible if someone really wishes. > Wait.... I'm wondering what we do with 'Save Page As..' Depends on how you save. If it's saving as "original HTML" we refetch from the cache. But "web page, complete" serializes the DOM, into the document encoding. That's where problems would come in. Same issue in composer and plaintext copy/paste and anything else that actually uses the various DOM serializers we have.

Assignee

Comment 17

•

22 years ago

Attached patch uconv patch (with utf-16 en/decoder) (obsolete) — Details — Splinter Review

I found the UTF-16 encoder broken so that I fixed it. It uses the native endian with BOM. I added bz's heuritics #2 (but it inspects only the first two bytes. Checking two more bytes doesn't seem to buy us much once we decide to take that road). As for #3, we can add it, but do we really have to? I got rid of obsolete codes. BTW, I'd love to rename nsUCS2BEtoUnicode.* and nsUnicodeToUCS2BE.* to nsUTF16toUnicode.* and nsUnicodeToUTF16.* if somebody can 'operate' on the cvs repository to preserve the revision history.

Attachment #141964 - Attachment is obsolete: true

Assignee

Comment 18

•

22 years ago

(In reply to comment #16) > > See bug 68738 comment #20. > > Note that doing that involves giving all consumers special > knowledge that UTF-16 and UTF-32 are the encodings that have an > endianness... Yes. > what if we add support for another encoding with an endianness? Gee, what's the chance of somebody conjuring up yet another endian-dependent UTF? UTF-64, UTF-128? That might well be called GTF-64 or GTF-128 for 'Galatic code Transformation Format' :-) > > we might have to return something different from NS_OK to let callers to > > distinguish this case > > Sure. I added a new return code NS_OK_UDEC_NOBOMFOUND. Not to hide/shadow return codes like NS_OK_UDEC_MOREINPUT, it's only returned when we'd NS_OK otherwise. > > It doesn't sound right that we apply format-dependent heuristics in uconv > > That would be very wrong, indeed. I'm not suggesting that. OK. I'm sold. > What the HTML parser is > currently doing would still be possible if someone really wishes. I think we have to keep that code (and your code in nsCSSLoader.cpp) even with this patch. > > Wait.... I'm wondering what we do with 'Save Page As..' > > Depends on how you save. .... "web page, complete" serializes the > DOM, into the document encoding. > That's where problems would come in. Thanks for the explanation.

Boris Zbarsky [:bzbarsky]

Comment 19

•

22 years ago

> As for #3, we can add it, but do we really have to? Well, what does your code do right now? Just return an error and completely fail to decode? That may be fine too... > I think we have to keep that code (and your code in nsCSSLoader.cpp) even with > this patch. The CSSLoader code probably needs to stay (since it's solving a different problem). The parser code I don't know enough about to comment.

Assignee

Comment 20

•

22 years ago

(In reply to comment #19) > > As for #3, we can add it, but do we really have to? > > Well, what does your code do right now? Just return an error and completely > fail to decode? That may be fine too... Yes, it does, but callers don't like it :-) > > I think we have to keep that code (and your code in nsCSSLoader.cpp) even wi > The CSSLoader code probably needs to stay (since it's solving a different > problem). The parser code I don't know enough about to comment. The same is true of the parser code. The 'BOM' detection is one of 7 or 8 priortized encoding 'detection' methods. It's useful when we just get 'text/html' from http. You can try my test cases mentioned in comment #8.

Assignee

Comment 21

•

22 years ago

Attached patch uconv patch with a bit more code reduction (obsolete) — Details — Splinter Review

basically the same as attachment 142043 [details] [diff] [review]. In addition to this patch, I'll 'cvs remove' nsUCS2LEToUnicode.* and nsUnicodeToUCS2LE.*.

Attachment #142043 - Attachment is obsolete: true

Assignee

Comment 22

•

22 years ago

Comment on attachment 142235 [details] [diff] [review] uconv patch with a bit more code reduction asking for r/sr (smontagu is on the road..) bz, if you can review, feel free to do so.

Attachment #142235 - Flags: superreview?(jst)

Attachment #142235 - Flags: review?(dbaron)

Assignee

Comment 23

•

22 years ago

Comment on attachment 142235 [details] [diff] [review] uconv patch with a bit more code reduction bz has been following this bug and apparently can review now.

Attachment #142235 - Flags: superreview?(jst) → superreview?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 24

•

22 years ago

I can, but it'll take me a bit (maybe up to a week).

Boris Zbarsky [:bzbarsky]

Comment 25

•

22 years ago

I won't get to this for at least another week; more likely until three weeks from now at best.

Assignee

Comment 26

•

22 years ago

Comment on attachment 142235 [details] [diff] [review] uconv patch with a bit more code reduction This bug has become an i18n bug (the patch adds UTF-16 en/decoder). I'm changing r/sr request accordingly hoping smontagu can catch up his bug mail on the road.

Attachment #142235 - Flags: superreview?(bzbarsky)

Attachment #142235 - Flags: superreview?(blizzard)

Attachment #142235 - Flags: review?(smontagu)

Attachment #142235 - Flags: review?(dbaron)

Christopher Blizzard (:blizzard)

Comment 27

•

22 years ago

Comment on attachment 142235 [details] [diff] [review] uconv patch with a bit more code reduction Could you document some of the random pointer math you're using?

Attachment #142235 - Flags: superreview?(blizzard) → superreview-

Assignee

Comment 28

•

22 years ago

Comment on attachment 142235 [details] [diff] [review] uconv patch with a bit more code reduction I hardly added any new pointer-trickery. Did you mean this? I wish I understood this so that I could add comments. It's ftang's code and it somehow works. Simon, can you decipher it? >+static nsresult >+UTF16ConvertToUnicode(PRUint8& aState, PRUint8& aData, const char * aSrc, >+ PRInt32 * aSrcLength, PRUnichar * aDest, >+ PRInt32 * aDestLength) >- if((1 == mState) && (src < srcEnd)) >+ if((1 == aState) && (src < srcEnd)) > { > if(dest >= destEnd) > goto error; > if(src>=srcEnd) > goto done; > char tmpbuf[2]; > PRUnichar * up = (PRUnichar*) &tmpbuf[0]; >- tmpbuf[0]= mData; >+ tmpbuf[0]= aData; > tmpbuf[1]= *src++; > *dest++ = *up; > } > > copybytes = (destEnd-dest)*2; > if(copybytes > (0xFFFE & (srcEnd-src))) > copybytes = 0xFFFE & (srcEnd-src); > memcpy(dest,src,copybytes); > src +=copybytes; > dest +=(copybytes/2);

Assignee

Comment 29

•

22 years ago

Comment on attachment 142235 [details] [diff] [review] uconv patch with a bit more code reduction I made a patch with more verbose comment, but it's not on a machine not reachable at the moment. I'll upload it later.

Attachment #142235 - Flags: review?(smontagu)

Assignee

Comment 30

•

22 years ago

Attached patch uconv patch (with more detailed comments) — Details — Splinter Review

Attachment #142235 - Attachment is obsolete: true

Assignee

Comment 31

•

22 years ago

This is safe enough to get in for 1.7final.

Target Milestone: --- → mozilla1.7final

Assignee

Comment 32

•

22 years ago

Comment on attachment 144438 [details] [diff] [review] uconv patch (with more detailed comments) asking for r/sr

Attachment #144438 - Flags: superreview?(blizzard)

Attachment #144438 - Flags: review?(smontagu)

Assignee

Comment 33

•

22 years ago

nominating as a blocker to 1.7 now that it's decided that 1.7 is gonna be a long lived branch. smontagu and blizzard, can you review?

Flags: blocking1.7?

Boris Zbarsky [:bzbarsky]

Comment 34

•

22 years ago

Comment on attachment 144438 [details] [diff] [review] uconv patch (with more detailed comments) >Index: intl/uconv/ucvlatin/Makefile.in > nsUCS2BEToUnicode.cpp \ I assume you left it as this filename to preserve the CVS history or something? >- nsUCS2LEToUnicode.cpp \ Does that need CVS removing? >Index: intl/uconv/ucvlatin/nsUCS2BEToUnicode.cpp >+UTF16ConvertToUnicode(PRUint8& aState, PRUint8& aData, const char * aSrc, >+ // Eleminate BOM (0xFEFF). Note that different endian case is taken "Eliminate" >+SwapBytes(PRUnichar *aDest, PRInt32 aLen) >+ char tmp = *(p + 1); >+ *(p + 1) = *p; >+ *p = tmp; Could you use something like http://lxr.mozilla.org/seamonkey/source/xpcom/io/nsIStreamBufferAccess.idl#117 here? Would that be faster/slower/same-ish? >+nsUTF16BEToUnicode::Convert(const char * aSrc, PRInt32 * aSrcLength, >+ // Process nsUTF16DiffEndianToUnicode. The 'same endian' case with the >+ // leading BOM will be taken care of by |UTF16ConvertToUnicode|. The first sentence of that comment is rather meaningless... Just say "remove the BOM if we're little-endian" or something? > if(0xFFFE == *((PRUnichar*)aSrc)) { Hrm... I guess we're assuming that we're passed a source length of at least two, so we're not looking at bogus data here? I guess we made the same assumption inUTF16ConvertToUnicode(). Add asserts to that effect, and maybe file a followup bug on eliminating that assumption? >+ // eleminate BOM (on LE machines, BE BOM is 0xFFFE) "eliminate" >+nsUTF16LEToUnicode::Convert(const char * aSrc, PRInt32 * aSrcLength, Same comments. >+nsUTF16ToUnicode::Convert(const char * aSrc, PRInt32 * aSrcLength, Again, this assumes aSrcLength is at least two... assert and deal in the followup? >+ return (rv == NS_OK) ? (mFoundBOM ? rv : NS_OK_UDEC_NOBOMFOUND) : rv; return (rv == NS_OK && !mFoundBOM) ? NS_OK_UDEC_NOBOMFOUND : rv; maybe? >Index: intl/uconv/ucvlatin/nsUCS2BEToUnicode.h >+class nsUTF16ToUnicodeBase : public nsBasicDecoderSupport > protected: > NS_IMETHOD GetMaxLength(const char * aSrc, PRInt32 aSrcLength, > PRInt32 * aDestLength); Should that really be protected? I'd think it would be public, as would Reset().... >+class nsUTF16ToUnicode : public nsUTF16ToUnicodeBase >+ nsUTF16ToUnicode() { Reset(); } Do you need that? I'd think the Reset() call in the constructor of the superclass would call our Reset(), since it's a virtual method... >Index: intl/uconv/ucvlatin/nsUnicodeToUCS2BE.cpp > NS_IMETHODIMP nsUnicodeToUTF16BE::CopyData(char* aDest, const PRUnichar* >+#ifdef IS_BIG_ENDIAN > //UnicodeToUTF16SameEndian > ::memcpy(aDest, (void*) aSrc, aLen * 2); Weird indent (outdent by 2?) >+ SwapBytes(aDest, aSrc, aLen); Again. > NS_IMETHODIMP nsUnicodeToUTF16LE::CopyData(char* aDest, const PRUnichar* And in this function too. >Index: intl/uconv/ucvlatin/nsUnicodeToUCS2BE.h >+// XXX In theory, we have to check the endianness at run-time because some >+// modern RISC processors can be run at both LE and BE. I suspect we have all sorts of code that would fail to deal with that.... One more thought. The pattern #ifdef IS_BIG_ENDIAN #else #endif should probably become #ifdef IS_BIG_ENDIAN #elif defined LITTLE_ENDIAN #else #error "Unknown endianness" #endif or something and similarly for starting with |#ifdef IS_LITTLE_ENDIAN|. sr=bzbarsky with those nits. I'd be happy to have that count as r+sr, but this is all intl code, so I'm not sure that will fly... If you need to convert this sr= to an r=, feel free to do so. And thanks for doing this!

Attachment #144438 - Flags: superreview?(blizzard) → superreview+

Assignee

Comment 35

•

22 years ago

Comment on attachment 144438 [details] [diff] [review] uconv patch (with more detailed comments) Thanks for sr. I'm just turning bz's sr to r+sr and asking for a1.7

Attachment #144438 - Flags: review?(smontagu)

Attachment #144438 - Flags: review+

Attachment #144438 - Flags: approval1.7?

Simon Montagu :smontagu

Comment 36

•

22 years ago

Comment on attachment 144438 [details] [diff] [review] uconv patch (with more detailed comments) r=smontagu as well. Sorry for the delay: I still have a huge backlog of bugmail. >- if(copybytes > (0xFFFE & (srcEnd-src))) >- copybytes = 0xFFFE & (srcEnd-src); >+ // if |srcEnd-src| is odd, we copy one fewer bytes. >+ if(copybytes > (~1 & (srcEnd - src))) >+ copybytes = ~1 & (srcEnd - src); Wow. Using 0xFFFE in the context of UTF-16, BOMs, etc. to mean something totally different deserves some kind of prize of obfuscation. Kudos to you for working out what the hell was going on. I think this section could be made a lot clearer and simpler, but that can wait for a later iteration.

Assignee

Comment 37

•

22 years ago

got into the trunk (1.8alpha). marking as fixed

Status: ASSIGNED → RESOLVED

Closed: 22 years ago

Resolution: --- → FIXED

Asa Dotzler [:asa]

Comment 38

•

22 years ago

Comment on attachment 144438 [details] [diff] [review] uconv patch (with more detailed comments) a=asa (on behalf of drivers) for checkin to 1.7

Attachment #144438 - Flags: approval1.7? → approval1.7+

Assignee

Comment 39

•

22 years ago

patch checked into 1.7 branch

Keywords: fixed1.7

Asa Dotzler [:asa]

Updated

•

22 years ago

Flags: blocking1.7?

Tracy Walker [:tracy]

Comment 40

•

21 years ago

verified on 1.7

Status: RESOLVED → VERIFIED

Keywords: fixed1.7 → verified1.7

You need to log in before you can comment on or make changes to this bug.