123389 - wrong format data for text/html type in cut-and-paste [X clipboard]

Reporter

Description

•

24 years ago

From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:0.9.7) Gecko/20020131 BuildID: 2001122204 In the composer window, when text is copied or cut and paste it to the self, "text/html" type is choosen for clipboard selection. Here, the text data in a X selection event is encoded in plain UCS2 string(always in a big-endian) and the format field of the event sets to 8. This shouldn't be working if the owner and the requestor are in different endian machines to each other. At least, the format should be 16, and let X take care of byte-swapping. Moreover, the plain UCS2 should not be a right data for text/html type, but it should be html code beginning with "<HTML>". According to RFC2854 (which defines text/html) the encoding must be iso8859-1 in absence of explicit encoding specifiers. Both selection getter(when the composer delivers) and selection requester(when it receives) should be modified to properly convert the cut-n-paste text to/from its html code, with the charset information explicitly specified. Reproducible: Always Steps to Reproduce: 1.run "mozilla -compose" with a debugger on Linux or Solaris which has widget/src/gtk/nsClipboard.cpp debuggable built. 2. In the debugger, monitor cliboardData in either nsClipboard:SelectionGetCB or nsClipboard:SelectionReceiver when its atom type is of "text/html". 3. Try to cut-and-paste any text in the composer window, and find out that the data in the clipboard is encoded in plain UCS2 with big-endian and the format is 8. Actual Results: The cut-and-paste text in the composer uses text/html data type but is encoded in plain UCS-2. Expected Results: It should be html code with charset information explictly given.

Kathleen :Brade

Comment 1

•

24 years ago

-->pavlov for X clipboard issue

Assignee: syd → pavlov

Keywords: intl

Summary: wrong format data for text/html type in cut-and-paste → wrong format data for text/html type in cut-and-paste [X clipboard]

Stuart Parmenter

Updated

•

24 years ago

Target Milestone: --- → Future

Louie Zhao

Assignee

Comment 2

•

23 years ago

Attached patch first working patch (obsolete) — Details — Splinter Review

This is a patch for this bug. According to RFC 2854, the "text/html" data should have "charset" info in it to indicate what kind of encoding it use. In Copy&Paste, when other app (such as StarOffice) requests for "text/html", mozilla puts data using UCS2 encoding into Clipboard, and there is no "charset" info in it. That is to say, it's not standard "text/html", so StarOffice can't interpret it. When mozilla paste, it consider the "text/html" data from clipboard is encoded in UCS2, but actually it can be other encodings. This patch adds HTML header which contains "charset" info when copy "text/html" data to clipboard. The "charset" is always "UTF-8" because "text/html" data will be converted to UTF-8 before putting to clipboard. When pasting , it first abstract the "charset" info from "text/html" data and then convert data into UCS2 encoding. With this patch, mozilla can Copy&Paste with other application who use standard "text/html" format.

Kathleen :Brade

Comment 3

•

23 years ago

In nsCopySupport.cpp, is there some reason you insert the info in reverse order rather than Appending it in the actual order we want it? It seems like appending would be more efficient as well as more intuitive. In nsClipboard.cpp, I prefer to see local variable declared near where they are first used. I would move the following block to the block with "nsAutoString platformCharset": > nsresult rv; > PRInt32 outUnicodeLen; > PRUnichar *unicodeData; Speaking of "unicodeData" it should be initialized to null; does the compiler give a warning about this? Is "rv" ever used? Why not check it? In GetHTMLCharset, this line is not needed and should be removed: > return; I'd avoid using the same variable name for the strings. It will lead to confusion to name them the same. Also, I'm not sure if it's more efficient to just declare it inline (NS_LITERAL_CSTRING) or actually name it as you have done (since you are using those only once each).

Louie Zhao

Assignee

Comment 4

•

23 years ago

Attached patch second working patch (obsolete) — Details — Splinter Review

Thank brade, your comment is very helpful, and I have updated my patch according to your suggestion, Can you spare some time to check this one and give the patch "review" ? Thanks a lot.

Attachment #90929 - Attachment is obsolete: true

Kathleen :Brade

Comment 5

•

23 years ago

I sent Louie some e-mail with more comments on his 2nd patch; in particular I'm concerned about a lack of error checking in the 2nd file (2 places)

Louie Zhao

Assignee

Comment 6

•

23 years ago

Attached patch third working patch — Details — Splinter Review

brade: In this patch, the string operation code is concise. The charset converting code is much like other similar parts in nsClipboard.cpp, so the checking may be not neccessary because other codes don't do checking and work fine. But adding the checking code to "do_GetService" and "GetUnicodeDecoder" will make code more stable as you suggested. So this patch includes the checking code. Thank you for taking time to check the patch when you are busy!

Attachment #91077 - Attachment is obsolete: true

Kathleen :Brade

Comment 7

•

23 years ago

I'd prefer to see outUnicodeLen initialized (to 0?) since no return value is checked here: decoder->GetMaxLength(data, numberOfBytes, &outUnicodeLen); r=brade but I'd like Akkana to also review and check for Unix paste issues which she is more knowledgeable about.

Assignee: pavlov → Louie.Zhao

Akkana Peck

Comment 8

•

23 years ago

One fairly serious problem is that it makes mozilla incompatible with earlier versions of itself on the same machine. - Run composer before this patch - Simultaneously run composer with this patch (to a different profile) - Try to copy/paste between them. It fails in both directions. Is this inevitable, or can it be fixed? It means that we won't work with other mozilla-based apps on the same platform (if they're using earlier versions of the libraries, e.g. 1.0) and that we won't work with any app that has implemented html copy/paste to be compatible with mozilla 1.0. That's a serious step, so let's not take it unless we really have to.

Akkana Peck

Comment 9

•

23 years ago

CC some other people who should be aware of this. Joe, in particular, is working on something that might directly relate to this code. Joe, can you take a look?

Boris Zbarsky [:bzbarsky]

Comment 10

•

23 years ago

Question. Is the "text/html" string specifying the data flavor an opaque string or a MIME type? If the latter, would things break horribly all over if we did something like "text/html; charset=UCS-2" or something like that?

Joe Francis

Comment 11

•

23 years ago

I'm not sure we can do this. Our text/html copies are not documents, but fragments. Internally we keep some context info in a seperate data flavor, and our html editor makes use of this at paste time. Eventually we will also ship this context info outside the app for CF_HTML support on windows. What does it mean, for instance, to wrap an <html> tag around an <li>? <html> is not a legal parent for <li>. Also, I don't understand the claim that we always go to utf8 on the clipboard. We certainly dont on our internal clipboard (ie, if you copy and paste within mozilla, it's ucs2 all the way through).

Louie Zhao

Assignee

Comment 12

•

23 years ago

It's really hard to guarantee normal COPY&PASTE for mozilla with both "early mozilla" and other app which using standard "text/html". If mozilla use UCS-2 encoding for COPY&PASTE instead of standard "text/html" in order to interact with other mozilla-based app, it will lose opportunity to interact with other unix application which may be more important.

Akkana Peck

Comment 13

•

23 years ago

> it will lose opportunity to interact with other > unix application Louie: What other Unix applications now support html copy/paste, and is there a standard? I was under the impression that the standards hadn't been written yet. Since mozilla is one of the first (if not the first) to handle this, I have to assume that in the absence of a standard, other apps will write their html clipboard to make them interact with the currently released versions of mozilla rather with some future version they don't know about yet. But if there really is a standard that other apps will be following, then of course we should follow it too.

Louie Zhao

Assignee

Comment 14

•

23 years ago

when any application want to get html data from clipboard, they send request with data flavor "text/html" to the clipboard. "text/html" is defined by RFC2854. So both when mozilla get "text/html" from the clipboard and when mozilla provide "text/html" to clipboard, it should following the RFC2854. Now there are so few applications supporting HTML copy&paste under Linux. StarOffice is one I can find. When requesting "text/html" from StarOffice, the data accords with RFC2854 (actually the data is always UTF-8 encoded , and always has <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">. Using utf-8 is really recommended by RFC2854).

Kathleen :Brade

Comment 15

•

23 years ago

RFC2854 is informational; it isn't a real "standard" as one can tell from the introductory paragraph: >Status of this Memo > > This memo provides information for the Internet community. It does > not specify an Internet standard of any kind. Shouldn't we instead be referencing W3C standards? but back to this bug... Joe--in comment 11 you ask about UTF8 vs UCS2; I think Louie.Zhao is asking about the external clipboard since he wants to enable copy/paste of data from StarOffice. I'd like to see us address Akkana's comment 8 since it is an important issue. What happens if the patch is changed from utf8 to ucs2? Does that maintain compatibility with other mozilla apps? Does that work with StarOffice or does StarOffice really only support UTF8?

Akkana Peck

Comment 16

•

23 years ago

Putting "text/html; charset=utf-8" and ucs2 both on the clipboard sounds like the best approach, so we're specific about what we're offering. We should probably offer and accept both, store only one in the actual system selection, and be able to convert on the fly if a client asks for something else. I don't think our current clipboard classes have any code to do charset conversion yet, but that should be straightforward to write. That leaves the question remains what we should assume as a default if another program puts text/html (with no charset specified) on the clipboard and someone pastes into mozilla. We know what older mozilla builds put on the clipboard (which is relevant for someone copying in galeon or another embedded app and pasting into a newer mozilla, for example); apparently star office specifies utf-8 explicitly, so that won't be a problem. Cc'ing others who have had an interest in the unix clipboard issue.

Christopher Blizzard (:blizzard)

Comment 17

•

23 years ago

We need to be specifying the charset for the type. This has been a long outstanding problem. What other apps do is only marginally important. In fact, other apps have just assumed UCS2 since mozilla has been doing the wrong thing for so long.

Louie Zhao

Assignee

Comment 18

•

23 years ago

> We should probably offer and accept both, > store only one in the actual system selection, > and be able to convert on the fly if a client asks for something else. mozilla can offer both easily, but the request from paster( whether mozilla or other app ) is the same -- asking for "text/html", that is to say, mozilla can't tell who is asking for the data. So mozilla can't choose UCS-2 or charset-specific data according to the client. I am not sure whether one can judge whether data is UCS-2 encoded without any other imformation. If answer is yes, mozilla can accept both, that is, mozilla can paste from mozilla-base app and other app like StarOffice. I think that is the best situation we can reach. > What happens if the patch is changed from utf8 to ucs-2? > Does that maintain compatibility with other mozilla apps? > Does that work with StarOffice or does StarOffice really only support UTF8? I have tried this before, StarOffice can't recoganize ucs2 even I include "charset = ucs2" in the data.

Hidetoshi Tajima

Reporter

Comment 19

•

23 years ago

shouldn't we stop using "text/html" for selection type until we see it standard? Then, when a new composer is a selection requstor, it gets an array of the target types from the "TARGETS" property owned by the currenct selection owner. When the old composer is the owner, it has "text/html" in the list, but the new mozilla would not choose it, but pick up some other type, such as "UTF8_STRING", "COMPOUND_TEXT", or "STRING". When the new composer is the owner, it will not put "text/html" in the "TARGET" properly, then the requestor would ask some other type, instead. If it really works, this could be a workaround to stop doing wrong thing any longer while keeping compatibility with earlier versions of composer and other copycats of them, and it should work with StarOffice too, since it supports "UTF8_STRING" target as well. For long run, I'd prefer to take a patch like Loui's to make things right.

Louie Zhao

Assignee

Comment 20

•

23 years ago

About Tajima's idea: "UTF8_STRING" / "COMPOUND_TEXT" / "STRING" is for pure text, which has no HTML tag. When you copy some HTML and paste to textbox, the request type is UTF8_STRING, what you get is only pure text. So, these types can't be used in composer to copy-paste HTML.

Louie Zhao

Assignee

Comment 21

•

23 years ago

Attached patch new patch compitable with old-Mozilla and StarOffice (obsolete) — Details — Splinter Review

With help of StarOffice guys, I made a new patch which can make mozilla compitable both with old-version mozilla(or mozilla-based app) and with app like StarOffice when copy-paste. New mozilla can 1. copy mozilla's HTML to old-version mozilla's composer 2. copy mozilla's HTML to StarOffice 3. copy old-version mozilla's HTML to mozilla's composer 4. copy StarOffice's HTML to mozilla's composer 5. of course mozilla's inner HTML copy-paste Working process: "text/html" can be encoded UTF-16[which I think impossible before :-(].But it is recommended that documents transmitted as UTF-16 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called Byte Order Mark (BOM)). StarOffice checks the BOM to decide whether pasted content use UTF-16 encoding. When mozilla copy, no need to convert to any other encoding, just add BOM before the copy content. (If add HTML info like <HTML>, <HEAD>, <...charset=utf-16>, it can be standard "text/html". But then, old mozilla will insert HTML info more than once.) when mozilla paste, it first check BOM. If BOM exits(mean UTF-16 encoding), just stripping off BOM is OK(inner copy-paste). If no BOM, assume it's iso8859-1 encoding. Then try to find "charset". If has, convert from "charset" to UCS2. If can't find, assume data was from old-version mozilla because it has no BOM and charset. After completing encode converting, check if there is HTML tag: <BODY> & </BODY>. If yes, strip off HTML info, only remain content in <BODY>..</BODY>. If no, data is from mozilla, don't need do anything. Imperfection: because this is a solution to cooperate app using different data format, it can't be perfect. Two things should be mentioned: 1. when copy mozilla's HTML to old-version mozilla's composer each time, there will be an extra char, that's BOM, because old-version mozilla's think BOM is also content. This is not a big thing, so we can ignore. 2. when paste from other app which use "text/html" without putting "charset" in, mozilla will think it's from old-version mozilla because it has no BOM and charset. This problem is for other app don't abide by "text/html" standard. mozilla can deal with standard "text/html" well. Until standard for Unix HTML copy-paste comes out, we can use this solution.

Louie Zhao

Assignee

Comment 22

•

23 years ago

Attached patch same as previous patch, using diff -u (obsolete) — Details — Splinter Review

same as the previous patch using diff -u to generate

Attachment #95089 - Attachment is obsolete: true

Akkana Peck

Comment 23

•

23 years ago

Adding pinkerton since I'm about to ask a question regarding nonUnix platform clipboards.

Akkana Peck

Comment 24

•

23 years ago

Comment on attachment 95350 [details] [diff] [review] same as previous patch, using diff -u I'm still reviewing the nsClipboard.cpp details, but I have a few questions: >Index: layout/base/src/nsCopySupport.cpp >=================================================================== >+ >+ PRUnichar prefix = 0xFEFF; >+ buffer.Insert(prefix, 0); Please add a comment here explaining why this is here. Your explanation in the bug and in the other file are clear, but someone reading this file wouldn't know where to look. It looks like we're inserting this char on all platforms. Are nonunix platforms going to see the extra character in pasted html? Other questions: if we add this special character, are we locking ourselves in to adding it forever? The extra character added when pasting in to old versions of composer might be something we can live with, since that will be a relatively rare occurrence. We shouldn't see problems pasting into text controls in an older mozilla, right? since the text control should accept only plaintext. But I hope we aren't locking ourselves into always having to add the special character -- it's okay as a temporary kludge, but wouldn't be a good longterm solution. >+ PRBool found; >+ found = FindInReadable(NS_LITERAL_STRING("<BODY>"), start, end); >+ if (!found) { >+ htmlStr.BeginReading(start); >+ htmlStr.EndReading(end); >+ found = FindInReadable(NS_LITERAL_STRING("<body>"), start, end); Can't you do this in one pass, using nsCaseInsensitiveCStringComparator as an argument to FindInReadable? And that way we'll also get less common variants like Body. I've only skimmed the rest of the nsClipboard code at this point, but it looks straightforward.

Louie Zhao

Assignee

Comment 25

•

23 years ago

Attached patch patch following akanna's advise (obsolete) — Details — Splinter Review

>Please add a comment here explaining why this is here. it's quite reasonable, i add comment on why add BOM >It looks like we're inserting this char on all platforms. I don't know much about non-unix platform. But I guess windows doesn't need this char. I move the code of adding BOM to nsClipboard.cpp, so this solution only concern unix-platform. >Can't you do this in one pass, using nsCaseInsensitiveCStringComparator >as an argument to FindInReadable? It's great! use CaseInsensitiveFindInReadable instead FindInReadable >But I hope we aren't locking ourselves into always >having to add the special character -- it's okay as >a temporary kludge, but wouldn't be a good longterm solution. This solution won't lock ourselves into always having to add the special character because: 1. if don't consider the interaction with other app, the solution only add 1 char when copy, and minus 1 char when paste. That is to say, the solution doesn't change much most of the time. 2. adding BOM in front of UTF-16 document is recommended. Many app have adapted this, such as StarOffice on solaris, and ultraedit on windows. 3. When we reach a agreement on the standard of unix copy-paste, mozilla and other app can use the standard. If the standard is UCS2(without BOM), we can easily get rid of the BOM; if the standard is "text/html", we can just wrap HTML info around the current content. In fact, now mozilla must deal with multiple format in order to be more compitable.

Attachment #95350 - Attachment is obsolete: true

Akkana Peck

Comment 26

•

23 years ago

I'm finding that when I paste html from an older build's editor into an editor with the patch, or when I copy any html in a build with the patch, I get this assert: ###!!! ASSERTION: nsDependentString must wrap only null-terminated strings: '!*aEndPtr', file ../../../dist/include/string/nsDependentString.h, line 86 It works fine in both directions, but maybe you need a Substring() or something? I'm not sure why nsDependantString requires null termination when it also takes a length argument ... oh, well. The extra character does show up when pasting into the older build, but I think we can live with that (it's a fairly uncommon case, anyway, and composer does show it and doesn't have any problems with deleting it). Thanks for the reassurance that other apps do this and it's at least somewhat standard.

Boris Zbarsky [:bzbarsky]

Comment 27

•

23 years ago

nsDependantString requires null termination because it inherits from nsAFlatString, which promises that .get() will return a pointer to a null- terminated buffer. It can take a length for efficiency reasons (to avoid an strlen()).

Louie Zhao

Assignee

Comment 28

•

23 years ago

Attached patch new patch (obsolete) — Details — Splinter Review

using nsString instead nsDependentString to avoid "ASSERTION:..." previous "adding BOM" will replace the first char, now it's ok.

Attachment #95536 - Attachment is obsolete: true

Johnny Stenback (:jst)

Comment 29

•

23 years ago

Ouch, that'll cost us... How about using nsDependentSubstring in stead?

Louie Zhao

Assignee

Comment 30

•

23 years ago

Attached patch using substring — Details — Splinter Review

using substring is great, it cost less and has no "ASSERTION...".

Attachment #95841 - Attachment is obsolete: true

Louie Zhao

Assignee

Comment 31

•

23 years ago

Attached patch final patch — Details — Splinter Review

after discussion with akkana, we agree to get rid of "GetHMTLBody", which should be done by composer, not nsClipboard.cpp

Akkana Peck

Comment 32

•

23 years ago

Comment on attachment 96126 [details] [diff] [review] final patch This one looks good! r=akkana

Attachment #96126 - Flags: review+

Akkana Peck

Comment 33

•

23 years ago

I don't think we're changing anything with regard to html paste into IM, but just in case, cc'ing Joe.

Louie Zhao

Assignee

Comment 34

•

23 years ago

seeking sr=

Johnny Stenback (:jst)

Comment 35

•

23 years ago

Comment on attachment 96126 [details] [diff] [review] final patch + PRUnichar prefix = 0xFEFF; + memcpy(buffer, &prefix, 2); + memcpy(buffer + 2, clipboardData, dataLength); In stead of using the literal 2 above, use sizeof(prefix) so that it's more obvious what this code does. + nsMemory::Free(NS_REINTERPRET_CAST(char*, clipboardData)); + clipboardData = NS_REINTERPRET_CAST(char*, buffer); + dataLength = dataLength + 2; Same here, use sizeof(prefix) and not 2. - In ConvertHTMLtoUCS2( + if (charset.Equals(NS_LITERAL_STRING("UTF-16"))) {//current mozilla + outUnicodeLen = dataLength / 2 - 1; Please add parens around "dataLength / 2", i.e.: + outUnicodeLen = (dataLength / 2) - 1; to make this clearer. + *unicodeData = NS_REINTERPRET_CAST(PRUnichar*, + nsMemory::Alloc((outUnicodeLen + 1) * sizeof(PRUnichar))); In stead of using the literal 1 here, use sizeof('\0') to indicate that you're adding space for the null terminator. Same thing here: + else if (charset.Equals(NS_LITERAL_STRING("OLD-MOZILLA"))) {// old mozilla + outUnicodeLen = dataLength / 2; + *unicodeData = NS_REINTERPRET_CAST(PRUnichar*, + nsMemory::Alloc((outUnicodeLen + 1) * sizeof(PRUnichar))); And also here: + // converting + decoder->GetMaxLength(data, dataLength, &outUnicodeLen); + // |outUnicodeLen| is number of chars + if (outUnicodeLen) { + *unicodeData = NS_REINTERPRET_CAST(PRUnichar*, + nsMemory::Alloc((outUnicodeLen + 1) * sizeof(PRUnichar))); - In GetHTMLCharset(): + nsACString::const_iterator start, end, valueStart, valueEnd; + + htmlStr.BeginReading(start); + htmlStr.EndReading(end); + htmlStr.BeginReading(valueStart); + htmlStr.BeginReading(valueEnd); I believe this would be faster: + nsACString::const_iterator start, end; + + htmlStr.BeginReading(start); + htmlStr.EndReading(end); + + nsACString::const_iterator valueStart(start), valueEnd(start); Further down: + if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("\""), start, end)) + valueEnd = start; What's the point of doing a case-insensitive find when looking for a quote character? Just do: + if (FindCharInReadable('"', start, end)) + valueEnd = start; - Nit of the day: + if ( !charsetStr.IsEmpty() ) { Loose the spaces inside the outer parens (for consistency with the rest of your code, and the surrounding code). sr=jst if you fix the above.

Attachment #96126 - Flags: superreview+

Louie Zhao

Assignee

Comment 36

•

23 years ago

Attached patch patch after jst's comment — Details — Splinter Review

Attachment #96126 - Attachment is obsolete: true

Louie Zhao

Assignee

Updated

•

23 years ago

Whiteboard: branchOEM

Pete Zha

Updated

•

23 years ago

Attachment #96126 - Attachment is obsolete: false

Pete Zha

Comment 37

•

23 years ago

checked in trunk

Status: NEW → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Jim Dunn

Comment 38

•

23 years ago

ok for oem branch

Whiteboard: branchOEM → branchOEM+

Pete Zha

Comment 39

•

23 years ago

checked in NETSCAPE_7_0_OEM_BRANCH (a=jdunn)

Whiteboard: branchOEM+ → branchOEM+ fixedOEM

timeless

Updated

•

23 years ago

Whiteboard: branchOEM+ fixedOEM → fixedOEM

timeless

Updated

•

23 years ago

Keywords: fixedOEM

Whiteboard: fixedOEM

Myk Melez [:myk] [@mykmelez]

Updated

•

21 years ago

Product: Browser → Seamonkey

first working patch 23 years ago Louie Zhao 4.35 KB, patch		Details \| Diff \| Splinter Review
second working patch 23 years ago Louie Zhao 4.23 KB, patch		Details \| Diff \| Splinter Review
third working patch 23 years ago Louie Zhao 4.43 KB, patch		Details \| Diff \| Splinter Review
new patch compitable with old-Mozilla and StarOffice 23 years ago Louie Zhao 7.43 KB, patch		Details \| Diff \| Splinter Review
same as previous patch, using diff -u 23 years ago Louie Zhao 7.94 KB, patch		Details \| Diff \| Splinter Review
patch following akanna's advise 23 years ago Louie Zhao 8.94 KB, patch		Details \| Diff \| Splinter Review
new patch 23 years ago Louie Zhao 8.93 KB, patch		Details \| Diff \| Splinter Review
using substring 23 years ago Louie Zhao 8.95 KB, patch		Details \| Diff \| Splinter Review
final patch 23 years ago Louie Zhao 8.24 KB, patch	akkzilla : review+ jst : superreview+	Details \| Diff \| Splinter Review
patch after jst's comment 23 years ago Louie Zhao 8.22 KB, patch		Details \| Diff \| Splinter Review