Closed Bug 123389 Opened 23 years ago Closed 22 years ago

wrong format data for text/html type in cut-and-paste [X clipboard]

Categories

(SeaMonkey :: Composer, defect)

All
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Future

People

(Reporter: tajima, Assigned: Louie.Zhao)

Details

(Keywords: fixedOEM, intl)

Attachments

(4 files, 6 obsolete files)

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:0.9.7) Gecko/20020131
BuildID:    2001122204

In the composer window, when text is copied or cut and paste it to the self,
"text/html" type is choosen for clipboard selection. Here,
the text data in a X selection event is encoded in plain UCS2 
string(always in a big-endian) and the format field of the event sets 
to 8.

This shouldn't be working if the owner and the requestor are
in different endian machines to each other. At least, the format
should be 16, and let X take care of byte-swapping. 

Moreover, the plain UCS2 should not be a right data for text/html 
type, but it should be html code beginning with "<HTML>". According to RFC2854
(which defines text/html) the encoding must be iso8859-1 
in absence of explicit encoding specifiers.

Both selection getter(when the composer delivers) and selection requester(when
it receives) should be modified to properly convert
the cut-n-paste text to/from its html code, with the charset 
information explicitly specified.

Reproducible: Always
Steps to Reproduce:
1.run "mozilla -compose" with a debugger on Linux or Solaris which 
has widget/src/gtk/nsClipboard.cpp debuggable built.
2. In the debugger, monitor cliboardData in either 
nsClipboard:SelectionGetCB or nsClipboard:SelectionReceiver when its 
atom type is of "text/html".
3. Try to cut-and-paste any text in the composer window, and find out 
that the data in the clipboard is encoded in plain UCS2 with 
big-endian and the format is 8.

Actual Results:  The cut-and-paste text in the composer uses text/html data type
but is encoded in plain UCS-2.

Expected Results:  It should be html code with charset information explictly given.
-->pavlov for X clipboard issue
Assignee: syd → pavlov
Keywords: intl
Summary: wrong format data for text/html type in cut-and-paste → wrong format data for text/html type in cut-and-paste [X clipboard]
Target Milestone: --- → Future
Attached patch first working patch (obsolete) — Splinter Review
This is a patch for this bug.

According to RFC 2854, the "text/html" data should have "charset" info in it to
indicate what kind of encoding it use. 

In Copy&Paste, when other app (such as StarOffice) requests for "text/html",
mozilla puts data using UCS2 encoding into Clipboard, and there is no "charset"
info in it. That is to say, it's not standard "text/html", so StarOffice can't
interpret it.

When mozilla paste, it consider the "text/html" data from clipboard is encoded
in UCS2, but actually it can be other encodings. 

This patch adds HTML header which contains "charset" info when copy "text/html"
data to clipboard. The "charset" is always "UTF-8" because "text/html" data
will be converted to UTF-8 before putting to clipboard.

When pasting , it first abstract the "charset" info from "text/html" data and
then convert data into UCS2 encoding.

With this patch, mozilla can Copy&Paste with other application who use standard
"text/html" format.
In nsCopySupport.cpp, is there some reason you insert the info in reverse order
rather than Appending it in the actual order we want it?  It seems like
appending would be more efficient as well as more intuitive.

In nsClipboard.cpp, I prefer to see local variable declared near where they are
first used.  I would move the following block to the block with "nsAutoString
platformCharset":
>     nsresult rv;
>     PRInt32 outUnicodeLen;
>     PRUnichar *unicodeData;


Speaking of "unicodeData" it should be initialized to null; does the compiler
give a warning about this?

Is "rv" ever used? Why not check it?

In GetHTMLCharset, this line is not needed and should be removed:
>   return;

I'd avoid using the same variable name for the strings.  It will lead to
confusion to name them the same.  Also, I'm not sure if it's more efficient to
just declare it inline (NS_LITERAL_CSTRING) or actually name it as you have done
(since you are using those only once each).
Attached patch second working patch (obsolete) — Splinter Review
Thank brade, your comment is very helpful, and I have updated my patch
according to your suggestion, Can you spare some time to check this one and
give the patch "review" ? Thanks a lot.
Attachment #90929 - Attachment is obsolete: true
I sent Louie some e-mail with more comments on his 2nd patch; in particular I'm
concerned about a lack of error checking in the 2nd file (2 places)
brade:

  In this patch, the string operation code is concise.

  The charset converting code is much like other similar parts in
nsClipboard.cpp, so the checking may be not neccessary because other codes
don't do checking and work fine. But adding the checking code to
"do_GetService" and "GetUnicodeDecoder" will make code more stable as you
suggested. So this patch includes the checking code.

  Thank you for taking time to check the patch when you are busy!
Attachment #91077 - Attachment is obsolete: true
I'd prefer to see outUnicodeLen initialized (to 0?) since no return value is
checked here:
  decoder->GetMaxLength(data, numberOfBytes, &outUnicodeLen);

r=brade but I'd like Akkana to also review and check for Unix paste issues which
she is more knowledgeable about.
Assignee: pavlov → Louie.Zhao
One fairly serious problem is that it makes mozilla incompatible with earlier
versions of itself on the same machine.

- Run composer before this patch
- Simultaneously run composer with this patch (to a different profile)
- Try to copy/paste between them.  It fails in both directions.

Is this inevitable, or can it be fixed?  It means that we won't work with other
mozilla-based apps on the same platform (if they're using earlier versions of
the libraries, e.g. 1.0) and that we won't work with any app that has
implemented html copy/paste to be compatible with mozilla 1.0.  That's a serious
step, so let's not take it unless we really have to.
CC some other people who should be aware of this.  Joe, in particular, is
working on something that might directly relate to this code.  Joe, can you take
a look?
Question.  Is the "text/html" string specifying the data flavor an opaque string
or a MIME type?  If the latter, would things break horribly all over if we did
something like "text/html; charset=UCS-2" or something like that?
I'm not sure we can do this.  Our text/html copies are not documents, but 
fragments.  Internally we keep some context info in a seperate data flavor, and 
our html editor makes use of this at paste time.  Eventually we will also ship 
this context info outside the app for CF_HTML support on windows.  What does it 
mean, for instance, to wrap an <html> tag around an <li>?  <html> is not a legal 
parent for <li>.

Also, I don't understand the claim that we always go to utf8 on the clipboard.  
We certainly dont on our internal clipboard (ie, if you copy and paste within 
mozilla, it's ucs2 all the way through).
It's really hard to guarantee normal COPY&PASTE for mozilla with both "early
mozilla" and other app which using standard "text/html". If mozilla use UCS-2
encoding for COPY&PASTE instead of standard "text/html" in order to interact
with other mozilla-based app, it will lose opportunity to interact with other
unix application which may be more important. 
> it will lose opportunity to interact with other
> unix application

Louie: What other Unix applications now support html copy/paste, and is there a
standard?  I was under the impression that the standards hadn't been written yet.
Since mozilla is one of the first (if not the first) to handle this, I have to
assume that in the absence of a standard, other apps will write their html
clipboard to make them interact with the currently released versions of mozilla
rather with some future version they don't know about yet.  But if there really
is a standard that other apps will be following, then of course we should follow
it too.
when any application want to get html data from clipboard, they send request
with data flavor "text/html" to the clipboard. "text/html" is defined by
RFC2854. So both when mozilla get "text/html" from the clipboard and when
mozilla provide "text/html" to clipboard, it should following the RFC2854.

Now there are so few applications supporting HTML copy&paste under Linux.
StarOffice is one I can find. When requesting "text/html" from StarOffice, the
data accords with RFC2854 (actually the data is always UTF-8 encoded , and
always has <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">.
Using utf-8 is really recommended by RFC2854).
RFC2854 is informational; it isn't a real "standard" as one can tell from the
introductory paragraph:

>Status of this Memo
>
>   This memo provides information for the Internet community.  It does
>   not specify an Internet standard of any kind.

Shouldn't we instead be referencing W3C standards?

but back to this bug...

Joe--in comment 11 you ask about UTF8 vs UCS2; I think Louie.Zhao is asking
about the external clipboard since he wants to enable copy/paste of data from
StarOffice.

I'd like to see us address Akkana's comment 8 since it is an important issue. 
What happens if the patch is changed from utf8 to ucs2?  Does that maintain
compatibility with other mozilla apps?  Does that work with StarOffice or does
StarOffice really only support UTF8?
Putting "text/html; charset=utf-8" and ucs2 both on the clipboard sounds like
the best approach, so we're specific about what we're offering.  We should
probably offer and accept both, store only one in the actual system selection,
and be able to convert on the fly if a client asks for something else.  I don't
think our current clipboard classes have any code to do charset conversion yet,
but that should be straightforward to write.

That leaves the question remains what we should assume as a default if another
program puts text/html (with no charset specified) on the clipboard and someone
pastes into mozilla.  We know what older mozilla builds put on the clipboard
(which is relevant for someone copying in galeon or another embedded app and
pasting into a newer mozilla, for example); apparently star office specifies
utf-8 explicitly, so that won't be a problem.

Cc'ing others who have had an interest in the unix clipboard issue.
We need to be specifying the charset for the type.  This has been a long
outstanding problem.  What other apps do is only marginally important.  In fact,
other apps have just assumed UCS2 since mozilla has been doing the wrong thing
for so long.
> We should probably offer and accept both, 
> store only one in the actual system selection, 
> and be able to convert on the fly if a client asks for something else.

mozilla can offer both easily, but the request from paster( whether mozilla or
other app ) is the same -- asking for "text/html", that is to say, mozilla can't
tell who is asking for the data. So mozilla can't choose UCS-2 or
charset-specific data according to the client.

I am not sure whether one can judge whether data is UCS-2 encoded without any
other imformation. If answer is yes, mozilla can accept both, that is, mozilla
can paste from mozilla-base app and other app like StarOffice. I think that is
the best situation we can reach.

> What happens if the patch is changed from utf8 to ucs-2?  
> Does that maintain compatibility with other mozilla apps?  
> Does that work with StarOffice or does StarOffice really only support UTF8?

I have tried this before, StarOffice can't recoganize ucs2 even I include
"charset = ucs2" in the data.
shouldn't we stop using "text/html" for selection type
until we see it standard?

Then, when a new composer is a selection requstor, it gets an array of the
target types from the "TARGETS" property owned by the currenct selection owner.
When the old composer is the owner, it has "text/html" in the list, but the new
mozilla would not choose it, but pick up some other type, such as "UTF8_STRING",
"COMPOUND_TEXT", 
or "STRING".

When the new composer is the owner, it will not put "text/html" in the "TARGET"
properly, then the requestor would ask some other
type, instead.

If it really works, this could be a workaround to stop doing wrong thing any
longer while keeping compatibility with earlier versions of composer and other
copycats of them, and it should work with StarOffice too, since it supports
"UTF8_STRING" target as well.
For long run, I'd prefer to take a patch like Loui's to make
things right.
About Tajima's idea: "UTF8_STRING" / "COMPOUND_TEXT" / "STRING" is for pure
text, which has no HTML tag. When you copy some HTML and paste to textbox, the
request type is UTF8_STRING, what you get is only pure text. So, these types
can't be used in composer to copy-paste HTML. 
With help of StarOffice guys, I made a new patch which can make mozilla
compitable both with old-version mozilla(or mozilla-based app) and with app
like StarOffice when copy-paste. New mozilla can

1. copy mozilla's HTML to old-version mozilla's composer
2. copy mozilla's HTML to StarOffice
3. copy old-version mozilla's HTML to mozilla's composer
4. copy StarOffice's HTML to mozilla's composer
5. of course mozilla's inner HTML copy-paste

Working process:

"text/html" can be encoded UTF-16[which I think impossible before :-(].But it
is recommended that documents transmitted as UTF-16 always begin with a
ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called Byte
Order Mark (BOM)). StarOffice checks the BOM to decide whether pasted content
use UTF-16 encoding.

When mozilla copy, no need to convert to any other encoding, just add BOM
before the copy content. (If add HTML info like <HTML>, <HEAD>,
<...charset=utf-16>, it can be standard "text/html". But then, old mozilla will
insert HTML info more than once.)

when mozilla paste, it first check BOM. If BOM exits(mean UTF-16 encoding),
just stripping off BOM is OK(inner copy-paste). If no BOM, assume it's
iso8859-1 encoding. Then try to find "charset". If has, convert from "charset"
to UCS2. If can't find, assume data was from old-version mozilla because it has
no BOM and charset.

After completing encode converting, check if there is HTML tag: <BODY> &
</BODY>. If yes, strip off HTML info, only remain content in <BODY>..</BODY>.
If no, data is from mozilla, don't need do anything.

Imperfection:

because this is a solution to cooperate app using different data format, it
can't be perfect. Two things should be mentioned:
1. when copy mozilla's HTML to old-version mozilla's composer each time, there
will be an extra char, that's BOM, because old-version mozilla's think BOM is
also content. This is not a big thing, so we can ignore.
2. when paste from other app which use "text/html" without putting "charset"
in, mozilla will think it's from old-version mozilla because it has no BOM and
charset. This problem is for other app don't abide by "text/html" standard.
mozilla can deal with standard "text/html" well.

Until standard for Unix HTML copy-paste comes out, we can use this solution.
same as the previous patch
using diff -u to generate
Attachment #95089 - Attachment is obsolete: true
Adding pinkerton since I'm about to ask a question regarding nonUnix platform
clipboards.
Comment on attachment 95350 [details] [diff] [review]
same as previous patch, using diff -u

I'm still reviewing the nsClipboard.cpp details, but I have a few questions:

>Index: layout/base/src/nsCopySupport.cpp
>===================================================================
>+
>+  PRUnichar prefix = 0xFEFF;
>+  buffer.Insert(prefix, 0);

Please add a comment here explaining why this is here.	Your explanation in the
bug and in the other file are clear, but someone reading this file wouldn't
know where to look.

It looks like we're inserting this char on all platforms.  Are nonunix
platforms going to see the extra character in pasted html?

Other questions: if we add this special character, are we locking ourselves in
to adding it forever?  The extra character added when pasting in to old
versions of composer might be something we can live with, since that will be a
relatively rare occurrence.  We shouldn't see problems pasting into text
controls in an older mozilla, right? since the text control should accept only
plaintext.  But I hope we aren't locking ourselves into always having to add
the special character -- it's okay as a temporary kludge, but wouldn't be a
good longterm solution.

>+  PRBool found;
>+  found = FindInReadable(NS_LITERAL_STRING("<BODY>"), start, end);
>+  if (!found) {
>+    htmlStr.BeginReading(start);
>+    htmlStr.EndReading(end);
>+    found = FindInReadable(NS_LITERAL_STRING("<body>"), start, end);

Can't you do this in one pass, using nsCaseInsensitiveCStringComparator as an
argument to FindInReadable?  And that way we'll also get less common variants
like Body.

I've only skimmed the rest of the nsClipboard code at this point, but it looks
straightforward.
Attached patch patch following akanna's advise (obsolete) — Splinter Review
>Please add a comment here explaining why this is here.
it's quite reasonable, i add comment on why add BOM

>It looks like we're inserting this char on all platforms.
I don't know much about non-unix platform. But I guess 
windows doesn't need this char. I move the code of adding BOM to
nsClipboard.cpp, so this solution only concern unix-platform.

>Can't you do this in one pass, using nsCaseInsensitiveCStringComparator
>as an argument to FindInReadable?
It's great! use CaseInsensitiveFindInReadable instead FindInReadable

>But I hope we aren't locking ourselves into always
>having to add the special character -- it's okay as
>a temporary kludge, but wouldn't be a good longterm solution.
This solution won't lock ourselves into always having to add the special
character because:

1. if don't consider the interaction with other app, the solution only add 1
char when copy, and minus 1 char when paste. That is to say, the solution
doesn't change much most of the time.

2. adding BOM in front of UTF-16 document is recommended. Many app have adapted
this, such as StarOffice on solaris, and ultraedit on windows.

3. When we reach a agreement on the standard of unix copy-paste, mozilla and
other app can use the standard. If the standard is UCS2(without BOM), we can
easily get rid of the BOM; if the standard is "text/html", we can just wrap
HTML info around the current content.

In fact, now mozilla must deal with multiple format in order to be more
compitable.
Attachment #95350 - Attachment is obsolete: true
I'm finding that when I paste html from an older build's editor into an editor
with the patch, or when I copy any html in a build with the patch, I get this
assert:

###!!! ASSERTION: nsDependentString must wrap only null-terminated strings:
'!*aEndPtr', file ../../../dist/include/string/nsDependentString.h, line 86

It works fine in both directions, but maybe you need a Substring() or something?
 I'm not sure why nsDependantString requires null termination when it also takes
a length argument ...  oh, well.

The extra character does show up when pasting into the older build, but I think
we can live with that (it's a fairly uncommon case, anyway, and composer does
show it and doesn't have any problems with deleting it).  Thanks for the
reassurance that other apps do this and it's at least somewhat standard.
nsDependantString requires null termination because it inherits from 
nsAFlatString, which promises that .get() will return a pointer to a null-
terminated buffer.  It can take a length for efficiency reasons (to avoid an 
strlen()).
Attached patch new patch (obsolete) — Splinter Review
using nsString instead nsDependentString to avoid "ASSERTION:..."
previous "adding BOM" will replace the first char, now it's ok.
Attachment #95536 - Attachment is obsolete: true
Ouch, that'll cost us... How about using nsDependentSubstring in stead?
Attached patch using substringSplinter Review
using substring is great, it cost less and has no "ASSERTION...".
Attachment #95841 - Attachment is obsolete: true
Attached patch final patchSplinter Review
after discussion with akkana, we agree to get rid of "GetHMTLBody", which
should be done by composer, not nsClipboard.cpp
Comment on attachment 96126 [details] [diff] [review]
final patch

This one looks good!  r=akkana
Attachment #96126 - Flags: review+
I don't think we're changing anything with regard to html paste into IM, but
just in case, cc'ing Joe.
seeking sr=
Comment on attachment 96126 [details] [diff] [review]
final patch

+	   PRUnichar prefix = 0xFEFF;
+	   memcpy(buffer, &prefix, 2);
+	   memcpy(buffer + 2, clipboardData, dataLength);

In stead of using the literal 2 above, use sizeof(prefix) so that it's more
obvious what this code does.

+	   nsMemory::Free(NS_REINTERPRET_CAST(char*, clipboardData));
+	   clipboardData = NS_REINTERPRET_CAST(char*, buffer);
+	   dataLength = dataLength + 2;

Same here, use sizeof(prefix) and not 2.

- In ConvertHTMLtoUCS2(

+  if (charset.Equals(NS_LITERAL_STRING("UTF-16"))) {//current mozilla
+    outUnicodeLen = dataLength / 2 - 1;

Please add parens around "dataLength / 2", i.e.:

+    outUnicodeLen = (dataLength / 2) - 1;

to make this clearer.


+    *unicodeData = NS_REINTERPRET_CAST(PRUnichar*,
+		    nsMemory::Alloc((outUnicodeLen + 1) * sizeof(PRUnichar)));

In stead of using the literal 1 here, use sizeof('\0') to indicate that you're
adding space for the null terminator.

Same thing here:

+  else if (charset.Equals(NS_LITERAL_STRING("OLD-MOZILLA"))) {// old mozilla
+    outUnicodeLen = dataLength / 2;
+    *unicodeData = NS_REINTERPRET_CAST(PRUnichar*,
+		    nsMemory::Alloc((outUnicodeLen + 1) * sizeof(PRUnichar)));

And also here:

+    // converting
+    decoder->GetMaxLength(data, dataLength, &outUnicodeLen);
+    // |outUnicodeLen| is number of chars
+    if (outUnicodeLen) {
+      *unicodeData = NS_REINTERPRET_CAST(PRUnichar*,
+		      nsMemory::Alloc((outUnicodeLen + 1) *
sizeof(PRUnichar)));

- In GetHTMLCharset():

+  nsACString::const_iterator start, end, valueStart, valueEnd;
+
+  htmlStr.BeginReading(start);
+  htmlStr.EndReading(end);
+  htmlStr.BeginReading(valueStart);
+  htmlStr.BeginReading(valueEnd);

I believe this would be faster:

+  nsACString::const_iterator start, end;
+
+  htmlStr.BeginReading(start);
+  htmlStr.EndReading(end);
+
+  nsACString::const_iterator valueStart(start), valueEnd(start);

Further down:

+      if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("\""), start, end))
+	 valueEnd = start;

What's the point of doing a case-insensitive find when looking for a quote
character? Just do:

+      if (FindCharInReadable('"', start, end))
+	 valueEnd = start;

- Nit of the day:

+    if ( !charsetStr.IsEmpty() ) {

Loose the spaces inside the outer parens (for consistency with the rest of your
code, and the surrounding code).

sr=jst if you fix the above.
Attachment #96126 - Flags: superreview+
Attachment #96126 - Attachment is obsolete: true
Whiteboard: branchOEM
Attachment #96126 - Attachment is obsolete: false
checked in trunk
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
ok for oem branch
Whiteboard: branchOEM → branchOEM+
checked in NETSCAPE_7_0_OEM_BRANCH (a=jdunn)
Whiteboard: branchOEM+ → branchOEM+ fixedOEM
Whiteboard: branchOEM+ fixedOEM → fixedOEM
Keywords: fixedOEM
Whiteboard: fixedOEM
Product: Browser → Seamonkey
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: