Closed Bug 1497580 Opened Last year Closed Last year

Improve GTK nsClipboard code (make text/html produce UTF-8)

Categories

(Core :: Widget: Gtk, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla64

People

(Reporter: evilpie, Assigned: evilpie)

References

Details

Attachments

(2 files)

Hi Karl!

I have some question about nsClipboard::SelectionGetEvent. I was investigating nsPrimitiveHelpers::CreateDataFromPrimitive fixing in that function.
https://searchfox.org/mozilla-central/rev/80ac71c1c54af788b32e851192dfd2de2ec18e18/widget/gtk/nsClipboard.cpp#574-604

1) Why 8 bits? CreateDataFromPrimitive can return char or char16_t.
2) What's up with all that crazy code for text/html? Is that really necessary? This also seems to assume char16_t.
Flags: needinfo?(karlt)
> 2) What's up with all that crazy code for text/html? Is that really necessary?

X11 usually works with UTF-8.  The comment seems to claim that a recommendation permits UCS2 for text/html if the data starts with a BOM.  I don't find that in a quick google search, but for a convention for UCS2 files.  Any X11 usage probably pre-dates UTF-8.

A quick search for UTF-16, BOM, and UCS2 in GTK code does not find any matches, and so I suspect conversion to UTF-8 would provide better compatibility with other apps.  If you'd like to change that, check LibreOffice for compat.  Also check compat with another Mozilla app (or process), or at least check that the Gecko code receiving HTML works with UTF-8.

> This also seems to assume char16_t.

The format (char or char16_t) depends on the flavor passed to nsPrimitiveHelpers.  char16_t is what is used for text/html.

1) Why 8 bits? CreateDataFromPrimitive can return char or char16_t.

It's just sending a stream of bytes, which, from the byte-count aDataLen parameter, seems to be what nsPrimitiveHelpers is providing.

https://searchfox.org/mozilla-central/rev/80ac71c1c54af788b32e851192dfd2de2ec18e18/widget/nsPrimitiveHelpers.cpp#74
Flags: needinfo?(karlt)
(In reply to Karl Tomlinson (:karlt) from comment #1)
> > 2) What's up with all that crazy code for text/html? Is that really necessary?
> 
> X11 usually works with UTF-8.  The comment seems to claim that a
> recommendation permits UCS2 for text/html if the data starts with a BOM.  I
> don't find that in a quick google search, but for a convention for UCS2
> files.  Any X11 usage probably pre-dates UTF-8.
> 
> A quick search for UTF-16, BOM, and UCS2 in GTK code does not find any
> matches, and so I suspect conversion to UTF-8 would provide better
> compatibility with other apps.  If you'd like to change that, check
> LibreOffice for compat.  Also check compat with another Mozilla app (or
> process), or at least check that the Gecko code receiving HTML works with
> UTF-8.
> 
Great, I will have a look. Is there anything that produces text/html so I could check pasting into Firefox?
> > This also seems to assume char16_t.
> 
> The format (char or char16_t) depends on the flavor passed to
> nsPrimitiveHelpers.  char16_t is what is used for text/html.
> 
> 1) Why 8 bits? CreateDataFromPrimitive can return char or char16_t.
> 
> It's just sending a stream of bytes, which, from the byte-count aDataLen
> parameter, seems to be what nsPrimitiveHelpers is providing.
> 
> https://searchfox.org/mozilla-central/rev/
> 80ac71c1c54af788b32e851192dfd2de2ec18e18/widget/nsPrimitiveHelpers.cpp#74
I am mostly just wondering how other apps are supposed to interpret that data, but I don't actually know what other kind of MIME types that we support would be requested by other applications.
Seems like Libre Office can handle UTF-8 encoded text/html, but Firefox can't ...
(In reply to Tom Schuster [:evilpie] from comment #2)
> (In reply to Karl Tomlinson (:karlt) from comment #1)
> > A quick search for UTF-16, BOM, and UCS2 in GTK code does not find any
> > matches, and so I suspect conversion to UTF-8 would provide better
> > compatibility with other apps.  If you'd like to change that, check
> > LibreOffice for compat.  Also check compat with another Mozilla app (or
> > process), or at least check that the Gecko code receiving HTML works with
> > UTF-8.
> > 
> Great, I will have a look. Is there anything that produces text/html so I
> could check pasting into Firefox?

I'm guessing that LibreOffice (editing HTML) and/or Chromium produce text/html, but I don't know TBH.

Note that the "STRING" etc block above almost does the same thing (with unnecessary "New" string allocations), but "text/unicode" would need to be changed.

> I am mostly just wondering how other apps are supposed to interpret that
> data, but I don't actually know what other kind of MIME types that we
> support would be requested by other applications.

|selectionTarget| indicates how the data will be interpreted.
Images have already been handled above, and so the common other one would be text/plain.  URLs may be another - I don't know the mime type for that.
(In reply to Tom Schuster [:evilpie] from comment #3)
> Seems like Libre Office can handle UTF-8 encoded text/html, but Firefox
> can't ...

I guess that is this case:
https://searchfox.org/mozilla-central/rev/65f9687eb192f8317b4e02b0b791932eff6237cc/widget/gtk/nsClipboard.cpp#676

I guess ideally Mozilla should present this as "text/html;charset=utf-8".  The variation from text/html would need to be performed via a new gtk_target_list_add() special case in nsClipboard::SetData().
(In reply to Karl Tomlinson (:karlt) from comment #5)
> I guess ideally Mozilla should present this as "text/html;charset=utf-8". 

Although that seems unnecessary if the charset may be in the html.
Maybe ConvertHTMLtoUCS2() should just assume utf-8 if no charset.
Using gtk_targets_include_text actually leads to a small behavior difference,
because gtk_targets_include_text also includes text/plain.

But gtk_selection_data_set_text seems to correctly convert UTF-8 to plain text.
Attachment #9016040 - Attachment description: Bug 1497580 - Start cleaning up GTK clipboard code. r?karlt → Bug 1497580 - use gtk_target_list_add_text_targets for text/unicode clipboard data r?karlt
Keywords: leave-open
Pushed by evilpies@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/80b93ae22c27
use gtk_target_list_add_text_targets for text/unicode clipboard data r=karlt
I just implemented the idea of adding a `<meta http-equiv="content-type" content="text/html; charset=utf-8">` to all text/html data. At least some version of WebKit seems to do the same: https://github.com/clbr/webkitfltk/blob/b5d1ddf4649ee9d7498a7b78fc48ac76d4adde1b/Source/WebCore/platform/gtk/PasteboardHelper.cpp#L142-L146. 

This approach works quite well, even with previous Firefox version. I don't think we can get away with not adding the prefix and always assuming UTF-8, because that would basically break other (older) Gecko based software, as those would not be able to detect the encoding in that case.
Assignee: nobody → evilpies
I verified that we can still copy from Firefox to an older version of Firefox without this patch.
LibreOffice also still works. Talking to some GTK people on IRC they are also happy about UTF-8 instead of wrongly declared UCS2.
Pushed by evilpies@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/89f4156c07a5
Save text/html as UTF-8 to the clipboard. r=karlt
Backout by dluca@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7f30799c0c32
Backed out changeset 89f4156c07a5 for CL failures in dom/base/test/test_bug166235.html
I wonder what is going on:

[task 2018-10-13T19:47:23.002Z] 19:47:23     INFO - GECKO(3395) | [3395, Main Thread] ###!!! ASSERTION: nsTDependentString must wrap only null-terminated strings. You are probably looking for nsTDependentSubstring.: 'this->mData[substring_type::mLength] == 0', file /builds/worker/workspace/build/src/obj-firefox/dist/include/nsTString.h, line 491
[task 2018-10-13T19:47:23.003Z] 19:47:23     INFO - GECKO(3395) | #01: nsTString<char>::AssertValidDependentString() [xpcom/string/nsTString.h:489]
[task 2018-10-13T19:47:23.003Z] 19:47:23     INFO - 
[task 2018-10-13T19:47:23.004Z] 19:47:23     INFO - GECKO(3395) | #02: nsClipboard::GetData(nsITransferable*, int) [xpcom/string/nsTDependentString.h:66]

So I assume this has something to do with the Substring(data, prefixLen).EqualsLiteral(kHTMLMarkupPrefix)? But doesn't Substring return an nsTDependentSubstring already https://searchfox.org/mozilla-central/rev/26b40a44691e0710838130b614c2f2662bc91eec/xpcom/string/nsTDependentSubstring.h#134.
Flags: needinfo?(evilpies)
Pushed by evilpies@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/bb2acd046eae
Save text/html as UTF-8 to the clipboard. r=karlt
I seriously have no idea what was going on there, I just went with strncmp for now, that should work at least.
Flags: needinfo?(evilpies)
There were also mochitest perma failures on test_bug333064.html

Log link: https://treeherder.mozilla.org/logviewer.html#?job_id=205979304&repo=autoland&lineNumber=12653
Pushed by evilpies@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/695366afa355
Save text/html as UTF-8 to the clipboard. r=karlt
Status: NEW → RESOLVED
Closed: Last year
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Summary: Clarify nsClipboard::SelectionGetEvent → Improve GTK nsClipboard code (make text/html produce UTF-8)
You need to log in before you can comment on or make changes to this bug.