Closed Bug 1708695 Opened 3 years ago Closed 3 years ago

MUC display names not properly UTF-8 decoded (double deocded?)

Categories

(Chat Core :: XMPP, defect)

defect

Tracking

(thunderbird_esr78 unaffected, thunderbird89 fixed)

RESOLVED FIXED
90 Branch
Tracking Status
thunderbird_esr78 --- unaffected
thunderbird89 --- fixed

People

(Reporter: freaktechnik, Assigned: freaktechnik)

References

(Regression)

Details

(Keywords: regression)

Attachments

(3 files)

Attached image xmpp-utf8-broken.png

STR:

  • Join an XMPP MUC where people have display names with UTF-8 characters

Expected result:
The participants list and tooltips correctly show the UTF-8 characters.

Actual result:
The UTF-8 characters appear double encoded or similar.

This might also be an issue on the UI side and not in the protocol. But because it affects both the participants list and tooltips I'm assuming it's on the protocol layer.

Attached image xmpp-utf8-ok.png

What's different about those two images? Are they different rooms or does this only happen sometimes? (Or is this a regression?)

Keywords: regression

Bug 1647252 was going to be my guess unfortunately. :(

It looks like the SAX parser in there doesn't some Unicode stuff manually, maybe it is interfering.

Any chance you could grab some protocol logs to get the raw XML?

Regressed by: 1647252

c-c:

[30.04.21, 16:12:36 MESZ] LOG   (@ prpl-jabber: _logReceivedData resource:///modules/xmpp-xml.jsm:402)
received:
<presence xmlns="jabber:client" from="tatoeba@chat.tatoeba.org/Étienne" to="freaktechnik@jabber.lugs.ch/30547725431619791947512145" xml:lang="en" id="5edeae8670514a6189037da4e2a1ded6">
 <c xmlns="http://jabber.org/protocol/caps" ver="ZyB1liM9c9GvKOnvl61+5ScWcqw=" node="https://poez.io" hash="sha-1"/>
 <x xmlns="vcard-temp:x:update">
  <photo xmlns="vcard-temp:x:update"/>
 </x>
 <idle xmlns="urn:xmpp:idle:1" since="2021-04-13T10:09:14.786178+00:00"/>
 <occupant-id xmlns="urn:xmpp:occupant-id:0" id="wNZPCZIVQ51D/heZQpOHi0ZgHXAEQonNPaLdyzLxHWs="/>
 <x xmlns="http://jabber.org/protocol/muc#user">
  <item xmlns="http://jabber.org/protocol/muc#user" jid="ejls@ejls.fr/poezio-8kIy" affiliation="member" role="participant"/>
 </x>
</presence>

[30.04.21, 16:12:36 MESZ] DEBUG (@ prpl-jabber: onPresenceStanza resource:///modules/xmpp-base.jsm:2192)
Received presence stanza for tatoeba@chat.tatoeba.org/Étienne

release:

[4/30/21, 4:10:19 PM GMT+2] DEBUG (@ prpl-jabber: _logReceivedData resource:///modules/xmpp-xml.jsm:390)
received:
<presence xmlns="jabber:client" from="tatoeba@chat.tatoeba.org/Étienne" to="freaktechnik@jabber.lugs.ch/Thunderbird" xml:lang="en" id="5edeae8670514a6189037da4e2a1ded6">
 <c xmlns="http://jabber.org/protocol/caps" ver="ZyB1liM9c9GvKOnvl61+5ScWcqw=" node="https://poez.io" hash="sha-1"/>
 <x xmlns="vcard-temp:x:update">
  <photo xmlns="vcard-temp:x:update"/>
 </x>
 <idle xmlns="urn:xmpp:idle:1" since="2021-04-13T10:09:14.786178+00:00"/>
 <occupant-id xmlns="urn:xmpp:occupant-id:0" id="wNZPCZIVQ51D/heZQpOHi0ZgHXAEQonNPaLdyzLxHWs="/>
 <x xmlns="http://jabber.org/protocol/muc#user">
  <item xmlns="http://jabber.org/protocol/muc#user" jid="ejls@ejls.fr/poezio-8kIy" affiliation="member" role="participant"/>
 </x>
</presence>

[4/30/21, 4:10:19 PM GMT+2] DEBUG (@ prpl-jabber: onPresenceStanza resource:///modules/xmpp-base.jsm:2196)
Received presence stanza for tatoeba@chat.tatoeba.org/Étienne

Ping -- any idea what could be going on here? I suspect adding a unit test for this might help debug it.

Flags: needinfo?(remotenonsense)

Adding this decode step before writing to the parser fixes it:

new TextDecoder().decode(new Uint8Array(Array.from(data, c => c.charCodeAt(0))))

IRC uses the scriptable unicode converter (see irc.jsm#805)

We seem to do something similar in the IRC code: https://searchfox.org/comm-central/source/chat/protocols/irc/irc.jsm#825-837 (I thought this happened in the socket code). So it seems the old SAX parser handled going from raw bytes (as UTF-8) -> Unicode, while the new one assumes it is getting Unicode-decoded text already?

(In reply to Patrick Cloke [:clokep] from comment #8)

So it seems the old SAX parser handled going from raw bytes (as UTF-8) -> Unicode, while the new one assumes it is getting Unicode-decoded text already?

Right, The old SAX parser gets BinaryString (string in idl) in OnDataAvailable, and emits UTF8 string (AString in idl) in startElement/endElement.

(In reply to Martin Giger [:freaktechnik] from comment #7)

Adding this decode step before writing to the parser fixes it:

new TextDecoder().decode(new Uint8Array(Array.from(data, c => c.charCodeAt(0))))

I think this is the right fix, please send a patch, thanks. Can put it in onDataReceived of xmpp-session.jsm

Flags: needinfo?(remotenonsense)
Assignee: nobody → martin
Status: NEW → ASSIGNED

(In reply to Ping Chen (:rnons) from comment #9)

I think this is the right fix, please send a patch, thanks. Can put it in onDataReceived of xmpp-session.jsm

I decided to put it in xmpp-xml.jsm since that is actually tested and I didn't want to explore xmpp-session to figure out how to test it for this patch.

Target Milestone: --- → 90 Branch

Pushed by thunderbird@calypsoblue.org:
https://hg.mozilla.org/comm-central/rev/6e3047fcecc7
Decode byte string to UTF-8 for XMPP parser. r=clokep

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED

Comment on attachment 9219770 [details]
Bug 1708695 - Decode byte string to UTF-8 for XMPP parser. r=clokep

[Approval Request Comment]
Regression caused by (bug #): 1647252
User impact if declined: unreadable user display names
Testing completed (on c-c, etc.): tests included in patch
Risk to taking this patch (and alternatives if risky):

Attachment #9219770 - Flags: approval-comm-beta?

Comment on attachment 9219770 [details]
Bug 1708695 - Decode byte string to UTF-8 for XMPP parser. r=clokep

[Triage Comment]
Approved for beta

Attachment #9219770 - Flags: approval-comm-beta? → approval-comm-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: