Closed Bug 1303723 Opened 8 years ago Closed 8 years ago

XMPP xml error on receiving misencoded utf-8 characters

Categories

(Chat Core :: XMPP, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(thunderbird52? affected)

RESOLVED WORKSFORME
Tracking Status
thunderbird52 ? affected

People

(Reporter: muzzlerator, Assigned: aleth)

Details

Attachments

(7 files, 2 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:48.0) Gecko/20100101 Firefox/48.0
Build ID: 20160823121617

Steps to reproduce:

Try to connect to a XMPP server with user that has 400+ contacts in the roster.


Actual results:

Unexpected data error and repeated attempts to connect.

Debug log shows a XML error, caused by long roster data line:

[19.09.2016 13:07:01] DEBUG (@ prpl-jabber: Socket.onTransportStatus resource:///modules/socket.jsm:553)
onTransportStatus(STATUS_RECEIVING_FROM)
[19.09.2016 13:07:01] ERROR (@ prpl-jabber: XMPPSession.prototype.onXMLError resource:///modules/xmpp-session.jsm:236)
parse-fatal-error: Ошибка синтаксического анализа XML: некорректно
Адрес: about:blank
Строка 1, символ 33419:
<iq from='xxxxxxxxxx' to='xxxxxxxxxxxxx/Thunderbird' id='661985c0-a565-4a08-bb7c-b478ecf89ff8' type='result'><query xmlns='jabber:iq:roster'><item [16385_characters_including_utf8 long line skipped]
[19.09.2016 13:07:01] LOG   (@ prpl-jabber: Socket.sendString resource:///modules/socket.jsm:255)
Sending:
</stream:stream>
[19.09.2016 13:07:01] LOG   (@ prpl-jabber: Socket.disconnect resource:///modules/socket.jsm:185)
Disconnect

Other IMs like PSI+, QIP work well.


Expected results:

Normal operation
OS: Unspecified → Windows 7
Hardware: Unspecified → x86_64
I think there was a bug fixed recently about long rosters, but this looks different.
Component: Instant Messaging → XMPP
Product: Thunderbird → Chat Core
Version: 49 Branch → 49
But some reformatting shows that the XML looks well:

<iq from='xxx@xxx.xx' to='xxx@xxx.xx/Thunderbird' id='846bd66e-9a46-453c-92fe-784227960ab8' type='result'>
<query xmlns='jabber:iq:roster'>
<item subscription='both' name='_some_name_in_utf8' jid='yyy@xxx.xx'><group>CO</group></item>
[more_389_well_formatted_items]
</query>
</iq>
Component: XMPP → Instant Messaging
Product: Chat Core → Thunderbird
Version: 49 → 49 Branch
How can I get more verbose debug log?
Component: Instant Messaging → XMPP
Product: Thunderbird → Chat Core
Version: 49 Branch → 49
Are these lines

[16385_characters_including_utf8 long line skipped]
[more_389_well_formatted_items]

something you are inserting yourself during copy/paste or is that actually in the debug log? I've never seen it before.
Yes, i tried to compact message a bit.
Sorry for misinformation.

Looks like the reason is not in the length, but handling utf-8 encoding.
The error points in the middle of name in this record

<item subscription='both' name='ÐкаÑеÑинбÑÑг, ÐикÑлова 46' jid='yyy@xxx.ru'><group>ÐпÑеки</group></item>
(In reply to Dmitry from comment #5)
> The error points in the middle of name in this record
> 
> <item subscription='both' name='ÐкаÑеÑинбÑÑг, ÐикÑлова
> 46' jid='yyy@xxx.ru'><group>ÐпÑеки</group></item>

That looks pretty unreadable on bugzilla too. Can you attach the copy in a file? Ideally together with what the correct utf8 name would be.

Also, if the problem is encoding, you should probably also see it when sending/receiving that name in a message in a conversation. Have you tried if that works?
Attached file Debug log, as is
A bit strange.
The XMPP server produce data encoded in utf8.
The debug log contains same data, but in windows-1251 misencoded as iso-8859-1. (System's codepage is windows-1251)

And the data in the debug log is broken a bit

As is:
<item subscription='both' name='ÐаменÑк-УÑалÑÑкий, ÐобÑолÑбоÐ' jid='apt025@yyyyyyy.ru'>

Reencoded:
<item subscription='both' name='Каменск-Уральский, Добролюбо�' jid='apt025@yyyyyyy.ru'>

Must be:
<item subscription='both' name='Каменск-Уральский, Добролюбова' jid='apt025@yyyyyyy.ru'>
Sorry, 
<item subscription='both' name='Каменск-Уральский, Добролюбова 3' jid='apt025@yyyyyyy.ru'>
May be data processed by chunks, and chunk ended on first byte of utf8 character?
(In reply to Dmitry from comment #9)
> A bit strange.
> The XMPP server produce data encoded in utf8.

Strange, generally utf8 is what TB chat uses as well.

> The debug log contains same data, but in windows-1251 misencoded as
> iso-8859-1. (System's codepage is windows-1251)

Are you on Windows?
Yes, windows 7 64bit
Could you please check if
* sending "Каменск-Уральский, Добролюбова" in a XMPP chat message works fine
Summary: XMPP xml error while receiving long roster → XMPP xml error on receiving misencoded utf-8 characters
It looks to me like the existing code fixes the encoding to utf-8 on send, but not when receiving.
Status: UNCONFIRMED → NEW
Ever confirmed: true
I think the XMPP changes here are required and I'm surprised we've not run into this kind of problem before, see https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Guide/Streams#A_note_about_Unicode_strings_versus_nsIInputStream . But once that change is made, try sending the unicode name from this bug in an XMPP MUC and the fatal parse error shows up even on OSX (before making that change, it doesn't, for reasons I don't fully understand). To fix that required the changes on the socket. There is something a bit crazy about the current setup where the socket converts a stream to a string and then XMPP converts the string back to a stream of course.
Attachment #8793065 - Flags: feedback?(clokep)
Assignee: nobody → aleth
Status: NEW → ASSIGNED
Dmitry, are you able to test this patch and see if it fixes the issue for you?
Flags: needinfo?(muzzlerator)
(The socket.jsm change was based on https://developer.mozilla.org/en-US/docs/Reading_textual_data#Reading_strings and assumes the incoming text data is utf-8.)
I'm glad to test the patch, but I don't know how.
Where can I get some instructions how to do it?
(There is no such file in installed windows version, and I doubt that I could recompile it from source.)
(In reply to Dmitry from comment #19)
> I'm glad to test the patch, but I don't know how.
> Where can I get some instructions how to do it?
> (There is no such file in installed windows version, and I doubt that I
> could recompile it from source.)

You could download a Thunderbird daily build from https://ftp.mozilla.org/pub/thunderbird/nightly/latest-comm-central/ . Look for the omni.ja file. This is actually a zip file (with an unusual extension) which will contain the files mentioned in the patch. If you have the ability to modify these, you should be able to test the patch without compiling your own TB.
OK thanks. Now undo the changes to socket.jsm but keep the changes to xmpp-session.jsm. What do you get then?
Flags: needinfo?(muzzlerator)
Roster data shown as utf-8 misencoded as iso-8859-1
I'd really like to know where the iso-8859-1 is coming from :-( If the server is sending utf-8, I would have thought the WIP patch should work.
Done, the error still the same. 
But misencoding in the debug log changed from windows-1251 -> iso8895-1 to utf-8 -> iso-8859-1.
BTW misencoding affects only roster data, other data shows properly.
(In reply to Dmitry from comment #25)
> BTW misencoding affects only roster data, other data shows properly.

I'm not sure what you mean by "other data". There should be no distinction between roster data and the rest, as what is roster data can only be identified after successful parsing. If incoming messages using similar characters don't have similar encoding issues, that would point at a server-side encoding issue.
PSI+ and QIP work normally...
How can I get more verbose debug info?
May be write extended debug data from socket.jsm directly into debug file?
Two chunks from same debug log:

[19.09.2016 18:19:43] LOG   (@ prpl-jabber: XMPPParser.prototype._logReceivedData resource:///modules/xmpp-xml.jsm:325)
received:
<iq xmlns="jabber:client" from="xxxxxx.ru" to="mosin_d@xxxxxx.ru/Thunderbird" id="28de9638-ea90-4992-897e-71d023248a58" type="result">
 <query xmlns="http://jabber.org/protocol/disco#items">
  <item xmlns="http://jabber.org/protocol/disco#items" jid="conference.xxxxxx.ru"/>
  <item xmlns="http://jabber.org/protocol/disco#items" jid="pubsub.xxxxxx.ru"/>
  <item xmlns="http://jabber.org/protocol/disco#items" jid="vjud.xxxxxx.ru"/>
  <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" node="announce" name="Объявления"/>
  <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" name="Конфигурация" node="config"/>
  <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" name="Управление пользователями" node="user"/>
  <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" name="Подключённые пользователи" node="online users"/>
  <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" name="Все пользователи" node="all users"/>
  <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" name="Исходящие s2s-соединения" node="outgoing s2s"/>
  <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" name="Работающие узлы" node="running nodes"/>
  <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" name="Остановленные узлы" node="stopped nodes"/>
 </query>
</iq>


[19.09.2016 18:19:43] DEBUG (@ prpl-jabber: Socket.onTransportStatus resource:///modules/socket.jsm:560)
onTransportStatus(STATUS_RECEIVING_FROM)
[19.09.2016 18:19:43] ERROR (@ prpl-jabber: XMPPSession.prototype.onXMLError resource:///modules/xmpp-session.jsm:326)
parse-fatal-error: Ошибка синтаксического анализа XML: некорректно
Адрес: about:blank
Строка 1, символ 33420:
<iq from='mosin_d@xxxxxx.ru' to='mosin_d@xxxxxx.ru/Thunderbird' id='6aa573f3-17f0-4395-83ce-6c6dc67268a2' type='result'><query xmlns='jabber:iq:roster'><item subscription='both' name='СÑÑеÑÑÑ, ÐÑджоникидзе 52' jid='apt084@yyyyyyy.ru'><group>ÐпÑеки</group></item><item subscription='both' name='ФаÑÑкÑина ÐаÑалÑÑ ÐикÑоÑовна' jid='farukshina_n@yyyyyyy.ru'><group>CO</group></item><item subscription='both' name='ÐеÑезовÑкий, СпоÑÑÐ¸Ð²Ð½Ð°Ñ 8' jid='apt195@yyyyyyy.ru'><group>ÐпÑеки</group></item><item subscription='both' name='ÐаменÑк-УÑалÑÑкий, ÐобÑолÑбоÐ' jid='apt025@yyyyyyy.ru'>
(In reply to Dmitry from comment #27)
> PSI+ and QIP work normally...

I didn't mean to say TB didn't need fixing, it obviously does. It's a matter of figuring out what exactly the problem is. At the moment it's very confusing

> How can I get more verbose debug info?
> May be write extended debug data from socket.jsm directly into debug file?

You've already modified socket.jsm, you can put additional this.DEBUG() or this.LOG() calls whereever you like.
It's really server's fault! This line is from PSI+ XML console

<item subscription="both" name="Каменск-Уральский, Добролюбо�" jid="apt025@yyyyyyy.ru">

Server sent broken data, and PSI+ somehow handles it
Ejabberd cuts "name" to some limit, and when it cuts at half of utf-8 character - oops we have a problem...
(In reply to Dmitry from comment #31)
> Ejabberd cuts "name" to some limit, and when it cuts at half of utf-8
> character - oops we have a problem...

That's interesting. Thanks for investigating! Since XMPP should always use utf-8, it would probably be a good idea to file an ejabberd bug for this (https://github.com/processone/ejabberd/issues), assuming your server isn't running some ancient version.
(In reply to Dmitry from comment #30)
> It's really server's fault! This line is from PSI+ XML console
> 
> <item subscription="both" name="Каменск-Уральский, Добролюбо�"
> jid="apt025@yyyyyyy.ru">
> 
> Server sent broken data, and PSI+ somehow handles it

I don't understand how it's possible to handle it correctly, if half a character is missing.

However, we should find a way to not fail completely with a fatal parse error. Does some version of the WIP patch here help with that at least?
(In reply to aleth [:aleth] from comment #22)
> OK thanks. Now undo the changes to socket.jsm but keep the changes to
> xmpp-session.jsm. What do you get then?
Ouch.. It's looks like xmpp-session.jsm left unpatched, the patch was rejected...
Attached file xmpp-session.jsm.rej (obsolete) —
(In reply to Dmitry from comment #35)
> Created attachment 8793442 [details]
> xmpp-session.jsm.rej

You're doing something strange when applying the patch. Those rejected changes are for socket.jsm.
To apply the patch file above, you can use hg ("hg import ...") or git ("git apply ..."), or do it by hand.
(In reply to aleth [:aleth] from comment #32)
> That's interesting. Thanks for investigating! Since XMPP should always use
> utf-8, it would probably be a good idea to file an ejabberd bug for this
> (https://github.com/processone/ejabberd/issues), assuming your server isn't
> running some ancient version.
The XMPP server is not under my control, but I'll try to get more info
Patched again, paying a bit more attention.
Patched only xmpp-session.jsm and both socket.jsm and xmpp-session.jsm show same result.
Session succesfully started, roster appears, but misencoded :(
May be I did something wrong again, I'll try to patch a fresh nightly build a bit later
Attached image patched_again.PNG (obsolete) —
BTW, a conversation displayed properly
(In reply to Dmitry from comment #40)
> BTW, a conversation displayed properly
No :(
TB    -> PSI+ - OK
PSI+  -> TB   - Wrong
Attached image conversation.PNG
Attachment #8793453 - Attachment is obsolete: true
(In reply to Dmitry from comment #41)
> (In reply to Dmitry from comment #40)
> > BTW, a conversation displayed properly
> No :(
> TB    -> PSI+ - OK
> PSI+  -> TB   - Wrong

That's probably because TB always sends utf-8...
Here's another patch for you to try - I don't know if it will make a difference.
Attachment #8793442 - Attachment is obsolete: true
After this patch:
Error: Received unexpected data


[25.09.2016 0:00:28] DEBUG (@ prpl-jabber: Socket.onTransportStatus resource:///modules/socket.jsm:560)
onTransportStatus(STATUS_RECEIVING_FROM)
[25.09.2016 0:00:28] ERROR (@ prpl-jabber: XMPPSession.prototype.onXMLError resource:///modules/xmpp-session.jsm:335)
parse-fatal-error: XML Parsing Error: not well-formed
Location: about:blank
Line Number 1, Column 37032:
<iq from=
Comment on attachment 8793065 [details] [diff] [review]
Ensure incoming UTF-8 socket data remains encoded correctly

Review of attachment 8793065 [details] [diff] [review]:
-----------------------------------------------------------------

This doesn't look crazy, but I'm confused how we could have not run into this before...
Attachment #8793065 - Flags: feedback?(clokep) → feedback+
(In reply to Patrick Cloke [:clokep] from comment #46)

> I'm confused how we could have not run into this before...

XMPP requires using UTF8 everywhere in its spec, so receiving non-UTF8 data means a broken server (if a broken client sends junk, the server is supposed to disconnect it). So that's probably why we haven't encountered it yet.
(In reply to Florian Quèze [:florian] [:flo] from comment #47)
> (In reply to Patrick Cloke [:clokep] from comment #46)
> 
> > I'm confused how we could have not run into this before...
> 
> XMPP requires using UTF8 everywhere in its spec, so receiving non-UTF8 data
> means a broken server (if a broken client sends junk, the server is supposed
> to disconnect it). So that's probably why we haven't encountered it yet.

Yes, that's exactly what seems to be happening here.

If we insist on UTF8 conversion (as the WIP patch does) the advantage is the account no longer chokes on a fatal parse error and disconnects. Unfortunately the disadvantage is the result may still be suboptimal (see screenshots provided).

I'm not completely sure what the right thing to do here is (or if there is some better third way I'm missing).
(In reply to Patrick Cloke [:clokep] from comment #46)
> but I'm confused how we could have not run into this before...
I'm "lucky" 

> XMPP requires using UTF8 everywhere in its spec, so receiving non-UTF8 data
But ejabberd sends data in UTF8 (sometimes with broken symbols)
How can I get a raw XML received from server? 
Or full XML conversation between server and TB.
(I'm not very experienced programmer)

BTW, does the TB send xml:lang in its queries?
And what language used as default?


I wonder, how can other im clients (like psi+, qip, miranda, miranda-ng) handle this weird situation.
All of them succesfully work with this particular xmpp server.
(In reply to Dmitry from comment #49)
> > XMPP requires using UTF8 everywhere in its spec, so receiving non-UTF8 data
> But ejabberd sends data in UTF8 (sometimes with broken symbols)

Comment 30 shows it's not valid UTF-8. I think your guess in comment 31 is likely correct - ejabberd enforces a length limit in bytes rather than characters.

I filed an ejabberd bug asking for feedback, https://github.com/processone/ejabberd/issues/1312

In principle TB is doing the right thing by disconnecting with a fatal parse error, because the RFC says:
"If one party to an XML stream detects that the other party has attempted to send XML data with an encoding other than UTF-8, it MUST close the stream with a stream error, which SHOULD be <unsupported-encoding/> (Section 4.9.3.22), although some existing implementations send <bad-format/> (Section 4.9.3.1) instead."
https://xmpp.org/rfcs/rfc6120.html#xml-encoding

That doesn't mean it wouldn't be nice to do better ;)

> How can I get a raw XML received from server? 

I'm not sure what you mean. Do you mean the binary data?

> Or full XML conversation between server and TB.

That's what the debug log gives you. Of course the log is utf-8 text so what you see depends on the encoding/decoding assumptions made.

> BTW, does the TB send xml:lang in its queries?

No.
After upgrade ejabberd, and tinkering with configs etc:

1. Wrong encoding - ejabberd's fault. 
Encoding has been fixed after changing language from 'ru' to 'en' in it's config. 

2. Broken utf-8 characters - openldap's fault, it breaks some characters for an unknown reason.

Sorry for wasting your time.
(In reply to Dmitry from comment #51)
> After upgrade ejabberd, and tinkering with configs etc:
> 
> 1. Wrong encoding - ejabberd's fault. 
> Encoding has been fixed after changing language from 'ru' to 'en' in it's
> config. 
> 
> 2. Broken utf-8 characters - openldap's fault, it breaks some characters for
> an unknown reason.
> 
> Sorry for wasting your time.

Thanks for letting us know!
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: