Closed
Bug 1303723
Opened 8 years ago
Closed 8 years ago
XMPP xml error on receiving misencoded utf-8 characters
Categories
(Chat Core :: XMPP, defect)
Tracking
(thunderbird52? affected)
RESOLVED
WORKSFORME
People
(Reporter: muzzlerator, Assigned: aleth)
Details
Attachments
(7 files, 2 obsolete files)
User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:48.0) Gecko/20100101 Firefox/48.0 Build ID: 20160823121617 Steps to reproduce: Try to connect to a XMPP server with user that has 400+ contacts in the roster. Actual results: Unexpected data error and repeated attempts to connect. Debug log shows a XML error, caused by long roster data line: [19.09.2016 13:07:01] DEBUG (@ prpl-jabber: Socket.onTransportStatus resource:///modules/socket.jsm:553) onTransportStatus(STATUS_RECEIVING_FROM) [19.09.2016 13:07:01] ERROR (@ prpl-jabber: XMPPSession.prototype.onXMLError resource:///modules/xmpp-session.jsm:236) parse-fatal-error: Ошибка синтаксического анализа XML: некорректно Адрес: about:blank Строка 1, символ 33419: <iq from='xxxxxxxxxx' to='xxxxxxxxxxxxx/Thunderbird' id='661985c0-a565-4a08-bb7c-b478ecf89ff8' type='result'><query xmlns='jabber:iq:roster'><item [16385_characters_including_utf8 long line skipped] [19.09.2016 13:07:01] LOG (@ prpl-jabber: Socket.sendString resource:///modules/socket.jsm:255) Sending: </stream:stream> [19.09.2016 13:07:01] LOG (@ prpl-jabber: Socket.disconnect resource:///modules/socket.jsm:185) Disconnect Other IMs like PSI+, QIP work well. Expected results: Normal operation
Assignee | ||
Comment 1•8 years ago
|
||
I think there was a bug fixed recently about long rosters, but this looks different.
Component: Instant Messaging → XMPP
Product: Thunderbird → Chat Core
Version: 49 Branch → 49
But some reformatting shows that the XML looks well: <iq from='xxx@xxx.xx' to='xxx@xxx.xx/Thunderbird' id='846bd66e-9a46-453c-92fe-784227960ab8' type='result'> <query xmlns='jabber:iq:roster'> <item subscription='both' name='_some_name_in_utf8' jid='yyy@xxx.xx'><group>CO</group></item> [more_389_well_formatted_items] </query> </iq>
Component: XMPP → Instant Messaging
Product: Chat Core → Thunderbird
Version: 49 → 49 Branch
Assignee | ||
Updated•8 years ago
|
Component: Instant Messaging → XMPP
Product: Thunderbird → Chat Core
Version: 49 Branch → 49
Assignee | ||
Comment 4•8 years ago
|
||
Are these lines [16385_characters_including_utf8 long line skipped] [more_389_well_formatted_items] something you are inserting yourself during copy/paste or is that actually in the debug log? I've never seen it before.
Yes, i tried to compact message a bit. Sorry for misinformation. Looks like the reason is not in the length, but handling utf-8 encoding. The error points in the middle of name in this record <item subscription='both' name='ÐкаÑеÑинбÑÑг, ÐикÑлова 46' jid='yyy@xxx.ru'><group>ÐпÑеки</group></item>
Assignee | ||
Comment 6•8 years ago
|
||
(In reply to Dmitry from comment #5) > The error points in the middle of name in this record > > <item subscription='both' name='ÐкаÑеÑинбÑÑг, ÐикÑлова > 46' jid='yyy@xxx.ru'><group>ÐпÑеки</group></item> That looks pretty unreadable on bugzilla too. Can you attach the copy in a file? Ideally together with what the correct utf8 name would be. Also, if the problem is encoding, you should probably also see it when sending/receiving that name in a message in a conversation. Have you tried if that works?
A bit strange. The XMPP server produce data encoded in utf8. The debug log contains same data, but in windows-1251 misencoded as iso-8859-1. (System's codepage is windows-1251) And the data in the debug log is broken a bit As is: <item subscription='both' name='ÐаменÑк-УÑалÑÑкий, ÐобÑолÑбоÐ' jid='apt025@yyyyyyy.ru'> Reencoded: <item subscription='both' name='Каменск-Уральский, Добролюбо�' jid='apt025@yyyyyyy.ru'> Must be: <item subscription='both' name='Каменск-Уральский, Добролюбова' jid='apt025@yyyyyyy.ru'>
Reporter | ||
Comment 10•8 years ago
|
||
Sorry, <item subscription='both' name='Каменск-Уральский, Добролюбова 3' jid='apt025@yyyyyyy.ru'>
Reporter | ||
Comment 11•8 years ago
|
||
May be data processed by chunks, and chunk ended on first byte of utf8 character?
Assignee | ||
Comment 12•8 years ago
|
||
(In reply to Dmitry from comment #9) > A bit strange. > The XMPP server produce data encoded in utf8. Strange, generally utf8 is what TB chat uses as well. > The debug log contains same data, but in windows-1251 misencoded as > iso-8859-1. (System's codepage is windows-1251) Are you on Windows?
Reporter | ||
Comment 13•8 years ago
|
||
Yes, windows 7 64bit
Assignee | ||
Comment 14•8 years ago
|
||
Could you please check if * sending "Каменск-Уральский, Добролюбова" in a XMPP chat message works fine
Summary: XMPP xml error while receiving long roster → XMPP xml error on receiving misencoded utf-8 characters
Assignee | ||
Comment 15•8 years ago
|
||
It looks to me like the existing code fixes the encoding to utf-8 on send, but not when receiving.
Assignee | ||
Updated•8 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee | ||
Comment 16•8 years ago
|
||
I think the XMPP changes here are required and I'm surprised we've not run into this kind of problem before, see https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Guide/Streams#A_note_about_Unicode_strings_versus_nsIInputStream . But once that change is made, try sending the unicode name from this bug in an XMPP MUC and the fatal parse error shows up even on OSX (before making that change, it doesn't, for reasons I don't fully understand). To fix that required the changes on the socket. There is something a bit crazy about the current setup where the socket converts a stream to a string and then XMPP converts the string back to a stream of course.
Attachment #8793065 -
Flags: feedback?(clokep)
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → aleth
Status: NEW → ASSIGNED
Assignee | ||
Comment 17•8 years ago
|
||
Dmitry, are you able to test this patch and see if it fixes the issue for you?
Flags: needinfo?(muzzlerator)
Assignee | ||
Comment 18•8 years ago
|
||
(The socket.jsm change was based on https://developer.mozilla.org/en-US/docs/Reading_textual_data#Reading_strings and assumes the incoming text data is utf-8.)
Assignee | ||
Updated•8 years ago
|
status-thunderbird52:
--- → affected
tracking-thunderbird52:
--- → ?
Reporter | ||
Comment 19•8 years ago
|
||
I'm glad to test the patch, but I don't know how. Where can I get some instructions how to do it? (There is no such file in installed windows version, and I doubt that I could recompile it from source.)
Assignee | ||
Comment 20•8 years ago
|
||
(In reply to Dmitry from comment #19) > I'm glad to test the patch, but I don't know how. > Where can I get some instructions how to do it? > (There is no such file in installed windows version, and I doubt that I > could recompile it from source.) You could download a Thunderbird daily build from https://ftp.mozilla.org/pub/thunderbird/nightly/latest-comm-central/ . Look for the omni.ja file. This is actually a zip file (with an unusual extension) which will contain the files mentioned in the patch. If you have the ability to modify these, you should be able to test the patch without compiling your own TB.
Reporter | ||
Comment 21•8 years ago
|
||
Assignee | ||
Comment 22•8 years ago
|
||
OK thanks. Now undo the changes to socket.jsm but keep the changes to xmpp-session.jsm. What do you get then?
Flags: needinfo?(muzzlerator)
Reporter | ||
Comment 23•8 years ago
|
||
Roster data shown as utf-8 misencoded as iso-8859-1
Assignee | ||
Comment 24•8 years ago
|
||
I'd really like to know where the iso-8859-1 is coming from :-( If the server is sending utf-8, I would have thought the WIP patch should work.
Reporter | ||
Comment 25•8 years ago
|
||
Done, the error still the same. But misencoding in the debug log changed from windows-1251 -> iso8895-1 to utf-8 -> iso-8859-1. BTW misencoding affects only roster data, other data shows properly.
Assignee | ||
Comment 26•8 years ago
|
||
(In reply to Dmitry from comment #25) > BTW misencoding affects only roster data, other data shows properly. I'm not sure what you mean by "other data". There should be no distinction between roster data and the rest, as what is roster data can only be identified after successful parsing. If incoming messages using similar characters don't have similar encoding issues, that would point at a server-side encoding issue.
Reporter | ||
Comment 27•8 years ago
|
||
PSI+ and QIP work normally... How can I get more verbose debug info? May be write extended debug data from socket.jsm directly into debug file?
Reporter | ||
Comment 28•8 years ago
|
||
Two chunks from same debug log: [19.09.2016 18:19:43] LOG (@ prpl-jabber: XMPPParser.prototype._logReceivedData resource:///modules/xmpp-xml.jsm:325) received: <iq xmlns="jabber:client" from="xxxxxx.ru" to="mosin_d@xxxxxx.ru/Thunderbird" id="28de9638-ea90-4992-897e-71d023248a58" type="result"> <query xmlns="http://jabber.org/protocol/disco#items"> <item xmlns="http://jabber.org/protocol/disco#items" jid="conference.xxxxxx.ru"/> <item xmlns="http://jabber.org/protocol/disco#items" jid="pubsub.xxxxxx.ru"/> <item xmlns="http://jabber.org/protocol/disco#items" jid="vjud.xxxxxx.ru"/> <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" node="announce" name="Объявления"/> <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" name="Конфигурация" node="config"/> <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" name="Управление пользователями" node="user"/> <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" name="Подключённые пользователи" node="online users"/> <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" name="Все пользователи" node="all users"/> <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" name="Исходящие s2s-соединения" node="outgoing s2s"/> <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" name="Работающие узлы" node="running nodes"/> <item xmlns="http://jabber.org/protocol/disco#items" jid="xxxxxx.ru" name="Остановленные узлы" node="stopped nodes"/> </query> </iq> [19.09.2016 18:19:43] DEBUG (@ prpl-jabber: Socket.onTransportStatus resource:///modules/socket.jsm:560) onTransportStatus(STATUS_RECEIVING_FROM) [19.09.2016 18:19:43] ERROR (@ prpl-jabber: XMPPSession.prototype.onXMLError resource:///modules/xmpp-session.jsm:326) parse-fatal-error: Ошибка синтаксического анализа XML: некорректно Адрес: about:blank Строка 1, символ 33420: <iq from='mosin_d@xxxxxx.ru' to='mosin_d@xxxxxx.ru/Thunderbird' id='6aa573f3-17f0-4395-83ce-6c6dc67268a2' type='result'><query xmlns='jabber:iq:roster'><item subscription='both' name='СÑÑеÑÑÑ, ÐÑджоникидзе 52' jid='apt084@yyyyyyy.ru'><group>ÐпÑеки</group></item><item subscription='both' name='ФаÑÑкÑина ÐаÑалÑÑ ÐикÑоÑовна' jid='farukshina_n@yyyyyyy.ru'><group>CO</group></item><item subscription='both' name='ÐеÑезовÑкий, СпоÑÑÐ¸Ð²Ð½Ð°Ñ 8' jid='apt195@yyyyyyy.ru'><group>ÐпÑеки</group></item><item subscription='both' name='ÐаменÑк-УÑалÑÑкий, ÐобÑолÑбоÐ' jid='apt025@yyyyyyy.ru'>
Assignee | ||
Comment 29•8 years ago
|
||
(In reply to Dmitry from comment #27) > PSI+ and QIP work normally... I didn't mean to say TB didn't need fixing, it obviously does. It's a matter of figuring out what exactly the problem is. At the moment it's very confusing > How can I get more verbose debug info? > May be write extended debug data from socket.jsm directly into debug file? You've already modified socket.jsm, you can put additional this.DEBUG() or this.LOG() calls whereever you like.
Reporter | ||
Comment 30•8 years ago
|
||
It's really server's fault! This line is from PSI+ XML console <item subscription="both" name="Каменск-Уральский, Добролюбо�" jid="apt025@yyyyyyy.ru"> Server sent broken data, and PSI+ somehow handles it
Reporter | ||
Comment 31•8 years ago
|
||
Ejabberd cuts "name" to some limit, and when it cuts at half of utf-8 character - oops we have a problem...
Assignee | ||
Comment 32•8 years ago
|
||
(In reply to Dmitry from comment #31) > Ejabberd cuts "name" to some limit, and when it cuts at half of utf-8 > character - oops we have a problem... That's interesting. Thanks for investigating! Since XMPP should always use utf-8, it would probably be a good idea to file an ejabberd bug for this (https://github.com/processone/ejabberd/issues), assuming your server isn't running some ancient version.
Assignee | ||
Comment 33•8 years ago
|
||
(In reply to Dmitry from comment #30) > It's really server's fault! This line is from PSI+ XML console > > <item subscription="both" name="Каменск-Уральский, Добролюбо�" > jid="apt025@yyyyyyy.ru"> > > Server sent broken data, and PSI+ somehow handles it I don't understand how it's possible to handle it correctly, if half a character is missing. However, we should find a way to not fail completely with a fatal parse error. Does some version of the WIP patch here help with that at least?
Reporter | ||
Comment 34•8 years ago
|
||
(In reply to aleth [:aleth] from comment #22) > OK thanks. Now undo the changes to socket.jsm but keep the changes to > xmpp-session.jsm. What do you get then? Ouch.. It's looks like xmpp-session.jsm left unpatched, the patch was rejected...
Reporter | ||
Comment 35•8 years ago
|
||
Assignee | ||
Comment 36•8 years ago
|
||
(In reply to Dmitry from comment #35) > Created attachment 8793442 [details] > xmpp-session.jsm.rej You're doing something strange when applying the patch. Those rejected changes are for socket.jsm. To apply the patch file above, you can use hg ("hg import ...") or git ("git apply ..."), or do it by hand.
Reporter | ||
Comment 37•8 years ago
|
||
(In reply to aleth [:aleth] from comment #32) > That's interesting. Thanks for investigating! Since XMPP should always use > utf-8, it would probably be a good idea to file an ejabberd bug for this > (https://github.com/processone/ejabberd/issues), assuming your server isn't > running some ancient version. The XMPP server is not under my control, but I'll try to get more info
Reporter | ||
Comment 38•8 years ago
|
||
Patched again, paying a bit more attention. Patched only xmpp-session.jsm and both socket.jsm and xmpp-session.jsm show same result. Session succesfully started, roster appears, but misencoded :( May be I did something wrong again, I'll try to patch a fresh nightly build a bit later
Reporter | ||
Comment 39•8 years ago
|
||
Reporter | ||
Comment 40•8 years ago
|
||
BTW, a conversation displayed properly
Reporter | ||
Comment 41•8 years ago
|
||
(In reply to Dmitry from comment #40) > BTW, a conversation displayed properly No :( TB -> PSI+ - OK PSI+ -> TB - Wrong
Reporter | ||
Comment 42•8 years ago
|
||
Attachment #8793453 -
Attachment is obsolete: true
Assignee | ||
Comment 43•8 years ago
|
||
(In reply to Dmitry from comment #41) > (In reply to Dmitry from comment #40) > > BTW, a conversation displayed properly > No :( > TB -> PSI+ - OK > PSI+ -> TB - Wrong That's probably because TB always sends utf-8...
Assignee | ||
Comment 44•8 years ago
|
||
Here's another patch for you to try - I don't know if it will make a difference.
Assignee | ||
Updated•8 years ago
|
Attachment #8793442 -
Attachment is obsolete: true
Reporter | ||
Comment 45•8 years ago
|
||
After this patch: Error: Received unexpected data [25.09.2016 0:00:28] DEBUG (@ prpl-jabber: Socket.onTransportStatus resource:///modules/socket.jsm:560) onTransportStatus(STATUS_RECEIVING_FROM) [25.09.2016 0:00:28] ERROR (@ prpl-jabber: XMPPSession.prototype.onXMLError resource:///modules/xmpp-session.jsm:335) parse-fatal-error: XML Parsing Error: not well-formed Location: about:blank Line Number 1, Column 37032: <iq from=
Comment 46•8 years ago
|
||
Comment on attachment 8793065 [details] [diff] [review] Ensure incoming UTF-8 socket data remains encoded correctly Review of attachment 8793065 [details] [diff] [review]: ----------------------------------------------------------------- This doesn't look crazy, but I'm confused how we could have not run into this before...
Attachment #8793065 -
Flags: feedback?(clokep) → feedback+
Comment 47•8 years ago
|
||
(In reply to Patrick Cloke [:clokep] from comment #46) > I'm confused how we could have not run into this before... XMPP requires using UTF8 everywhere in its spec, so receiving non-UTF8 data means a broken server (if a broken client sends junk, the server is supposed to disconnect it). So that's probably why we haven't encountered it yet.
Assignee | ||
Comment 48•8 years ago
|
||
(In reply to Florian Quèze [:florian] [:flo] from comment #47) > (In reply to Patrick Cloke [:clokep] from comment #46) > > > I'm confused how we could have not run into this before... > > XMPP requires using UTF8 everywhere in its spec, so receiving non-UTF8 data > means a broken server (if a broken client sends junk, the server is supposed > to disconnect it). So that's probably why we haven't encountered it yet. Yes, that's exactly what seems to be happening here. If we insist on UTF8 conversion (as the WIP patch does) the advantage is the account no longer chokes on a fatal parse error and disconnects. Unfortunately the disadvantage is the result may still be suboptimal (see screenshots provided). I'm not completely sure what the right thing to do here is (or if there is some better third way I'm missing).
Reporter | ||
Comment 49•8 years ago
|
||
(In reply to Patrick Cloke [:clokep] from comment #46) > but I'm confused how we could have not run into this before... I'm "lucky" > XMPP requires using UTF8 everywhere in its spec, so receiving non-UTF8 data But ejabberd sends data in UTF8 (sometimes with broken symbols) How can I get a raw XML received from server? Or full XML conversation between server and TB. (I'm not very experienced programmer) BTW, does the TB send xml:lang in its queries? And what language used as default? I wonder, how can other im clients (like psi+, qip, miranda, miranda-ng) handle this weird situation. All of them succesfully work with this particular xmpp server.
Assignee | ||
Comment 50•8 years ago
|
||
(In reply to Dmitry from comment #49) > > XMPP requires using UTF8 everywhere in its spec, so receiving non-UTF8 data > But ejabberd sends data in UTF8 (sometimes with broken symbols) Comment 30 shows it's not valid UTF-8. I think your guess in comment 31 is likely correct - ejabberd enforces a length limit in bytes rather than characters. I filed an ejabberd bug asking for feedback, https://github.com/processone/ejabberd/issues/1312 In principle TB is doing the right thing by disconnecting with a fatal parse error, because the RFC says: "If one party to an XML stream detects that the other party has attempted to send XML data with an encoding other than UTF-8, it MUST close the stream with a stream error, which SHOULD be <unsupported-encoding/> (Section 4.9.3.22), although some existing implementations send <bad-format/> (Section 4.9.3.1) instead." https://xmpp.org/rfcs/rfc6120.html#xml-encoding That doesn't mean it wouldn't be nice to do better ;) > How can I get a raw XML received from server? I'm not sure what you mean. Do you mean the binary data? > Or full XML conversation between server and TB. That's what the debug log gives you. Of course the log is utf-8 text so what you see depends on the encoding/decoding assumptions made. > BTW, does the TB send xml:lang in its queries? No.
Reporter | ||
Comment 51•8 years ago
|
||
After upgrade ejabberd, and tinkering with configs etc: 1. Wrong encoding - ejabberd's fault. Encoding has been fixed after changing language from 'ru' to 'en' in it's config. 2. Broken utf-8 characters - openldap's fault, it breaks some characters for an unknown reason. Sorry for wasting your time.
Assignee | ||
Comment 52•8 years ago
|
||
(In reply to Dmitry from comment #51) > After upgrade ejabberd, and tinkering with configs etc: > > 1. Wrong encoding - ejabberd's fault. > Encoding has been fixed after changing language from 'ru' to 'en' in it's > config. > > 2. Broken utf-8 characters - openldap's fault, it breaks some characters for > an unknown reason. > > Sorry for wasting your time. Thanks for letting us know!
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•