115.1 hangs randomly "not responding" when automatically checking multiple accounts for new messages. Workaround stagger mail check interval (pop)
Categories
(MailNews Core :: Networking: POP, defect)
Tracking
(thunderbird_esr128 affected)
Tracking | Status | |
---|---|---|
thunderbird_esr128 | --- | affected |
People
(Reporter: johnbast, Unassigned)
References
Details
(Keywords: hang, perf, Whiteboard: version 115)
Attachments
(5 files)
For a description see bug #1847137.
v128 version 128 seems to reduce the number of hangs, but it still becomes unresponsive in the same way. No slowness at that time. I have not been using v128.1.0 long enough.
There are multiple pop3 and imap accounts.
Presumably there is a race condition causing the nonresponsiveness. Making the code thread safe could probably solve such problems at the root.
To be clear: this is a hang, not a slowness issue. It does not recover without killing the process.
Updated•6 months ago
|
Comment 2•6 months ago
|
||
There are multiple pop3 and imap accounts.
How many of each?
For a description see bug #1847137.
v128 version 128 seems to reduce the number of hangs, but it still becomes unresponsive in the same way. No slowness at that time. I have not been using v128.1.0 long enough.
So this started in version 115 or prior
Workaround stagger mail check interval
Please describe in detail what you did to reduce the condition.
Presumably there is a race condition causing the nonresponsiveness. Making the code thread safe could probably solve such problems at the root.
Please don't speculate unless you have technical knowledge of the code for making such a statement - lest it add confusion.
Comment 3•6 months ago
|
||
(In reply to John from comment #1)
To be clear: this is a hang, not a slowness issue. It does not recover without killing the process.
Should it be of some interest, something similar began happening in my TB 128esr just after upgrading from 115 (I'm feeding multiple frequently checked pop3 accounts into one local Inbox) on macos 14.4: but in my case, simply waiting (a lot) is solving.
(In reply to Wayne Mery (:wsmwk) from comment #2)
There are multiple pop3 and imap accounts.
How many of each?
From prefs.js:
count of .type", "imap");
is 8
count of .type", "pop3");
is 17
mail.server.server#
last number for # is 27
mail.account.account#
last number for # is 43
Workaround stagger mail check interval
Please describe in detail what you did to reduce the condition.
I didn't do anything with intend to reduce the number of occurrences.
However, since my last post, there were about 6 hangs. That's the same level as in previous versions.
Comment 5•6 months ago
|
||
Issac,
Sorry to bother you again, but need some more info from you. Have you updated to TB 128 yet? If you have, are you able to check mail without staggering the check mail intervals? If so, are you seeing any major slow-downs or hangs as are reported here? I remember that you have a lot of POP3 accounts that you check for new mail on a timed basis.
Thank a lot (in advance)! --gene
Network (tcp, udp, app layer) timeouts, errors and such could be a factor. Can someone explain how the network traffic is performed?
Are DNS lookups blocking or asynchronous? Is DNS cached at program level? Does it use Windows DNSAPI? With or without synchronous (e.g. DnsQuery_A) or asynchronous (DnsQueryEx)?
Is network traffic blocking? How are tcp resends, tcp resets, timeouts handled?
In software, hangs can occur when there is a lock that is not released (e.g. one lock waiting for a release). Are locks used? Does it use callbacks or threads?
Are retrieval cycles done one after another (no preset times) or are they done at certain time intervals?
How are overlapping requests handled? pop3 queries can be set to query once per minute, if the network is unavailable for longer than a minute or longer than the duration of a full retrieval cycle, does that stack requests or are new queries halted until a full retrieve cycle is performed? Or is this done some other way?
I think clearing this up can help getting this problem fixed.
I captured a large amount of network traffic and noticed that one of the servers is returning * BYE
from port 143 (imap) after establishing the connection. It sends this BYE reply because it has run out of disk space.
Thunderbird sends a 15 capability
request after receiving a FIN ACK packet from the server. It stops sending after the server sends a RST ACK. There are two near simultaneous connections with the same sequence of 8 packets. Theoretically, it could be querying the same account twice or two accounts on the same server, that is not visible in this network traffic. However, I think there are two connections for the same account because the request number of capability is the same.
Earlier connections more or less look the same and often have the same problem.
I wonder if these connections are handled and finished correctly by Thunderbird.
Because there appear to be two near simultaneous requests, one request may be locking or blocking data the other request does not expect. Or if there is no locking, there probably should be, because the mail client needs time to finish before another request attempts to access or change the same data. Also see this.
Comment 8•6 months ago
|
||
TB just uses the network facilities provided by firefox (mozilla) and does not roll its own. POP3 uses async JS and network calls which is single thread and callback based. IMAP uses separate threads in c++ for for each mailbox/tcp-ip connection, up to 5 thread (default) per server.
It sends this BYE reply because it has run out of disk space.
I assume this is an IMAP server sending the bye? When TB starts up, it does make 2 connections to the imap server. One is used for "folder discovery" and the other just does a "select" on Inbox. These run in parallel in their own threads.
There is a setting in "Server settings"/"Advanced" where you can set the max number of "cached" IMAP connections (equal to imap threads) that can occur for an imap account/server. It defaults to 5.
How do you know the server is out of disk space? Can you see the server logs? What type of server(s) are the problem occurring with? E.g., dovecot, cyrus, gmail (or homegrown)? Both imap and POP3?
Anyhow, if you can provide a IMAP:4 log it might show something. See this for details: https://wiki.mozilla.org/MailNews:Logging
Another possibility is if you can provide me with a temporary test account on the problem server(s) so I can see for myself what is happening.
Comment 9•6 months ago
•
|
||
If you record the IMAP:4 log, it might be better to "disable" other IMAP accounts so only the info for a "problem" account is included. To disable the others, uncheck/unselect, in Server settings, "Check for new messages at startup" and "Check for new messages every X minutes", then restart TB and record the log.
Edit: Don't click in any of the "disabled" accounts while recording the log. This would cause data to still be recorded for the "disabled" accounts.
Reporter | ||
Comment 10•6 months ago
|
||
(In reply to gene smith from comment #8)
Thanks for the information. After the disk problem was fixed and the imap server was restarted, hangs haven't occurred again.
The IMAP server sends * BYE
appended by a text message explaining that it's out of disk space . There is no pop account on the server. The server runs on a local network. It's not accessible from internet and the licenced software is not available.
It should be possible to emulate the server network behaviour though. I will attempt to find out if it is possible to anonymize and share part of the capture. There are capture replay tools.
It took 92 minutes after starting the capture until the hang occurred. I haven't counted them, but it seems that many connections were set up by Thunderbird, too many to count. Intervals are not always as expected: sometimes there is one at 27 seconds, when 60 seconds is expected.
Comment 11•6 months ago
|
||
Based on what you describe (and your email addr) I suspect you are running the local/proprietary server for PM. A few years ago I installed a "trial" version of that for another bug. Of course the trial period is long expired.
It should be possible to emulate the server network behaviour though. I will attempt to find out if it is possible to anonymize and share part of the capture. There are capture replay tools.
It would probably be better to share the IMAP:4 log recorded by TB rather than the capture from "replay tools". You can redact anything in the IMAP:4 log that you want, no problem.
Reporter | ||
Comment 12•6 months ago
|
||
(In reply to gene smith from comment #11)
It should be possible to emulate the server network behaviour though. I will attempt to find out if it is possible to anonymize and share part of the capture. There are capture replay tools.
It would probably be better to share the IMAP:4 log recorded by TB rather than the capture from "replay tools". You can redact anything in the IMAP:4 log that you want, no problem.
The hang has not occurred since the out of disk space problem was solved. The network capture is all I have at the moment when the hang did happen.
I have found another way to have the server greet with * BYE
, and the network package sequence results are similar. Just this server is requested, no other servers (imap or pop3). There was no hang in this situation. I cannot keep the server running in this state, I need mail to work.
Issues found:
- client does not process
* BYE
at greeting level before sending IMAP capability request (see RFC 9051 3.4 flow diagram) MOZ_LOG
output is garbled: original CR LF line endings in communication lines are transformed to CR CR LF in the log.
Reporter | ||
Comment 13•6 months ago
|
||
MOZ_LOG=IMAP:4 binary
[] -> replaced strings
Updated•6 months ago
|
Comment 14•6 months ago
|
||
I've never seen a server allow a connection and then immediately respond with "Bye". There is code in TB to respond to BYE but I don't think it is doing the right thing at the "greeting" point. It should just treat this like a unreachable or timed out connection request to the server rather than just continuing on and attempt a "capability" (over and over like in your log).
I think it is, but is the server dropping the connection after sending the "Bye"? https://www.rfc-editor.org/rfc/rfc3501.html#section-7.1.5 says:
4) as one of three possible greetings at connection startup,
indicating that the server is not willing to accept a
connection from this client. The server closes the
connection immediately.
I would say it probably is since TB goes ahead and tries to do the capability but then sees that the network connection is down and tries again.
Reporter | ||
Comment 15•6 months ago
|
||
The Wireshark screenshot shows one tcp packet stream, edited to show server and client packets.
Reporter | ||
Comment 16•6 months ago
|
||
I marked the LOG file expressly as binary so it arrives in its original form without changing line endings.
The server side does what you quote from section 7.1.5. The chart (probably incorrectly) shows a client logout, which is not consistent with this text.
The packet from the server after BYE has FIN ACK (see comment 7), that means that the server is ending the connection. From the Thunderbird LOG I think I can see that there are two threads started that use the same tcp connection. I think one sends an ACK as a response to FIN ACK the other thread doesn't check or know and sends a capability request.
Updated•6 months ago
|
Comment 17•6 months ago
•
|
||
Each imap thread handles one connection. What comment 15 image shows is TB making the connection successfully and then server responds with BYE then the server closes the connection (sends FIN). The mozilla framework then ACKs the FIN but TB doesn't know about this. Then TB tries to send the CAPABILITY on the same now closed connection and the server responds with RST. The fact that TB doesn't know about a connection down in IMAP code until something is attempted to be sent was somewhat surprising to me when I learned about it some time back; but, it "simplifies" the IMAP handling since it becomes pretty much synchronous (code in thread sends imap command and waits for a response).
Anyhow, I'll try to find a fix for the "unexpected" BYE response for the initial greeting.
I'm think you are saying now that you have fixed the server so there is no initial BYE that TB is now working OK again?
I marked the LOG file expressly as binary so it arrives in its original form without changing line endings.
I've never noticed a problem with line endings using my editors (vim or gvim). Looking with a binary "hexdump" program at logs I record here, I see the lines that are imap request or responses end with "0d0a" (as required by RFC) while the other lines end with just "0a" which is the standard linux/unix line ending. I'm not seeing "0d0d0a" in log files I produce here on linux.
Anyhow, I'll set your log attachment back to binary and see what I see.
:
Ok, I see the 0d0d0a on imap lines but don't know why. File still looks OK in gvim.
Edit: P/S: Tried another editor that I seldom use (kate) and it shows a blank line after each line with the 0d0d0a.
Reporter | ||
Comment 18•6 months ago
|
||
Password protected ZIP file, can be opened using a ZIP decompression tool.
Windows does not support this encryption format.
Contents not for publication.
Contains an edited version of packet traffic between Thunderbird and an IMAP server that is out of disk space and greets with * BYE
. Some time after the last exchange, Thunderbird hanged and did not recover. MAC, IP, greeting, were anonymized, checksums were recalculated.
Reporter | ||
Comment 19•6 months ago
|
||
The logging line ending problem is likely due to log construction error. The original packets from server and client have the correct RFC compliant CR LF (0x0D 0x0A) line ending. One scenario is that it is assumed that line ending is just one character long (Linux: \n
or 0x0A). On Windows, if the log writes 0x0D 0x0A at position stringlen-1, it would create a bad line ending of 0x0D 0x0D 0x0A. This is of course cosmetic and probably unrelated to actual network traffic.
Comment 20•6 months ago
|
||
I've started a "try" build here (https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=8c7c4f22e8da9c153380513e9d764cad227cbeed) based on the current TB daily. I'm assuming you are running windows or linux so, if it finishes OK, you can click the green B beside the platform you have and down below, under the "artifacts" column, you can find an installer for windows or a target.tar.bz2 for linux you can unzip and run "thunderbird" inside.
The fix now handles the possibility of a BYE response to the greeting. I've been unable to find a way to cause my test servers (cyrus and dovecot) to send the BYE in the initial greeting, so I'm asking you to test this with your server since you say you can cause BYE greeting to occur.
I've tried to simulate the BYE but the server doesn't disconnect so my test isn't really valid. You should just see a pop-up message about the BYE and then TB should do nothing more until you select a folder or another check for new messages occurs. It should not just keep trying do CAPABILITY requests as shown in your .pcap trace.
Comment 21•6 months ago
|
||
John, I re-ran the try build for all platforms since I don't know what you are using. Again, click the green "B" by your platform and look under "Artifacts and Debugging" to find the appropriate installation file:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=a13d081957bb4496ca12a678848e2d49dbb0bfaa
Reporter | ||
Comment 22•6 months ago
|
||
Win32 client offline installer (or zip) is preferred. I will have to image and reconfigure the server and client environments.
It's not difficult to create a server that just sends * BYE
and then disconnects in for example Python, but in one I tried the packet sequence is not the same (introduces PSH) and the timing is off. In the 'hang' packets, the IMAP server was extremely quick in sending * BYE
and then immediately the FIN, ACK packet. Wireshark shows the same time in millionth of a second.
Reporter | ||
Comment 23•6 months ago
|
||
Attached are server scripts to simulate the start of an imap connection.
Some time after running a test account that connects to 'imap' at 127.0.0.2 to the simple Python server and a ws2 server in C for Windows, Thunderbird v128.1 (regular version) hangs again. The attached server scripts really are quick & very dirty and are not functional or well behaved. It will accept a connection and return a * BYE
line. Logs were not turned on yet. It's hard to find a causal relationship to something that has happened before the hang occurs hours later. Running out of resources? Deadlock?
The C file is based on winsock examples on msdn. It can be compiled using Mingw or llvm, using the Windows C runtime. VC probably works if the ws2 lib pragma is set. Python script tested on version 3.8.
Comment 24•6 months ago
|
||
Thanks for the "hangtools". Since you suggested it in comment 22, last night I went ahead and created by own "bye" server by modifying an example "echo" server I found written in C and it seems to work. I need to test some more today. While working with it, I realized what I think you are doing/seeing. I originally though that TB was just immediately reconnecting as soon as the BYE is received and when the server disconnects (FIN or RST). But what I now think you are seeing and have been trying to tell me is that you are setting the time between new message checks in TB to 1 minute and you are saying after a couple hours later and many connect tries, TB locks up. Looking at timestamps in your trace file, the connection request are about 1 minute apart. (FWIW, in comment 10 you did mention 60 seconds as expected interval but I didn't pick up on that as the check message time.)
Anyhow, based on what I think I understand, I don't think the "try" build will make a difference.
Also, this doesn't seem to have anything to do with POP3 (clearing info request from Isaac) and I think the bug title/summary at the top needs to be modified.
Comment 25•6 months ago
•
|
||
I'm now using your python server and it's working better than my own. It's producing the same network and imap log activity as your attachments.
Your pcap trace attachment took over 24 hours to cause TB to lock up. At another place (comment 10) you said the lock up occurs in 92 minutes.
In comment 12 you said:
I have found another way to have the server greet with * BYE, and the network package sequence results are similar. Just this server is requested, no other servers (imap or pop3). There was no hang in this situation. I cannot keep the server running in this state, I need mail to work.
So I'm wondering how long and under what server setup the lockup/hang occurs? Does it only occur with your "real" (PM?) server and will it also happen with the python (only sends BYE) server? What is the "another way".
Possibly I need to go into the TB "biff" code and set the check for new mail timer down to something very short, like maybe 10 or 5 seconds, since in "server setting" you can only set it to X minutes.
Edit: I hardcoded the "biff" time to 1 sec. So far, still running. Hitting the "bye" server every second on the 2 connections.
Comment 26•6 months ago
|
||
Some time after running a test account that connects to 'imap' at 127.0.0.2 to the simple Python server and a ws2 server in C for Windows, Thunderbird v128.1 (regular version) hangs again.
I need to improve my reading skills ;)
But still wondering how long is "some time"?
Reporter | ||
Comment 27•6 months ago
|
||
Using the tools did seem to be a catalyst for the hangs. It was hours (didn't measure) after starting to use the fake servers and testing all sorts of things, including compiling and testing the tools themselves, the Thunderbird hangs happened twice. I can't say what triggered it. It's unclear. In the past with v115.x, it took hours to days and sometimes it went on for a week without a hang. That's 'some' time, unfortunately. 'Some' as in: not clear and unspecified. BTW, I used "get messages" often during tests.
In comment 12, I mentioned that a * BYE
greeting was not handled, I did not say that this is what creates a hang. I don't know what does that. It may be some type of network problem, such as reuse of client ports, the speed of the server sending * BYE
and next the arrival in the same millionth[^1] of a second the FIN ACK packet, or some other timing issue, or a combination of reasons, etc. etc. The reason I named the ZIP hangtools.zip is to prevent people from attempting to use for production. It's solely intended to assist in possibly recreating the hang bug.
[^1]: again, according to Wireshark, I don't know if it has a precision of 1 or more millionths of second. The scripts I put together are not fast, Python sockets here is actually rather slow. For speed, it's probably necessary to use low level packet programming.
Comment 28•6 months ago
•
|
||
I ran 128.1.0 with your python server for about 15 hours with biff time hardcoded to 1 second so there were a lot of connect attempts, BYEs and CAPABILITYs. I never saw TB hang and was able to access other accounts in the profile. Since I can't seem to duplicate it here, have you tried the 32-bit windows build pointed to in comment 21? The 32 bit windows build is here:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/JGAboXfuT6yreioxeehXiw/runs/1/artifacts/public/build/install/sea/target.installer.exe
This is under the heading Windows 2012 opt B
-- click the green B and look under "Artifacts and Debugging" to see target.installer.exe
Reporter | ||
Comment 29•6 months ago
|
||
I will try, but it won't be soon before I can attempt to set up an environment to simulate the situation. It causes a long downtime for critical systems.
Comment 30•6 months ago
|
||
Ignore this is you already know this or it's not feasible.
You can install the "try" build along side your release TB build. Then you might just create a new TB profile with an IMAP account for your "bye" server. Do this by starting "try" TB with the -p option, e.g. run tb with <path-to-try-tb>/thunderbird.exe -p
, create the new profile and setup the "bye" account (just leave the outgoing/SMTP setting as is since not sending mail). You probably need to change the server's listening port to something other than 143 like maybe 1143 if real server and "bye" server on same ip address/interface. This will allow you to run your real server on port 143, the "bye" server on port 1143 along with release (128) TB instance on your default profile and the "try" tb instance on the new test profile (with server address set to 1148 if needed). This shouldn't interfere with your normal/critical environment which will just keep running.
Reporter | ||
Comment 31•6 months ago
|
||
I have run tests between virtual machines. One VM is the IMAP server that says bye (exact copy, but the network had to be reconfigured), the other VM is Windows 11 with a completely fresh Thunderbird v128.1.1 install - not previously installed, new profile. It ran for more than a day. So far, Thunderbird did not hang in this situation. The host-only adapter seemingly cannot be captured for local traffic on it's own IP, so I will need to run the test servers (Python, C-runtime) on another test server.
There are various differences between the test setup and the actual work environment: VM's have virtual network interfaces, no DNS, no other accounts. Captured traffic between virtual machines sometimes shows that the traffic on dual Thunderbird threads happen one after another. It seems the thread is delayed.
Next, I will test with the target.installer build, talking to the IMAP server.
Reporter | ||
Comment 32•6 months ago
|
||
Network capture of traffic from daily build a13d081957bb4496ca12a678848e2d49dbb0bfaa (ip 10.10.10.6) to imap server (ip 10.10.10.3) that greets with * BYE
message. Both systems are virtual machines using local traffic. The IMAP server is an image copy of the actual server. The pcap file has been anonymized. Checksums of payload were rewritten.
The client stops sending and sends an ACK as a response to FIN and then sends its own FIN which the server responds to with ACK, gracefully ending the communication.
MOZ_LOG/MOZ_LOG_FILE log files remained zero size.
Comment 33•6 months ago
|
||
So that seems to fix it, right?
However, I would expect something to appear in the IMAP:4 log ??
Reporter | ||
Comment 34•6 months ago
|
||
It seems to fix the BYE greeting handling and to me the network traffic looks as it should. I would expect it to provide better networking.
Because replicating the hang problem in a virtual machine environment failed - even with an exact image copy of the server, it does not tell us if that problem will be fixed as well.
Comment 35•4 months ago
|
||
(In reply to Giacomo Mazzini from comment #3)
(In reply to John from comment #1)
To be clear: this is a hang, not a slowness issue. It does not recover without killing the process.
Should it be of some interest, something similar began happening in my TB 128esr just after upgrading from 115 (I'm feeding multiple frequently checked pop3 accounts into one local Inbox) on macos 14.4: but in my case, simply waiting (a lot) is solving.
Thanks. This would be a new issue, likely unrelated to this bug report.
Updated•3 months ago
|
Comment 36•3 months ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #35)
(In reply to Giacomo Mazzini from comment #3)
(In reply to John from comment #1)
To be clear: this is a hang, not a slowness issue. It does not recover without killing the process.
Should it be of some interest, something similar began happening in my TB 128esr just after upgrading from 115 (I'm feeding multiple frequently checked pop3 accounts into one local Inbox) on macos 14.4: but in my case, simply waiting (a lot) is solving.
Thanks. This would be a new issue, likely unrelated to this bug report.
I'm seeing this as well (general unresponsiveness of the whole pc while thunderbird starts up and checks multiple accounts for messages). I don't know when it started but I know it's very recent. I'm on 1.28.04 on Windows 11 and still only use POP.
Description
•