Open Bug 1911950 Opened 6 months ago Updated 3 months ago

115.1 hangs randomly "not responding" when automatically checking multiple accounts for new messages. Workaround stagger mail check interval (pop)

Categories

(MailNews Core :: Networking: POP, defect)

Thunderbird 116
Unspecified
All
defect

Tracking

(thunderbird_esr128 affected)

UNCONFIRMED
Tracking Status
thunderbird_esr128 --- affected

People

(Reporter: johnbast, Unassigned)

References

Details

(Keywords: hang, perf, Whiteboard: version 115)

Attachments

(5 files)

2.97 MB, application/octet-stream
Details
67.71 KB, image/png
Details
595.85 KB, application/x-zip-compressed
Details
1.61 KB, application/x-zip-compressed
Details
49.01 KB, application/x-zip-compressed
Details

For a description see bug #1847137.

v128 version 128 seems to reduce the number of hangs, but it still becomes unresponsive in the same way. No slowness at that time. I have not been using v128.1.0 long enough.

There are multiple pop3 and imap accounts.

Presumably there is a race condition causing the nonresponsiveness. Making the code thread safe could probably solve such problems at the root.

Keywords: perf

To be clear: this is a hang, not a slowness issue. It does not recover without killing the process.

Blocks: tb128found

There are multiple pop3 and imap accounts.

How many of each?

For a description see bug #1847137.
v128 version 128 seems to reduce the number of hangs, but it still becomes unresponsive in the same way. No slowness at that time. I have not been using v128.1.0 long enough.

So this started in version 115 or prior

Workaround stagger mail check interval

Please describe in detail what you did to reduce the condition.

Presumably there is a race condition causing the nonresponsiveness. Making the code thread safe could probably solve such problems at the root.

Please don't speculate unless you have technical knowledge of the code for making such a statement - lest it add confusion.

No longer blocks: tb128found
Component: Networking → Networking: POP
Flags: needinfo?(johnbast)
Keywords: hang
See Also: → 1847137
Whiteboard: version 115
Version: Thunderbird 128 → Thunderbird 116

(In reply to John from comment #1)

To be clear: this is a hang, not a slowness issue. It does not recover without killing the process.

Should it be of some interest, something similar began happening in my TB 128esr just after upgrading from 115 (I'm feeding multiple frequently checked pop3 accounts into one local Inbox) on macos 14.4: but in my case, simply waiting (a lot) is solving.

(In reply to Wayne Mery (:wsmwk) from comment #2)

There are multiple pop3 and imap accounts.

How many of each?

From prefs.js:
count of .type", "imap"); is 8
count of .type", "pop3"); is 17
mail.server.server# last number for # is 27
mail.account.account# last number for # is 43

Workaround stagger mail check interval

Please describe in detail what you did to reduce the condition.

I didn't do anything with intend to reduce the number of occurrences.

However, since my last post, there were about 6 hangs. That's the same level as in previous versions.

Flags: needinfo?(johnbast)

Issac,
Sorry to bother you again, but need some more info from you. Have you updated to TB 128 yet? If you have, are you able to check mail without staggering the check mail intervals? If so, are you seeing any major slow-downs or hangs as are reported here? I remember that you have a lot of POP3 accounts that you check for new mail on a timed basis.
Thank a lot (in advance)! --gene

Flags: needinfo?(isaacribeiro)

Network (tcp, udp, app layer) timeouts, errors and such could be a factor. Can someone explain how the network traffic is performed?

Are DNS lookups blocking or asynchronous? Is DNS cached at program level? Does it use Windows DNSAPI? With or without synchronous (e.g. DnsQuery_A) or asynchronous (DnsQueryEx)?

Is network traffic blocking? How are tcp resends, tcp resets, timeouts handled?

In software, hangs can occur when there is a lock that is not released (e.g. one lock waiting for a release). Are locks used? Does it use callbacks or threads?

Are retrieval cycles done one after another (no preset times) or are they done at certain time intervals?

How are overlapping requests handled? pop3 queries can be set to query once per minute, if the network is unavailable for longer than a minute or longer than the duration of a full retrieval cycle, does that stack requests or are new queries halted until a full retrieve cycle is performed? Or is this done some other way?

I think clearing this up can help getting this problem fixed.

I captured a large amount of network traffic and noticed that one of the servers is returning * BYE from port 143 (imap) after establishing the connection. It sends this BYE reply because it has run out of disk space.

Thunderbird sends a 15 capability request after receiving a FIN ACK packet from the server. It stops sending after the server sends a RST ACK. There are two near simultaneous connections with the same sequence of 8 packets. Theoretically, it could be querying the same account twice or two accounts on the same server, that is not visible in this network traffic. However, I think there are two connections for the same account because the request number of capability is the same.

Earlier connections more or less look the same and often have the same problem.

I wonder if these connections are handled and finished correctly by Thunderbird.

Because there appear to be two near simultaneous requests, one request may be locking or blocking data the other request does not expect. Or if there is no locking, there probably should be, because the mail client needs time to finish before another request attempts to access or change the same data. Also see this.

TB just uses the network facilities provided by firefox (mozilla) and does not roll its own. POP3 uses async JS and network calls which is single thread and callback based. IMAP uses separate threads in c++ for for each mailbox/tcp-ip connection, up to 5 thread (default) per server.

It sends this BYE reply because it has run out of disk space.

I assume this is an IMAP server sending the bye? When TB starts up, it does make 2 connections to the imap server. One is used for "folder discovery" and the other just does a "select" on Inbox. These run in parallel in their own threads.
There is a setting in "Server settings"/"Advanced" where you can set the max number of "cached" IMAP connections (equal to imap threads) that can occur for an imap account/server. It defaults to 5.

How do you know the server is out of disk space? Can you see the server logs? What type of server(s) are the problem occurring with? E.g., dovecot, cyrus, gmail (or homegrown)? Both imap and POP3?

Anyhow, if you can provide a IMAP:4 log it might show something. See this for details: https://wiki.mozilla.org/MailNews:Logging
Another possibility is if you can provide me with a temporary test account on the problem server(s) so I can see for myself what is happening.

If you record the IMAP:4 log, it might be better to "disable" other IMAP accounts so only the info for a "problem" account is included. To disable the others, uncheck/unselect, in Server settings, "Check for new messages at startup" and "Check for new messages every X minutes", then restart TB and record the log.
Edit: Don't click in any of the "disabled" accounts while recording the log. This would cause data to still be recorded for the "disabled" accounts.

(In reply to gene smith from comment #8)

Thanks for the information. After the disk problem was fixed and the imap server was restarted, hangs haven't occurred again.

The IMAP server sends * BYE appended by a text message explaining that it's out of disk space . There is no pop account on the server. The server runs on a local network. It's not accessible from internet and the licenced software is not available.

It should be possible to emulate the server network behaviour though. I will attempt to find out if it is possible to anonymize and share part of the capture. There are capture replay tools.

It took 92 minutes after starting the capture until the hang occurred. I haven't counted them, but it seems that many connections were set up by Thunderbird, too many to count. Intervals are not always as expected: sometimes there is one at 27 seconds, when 60 seconds is expected.

Based on what you describe (and your email addr) I suspect you are running the local/proprietary server for PM. A few years ago I installed a "trial" version of that for another bug. Of course the trial period is long expired.

It should be possible to emulate the server network behaviour though. I will attempt to find out if it is possible to anonymize and share part of the capture. There are capture replay tools.

It would probably be better to share the IMAP:4 log recorded by TB rather than the capture from "replay tools". You can redact anything in the IMAP:4 log that you want, no problem.

(In reply to gene smith from comment #11)

It should be possible to emulate the server network behaviour though. I will attempt to find out if it is possible to anonymize and share part of the capture. There are capture replay tools.

It would probably be better to share the IMAP:4 log recorded by TB rather than the capture from "replay tools". You can redact anything in the IMAP:4 log that you want, no problem.

The hang has not occurred since the out of disk space problem was solved. The network capture is all I have at the moment when the hang did happen.

I have found another way to have the server greet with * BYE, and the network package sequence results are similar. Just this server is requested, no other servers (imap or pop3). There was no hang in this situation. I cannot keep the server running in this state, I need mail to work.

Issues found:

  • client does not process * BYE at greeting level before sending IMAP capability request (see RFC 9051 3.4 flow diagram)
  • MOZ_LOG output is garbled: original CR LF line endings in communication lines are transformed to CR CR LF in the log.
Attached file moz_log

MOZ_LOG=IMAP:4 binary
[] -> replaced strings

Attachment #9419531 - Attachment mime type: application/octet-stream → text/plain

I've never seen a server allow a connection and then immediately respond with "Bye". There is code in TB to respond to BYE but I don't think it is doing the right thing at the "greeting" point. It should just treat this like a unreachable or timed out connection request to the server rather than just continuing on and attempt a "capability" (over and over like in your log).

I think it is, but is the server dropping the connection after sending the "Bye"? https://www.rfc-editor.org/rfc/rfc3501.html#section-7.1.5 says:

         4) as one of three possible greetings at connection startup,
            indicating that the server is not willing to accept a
            connection from this client.  The server closes the
            connection immediately.

I would say it probably is since TB goes ahead and tries to do the capability but then sees that the network connection is down and tries again.

Attached image 7919.png

The Wireshark screenshot shows one tcp packet stream, edited to show server and client packets.

I marked the LOG file expressly as binary so it arrives in its original form without changing line endings.

The server side does what you quote from section 7.1.5. The chart (probably incorrectly) shows a client logout, which is not consistent with this text.

The packet from the server after BYE has FIN ACK (see comment 7), that means that the server is ending the connection. From the Thunderbird LOG I think I can see that there are two threads started that use the same tcp connection. I think one sends an ACK as a response to FIN ACK the other thread doesn't check or know and sends a capability request.

Attachment #9419531 - Attachment mime type: text/plain → application/octet-stream

Each imap thread handles one connection. What comment 15 image shows is TB making the connection successfully and then server responds with BYE then the server closes the connection (sends FIN). The mozilla framework then ACKs the FIN but TB doesn't know about this. Then TB tries to send the CAPABILITY on the same now closed connection and the server responds with RST. The fact that TB doesn't know about a connection down in IMAP code until something is attempted to be sent was somewhat surprising to me when I learned about it some time back; but, it "simplifies" the IMAP handling since it becomes pretty much synchronous (code in thread sends imap command and waits for a response).
Anyhow, I'll try to find a fix for the "unexpected" BYE response for the initial greeting.

I'm think you are saying now that you have fixed the server so there is no initial BYE that TB is now working OK again?

I marked the LOG file expressly as binary so it arrives in its original form without changing line endings.

I've never noticed a problem with line endings using my editors (vim or gvim). Looking with a binary "hexdump" program at logs I record here, I see the lines that are imap request or responses end with "0d0a" (as required by RFC) while the other lines end with just "0a" which is the standard linux/unix line ending. I'm not seeing "0d0d0a" in log files I produce here on linux.
Anyhow, I'll set your log attachment back to binary and see what I see.
:
Ok, I see the 0d0d0a on imap lines but don't know why. File still looks OK in gvim.
Edit: P/S: Tried another editor that I seldom use (kate) and it shows a blank line after each line with the 0d0d0a.

Attached file port143an.zip

Password protected ZIP file, can be opened using a ZIP decompression tool.
Windows does not support this encryption format.
Contents not for publication.

Contains an edited version of packet traffic between Thunderbird and an IMAP server that is out of disk space and greets with * BYE. Some time after the last exchange, Thunderbird hanged and did not recover. MAC, IP, greeting, were anonymized, checksums were recalculated.

The logging line ending problem is likely due to log construction error. The original packets from server and client have the correct RFC compliant CR LF (0x0D 0x0A) line ending. One scenario is that it is assumed that line ending is just one character long (Linux: \n or 0x0A). On Windows, if the log writes 0x0D 0x0A at position stringlen-1, it would create a bad line ending of 0x0D 0x0D 0x0A. This is of course cosmetic and probably unrelated to actual network traffic.

I've started a "try" build here (https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=8c7c4f22e8da9c153380513e9d764cad227cbeed) based on the current TB daily. I'm assuming you are running windows or linux so, if it finishes OK, you can click the green B beside the platform you have and down below, under the "artifacts" column, you can find an installer for windows or a target.tar.bz2 for linux you can unzip and run "thunderbird" inside.
The fix now handles the possibility of a BYE response to the greeting. I've been unable to find a way to cause my test servers (cyrus and dovecot) to send the BYE in the initial greeting, so I'm asking you to test this with your server since you say you can cause BYE greeting to occur.
I've tried to simulate the BYE but the server doesn't disconnect so my test isn't really valid. You should just see a pop-up message about the BYE and then TB should do nothing more until you select a folder or another check for new messages occurs. It should not just keep trying do CAPABILITY requests as shown in your .pcap trace.

John, I re-ran the try build for all platforms since I don't know what you are using. Again, click the green "B" by your platform and look under "Artifacts and Debugging" to find the appropriate installation file:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=a13d081957bb4496ca12a678848e2d49dbb0bfaa

Win32 client offline installer (or zip) is preferred. I will have to image and reconfigure the server and client environments.

It's not difficult to create a server that just sends * BYE and then disconnects in for example Python, but in one I tried the packet sequence is not the same (introduces PSH) and the timing is off. In the 'hang' packets, the IMAP server was extremely quick in sending * BYE and then immediately the FIN, ACK packet. Wireshark shows the same time in millionth of a second.

Attached file hangtools.zip

Attached are server scripts to simulate the start of an imap connection.

Some time after running a test account that connects to 'imap' at 127.0.0.2 to the simple Python server and a ws2 server in C for Windows, Thunderbird v128.1 (regular version) hangs again. The attached server scripts really are quick & very dirty and are not functional or well behaved. It will accept a connection and return a * BYE line. Logs were not turned on yet. It's hard to find a causal relationship to something that has happened before the hang occurs hours later. Running out of resources? Deadlock?

The C file is based on winsock examples on msdn. It can be compiled using Mingw or llvm, using the Windows C runtime. VC probably works if the ws2 lib pragma is set. Python script tested on version 3.8.

Thanks for the "hangtools". Since you suggested it in comment 22, last night I went ahead and created by own "bye" server by modifying an example "echo" server I found written in C and it seems to work. I need to test some more today. While working with it, I realized what I think you are doing/seeing. I originally though that TB was just immediately reconnecting as soon as the BYE is received and when the server disconnects (FIN or RST). But what I now think you are seeing and have been trying to tell me is that you are setting the time between new message checks in TB to 1 minute and you are saying after a couple hours later and many connect tries, TB locks up. Looking at timestamps in your trace file, the connection request are about 1 minute apart. (FWIW, in comment 10 you did mention 60 seconds as expected interval but I didn't pick up on that as the check message time.)

Anyhow, based on what I think I understand, I don't think the "try" build will make a difference.

Also, this doesn't seem to have anything to do with POP3 (clearing info request from Isaac) and I think the bug title/summary at the top needs to be modified.

Flags: needinfo?(isaacribeiro)

I'm now using your python server and it's working better than my own. It's producing the same network and imap log activity as your attachments.

Your pcap trace attachment took over 24 hours to cause TB to lock up. At another place (comment 10) you said the lock up occurs in 92 minutes.

In comment 12 you said:

I have found another way to have the server greet with * BYE, and the network package sequence results are similar. Just this server is requested, no other servers (imap or pop3). There was no hang in this situation. I cannot keep the server running in this state, I need mail to work.

So I'm wondering how long and under what server setup the lockup/hang occurs? Does it only occur with your "real" (PM?) server and will it also happen with the python (only sends BYE) server? What is the "another way".

Possibly I need to go into the TB "biff" code and set the check for new mail timer down to something very short, like maybe 10 or 5 seconds, since in "server setting" you can only set it to X minutes.
Edit: I hardcoded the "biff" time to 1 sec. So far, still running. Hitting the "bye" server every second on the 2 connections.

Some time after running a test account that connects to 'imap' at 127.0.0.2 to the simple Python server and a ws2 server in C for Windows, Thunderbird v128.1 (regular version) hangs again.

I need to improve my reading skills ;)
But still wondering how long is "some time"?

Using the tools did seem to be a catalyst for the hangs. It was hours (didn't measure) after starting to use the fake servers and testing all sorts of things, including compiling and testing the tools themselves, the Thunderbird hangs happened twice. I can't say what triggered it. It's unclear. In the past with v115.x, it took hours to days and sometimes it went on for a week without a hang. That's 'some' time, unfortunately. 'Some' as in: not clear and unspecified. BTW, I used "get messages" often during tests.

In comment 12, I mentioned that a * BYE greeting was not handled, I did not say that this is what creates a hang. I don't know what does that. It may be some type of network problem, such as reuse of client ports, the speed of the server sending * BYE and next the arrival in the same millionth[^1] of a second the FIN ACK packet, or some other timing issue, or a combination of reasons, etc. etc. The reason I named the ZIP hangtools.zip is to prevent people from attempting to use for production. It's solely intended to assist in possibly recreating the hang bug.

[^1]: again, according to Wireshark, I don't know if it has a precision of 1 or more millionths of second. The scripts I put together are not fast, Python sockets here is actually rather slow. For speed, it's probably necessary to use low level packet programming.

I ran 128.1.0 with your python server for about 15 hours with biff time hardcoded to 1 second so there were a lot of connect attempts, BYEs and CAPABILITYs. I never saw TB hang and was able to access other accounts in the profile. Since I can't seem to duplicate it here, have you tried the 32-bit windows build pointed to in comment 21? The 32 bit windows build is here:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/JGAboXfuT6yreioxeehXiw/runs/1/artifacts/public/build/install/sea/target.installer.exe
This is under the heading Windows 2012 opt B -- click the green B and look under "Artifacts and Debugging" to see target.installer.exe

I will try, but it won't be soon before I can attempt to set up an environment to simulate the situation. It causes a long downtime for critical systems.

Ignore this is you already know this or it's not feasible.

You can install the "try" build along side your release TB build. Then you might just create a new TB profile with an IMAP account for your "bye" server. Do this by starting "try" TB with the -p option, e.g. run tb with <path-to-try-tb>/thunderbird.exe -p, create the new profile and setup the "bye" account (just leave the outgoing/SMTP setting as is since not sending mail). You probably need to change the server's listening port to something other than 143 like maybe 1143 if real server and "bye" server on same ip address/interface. This will allow you to run your real server on port 143, the "bye" server on port 1143 along with release (128) TB instance on your default profile and the "try" tb instance on the new test profile (with server address set to 1148 if needed). This shouldn't interfere with your normal/critical environment which will just keep running.

I have run tests between virtual machines. One VM is the IMAP server that says bye (exact copy, but the network had to be reconfigured), the other VM is Windows 11 with a completely fresh Thunderbird v128.1.1 install - not previously installed, new profile. It ran for more than a day. So far, Thunderbird did not hang in this situation. The host-only adapter seemingly cannot be captured for local traffic on it's own IP, so I will need to run the test servers (Python, C-runtime) on another test server.

There are various differences between the test setup and the actual work environment: VM's have virtual network interfaces, no DNS, no other accounts. Captured traffic between virtual machines sometimes shows that the traffic on dual Thunderbird threads happen one after another. It seems the thread is delayed.

Next, I will test with the target.installer build, talking to the IMAP server.

Network capture of traffic from daily build a13d081957bb4496ca12a678848e2d49dbb0bfaa (ip 10.10.10.6) to imap server (ip 10.10.10.3) that greets with * BYE message. Both systems are virtual machines using local traffic. The IMAP server is an image copy of the actual server. The pcap file has been anonymized. Checksums of payload were rewritten.

The client stops sending and sends an ACK as a response to FIN and then sends its own FIN which the server responds to with ACK, gracefully ending the communication.

MOZ_LOG/MOZ_LOG_FILE log files remained zero size.

So that seems to fix it, right?
However, I would expect something to appear in the IMAP:4 log ??

It seems to fix the BYE greeting handling and to me the network traffic looks as it should. I would expect it to provide better networking.

Because replicating the hang problem in a virtual machine environment failed - even with an exact image copy of the server, it does not tell us if that problem will be fixed as well.

(In reply to Giacomo Mazzini from comment #3)

(In reply to John from comment #1)

To be clear: this is a hang, not a slowness issue. It does not recover without killing the process.

Should it be of some interest, something similar began happening in my TB 128esr just after upgrading from 115 (I'm feeding multiple frequently checked pop3 accounts into one local Inbox) on macos 14.4: but in my case, simply waiting (a lot) is solving.

Thanks. This would be a new issue, likely unrelated to this bug report.

Summary: 115.1 hangs randomly "not responding" when automatically checking multiple accounts for new messages. Workaround stagger mail check interval → 115.1 hangs randomly "not responding" when automatically checking multiple accounts for new messages. Workaround stagger mail check interval (pop)

(In reply to Wayne Mery (:wsmwk) from comment #35)

(In reply to Giacomo Mazzini from comment #3)

(In reply to John from comment #1)

To be clear: this is a hang, not a slowness issue. It does not recover without killing the process.

Should it be of some interest, something similar began happening in my TB 128esr just after upgrading from 115 (I'm feeding multiple frequently checked pop3 accounts into one local Inbox) on macos 14.4: but in my case, simply waiting (a lot) is solving.

Thanks. This would be a new issue, likely unrelated to this bug report.

I'm seeing this as well (general unresponsiveness of the whole pc while thunderbird starts up and checks multiple accounts for messages). I don't know when it started but I know it's very recent. I'm on 1.28.04 on Windows 11 and still only use POP.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: