Implement TCP keepalive for IMAP protocol, to detect if server dropped connection or machine suspended longer than server (application layer) IMAP timeout
Categories
(MailNews Core :: Networking: IMAP, enhancement)
Tracking
(Not tracked)
People
(Reporter: mail, Assigned: mail)
References
Details
Attachments
(1 file, 9 obsolete files)
|
9.79 KB,
patch
|
jorgk-bmo
:
review+
|
Details | Diff | Splinter Review |
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0
Steps to reproduce:
This patch implements TCP keepalive for IMAP protocol, as it is already used for
HTTP.
This way, stalled IMAP connections are detected relatively quickly.
Three new preferences are added, mirroring these in
"network.http.tcp_keepalive" namespace:
- "mail.imap.tcp_keepalive.enabled",
- "mail.imap.tcp_keepalive.idle_time",
- "mail.imap.tcp_keepalive.retry_interval".
For the two last ones setting any of them to -1 means to use the relevant value
from "network.tcp.keepalive" namespace.
As it is in the HTTP case, the feature is used by default if the system supports it.
Updated•6 years ago
|
Comment 1•6 years ago
|
||
Hi Maciej, when you submit a patch, you need to draw some attention to it, so request feedback or review.
Comment 2•6 years ago
|
||
| Assignee | ||
Comment 3•6 years ago
|
||
Thanks Jorg for your guidance.
With TCP keepalives enabled, when a system resumes from suspend or hibernation, Thunderbird notices almost immediately
that its IMAP connections are no longer valid and reconnects.
Previously, it took it a good few minutes to do so.
Also, TCP keepalives have been enabled for years for HTTP, where (in Thunderbird) a stalled connection isn't
that much of an issue (it isn't a web browser after all), but not for IMAP where a stalled IDLE connection means
users silently don't get new mail notifications.
Comment 4•6 years ago
|
||
Keep alive allows the tcp connection to stay around, so on next contact that connection can be reused (avoiding the tcp handshake) -> slightly faster and consuming less resources under certain condition.
Comment 5•6 years ago
|
||
Sounds like a useful thing to have.
Updated•6 years ago
|
Comment 6•6 years ago
|
||
I haven't tried applying and testing the patch but let me ask some questions first:
So do I understand that this will send a keep alive packet (ACK with no data, I think) every 10m (600s) by default on each IMAP connection?
The patch also defines a retry interval. I think default 1m (possibly 1s). I assume if the keep alive gets no response after 1m the keep alive packet is sent again? How many retries before the connection is deemed to be closed? (Maybe this is network.tcp.keepalive.probe_count default 4?)
I haven't looked in detail about what happen during/after suspend and hibernate but I assume that typically all connections are dropped by the server if the suspended time is long enough. IMAP servers typically timeout and drop after 30m. When computer resumes, will it immediately try to send the keep alive or might it take up to 10m before it is sent?
I think there is a bug that specifically addresses the problem with delayed new mail response after a resume. But I can't find it. I typically don't see a big problem with this. I often suspend this laptop and will see new emails fairly quick on resume. But I just noticed I have the check for new at 1m so that's probably why.
When the keep alive determines that the connection is gone, is the connection re-established in the background without killing and re-constructing a new nsImapProtocol thread?
Comment 7•6 years ago
|
||
Bug 1196662 covers the situation where new mail response after resume is not so good.
This may help with the resume problem but not sure about just keeping imap connections alive with this. I don't think imap servers will care about the transport layer ack keepalive. They need an application layer keep-alive like imap NOOP probably to avoid a timeout and disconnect.
| Assignee | ||
Comment 8•6 years ago
|
||
So do I understand that this will send a keep alive packet (ACK with no data, I think)
every 10m (600s) by default on each IMAP connection?
The TCP keepalive packet is sent only on an idle IMAP TCP connection (that should not be confused with the
IDLE IMAP command), that is, a connection that had no TCP-level data exchanged for at least 10
minutes (by default).
Every time some data gets transferred at connection TCP level this timer starts counting from zero again.
So TCP keepalives don't do anything unless the connection is completely dormant for at least
these 10 minutes.
The patch also defines a retry interval. I think default 1m (possibly 1s).
The patch allows a user to force a particular TCP keepalive retry interval for IMAP connections.
By default, the nsSocketTransportService default is used, as specified in
"network.tcp.keepalive.retry_interval" preference. This preference defaults to 1s.
I assume if the keep alive gets no response after 1m the keep alive packet is sent again?
How many retries before the connection is deemed to be closed?
(Maybe this is network.tcp.keepalive.probe_count default 4?)
Yes, this setting determines how many probes are sent in the worst case that there is no
response at all from the server (the probes are retry interval apart).
There is an important caveat to this setting, however - it is only customizable on Linux.
Windows and Mac OS X have this value hardcoded at 10 and 8 probes, respectively.
Please refer to netwerk/base/nsSocketTransportService2.h around "kDefaultTCPKeepCount" variable.
I haven't looked in detail about what happen during/after suspend and hibernate but I
assume that typically all connections are dropped by the server if the suspended time is long enough.
That's right.
BTW. Connections can be dropped not only by the IMAP server by also by NAT routers or stateful
firewalls that are between the client and the server.
IMAP servers typically timeout and drop after 30m.
When computer resumes, will it immediately try to send the keep alive or might it take up to 10m before it is sent?
At least Linux doesn't seem to take into account the time the system was suspended when calculating
a TCP connection keepalive idle time.
That's why I personally use an idle time of 1 minute instead of nsSocketTransportService default of 10 minutes.
I often suspend this laptop and will see new emails fairly quick on resume.
But I just noticed I have the check for new at 1m so that's probably why.
Even if you poll the IMAP server every minute you still might need to wait the whole IMAP protocol timeout
if a NAT router or a stateful firewall between you and the server has already dropped this connection record.
When the keep alive determines that the connection is gone,
is the connection re-established in the background without killing and re-constructing a new nsImapProtocol thread?
The situation here is the same as if the IMAP server just closed the connection for any reason.
In practice, a failing TCP keepalive check should happen so infrequently (by default, it can't happen more often than
10 minutes+ in the worst case) that the resource usage shouldn't be a concern here IMHO.
I don't think imap servers will care about the transport layer ack keepalive.
They need an application layer keep-alive like imap NOOP probably to avoid a timeout and disconnect.
This patch doesn't touch the actual application layer IMAP protocol, just the TCP transport layer.
If a user had his machine suspended for more than the server (application layer) IMAP timeout the server
is going to drop his connection anyway and he or she can't do anything about it.
This patch only makes the user's Thunderbird detect this situation more quickly.
Comment 9•6 years ago
|
||
Gene, can you please check it out for us. Formally the patch looks good to me.
Comment 10•6 years ago
|
||
I applied the patch but when I run tb it immediately MOZ_ASSERTs here:
https://searchfox.org/comm-central/source/mozilla/netwerk/base/nsSocketTransport2.cpp#3341
MOZ_ASSERT only has an effect when the build is done with DEBUG defined so maybe reporter tested with a non-debug build.
Apparently nsImapMockChannel::OnTransportStatus() is not called on the "socket thread".
I found a place in nsImapProtocol.cpp that runs on the "socket thread" but the place I found, nsImapProtocol::OnInputStreamReady() is used for imap IDLE functionality and calling NotifyTransportConnected() from there avoids the MOZ_ASSERT but breaks subsequent imap commands so emails aren't accessed.
To get OnInputStreamReady() to be called you have to enable AsyncWait() like is done for imap IDLE. This will cause later Read()s to be non-blocking which doesn't work for normal imap commands. Quickly canceling the AsyncWait(), as is done for EndIdle(), doesn't help.
Maybe there is an existing place to call NotifyTransportConnected() that is on the "socket thread" that I didn't find. Perhaps another way is to define a new "socket thread" callback function that's not so problematic. But I don't know which function might be possible or work for this.
Comment 11•6 years ago
|
||
Gene, thanks for looking. Feel free to mark this f- if it crashes. You don't have to debug it, that's up to the patch author.
| Assignee | ||
Comment 12•6 years ago
|
||
MOZ_ASSERT only has an effect when the build is done with DEBUG defined so maybe reporter tested with a non-debug build.
I thought my build have asserts enabled, but it looks like only some of them were compiled in.
Will investigate and update the patch.
Comment 13•6 years ago
|
||
I think to enable MOZ_ASSERTs you need "ac_add_options --enable-debug" item show here in the mozconfig file:
| Assignee | ||
Comment 14•6 years ago
|
||
Updated the patch so it now applies keepalive settings from the socket thread.
Comment 15•6 years ago
|
||
Thanks, looks good. I've never understood "sinks" and similar concepts that are used in the tb code so I can learn something from your patch.
I applied the patch and it seems to run with no problems now. I will test it ASAP and verify that the keepalives are actually working and can be configured as expected.
My only minor complaint is that the string "idle" occurs a lot in the patch. Since the imap IDLE is also used a lot in this nsImapProtocol.cpp file I would personally prefer that the string "idle" in your patch be replaced with maybe "quiet" or "dormant" or something similar if possible.
Also, just curious, are you making the patch with non-hg tools, possibly git? Unless the rules have changed, I think Jorg will want the patch created with hg.
Comment 16•6 years ago
|
||
Comment 17•6 years ago
•
|
||
Can you put the brace on the previous line, here and everywhere else.
Jorg, Is this a requirement for new code? The old/original code in the imap source files don't do this but keep opening/closing braces on same column for the most part. I personally prefer them in same column, non-K&R style, but it does add some to the line count of the ridiculously long nsImapProtocol.cpp file!
Updated•6 years ago
|
Comment 18•6 years ago
|
||
Well, the patch has some on the same line and some not. So consistency would be nice. I prefer them on the same line and that's now also the standard in M-C. M-C recently reformatted 10.000 files automatically.
| Assignee | ||
Comment 19•6 years ago
|
||
My only minor complaint is that the string "idle" occurs a lot in the patch.
Since the imap IDLE is also used a lot in this nsImapProtocol.cpp file
I would personally prefer that the string "idle" in your patch be replaced
with maybe "quiet" or "dormant" or something similar if possible.
I understand your concern, however both OSes (TCP_KEEPIDLE, etc.), nsISocketTransport
and nsSocketTransportService use term "idle" almost exclusively, so if we invent our
own terminology here we would be standing out from everything else.
However, a comment could be added instead here and there saying something like
"this is the TCP keepalive idle time, don't mistake with IMAP IDLE command".
Also, just curious, are you making the patch with non-hg tools, possibly git?
Unless the rules have changed, I think Jorg will want the patch created with hg.
Yes, the patch was made by git, but this is a "unified diff" format patch which should
also be importable by Mercurial, as far as I can see from its docs.
If this turns to be a problem please let me know.
Thanks for the contribution.
Thanks for your review.
Can you put the brace on the previous line, here and everywhere else.
Will do - I have used this style since like 95% of "if" or "else" statements
in this file (nsImapProtocol.cpp) use it (that is, braces are on new lines).
This function is only ever used once, so please inline it.
Will do, but with a heavy heart - nsImapProtocol is already a giant class and
LoadImapUrlInternal() length is already well past one screen.
All the reference management in this (unnecessary) function is incorrect.
We never leave a function like this.
It works because it compensates the fact that you started with a raw pointer.
This function is based on net_NewTransportEventSinkProxy() in netwerk/base/nsTransportUtils.cpp,
which also checks for memory allocation failure.
Comment 20•6 years ago
|
||
Well, I don't want to get involved in the "where to put the brace" discussion. Sure, most mailnews/ code has it on the next line, but your patch in inconsistent in that regard.
Lots of old code still checks after new, but I rip it out wherever I see it. Here's an example of the new style:
https://searchfox.org/mozilla-central/rev/b3ac60ff061c7891e77c26b73b61804aa1a8f682/netwerk/base/nsPACMan.cpp#823
https://searchfox.org/mozilla-central/rev/b3ac60ff061c7891e77c26b73b61804aa1a8f682/netwerk/base/nsSimpleURI.cpp#745
OK, I can can see net_NewTransportEventSinkProxy() from around 2003 is similar, but it has three callers.
Comment 21•6 years ago
|
||
If you inline the function, you get:
nsCOMPtr<nsITransportEventSink> sinkMC = do_QueryInterface(m_mockChannel);
if (sinkMC) {
nsCOMPtr<nsIThread> thread = do_GetMainThread();
RefPtr<nsImapTransportEventSink> sink = new nsImapTransportEventSink;
nsresult rv = sink->SetSink(sinkMC, thread);
NS_ENSURE_SUCCESS(rv, rv);
sink->SetTCPKeepalive(&m_TCPKeepalive);
m_transport->SetEventSink(sink, nullptr);
}
Is that terrible?
Comment 22•6 years ago
|
||
Hmm, SetSink() has one caller and so does SetTCPKeepalive(). You'd come to that:
if (sinkMC) {
nsCOMPtr<nsIThread> thread = do_GetMainThread();
RefPtr<nsImapTransportEventSink> sink = new nsImapTransportEventSink;
nsresult rv = net_NewTransportEventSinkProxy(getter_AddRefs(sink->m_proxy), sinkMC, thread);
NS_ENSURE_SUCCESS(rv, rv);
sink->m_TCPKeepalive = m_TCPKeepalive;
m_transport->SetEventSink(sink, nullptr);
}
if you have
protected:
friend class nsImapProtocol;
nsCOMPtr<nsITransportEventSink> m_proxy;
nsImapProtocol::TCPKeepalive m_TCPKeepalive;
| Assignee | ||
Comment 23•6 years ago
|
||
Updated the patch implementing review remarks.
Comment 24•6 years ago
|
||
Comment 25•6 years ago
|
||
I tried the v0 patch and it worked as advertised. I will test again in more detail, but will wait for the requested "clamp" change, probably.
I checked that the keepalive packet timing works on a quiet imap connection and, after dropping packets on a localhost connection using ip6tables command, the specified number of "probes" occur at the set time interval. (It took me a while to realize that my localhost connection to my dovecot server uses ipv6 and not ipv4 so I had to use ip6tables instead of iptables to block the keepalives.)
Comment 26•6 years ago
|
||
Don't worry about the clamped thing. Either way the value will be moved to the interval between 1 and max.
| Assignee | ||
Comment 27•6 years ago
|
||
Thanks for testing Gene.
Here is an updated patch using std::{min,max} instead of clamp.
Comment 28•6 years ago
|
||
The patch works as I would expect. However, I have a few comments and suggestions that I think are valid:
-
The keep-alive pref setting are essentially global to all imap connections. But the 3 parameters are obtained and stored twice for every connection (typically up to 5 connections per account) in the main nsImapProtocol object and in the new nsImapTransportEventSink object, both called m_TCPKeepalive. Since the 3 parameters are actually common to all connection and sink objects, I think just a single global/static instance of struct TCPKeepalive could suffice.
I modified the patch to see if it would work and it does. So the two m_TCPKeepalive variables are replace with a single gTCPKeepalive that is initialized in nsImapProtocol::GlobalInitialization(). See attached diff. I just commented out the m_TCPKeepalive's and replaced it with static gTCPKeepalive.
-
Should the keeplive be turned on by default? I think this may surprise some users. Possibly is should be disable by default and allow users who need it to switch it on.
-
Changes to the keepalive parameters require a tb restart except, I think, for connections made while running. Is this OK? Seems to be the case for most low level prefs.
-
For maximum flexibility keepalive should ideally be independently configurable on an account (server) basis. The parameters could be added to the "Advanced" server setting page. (Of course, my suggestion for common var storage in item 1 would no longer apply.)
-
I still need to check the effect of keepalive on other problems like getting new mail after resume, loss of connection during IDLE etc and see if this fixes any old bugs.
-
I haven't tested this on windows or OSX, just linux.
Updated•6 years ago
|
Comment 29•6 years ago
|
||
I had left in the keepalive initialization in the nsImapProtocol constructor. Removed it with #if 0 ... #endif block. Still works OK.
Comment 30•6 years ago
|
||
Re per-server, yes that would be best. In case we need/want UI for it, it's rather difficult otherwise. With a per-server pref it can be tucked away under the Advanced settings of the server.
| Assignee | ||
Comment 31•6 years ago
|
||
I modified the patch to see if it would work and it does.
So the two m_TCPKeepalive variables are replace with a single gTCPKeepalive that is initialized in
nsImapProtocol::GlobalInitialization()
This will now require restart of TB in order for any change in these preferences to get detected and applied.
When these preferences were read in nsImapProtocol constructor they applied automatically to every new connection
to the IMAP server.
And my experience shows TB creates new server connections pretty often on user actions like navigating through
IMAP folders in the UI.
Should the keeplive be turned on by default?
TCP keepalives for HTTP are enabled by default, so if there was any problem with them on any platform it
would be noticed long ago.
Also, it is much better to terminate stalled connections now then to let them linger around since they
can only cause problems, for example silently break new email notifications.
| Assignee | ||
Comment 32•6 years ago
|
||
BTW. There is no UI to configure TCP keepalives for HTTP (other than about:config, of course).
Comment 33•6 years ago
|
||
I'm watching this from far away. When you're done, please request review.
Comment 34•6 years ago
|
||
This will now require restart of TB in order for any change in these preferences to get detected and applied.
MSS, yes the way I changed it even a new connection got the original KA prefs. I commented out reading the prefs in GlobalINitialization() and put it back into nsImapProtocol constructor, while still using the static var. Now it again allows on-the-fly KA pref change to be set on a new connection w/o a restart. Anyhow, just an idea to reduce some duplicated storage vars for each connection.
I set my KA time at 30s and with 5 connection all in IDLE state, all except the INBOX connection still time out and go away after a while (~30m) with the KA still happening. This is because the imap server decides that the connections have been quiet for long enough that it disconnects (doesn't care about the KA segments). This is a problem I have been working on: Re Bug 468490 (Tb makes no attempt to keep alive an IDLE connection as recommended in the IDLE RFC.)
The INBOX connection stays connected because the check for new mail occurs on it every 10m. If I block the KA's with ip6tables, after the retries occur and the connection RSTs, tb doesn't attempt to re-establish an INBOX connection until the next 10m event occurs. So the KAs don't seem to affect this either.
Now with 5 IDLE connections, including INBOX, I hibernated for over 30m. When I resumed, all the connection were gone (FIN's or RST'd as seen in wireshark and no connctions list via lsof -- it appears that tb actually closes the connections on hibernate per wireshark). Since there were no connections, the KA was, of course, not occuring at resume. But after a while the 10m "check for new mail" occurred and created two connections (INBOX and the folder I was selected on). So I don't see that KA had any effect on this. The same thing would have happened with or without the KA. (After the 2 new connection were created after resume and they went IDLE, the KAs started again as expected.)
So, other than maybe helping with connection drops due to firewalls or routers, I don't see that the keepalive feature fixes any other outstanding issues. However, it may expose hooks into the stack that are not available currently to help with these problems.
(In reply to Maciej S. Szmigiero from comment #3)
With TCP keepalives enabled, when a system resumes from suspend or hibernation, Thunderbird notices almost immediately
that its IMAP connections are no longer valid and reconnects.
Previously, it took it a good few minutes to do so.
MSS, I wonder why you see an improvement when using the KAs? You said below you set the KA time to 1m. What time do you have the "check for new mail" timer set to?
| Assignee | ||
Comment 35•6 years ago
|
||
I commented out reading the prefs in GlobalINitialization() and put it back into nsImapProtocol constructor,
while still using the static var.
Will nsImapProtocol always be constructed on main UI (or some other, always-the-same) thread?
If not, these static variables may need locking (or changing into atomic ones).
I hibernated for over 30m. When I resumed, all the connection were gone
Let me guess - you were connected using WiFi or some other connection that
loses IP addresses temporary during a suspend - resume cycle?
Then the network watcher in TB will terminate all IMAP connections anyway.
Note that this won't happen if you are connected via cable - even when using DHCP -
as these connections typically survive a suspend - resume cycle intact
(can't speak of every distro here, however - mine is Gentoo with netifrc + dhcpcd).
AFAIK TB on Linux can't detect otherwise that the system has been resumed from suspend.
MSS, I wonder why you see an improvement when using the KAs?
What time do you have the "check for new mail" timer set to?
I personally use keepalive time of 1 minute (with a slightly larger inter-probe interval of 5 seconds).
The patch uses 10 minutes by default to be consistent with what HTTP uses by default
("network.http.tcp_keepalive.long_lived_idle_time" preference).
See also the note [1] below.
My biff ("check for new mail") interval is 15 minutes as I have a large number of IMAP folders and
don't want to hammer the IMAP server with frequent polls
(at the same time I want them to be refreshed from time to time).
With keepalives enabled, after a "suspend, then kill all connections on the IMAP server,
then wait few minutes for their records to get cleaned, then resume" cycle TB IMAP connections
are dropped in under a minute, as they are supposed to be.
With keepalives disabled, after the above cycle they are still there, they are only cleaned after
biff time arrives and then a 100 second IMAP response timer expires (probably the "mailnews.tcptimeout" setting).
You are right that broken IMAP connections don't seem to be reconnected until the biff time arrives or
the user performs some UI action, like selecting a different IMAP folder (see also the note [2] below).
However, this is a different problem from detecting their brokenness (which is what functionality
provided by this patch is about).
Ultimately, in order for TB to reconnect a broken connection it must first learn that it is actually broken.
On Linux, you can see current TB IMAP connections, together with their remaining keepalive timers by
issuing the following command (assuming your IMAP server listens at port 993):
ss --tcp -pno 'dport = 993'
This will print lines like:
"ESTAB 0 0 1.2.3.4:12345 5.6.7.8:993 users:(("thunderbird",pid=6789,fd=1234)) timer:(keepalive,55sec,0)"
[1]: It would be great if the Linux kernel timed the TCP keepalive time against suspend-ticking BOOTTIME clock
instead of raw jiffies, just like, for example, in-kernel IPSec security associations are currently timed.
Then the keepalive time of 10 minutes would also trigger almost immediately after a longer sleep.
Unfortunately, looking at the relevant kernel TCP code it looks like the jiffies-based timing is integrated
quite deeply and so this isn't that easy to change.
[2]: Making the biff timer use the BOOTTIME clock would also improve things somewhat, as it would trigger
an immediate mail check after every longer suspend.
This also may need some significant work - I wrote about this a week ago in bug 1359403.
Comment 36•6 years ago
|
||
(In reply to Maciej S. Szmigiero from comment #35)
I commented out reading the prefs in GlobalINitialization() and put it back into nsImapProtocol constructor,
while still using the static var.Will nsImapProtocol always be constructed on main UI (or some other, always-the-same) thread?
If not, these static variables may need locking (or changing into atomic ones).
I think it is always constructed on UI thread, but not 100% sure. Seems like you could also have a problem if you wanted to change all KA params but a new thread/connection got created after only one was changed in config editor. Couldn't this occur with either method? (Maybe why you say "change to atomic ones"?)
I hibernated for over 30m. When I resumed, all the connection were gone
Let me guess - you were connected using WiFi or some other connection that
loses IP addresses temporary during a suspend - resume cycle?
Then the network watcher in TB will terminate all IMAP connections anyway.
I'm not familiar with the "network watcher in TB". Might this be somewhere in the mozilla "netwerk" code?
Note that this won't happen if you are connected via cable - even when using DHCP -
as these connections typically survive a suspend - resume cycle intact
(can't speak of every distro here, however - mine is Gentoo with netifrc + dhcpcd).
Well, you are right I only have a wifi connection currently to the outside world (wifi router connects to cable modem). However, the only way I have tested this so far is using a local dovecot imap server that connects via localhost so it's on the same computer as tb and no wifi or cables involved. Would that also cause the imap connection to be dropped on hibernate? (It appears that tb is initiating the disconnects with an IMAP close commands.)
AFAIK TB on Linux can't detect otherwise that the system has been resumed from suspend.
MSS, I wonder why you see an improvement when using the KAs?
What time do you have the "check for new mail" timer set to?I personally use keepalive time of 1 minute (with a slightly larger inter-probe interval of 5 seconds).
The patch uses 10 minutes by default to be consistent with what HTTP uses by default
("network.http.tcp_keepalive.long_lived_idle_time" preference).
See also the note [1] below.My biff ("check for new mail") interval is 15 minutes as I have a large number of IMAP folders and
don't want to hammer the IMAP server with frequent polls
(at the same time I want them to be refreshed from time to time).With keepalives enabled, after a "suspend, then kill all connections on the IMAP server,
then wait few minutes for their records to get cleaned, then resume" cycle TB IMAP connections
are dropped in under a minute, as they are supposed to be.
Don't understand what you are describing here. Why are (you?) killing connections on the IMAP server?. What does "cycle tb imap connections mean? Are you suspending the IMAP server or the box running tb?
With keepalives disabled, after the above cycle they are still there, they are only cleaned after
biff time arrives and then a 100 second IMAP response timer expires (probably the "mailnews.tcptimeout" setting).
After I hibernate and resume, I don't see any KA packets sent (as I expect since the connects are gone). However, I do notice that the re-connect hasn't taken 10m (my biff time) each time I have checked (2-3 times). But maybe more like 100s. I'm need to review what the tcptimeout 100s time is all about. Maybe that's what is bringing it back.
You are right that broken IMAP connections don't seem to be reconnected until the biff time arrives or
the user performs some UI action, like selecting a different IMAP folder (see also the note [2] below).
However, this is a different problem from detecting their brokenness (which is what functionality
provided by this patch is about).
Ultimately, in order for TB to reconnect a broken connection it must first learn that it is actually broken.On Linux, you can see current TB IMAP connections, together with their remaining keepalive timers by
issuing the following command (assuming your IMAP server listens at port 993):
ss --tcp -pno 'dport = 993'
My dovecot server is on port 130. I use lsof for this, however it doesn't seem to support the keepalive stat. Didn't know about ss, thanks!
This will print lines like:
"ESTAB 0 0 1.2.3.4:12345 5.6.7.8:993 users:(("thunderbird",pid=6789,fd=1234)) timer:(keepalive,55sec,0)"
[1]: It would be great if the Linux kernel timed the TCP keepalive time against suspend-ticking BOOTTIME clock
instead of raw jiffies, just like, for example, in-kernel IPSec security associations are currently timed.
Then the keepalive time of 10 minutes would also trigger almost immediately after a longer sleep.Unfortunately, looking at the relevant kernel TCP code it looks like the jiffies-based timing is integrated
quite deeply and so this isn't that easy to change.[2]: Making the biff timer use the BOOTTIME clock would also improve things somewhat, as it would trigger
an immediate mail check after every longer suspend.
This also may need some significant work - I wrote about this a week ago in bug 1359403.
Comment 37•6 years ago
|
||
Not sure if it proves anything but I ssh'd to my wifi router (wirelessly, not via ethernet) and hibernated linux (fedora). Let it sit for a few minutes and resumed. When it came back, the connection was still there with a perfectly functional ssh shell.
I always just hibernate (to disk) rather than suspend to ram. Do you think could make a difference?
Comment 38•6 years ago
|
||
Also checked the "lo" inferface. I ssh'd to localhost and hiberate/resumed and the connection was kept. Then realized a better test would just be to "telnet localhost 130" to the imap server and login and do some imap commands. This worked too before and after h/r with no loss of connection.
So it seems that any loss of connection from tb that occur during hibernate must be caused by tb itself.
Complete segue...
(In reply to Jorg K (GMT+1) from comment #33)
I'm watching this from far away.
Maybe the moon or mars? :-)
| Assignee | ||
Comment 39•6 years ago
|
||
Seems like you could also have a problem if you wanted to change all KA params
but a new thread/connection got created after only one was changed in config editor.
Here I am not sure if there is even a mechanism in Mozilla that forces some config
parameters to change atomically as a set.
Maybe why you say "change to atomic ones"?
Atomic variables will solve the threading issue but will do nothing for the
"changing all preferences at the same time" thing.
However, I don't think that changing one at a time can really cause issues -
HTTP also sees separate KA preference changes and nobody seems to complain about this.
I'm not familiar with the "network watcher in TB". Might this be somewhere in the mozilla "netwerk" code?
Yes, the Linux implementation is in "netwerk/system/linux/nsNotifyAddrListener_Linux.cpp".
This will send NS_NETWORK_LINK_DATA_DOWN event when there is no non-loopback, running interface with
an IP address in the system (even momentarily).
nsIOService will convert this into NS_IOSERVICE_GOING_OFFLINE_TOPIC ("network:offline-about-to-go-offline")
event which will cause TB nsMsgAccountManager to close all cached mail server connections
(via "CloseCachedConnections()").
However, the only way I have tested this so far is using a local dovecot imap server that connects
via localhost so it's on the same computer as tb and no wifi or cables involved.
Would that also cause the imap connection to be dropped on hibernate?
The connection closing code doesn't differentiate between local and remote mail servers, so yes,
this connection should get closed, too.
Why are (you?) killing connections on the IMAP server?
Since normally the IMAP server will drop a connection only after it hits a 30 minute timeout during
testing I kill the connection manually on the server so I don't have to wait half an hour for
it to time out by itself.
What does "cycle tb imap connections mean?
That was supposed to mean that I am performing a cycle of operations consisting of:
- Suspending the TB machine,
- Killing all the connections that it had opened on the IMAP server,
- Waiting few minutes for their (connections) records to get cleaned,
- Resuming the TB machine.
After completing the above cycle of operations, TB IMAP connections are dropped in under a minute
(with keepalives enabled).
Are you suspending the IMAP server or the box running tb?
I am suspending the machine that is running TB.
The IMAP server is running all the time.
Not sure if it proves anything but I ssh'd to my wifi router
When it came back, the connection was still there with a perfectly functional ssh shell.
The "network watcher in TB" only closes TB mail server connections, it doesn't affect other
programs, like a ssh client.
I always just hibernate (to disk) rather than suspend to ram. Do you think could make a difference?
I don't think so.
Comment 40•6 years ago
|
||
Ok, think I see what you mean. The NS_NETWORK_LINK_DATA_DOWN event occurs on linux suspend because my wifi interfaces go down. This causes the imap connections to close even though they are not connected on the wifi interface but, for my setup, on the loopback interface.
I made a temporary modification to the checkLink() function to test this:
if (prevLinkUp != mLinkUp) {
// UP/DOWN status changed, send appropriate UP/DOWN event
printf("gds: mLinkUp=%d(bool)\n", mLinkUp);
if (mLinkUp)
SendEvent(NS_NETWORK_LINK_DATA_UP);
//SendEvent(mLinkUp ? NS_NETWORK_LINK_DATA_UP : NS_NETWORK_LINK_DATA_DOWN);
}
With the "down" event inhibited, I can hibernate and resume and tb takes right off where it left off with no closed connections like before. I completely missed your point that any interface going down will cause imap disconnections even if imap is on another interface.
One more dumb question: How exactly are you "killing" your imap server connections? Are you using something like "tcpkill" program? Or, and I just looked at "man ss", the -K option of ss?
Comment 41•6 years ago
|
||
I just did this, tell me if it's valid:
- Reduced number of cached connection for account to 1.
- Disabled completely check for new mail
- With one connection active, block imap server response with ip6tables
- Select INBOX
- Keepalives timed out and tb sent RST to server.
- Tb quiet for a long time (abt. 1h) and and no re-connect occurred.
- Select another folder, new connection occurs, KAs start back.
I see no effect of the KA with the above sequence(*). Now do this:
- Same as above but enable check for new mail and set to 60s.
- After keepalive timed out waited >10m and still no re-connect.
- Click from INBOX to another folder, connection occurs.
Still no effect of the KA(*). Tried again but disabled KAs:
- Still enable check for new mail and time at 60s
- No keepalives but still block imap server response with ip6tables
- Click from INBOX to another folder and tb waits 1m40s (with several re-transmissions) and spinner spins.
- New connection establishes for mailbox and spinner stops after about 1m40s.
So without KA the connection occurs but it is delayed 1m40s compared to with KAs.
(*) So now I do see the effect of KAs!
I guess if you assume that if the imap server responses or tb's requests are blocked and there is no network notification to tb of a disconnection (FIN or RST) then the keepalives avoid the delay I observed when clicking a folder (or you would observe when biff checks new mail). But if the server or intermediate node drops the connection, would it send tb a FIN or RST so tb knows the connection is gone and avoid doing multiple retries that result in a 1m40s delay? If so, this would have the same effect as a KA it seems.
I haven't tried suspend/resume again but, now that I sort of understand, I don't think it will be different. The KAs will just start back after resume but only if no wireless interface exists, at least on linux. Does windows or OSX have the same issue with imap connection going away on resume when any interface is wireless?
Comment 42•6 years ago
|
||
I just checked with gdb and it appears that the nsImapProtocol constructor is only called from one place in this thread:
(gdb) info thr
Id Target Id Frame
* 1 Thread 0x7ffff7fad740 (LWP 21931) "thunderbird" 0x00007ffff6e00ddb in poll () from /lib64/libc.so.6
<many more threads...>
And is called from here:
Note: A "thunderbird" thread is created for each nsImapProtocol instance, one for each imap connection. But new ones are called from the thread ID 1 only. Not sure, is this also considered the "UI" thread? (Don't see any other thread in the "info thr" list having the name "User Interface" or "UI" that I see mentioned so much in the code comments.)
| Assignee | ||
Comment 43•6 years ago
|
||
How exactly are you "killing" your imap server connections?
I just looked at "man ss", the -K option of ss?
Yes, I was using "--kill" (or "-K") argument for ss.
After keepalive timed out waited >10m and still no re-connect.
As I have written above, that's another issue:
However, this is a different problem from detecting their brokenness (which is what functionality
provided by this patch is about).
Ultimately, in order for TB to reconnect a broken connection it must first learn that it is actually broken.
Without keepalives TB will only learn that the (idle) connection is broken after 30 minutes + IMAP response timer value.
So without KA the connection occurs but it is delayed 1m40s compared to with KAs.
That's the 100-second IMAP response timer (probably the "mailnews.tcptimeout" setting).
But if the server or intermediate node drops the connection,
would it send tb a FIN or RST so tb knows the connection is gone and
avoid doing multiple retries that result in a 1m40s delay?
The TB machine won't receive these packets if it is sleeping (suspended or hibernated) or / and
a NAT router or a stateful firewall that is between the client and the server has dropped the connection record.
Does windows or OSX have the same issue with imap connection going away on resume when any interface is wireless?
It seems that at least Windows has special suspend / resume notifiers in TB.
Don't know about their exact behaviour, however.
I just checked with gdb and it appears that the nsImapProtocol constructor is only called from one place in this thread:
But new ones are called from the thread ID 1 only.
Not sure, is this also considered the "UI" thread?
If you aren't 100% sure then it is much cheaper to add potentially unnecessary synchronization mechanisms
(like mutex or atomic variables) than risk having to track down very subtle threading races.
Comment 44•6 years ago
|
||
MSS, Will the attached diff do what you suggest about locking? It uses mon(mLock) before the gTCPKeepalive elements are set in the constructor. The lock is released when mon goes out of scope. This is done at several other places in the file.
I am also skipping setting the idle time and interval if KA is disabled.
I don't know a way to make config editor set multiple parameters at the same time.
Comment 45•6 years ago
|
||
(In reply to Maciej S. Szmigiero from comment #43)
After keepalive timed out waited >10m and still no re-connect.
As I have written above, that's another issue:
However, this is a different problem from detecting their brokenness (which is what functionality
provided by this patch is about).
Ultimately, in order for TB to reconnect a broken connection it must first learn that it is actually broken.Without keepalives TB will only learn that the (idle) connection is broken after 30 minutes + IMAP response timer value.
Right, without KAs and everything idle and an undetected broken connection, if the user clicks on another folder and tb tries to use the broken connection, there will be a 100s wait (with spinner) before the connection is restored. Yes, that is the 100s tcptimeout values doing this. I set it to 50 and it took 50 sec instead.
I see that this can be helpful to avoid the 100s wait when the user does an action and the cached connection is broken.
I was hoping that the KAs would somehow magically bring back an IDLE connection that is broken with no user action. Even after 30 minutes, tb currently does nothing to actually refresh the IDLE connections like it should. It should send DONE and go back into IDLE at the 29m point, right before most servers close the connection. This is Bug 468490 that I mentioned before. (There is a 29m timeout for each connection but it doesn't actually do this.)
So without KA the connection occurs but it is delayed 1m40s compared to with KAs.
That's the 100-second IMAP response timer (probably the "mailnews.tcptimeout" setting).
But if the server or intermediate node drops the connection,
would it send tb a FIN or RST so tb knows the connection is gone and
avoid doing multiple retries that result in a 1m40s delay?The TB machine won't receive these packets if it is sleeping (suspended or hibernated) or / and
a NAT router or a stateful firewall that is between the client and the server has dropped the connection record.
Of course, when sleeping tb won't receive anything. I was asking if the NAT router or stateful firewall will send a FIN or RST to the connection endpoints (tb and imap server) if it decides to drop a connection. Or does it just delete the connection and stop passing data in both directions for that connection? Probably not relevant, just curious. (My guess is that it runs at ip layer and doesn't have any info on tcp layer that sends ACK, FIN, RST etc.)
Does windows or OSX have the same issue with imap connection going away on resume when any interface is wireless?
It seems that at least Windows has special suspend / resume notifiers in TB.
Don't know about their exact behaviour, however.
This reminds me of this Bug 758848 that I haven't gotten back to in a while (my comment is at Bug 758848 comment 11). This adds the notifier for linux. I attached a (very) preliminary working patch but my skills in this area are also very lacking.
I just checked with gdb and it appears that the nsImapProtocol constructor is only called from one place in this thread:
But new ones are called from the thread ID 1 only.
Not sure, is this also considered the "UI" thread?If you aren't 100% sure then it is much cheaper to add potentially unnecessary synchronization mechanisms
(like mutex or atomic variables) than risk having to track down very subtle threading races.
Just not sure if thread 1 is the UI thread. But I am sure that nsImapProtocol constructors only called from thread 1. Anyhow, I went ahead and added the locks like you suggested. See attached diff above.
Anyhow, I think I am hopefully now understand the utility of the KAs. It's probably OK to enabled them by default and I might make the following suggestion for default settings:
enabled: true
idle time: 100s. This will have a similar effect as tcptimeout but will occur in background and usually avoid the spinner (moz default is 600s, probably too long).
retry interval: 5s. Seem more appropriate than the default 1s from the mozilla parameter.
Also, keeping it common to all accounts is probably OK.
| Assignee | ||
Comment 46•6 years ago
|
||
Will the attached diff do what you suggest about locking?
It uses mon(mLock) before the gTCPKeepalive elements are set in the constructor.
Reads of these variables seem to be done without holding any lock (unless it is taken somewhere else),
also using a shared lock that also protects other things will unnecessarily serialize them with
keepalive variable accesses.
I would use std::atomic<Type> variables with relaxed memory order instead.
On x86 they compile to just an ordinary memory store and load.
Just make sure to read a particular preference into a temporary variable then do a
global_variable.store(temp_variable, std::memory_order_relaxed) to store it,
while using temp_variable = global_variable.load(std::memory_order_relaxed) for reading
(don't use temp_variable = global_variable as this will emit an unnecessary barrier because it
uses the highest memory order possible).
I see STL atomics being used already in the Mozilla code so I guess they are OK to use.
I was asking if the NAT router or stateful firewall will send a FIN or RST to
the connection endpoints (tb and imap server) if it decides to drop a connection.
Or does it just delete the connection and stop passing data in both directions for that connection?
It will just forget the connection record.
Comment 47•6 years ago
|
||
MSS, Thanks for the info on <atomic>. I went ahead and updated the patch (with your name on it). I pulled and updated to the latest HEAD m-c/c-c versions and re-built before making the patch so it should be OK. I think I have done what you suggested with the KA vars. I also took the liberty of setting the defaults for enabled, idle and retryInterval to true, 100 and 5 respectively in mailnew.js.
| Assignee | ||
Comment 48•6 years ago
|
||
Thanks Gene.
It's almost there:
1)
+nsresult
+nsImapTransportEventSink::ApplyTCPKeepalive(nsISocketTransport *aTransport)
+{
- nsresult rv;
- if (gTCPKeepalive.enabled) {
This value should be loaded into a temporary variable using something like
kaEnabled = gTCPKeepalive.enabled.load(std::memory_order_relaxed),
then this temporary (kaEnabled) should be used everywhere in this function instead
of direct gTCPKeepalive.enabled access (which can emit an unnecessary barrier),
- The patch now also deletes a blank line in nsImapProtocol.h, was this intentional?
Comment 49•6 years ago
|
||
I wondered about that too. Also, possibly where it it passed to a function:
rv = aTransport->SetKeepaliveEnabled(gTCPKeepalive.enabled); <---- ???
I also noticed the blank line change in the diff after I sent the patch. I'll put it back in the next patch.
Thanks again!
| Assignee | ||
Comment 50•6 years ago
|
||
I wondered about that too. Also, possibly where it it passed to a function:
rv = aTransport->SetKeepaliveEnabled(gTCPKeepalive.enabled); <---- ???
Yes, everywhere in that function (ApplyTCPKeepalive) it should refer to a temporary variable.
Comment 51•6 years ago
|
||
Comment 52•6 years ago
•
|
||
Yes, everywhere in that function (ApplyTCPKeepalive) it should refer to a temporary variable.
Hmm, can you explain why? Is gTCPKeepalive.enabled going to change? What does "emit an unnecessary barrier" mean. And what's the .load(std::memory_order_relaxed) about. This is part of educating the reviewer. Yes, you might wish for a more qualified one, but there isn't, so it becomes your job to qualify him ;-)
Comment 53•6 years ago
|
||
I just learned this from MSS in comment 46 and googled. It just ensures that a static var is not read in a thread while it is being set in another thread so you get consistent data. I think this is only actually possible when the vars are not aligned on a word boundary. MSS says if you use the "relaxed" attribute it doesn't really add any code. Without the relaxed attribute there is a "mutex" wait regardless. But I could be wrong...
Comment 54•6 years ago
•
|
||
MSS, In the GlobalInit function, probably should use the .store() ?
+ gTCPKeepalive.enabled = false;
+ gTCPKeepalive.idleTimeS = -1;
+ gTCPKeepalive.retryIntervalS = -1;
Comment 55•6 years ago
|
||
This incorporates the comments since the v8 patch.
| Assignee | ||
Comment 56•6 years ago
|
||
Hmm, can you explain why? Is gTCPKeepalive.enabled going to change?
Every time an atomic variable is accessed (at least) a memory read is needed.
If the value is copied into a temporary variable then the compiler can get away
with doing just one memory read and then keeping this value in a register
(if it has one currently available).
What does "emit an unnecessary barrier" mean.
See two loads from a std::atomic<bool> flag, just like gTCPKeepalive is, here:
https://godbolt.org/z/SRosde
The fist one uses the default, highest memory order (memory_order_seq_cst),
the second one uses the memory_order_relaxed order.
You can see two extra "dmb" ("Data Memory Barrier") instructions in the function
that does its load using the default (memory_order_seq_cst) memory order.
This is on ARM, on x86 there is actually no difference (for loads, for stores there
is one) because x86 has unusually strict memory order on its own for backward
compatibility reasons.
And what's the .load(std::memory_order_relaxed) about.
memory_order_relaxed means that the load is just thread-safe (no partial results possible,
for example), without giving any extra ordering guarantees that higher memory orders do.
You can read more about this for example here:
https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync
MSS says if you use the "relaxed" attribute it doesn't really add any code.
Yes, on x86 "relaxed" operations are basically just plain memory order operations,
you can see it here:
https://godbolt.org/z/Zvvkei
For comparison, there is also "setValDefault()" function there that uses the default
(memory_order_seq_cst) memory order.
You can see that even on x86 this memory order requires a memory barrier instruction
("mfence") to be emitted.
In the GlobalInit function, probably should use the .store() ?
Yes, but since this is executed just once not much is lost by using the highest memory order.
This incorporates the comments since the v8 patch.
Looks good to me.
Comment 57•6 years ago
|
||
| Assignee | ||
Updated•6 years ago
|
| Assignee | ||
Updated•6 years ago
|
| Assignee | ||
Comment 58•6 years ago
|
||
Maciej, our process is that if someone else but the original author makes changes to a patch,
we get the original author to approve those changes by adding a feedback+.
Thanks for the reminder, hope it is OK now.
Comment 59•6 years ago
|
||
Sorry for the delay here. Lots of other things to do. I'll get to it.
Comment 60•6 years ago
|
||
| Assignee | ||
Comment 61•6 years ago
|
||
Can you explain these values. Initially we had -1 here.
Gene has suggested these values at the end of comment 45.
The 100-second idle time is for consistency with "mailnews.tcptimeout" preference,
while the 5-second keepalive retry interval was adopted because the default 1-second one
from network.tcp.keepalive.* seems a little bit too aggressive.
A retry interval of 5-seconds will disconnect a broken connection in 20 seconds (plus the idle time)
which seems like a good compromise between being resilient to transient line conditions and
not delaying the detection unnecessarily.
Comment 62•6 years ago
|
||
OK. If you address the four nits, then we're ready to go.
| Assignee | ||
Comment 63•6 years ago
|
||
Comment 64•6 years ago
|
||
| Assignee | ||
Comment 65•6 years ago
|
||
Wasn't that conditional on bVal before?
Yes and no - it wasn't in the original patch but Gene introduced it in the version from
comment 44.
Thought your comment was about reverting this, but either way will work OK.
Updated•6 years ago
|
Comment 66•6 years ago
|
||
Comment 67•6 years ago
|
||
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/58389d4c917e
Add imlementation of TCP keeplives for IMAP connections. r=GeneSmith,jorgk
Comment 68•6 years ago
|
||
Landed with a few white-space tweaks. I also restored the if (bVal) stuff here:
https://hg.mozilla.org/comm-central/rev/58389d4c917e#l2.203
I trust that you agree.
| Assignee | ||
Comment 69•6 years ago
|
||
I also restored the if (bVal) stuff
I trust that you agree.
No problem.
Thank you all here for your input and cooperation!
Description
•