Closed Bug 1048487 Opened 10 years ago Closed 10 years ago

[email/backend] node-net.js's end() method does not actually close the socket, leading to connection leaks until the other end closes the conn

Categories

(Firefox OS Graveyard :: Gaia::E-Mail, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(blocking-b2g:2.0+, b2g-v2.0 fixed, b2g-v2.1 fixed)

RESOLVED FIXED
2.1 S2 (15aug)
blocking-b2g 2.0+
Tracking Status
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: asuth, Assigned: mcav)

References

Details

Attachments

(1 file)

:mcav noticed this and mentioned on IRC, logging the bug to ensure it doesn't fall off our radar. (Highly unlikely given the problem, but good hygiene :) Besides obviously fixing the issue I think we want to do the following on this bug: - have the staleness unit tests assert/otherwise verify that we have actually told the FawltySocket to close! - uplift to v2.0. I was going to fold this into the save correctness thing uplift, but I'm now thinking that's too sketchy. It's probably also worth doing the following on a separate bug for uplift reasons: - have net-main.js and all of its friends have a 'default' case that throws if they see a message they don't understand. - ensure our router logic has a try/catch mechanism in place with a hook or something so that if any of our router logic throws that we end up with the GELAM test in question failing. This may currently fall into an 'onerror' black hole, unsure. Speculatively assigning to :mcav because he's ahead of schedule on stuff. :)
See Also: → 1052178
Target Milestone: --- → 2.1 S2 (15aug)
Comment on attachment 8471183 [details] [review] Link to Github pull-request: https://github.com/mozilla-b2g/gaia-email-libs-and-more/pull/327 r=asuth This will be really handy for testing cronsync on a 60-second interval with servers like yahoo.com where they don't tell you it's a connection limit and we end up thinking it's a bad password.
Attachment #8471183 - Flags: review?(bugmail) → review+
Comment on attachment 8471183 [details] [review] Link to Github pull-request: https://github.com/mozilla-b2g/gaia-email-libs-and-more/pull/327 Requesting 2.0 approval; this is a tiny patch that mitigates situations where Gaia Email could eat too many connections (because we failed to close them properly before), causing their email server to reject new connections. [Approval Request Comment] [Bug caused by] (feature/regressing bug #): none [User impact] if declined: Email may hold more connections open than necessary, causing users to potentially run into their email providers' connection limits. [Testing completed]: Unit test added. [Risk to taking this patch] (and alternatives if risky): Low. [String changes made]: None.
Attachment #8471183 - Flags: approval-gaia-v2.0?
Is this a 2.0 regression or a long standing issue we are trying to fix here? If its the latter, I would prefer this ride the trains however low risk given we do not want any unwanted code change in 2.0 at this point else please nominate this as a blocker.
Flags: needinfo?(m)
[Blocking Requested - why for this release]: This is a longstanding issue that has been regressed by patches uplifted to v2.0 to improve the sync problem since we now aggressively close idle connections. We could back out those uplifts but that would make periodic sync more unreliable and would generally be silly given the simplicity of this fix. It is a very important part of fixing email's periodic sync problems which we are already treating as 2.0 blockers. Bug 1018828 was our bug for tracking this overall, but it became a meta bug, so we sharded the 2.0 blocker status to the actual implementation fixes. Bug 1042221 is the last remaining blocker for this right now. For hygiene reasons I thought it better to have this be its own bug with proper uplift rather than cramming it into that (imminent) uplift aggregate. If it helps the blocking analysis, many servers implement connection limits. Because the baseline IMAP standard has no concept of saying "hey, you're trying to make too many connections", connection limits will usually manifest as authentication errors. When the UI encounters an authentication error it pops up an effectively modal dialog prompting the user to re-enter their password which is arguably quite confusing to the user and pretty horrid UX. (We have bugs on improving some of that, but they are less simple and involve strings, etc. etc.)
Blocks: 1018828, 1044179
blocking-b2g: --- → 2.0?
:asuth, thanks for the input in commen t#6. Can you also help shed some light on how risky this change may be ? Also, any risk of plausible fallouts ? I am also adding qawanted here so this gets some testing even if we end up blocking this. Could you give some guidance to QA in areas they should testing ?
Keywords: qawanted
No risk. The relevant change in its entirety is this: NetSocket.prototype.end = function() { if (this.destroyed) return; - this._sendMessage('end'); + this._sendMessage('close'); this.destroyed = true; this._unregisterWithRouter(); }; The problem is 'end' is the wrong string and results in us not actually closing the TCP connection because the other side is expecting 'close'. The new control flow path from correctly sending 'close' is simple and safe. Even if it was not safe and failed and threw an exception, we: - would be no worse off than now where we are leaking an exception - the exception would do no harm to the app because this all happens as part of a message event; throwing an exception will simply bubble up to the native call and be logged to the window's "onerror" handler. (Versus if we had some kind of important stuff happening on the stack that could be interrupted by a propagated exception.) There's 2 things that can be tested/verified: 1) The existing basic smoke-test that the email app works and that the uplift didn't go wrong and break everything. 2) Verifying that we actually close the connection by running "adb shell netstat" and ensuring that we are closing connections. An open connection looks like: tcp 0 0 192.168.1.17:40570 98.138.227.202:993 ESTABLISHED A closed connection looks like: tcp 0 0 192.168.1.17:40569 98.138.227.202:993 TIME_WAIT For IMAP the foreign address with a port of 993 is how you know it's IMAP. The actual steps for this would be: - Use https://wiki.mozilla.org/Gaia/Email/SecretDebugMode to set the sync interval to 60 seconds. Don't set it to 20 seconds or we'll never stale-kill the connection. - Launch the email app / don't close it. This ensures that the email app doesn't close itself when periodic sync completes. - Watch that we only have at most one ESTABLISHED connection at a time, and sometimes (when the stale-killing) happens, we have none. TIME_WAIT means we closed the connection and it's just a TCP thing. (see http://superuser.com/questions/173535/what-are-close-wait-and-time-wait-states) if interested.
(In reply to Andrew Sutherland [:asuth] from comment #8) > - would be no worse off than now where we are leaking an exception er, leaking a "connection".
(In reply to Andrew Sutherland [:asuth] from comment #8) > There's 2 things that can be tested/verified: > 1) The existing basic smoke-test that the email app works and that the > uplift didn't go wrong and break everything. I checked our smoke test team's Flame 2.1 report today and yesterday and found no email related bugs. > 2) Verifying that we actually close the connection by running "adb shell > netstat" and ensuring that we are closing connections. On today's 2.0, with email set to sync every 60 seconds, running 'adb shell netstat' I observed instances of Foreign Addresses with port 933 increase every 30 seconds or so and they're all ESTABLISHED. In other words I observed the bug reproduce. On today's 2.1 with the same email configurations, I observed at most 2 instances of Foreign Addresses with port 933, and when there are 2 instances/connections, one of them will be in TIME_WAIT state and the other will be in ESTABLISHED state. The TIME_WAIT one disappears several seconds later, and the ESTABLISHED one becomes TIME_WAIT. I successfully received an email I sent to myself within the period. Nothing seems out of ordinary. Tested with a gmail account on the following: Device: Flame 2.0 BuildID: 20140812092813 Gaia: 88e774c606520c9ce349b467866e200e8db31af4 Gecko: d0f5b392dc93 Version: 32.0 (2.0) Firmware V123 User Agent Mozilla/5.0 (Mobile; rv:32.0) Gecko/32.0 Firefox/32.0 Device: Flame 2.1 Master BuildID: 20140812125211 Gaia: 9f35fca9d818b26c06aa6b7e5c0bef25886f8f20 Gecko: 8f84f8a3a49c Version: 34.0a1 (2.1 Master) Firmware V123 User Agent Mozilla/5.0 (Mobile; rv:33.0) Gecko/33.0 Firefox/33.0
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(jmitchell)
Keywords: qawanted
Thanks Andrew and Pi Wei for the thorough attention to ensuring this change is stable! Blocking+, land away.
blocking-b2g: 2.0? → 2.0+
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(jmitchell)
Attachment #8471183 - Flags: approval-gaia-v2.0?
Flags: needinfo?(m)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: