Closed Bug 1563790 Opened 5 months ago Closed 4 months ago

Reduce BITS no progress timeout

Categories

(Toolkit :: Application Update, enhancement, P2)

Desktop
Windows
enhancement

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 69+ fixed
firefox68 - wontfix
firefox69 + fixed
firefox70 + fixed

People

(Reporter: agashlin, Assigned: agashlin)

References

Details

Attachments

(2 files)

If BITS is blocked by a transient error that should not prevent network access, we may not be failing BITS jobs fast enough in order to fall back in nsIncrementalDownload. Bug 1558070 is the best example of that, BITS can get confused with VPNs and think there is no network connection available, for unclear reasons.

The SetNoProgressTimeout method allows us to lower the timeout from the default of 14 days to something more aggressive, say 5 minutes.

Some issues to consider:

  1. Testing is required, but it's possible that the timeout only applies at the time when the job resumes and hits a transient error, so this might not actually solve the problem if BITS thinks there is no network available and thus doesn't retry.
  2. On the other hand, if the timeout is checked before trying to resume, then if the network (or download server) really isn't available temporarily, we'll fail out of BITS quickly, rather than having a job that will auto-resume when the network comes back.
  3. System state such as Game Mode also stops the transfer, if this lasts too long then the job will also fail (maybe? testing needed) and we'll fall back, which undoes a little of the benefit. Firefox would have to be running at the time so this shouldn't be surprising, though.
  4. If the computer has been off, or the user logged out, the job may fail when logging in (again, need to test to find out if this is the case).
  5. There will be some complication for the update agent, which should be using background jobs that should be able to wait until the network is available. This can be handled by making the timeout longer for background jobs (BITS "normal" priority), I think.

If it does actually fix the issue, this is much simpler than having Firefox try to figure out if a job isn't making progress. BITS offers this functionality and we should use it.

For comparison, Omaha sets the timeout to 0 for foreground downloads (Firefox currently uses foreground) and 5 minutes for background ("normal" priority). We may want to also set the timeout very low for foreground downloads; a network should be available if Firefox is in use (barring Work Offline), so if BITS can't make any progress then it's likely a problem with BITS.

This will be very simple to implement, but as mentioned above a good deal of testing is required to evaluate if it even solves the problem, and what the costs will be, before a shipping decision can be made.

Assignee: nobody → agashlin

I went ahead and merged the latest m-c to oak so the builds with your patch will have an update. The builds with your patch are located here
https://archive.mozilla.org/pub/firefox/nightly/2019/07/2019-07-23-21-01-46-oak/

This seems to be working, with the current patch, which turns the foreground job "no progress timeout" down to 0, according to the testing in bug 1558070 comment 32.

I've confirmed this as well with my own testing, the easiest way I found to generate a transient error is to set up Notepad to be recognized by "Game Mode".

Testing procedure:

  1. Set Game Mode on Notepad
    1. open Notepad
    2. While it is focused Win+G (may need to hit Win again after this if in a VM as the host Windows will also get the Win+G and thus eats the Win keyup)
    3. under the settings gear check "Remember this is a game"
    4. close the Game Mode interface, leave Notepad running for later
  2. Disable network (to prevent updating early)
  3. Run the fresh browser once to set app.update.log true, browser.shell.checkDefaultBrowser false, Close
  4. Enable network
  5. Run firefox.exe -jsconsole
  6. Quickly switch focus to Notepad before Firefox has gotten a chance to start the download

I get the following log, showing that it falls back properly:

AUS:SVC Downloader:downloadUpdate - Starting BITS download with url: https://archive.mozilla.org/pub/firefox/nightly/partials/2019/07/2019-07-24-14-48-51-oak/firefox-oak-70.0a1-win64-en-US-20190723210146-20190724144851.partial.mar, updateDir: C:\ProgramData\Mozilla\updates\DA9A58D799390927\updates\0, filename: update.mar
AUS:SVC Downloader:downloadUpdate - BITS download running. BITS ID: {4661701A-D9C1-4C47-9835-A9FDD3EF724B}
AUS:SVC Downloader:onProgress - progress: 0/-1
AUS:SVC Downloader:onStopRequest - downloader: BITS, status: 2147500037
AUS:SVC Downloader: cancel
AUS:SVC Downloader:onStopRequest - status: 2147500037, current fail: 0, max fail: 10, retryTimeout: 2000
AUS:SVC Downloader:onStopRequest - non-verification failure
AUS:SVC getStatusTextFromCode - transfer error: Failed (unknown reason), default code: 2152398849
AUS:SVC Downloader:onStopRequest - setting state to: download-failed
AUS:SVC Downloader:onStopRequest - BITS download failed. Falling back to nsIIncrementalDownload

Without Game Mode set it uses BITS as usual.


After some discussion with Robert I think it makes more sense to set the timeout to a more conservative 1 hour, this will allow recovery from actually transient conditions. But I think this should be turned down to 5 or 0 seconds when the UI is shown, I think I can add that to the code that sets the monitor poll rate here and here, though there is an irritatingly large amount of plumbing between there and BITS.

Attachment #9080152 - Attachment description: Bug 1563790 - Quickly fail on BITS transient error → Bug 1563790 - Part 1: Expose and require No Progress Timeout

Test report, with Windows 10 Home in a VM:

The above patch seems to work well in my local builds. I can bring up the About window and it will fall back in 5 seconds. If I close About before the 5 seconds (and switch back to Game Mode Notepad), it will wait the full time (1 minute in my tests, 1 hour in the patch above) before falling back. The bits_clients tests are expanded to cover this interface.

I can also stimulate a transient failure by pulling the (virtual) plug on the network during the download. That works, too, though of course then fallback doesn't succeed.

If I start the download from About, then disconnect, it'll wait patiently, but as soon as I open About it will give up as it has already passed the 5 second timeout.

A very slow download (via Netlimiter) does not fail, even if the download progress doesn't change before the timeout, because it never enters the transient error state.

Being powered off for longer than the timeout is handled gracefully by BITS, the download still resumes when the system comes back up.

Sleeping the host system will cause failure, but I think the guest just sees that as the network going down.

There are a browser chrome test failures on oak: https://treeherder.mozilla.org/#/jobs?repo=oak&selectedJob=258253392&revision=d9d914ed48fac6fde75e4a1aeda0536fddede06d

all like this:
TEST-UNEXPECTED-FAIL | toolkit/mozapps/update/tests/browser/browser_aboutDialog_bc_downloading.js | A promise chain failed to handle a rejection: this._request is null - stack: Downloader__maybeStopActiveNotifications@resource://gre/modules/UpdateService.jsm:4802:7

I can easily reproduce this with these tests locally, though I don't get the error in normal operation. I'm looking into the test, but it seems like I need to not assume this._request will be around after the first call.

New patch up. I'll push this up to oak when the last few jobs finish, the local build is working for me.

Hi Adam, I'm wondering about whether this patch is something we'd want to consider uplifting to m-r for the 68.0.2 dot release next week or let it ride with 69 instead? My gut is telling me that some bake time would probably be a good idea? :)

Hi Ryan, I don't think this is a good candidate for the dot release.

Flags: needinfo?(agashlin)

Thanks for the confirmation. We'll aim for 69/68.1esr in that case.

Pushed by agashlin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ad6ae5e8a4f0
Part 1: Expose and require No Progress Timeout r=bytesized,rstrong
https://hg.mozilla.org/integration/autoland/rev/971f34603d3b
Part 2: Lower No Progress Timeout when UI is displayed r=bytesized,rstrong
Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70

Hi Adam, is this ready for approval requests?

Flags: needinfo?(agashlin)

Comment on attachment 9080152 [details]
Bug 1563790 - Part 1: Expose and require No Progress Timeout

Beta/Release Uplift Approval Request

  • User impact if declined: In some cases (rare VPN configurations), update downloads will be delayed for 2 weeks.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The only risk should be that we will abandon using BITS too early, falling back from a failed BITS download to the incremental downloader is well tested.
  • String changes made/needed:
Flags: needinfo?(agashlin)
Attachment #9080152 - Flags: approval-mozilla-beta?
Attachment #9080503 - Flags: approval-mozilla-beta?

Comment on attachment 9080152 [details]
Bug 1563790 - Part 1: Expose and require No Progress Timeout

Fixes a situation where users on a VPN can get stuck without updates for long periods of time. Approved for 69.0b14.

Attachment #9080152 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9080503 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9080503 - Flags: approval-mozilla-esr68?

Comment on attachment 9080152 [details]
Bug 1563790 - Part 1: Expose and require No Progress Timeout

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Enterprises may be more likely to have special network setups such as custom VPNs that will cause BITS to be retry update download but not recognize the failure is permanent.
  • User impact if declined: Affected users will be delayed from downloading updates for up to two weeks.

It is difficult to estimate how many users might be affected. I have heard of 3 reports so far on Nightly in the vein of bug 1558070, but only one of them was permanent.

  • Fix Landed on Version: 70 (nightly 2019-08-02)
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): BITS failure is expected and handled with a fallback, this patch just makes it happen within a more reasonable timeframe. Tests have been expanded to cover adjusting this parameter of the BITS job.

This could result in an increase in extra downloads, as BITS might give up more easily when it has already partially downloaded the update. But that this will only happen if there has been no progress on the download for a default of 1 hour.

  • String or UUID changes made by this patch:
Attachment #9080152 - Flags: approval-mozilla-esr68?

Comment on attachment 9080152 [details]
Bug 1563790 - Part 1: Expose and require No Progress Timeout

Approved for 68.1esr.

Attachment #9080152 - Flags: approval-mozilla-esr68? → approval-mozilla-esr68+
Attachment #9080503 - Flags: approval-mozilla-esr68? → approval-mozilla-esr68+
You need to log in before you can comment on or make changes to this bug.