760577 - Keep the mar file around when staging an update in the background

Assignee

Description

•

12 years ago

There is one failure scenario that the work in bug 307181 introduces:

1. An update is downloaded and staged successfully in the background.  At this point we remove the original mar file.
2. Firefox is restarted, and updater.exe tries to replace the installation dir with the staged dir.  Let's say that this fails.  At this point there is nothing that we can do, since the mar file is no longer around.
3. We need to attempt to update again when Firefox is restarted.

In order to improve this situation, we need to do two things:

1. Keep the mar file around when the update is staged successfully.
2. Teach updater to fall back to a regular update if the replace process fails.

#1 is relatively easy.  #2 is a bit trickier, since we'll know when that happens in ProcessReplaceRequest.  At that point, all of the program state is set up based on the assumption that updater is invoked to perform a replace.  I think the cleanest way to handle this is to launch updater again with a tailored command line argument which makes it perform an old school on-restart update using the mar file preserved in #1.

Please let me know if that makes sense.

(no longer active)

Assignee

Updated

•

12 years ago

Blocks: bgupdates

(no longer active)

Assignee

Updated

•

12 years ago

tracking-firefox15: --- → +

(no longer active)

Assignee

Comment 2

•

12 years ago

Attached patch Part 1: Keep the update files around when an update has been staged — Details — Splinter Review

This is the easy part.  I'm still working on the fallback path in the updater application.

Assignee: nobody → ehsan

Status: NEW → ASSIGNED

Attachment #630376 - Flags: review?(robert.bugzilla)

Robert Strong (they/them - no direct email)

Comment 3

•

12 years ago

Comment on attachment 630376 [details] [diff] [review]
Part 1: Keep the update files around when an update has been staged

Can you add a test to verify the files are left during a background update?

Attachment #630376 - Flags: review?(robert.bugzilla) → review+

(no longer active)

Assignee

Comment 4

•

12 years ago

Yeah, I'll work on a test when I finish this.

(no longer active)

Assignee

Comment 5

•

12 years ago

Attached patch WIP (obsolete) — Details — Splinter Review

This is a work in progress patch which demonstrates how the updater side of things will work.  The logic that I ended up implementing is very simple.  During the replace step, if any errors occur, we just write pending (or pending-service if the service is being used) to the update.status file, and we quit and run the callback app.  The callback notices the pending status, and launches another copy of updater.exe to perform the update without staging.

I decided against fixing the program state up in updater itself and retry it.  That makes updater.exe too complicated, and a lot harder to change in the future.  This keeps things rather simple, but I still need to test it on Windows to see if the performance is acceptable.

So far this works fine on Mac and Linux, and should pass all of the existing tests.  I expect to come up with something review-quality which has also been tested on Windows by tomorrow...

Brian R. Bondy [:bbondy]

Comment 6

•

12 years ago

If you find the performance isn't good you could just relaunch updater.exe directly again with the same command line parameters and a different update.status file.

(no longer active)

Assignee

Comment 7

•

12 years ago

(In reply to Brian R. Bondy [:bbondy] from comment #6)
> If you find the performance isn't good you could just relaunch updater.exe
> directly again with the same command line parameters and a different
> update.status file.

I would really prefer not to do that since the command line will be different, and I want to avoid reimplementing the logic in nsUpdateDriver.cpp in the updater if possible.  FWIW, so far my tests don't show a noticeable difference on a ~2yr old Windows machine (that is, relative to the amount of time that applying the update actually takes!)

(no longer active)

Assignee

Comment 8

•

12 years ago

Attached patch Part 2: Fall back to regular updates if the replacing step fails after we've staged an update — Details — Splinter Review

This version of the patch passes all of the tests on all platforms.  The previous one was failing some tests on Windows.

Attachment #630775 - Attachment is obsolete: true

(no longer active)

Assignee

Comment 9

•

12 years ago

Comment on attachment 631081 [details] [diff] [review]
Part 2: Fall back to regular updates if the replacing step fails after we've staged an update

Actually I think this patch is good for review now.  It seems to work pretty well for me locally.  I've pushed it to Oak to get Nightlies for more testing.

I will also work on an automated test, but I'd like to get this on m-c as soon as possible for more testing on Nightly users.

Attachment #631081 - Attachment description: WIP 2 → Part 2: Fall back to regular updates if the replacing step fails after we've staged an update

Attachment #631081 - Flags: review?(robert.bugzilla)

Robert Strong (they/them - no direct email)

Comment 10

•

12 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #9)
> Comment on attachment 631081 [details] [diff] [review]
> Part 2: Fall back to regular updates if the replacing step fails after we've
> staged an update
> 
> Actually I think this patch is good for review now.  It seems to work pretty
> well for me locally.
When you say, "It seems to work pretty well for me locally" have you seen any issues and what concerns do you have?

(no longer active)

Assignee

Comment 11

•

12 years ago

(In reply to Robert Strong [:rstrong] (do not email) from comment #10)
> (In reply to Ehsan Akhgari [:ehsan] from comment #9)
> > Comment on attachment 631081 [details] [diff] [review]
> > Part 2: Fall back to regular updates if the replacing step fails after we've
> > staged an update
> > 
> > Actually I think this patch is good for review now.  It seems to work pretty
> > well for me locally.
> When you say, "It seems to work pretty well for me locally" have you seen
> any issues and what concerns do you have?

No.  :-)  Let me rephrase: I think this is the correct patch, and my testings on Mac, Linux and Windows have not shown any problems with it.

(no longer active)

Assignee

Comment 12

•

12 years ago

Attached patch Part 3: Verify the fallback behavior on all the locked file tests — Details — Splinter Review

This patch adds tests similar to all of the existing ones for locked files in face of staging updates to verify that the fallback logic works for all of these cases.

These tests are valuable, but I'll also work on another test which tests the whole cycle.

Attachment #631250 - Flags: review?(robert.bugzilla)

(no longer active)

Assignee

Comment 13

•

12 years ago

Attached patch Part 4: Add full round-trip tests — Details — Splinter Review

These tests do a whole round-trip test of the failure scenario.  Here's what they do:

1. They stage an update in the background, and verify that it succeeds.
2. They lock the installation directory (same way as having a command prompt open in it would for example).
3. They run the aus test app with -process-updates to replace the app, and then they wait to make sure that the fallback kicks in place, and results in a successful update.

There's also a version of the same test using the service to perform the updates.

This should conclude the set of patches I wanted to submit in this bug.

Attachment #631494 - Flags: review?(robert.bugzilla)

(no longer active)

Assignee

Comment 14

•

12 years ago

I also forgot to mention that I tested Win32 and Win64 builds on Oak and the fallback works correctly in every case.

Paul [pwd]

Comment 15

•

12 years ago

Any chance of pinging rstrong so we can get this landed. Updating is currently a huge PiTA and if there's a fix available, we should work to get it landed ASAP.

(no longer active)

Assignee

Comment 16

•

12 years ago

(In reply to Paul [sabret00the] from comment #15)
> Any chance of pinging rstrong so we can get this landed. Updating is
> currently a huge PiTA and if there's a fix available, we should work to get
> it landed ASAP.

We're doing our best.  I hope to get this landed today.

(no longer active)

Assignee

Comment 17

•

12 years ago

FWIW, the test added in this bug fails intermittently, but the good news is that my patch in bug 762032 seems to fix it.

Depends on: 762032

Frank Zimmer

Comment 18

•

12 years ago

I guess you need much higher times to wait.
I have done that manually and deleting for instance 
c:\Programm Files\Nightly\firefox.exe 
sometimes takes up to 10 seconds before the handle is free up and you can delete the executable.

(no longer active)

Assignee

Comment 19

•

12 years ago

(In reply to Frank Zimmer from comment #18)
> I guess you need much higher times to wait.
> I have done that manually and deleting for instance 
> c:\Programm Files\Nightly\firefox.exe 
> sometimes takes up to 10 seconds before the handle is free up and you can
> delete the executable.

With my patches here, we will not be entirely relying on the wait time.

Robert Strong (they/them - no direct email)

Comment 20

•

12 years ago

My main concerns at this time are (the first one being the biggest):
1. if this is a common occurrence
2. the fallback flow is to startup and then apply the update on the next startup which is quite far from being ideal.

What do you think about a telemetry probe specifically for whether this is a common occurrence?

Did you gather data comparing the time it took to apply updates with and without staging the update and if so, could you post it to this bug?

(no longer active)

Assignee

Comment 21

•

12 years ago

(In reply to Robert Strong [:rstrong] (do not email) from comment #20)
> My main concerns at this time are (the first one being the biggest):
> 1. if this is a common occurrence
> 2. the fallback flow is to startup and then apply the update on the next
> startup which is quite far from being ideal.

No, the fallback flow is no worse than before.  Here's what happens:

* I run Firefox.  An update gets downloaded in the background, and gets staged.  The status file will contain "applied".
* I restart, updater tries to rename the installation dir but will not be successful.
* It waits for 1 second, in 100ms chunks.  If the folder is not locked, it gives up and instead of writing a failure message to the log file, it writes "pending" (or "pending-service" if the service is being used), removes the staged directory, and quits.
* Firefox notices that the update is pending, and starts the updater process again.  This happens immediately, and is not noticeable to the user.
* The updater applies the update without using the staged copy.  This is the same update workflow before bug 307181.
* When the updater is done, it quits and Firefox is re-launched normally.

> What do you think about a telemetry probe specifically for whether this is a
> common occurrence?

The telemetry probe for the update status can be used to measure this.  Based on the current data that we have on Windows Nightly 16.a1 users, 2.66% of them are affected by this problem.  So I wouldn't say that this is very common.

> Did you gather data comparing the time it took to apply updates with and
> without staging the update and if so, could you post it to this bug?

Do you mean the time it takes for the update itself to be applied/staged, or the extra time that it takes for the fallback mechanism implemented in these patches to kick in?

Robert Strong (they/them - no direct email)

Comment 22

•

12 years ago

Thanks for the answers and I am glad to hear that it isn't common!

The time for applying a complete vs. staging and a partial vs. staging.

Robert Strong (they/them - no direct email)

Comment 23

•

12 years ago

I ask because at least on my Windows system it seems to take around the same or more amount of time.

(no longer active)

Assignee

Comment 24

•

12 years ago

I can do some time measurements for you tomorrow to have accurate numbers, but in general staging an (complete/partial) update involves first copying the entire installation directory, and then applying the (complete/partial) update in the exact same way as when applying the update without staging, so in both cases staging an update takes longer, by the amount of time required to copy the installation directory.

(no longer active)

Assignee

Comment 25

•

12 years ago

(In reply to Robert Strong [:rstrong] (do not email) from comment #23)
> I ask because at least on my Windows system it seems to take around the same
> or more amount of time.

Yeah, that's expected.  Note that ensure_copy has a fast path for Windows using CopyFile <http://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/updater/updater.cpp#558>, and I added that specifically because copying files by reading and writing buffers manually would take a very long time on Windows.  With that optimization, copying the installation directory is reasonably fast on all platforms, but still staging an update will be slightly slower than applying it.

Lawrence Mandel [:lmandel] (use needinfo)

Comment 26

•

12 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #21)
> > What do you think about a telemetry probe specifically for whether this is a
> > common occurrence?
> 
> The telemetry probe for the update status can be used to measure this. 
> Based on the current data that we have on Windows Nightly 16.a1 users, 2.66%
> of them are affected by this problem.  So I wouldn't say that this is very
> common.

In general Telemetry should not be used to ask questions of the Firefox population. Telemetry opt-in rates are fairly low, the population is not random, and the sample is therefore not guaranteed to be representative of the entire population. What the 2.66% says is that only this percentage of the Telemetry population are affected by this problem.

Robert Strong (they/them - no direct email)

Comment 27

•

12 years ago

Already understood and agreed. The telemetry data should be sufficient for getting a general idea of how bad the problem is in this instance though.

Robert Strong (they/them - no direct email)

Comment 28

•

12 years ago

Comment on attachment 631081 [details] [diff] [review]
Part 2: Fall back to regular updates if the replacing step fails after we've staged an update

>diff --git a/toolkit/mozapps/update/updater/updater.cpp b/toolkit/mozapps/update/updater/updater.cpp
>--- a/toolkit/mozapps/update/updater/updater.cpp
>+++ b/toolkit/mozapps/update/updater/updater.cpp
>@@ -2028,41 +2036,74 @@ UpdateThreadFunc(void *param)
>     }
> 
>     if (rv == OK) {
>       rv = DoUpdate();
>       gArchiveReader.Close();
>     }
>   }
> 
>-  if (rv) {
>-    LOG(("failed: %d\n", rv));
>+  bool reportRealResults = true;
>+  if (sReplaceRequest && rv && !getenv("MOZ_NO_REPLACE_FALLBACK")) {
>+    // When attempting to replace the application, we should fall back
>+    // to non-staged updates in case of a failure.  We do this but
s/but/by/

>+    // setting the status to pending, and quitting and allowing the
>+    // callback application to be run.  The callback application's
setting the status to pending, exiting the updater, and launching the callback application.

or similar

>+    // startup path will see the pending status, and will start the
>+    // updater application again in order to apply the update without
>+    // staging.
>+    // The MOZ_NO_REPLACE_FALLBACK environment variable is used to
>+    // bypass this fallback, and is used in the updater tests.
>+    // The only special thing which we should do here is to remove the
>+    // staged directory as it won't be useful any more.
>+    NS_tchar installDir[MAXPATHLEN];
>+    if (GetInstallationDir(installDir)) {
>+      NS_tchar stageDir[MAXPATHLEN];
>+      NS_tsnprintf(stageDir, sizeof(stageDir)/sizeof(stageDir[0]),
>+#ifdef XP_MACOSX
>+                   NS_T("%s/Updated.app"),
>+#else
>+                   NS_T("%s/updated"),
>+#endif
>+                   installDir);
>+
>+      if (ensure_remove_recursive(stageDir) == 0) {
>+        WriteStatusText(sUsingService ? "pending-service" : "pending");
>+        reportRealResults = false; // pretend success
>+      }
What do you think about trying to apply the update when the recursive remove is unsuccessful? I *think* that is the better behavior.

How can this failure condition be checked via telemetry? I don't think this needs to be done in this bug if it isn't.

Attachment #631081 - Flags: review?(robert.bugzilla) → review+

Robert Strong (they/them - no direct email)

Comment 29

•

12 years ago

Comment on attachment 631250 [details] [diff] [review]
Part 3: Verify the fallback behavior on all the locked file tests

Wow... amazing how many additional tests are being created by just copying the existing tests with a few mods. Looks good though I didn't check if any of these are redundant.

Attachment #631250 - Flags: review?(robert.bugzilla) → review+

Robert Strong (they/them - no direct email)

Comment 30

•

12 years ago

Comment on attachment 631494 [details] [diff] [review]
Part 4: Add full round-trip tests

>diff --git a/toolkit/mozapps/update/updater/updater.cpp b/toolkit/mozapps/update/updater/updater.cpp
>--- a/toolkit/mozapps/update/updater/updater.cpp
>+++ b/toolkit/mozapps/update/updater/updater.cpp
>@@ -2062,16 +2062,17 @@ UpdateThreadFunc(void *param)
>                    NS_T("%s/Updated.app"),
> #else
>                    NS_T("%s/updated"),
> #endif
>                    installDir);
> 
>       if (ensure_remove_recursive(stageDir) == 0) {
>         WriteStatusText(sUsingService ? "pending-service" : "pending");
>+        putenv("MOZ_PROCESS_UPDATES="); // We need to use -process-updates again in the tests
Please expand this comment to include that this code path shouldn't cause a loop since the status code has been set to pending.

>         reportRealResults = false; // pretend success
>       }
>     }
>   }
> 
>   if (reportRealResults) {
>     if (rv) {
>       LOG(("failed: %d\n", rv));

Attachment #631494 - Flags: review?(robert.bugzilla) → review+

Robert Strong (they/them - no direct email)

Comment 31

•

12 years ago

btw: I'm fine with this landing though I'd still like to get startup times with and without an update already staged prior to enabling the pref on Aurora for Windows. If this is a pain I can probably do this sometime next week.

(no longer active)

Assignee

Comment 32

•

12 years ago

(In reply to Robert Strong [:rstrong] (do not email) from comment #28)
> >+      if (ensure_remove_recursive(stageDir) == 0) {
> >+        WriteStatusText(sUsingService ? "pending-service" : "pending");
> >+        reportRealResults = false; // pretend success
> >+      }
> What do you think about trying to apply the update when the recursive remove
> is unsuccessful? I *think* that is the better behavior.

You're right, I'll make that change before landing.

> How can this failure condition be checked via telemetry? I don't think this
> needs to be done in this bug if it isn't.

Hmm, good question.  It's a bit hard since we're pretending that the update is just pending.  I'll think more about this and will file a new bug.

(In reply to Robert Strong [:rstrong] (do not email) from comment #31)
> btw: I'm fine with this landing though I'd still like to get startup times
> with and without an update already staged prior to enabling the pref on
> Aurora for Windows. If this is a pain I can probably do this sometime next
> week.

I'll try to get this information for you today.

(no longer active)

Assignee

Comment 33

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/330f6f82f913
https://hg.mozilla.org/mozilla-central/rev/4a5680646fe1
https://hg.mozilla.org/mozilla-central/rev/3c9425ee3222
https://hg.mozilla.org/mozilla-central/rev/d5f72bc8b49d

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla16

(no longer active)

Assignee

Updated

•

12 years ago

Flags: in-testsuite+

(no longer active)

Assignee

Comment 35

•

12 years ago

I did some measurements.  On average, it took the updater ~15ms to replace the application after the update had been staged, but for applying updates during restart, the average time that I measured was ~3.720s.  Note that the variation in the latter measurement was quite a lot.  The staging process took ~4.310s.

Robert Strong (they/them - no direct email)

Comment 36

•

12 years ago

How was this measured? I'd like to know what the difference is for the user via the time from starting the application, the update is applied or the staged dir is copied, and to the application being relaunched.

(no longer active)

Assignee

Comment 37

•

12 years ago

(In reply to Robert Strong [:rstrong] (do not email) from comment #36)
> How was this measured? I'd like to know what the difference is for the user
> via the time from starting the application, the update is applied or the
> staged dir is copied, and to the application being relaunched.

I instrumented updater.exe to measure the amount of time it takes for its main function to run.

Robert Strong (they/them - no direct email)

Comment 38

•

12 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #37)
> (In reply to Robert Strong [:rstrong] (do not email) from comment #36)
> > How was this measured? I'd like to know what the difference is for the user
> > via the time from starting the application, the update is applied or the
> > staged dir is copied, and to the application being relaunched.
> 
> I instrumented updater.exe to measure the amount of time it takes for its
> main function to run.
That is good enough for me and thanks!

(no longer active)

Assignee

Comment 39

•

12 years ago

Comment on attachment 630376 [details] [diff] [review]
Part 1: Keep the update files around when an update has been staged

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 307181
User impact if declined: Updates on Windows could fail in a variety of ways.
Testing completed (on m-c, etc.): Locally, m-c and in the test suite
Risk to taking this patch (and alternatives if risky): This is sort of risky but has been baking on m-c for some time without issues.
String or UUID changes made by this patch: none.

Attachment #630376 - Flags: approval-mozilla-aurora?

Alex Keybl [:akeybl]

Comment 40

•

12 years ago

Comment on attachment 630376 [details] [diff] [review]
Part 1: Keep the update files around when an update has been staged

[Triage Comment]
Approving for Aurora 15 in support of landing bug 767125 in the next few days.

Attachment #630376 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

(no longer active)

Assignee

Comment 41

•

12 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/0ed76cdf4abe
https://hg.mozilla.org/releases/mozilla-aurora/rev/23a5d4b7a6b9
https://hg.mozilla.org/releases/mozilla-aurora/rev/45462dd86c4c
https://hg.mozilla.org/releases/mozilla-aurora/rev/073af172958b

status-firefox15: --- → fixed

(no longer active)

Assignee

Updated

•

12 years ago

Depends on: 769246

Ed Morley [:emorley]

Updated

•

12 years ago

Depends on: 773251

Part 1: Keep the update files around when an update has been staged 12 years ago (no longer active) 1.81 KB, patch	robert.strong.bugs : review+ akeybl : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
WIP 12 years ago (no longer active) 11.15 KB, patch		Details \| Diff \| Splinter Review
Part 2: Fall back to regular updates if the replacing step fails after we've staged an update 12 years ago (no longer active) 26.32 KB, patch	robert.strong.bugs : review+	Details \| Diff \| Splinter Review
Part 3: Verify the fallback behavior on all the locked file tests 12 years ago (no longer active) 37.23 KB, patch	robert.strong.bugs : review+	Details \| Diff \| Splinter Review
Part 4: Add full round-trip tests 12 years ago (no longer active) 8.76 KB, patch	robert.strong.bugs : review+	Details \| Diff \| Splinter Review