Closed Bug 760577 Opened 12 years ago Closed 12 years ago

Keep the mar file around when staging an update in the background

Categories

(Toolkit :: Application Update, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla16
Tracking Status
firefox15 + fixed

People

(Reporter: ehsan.akhgari, Assigned: ehsan.akhgari)

References

Details

Attachments

(4 files, 1 obsolete file)

There is one failure scenario that the work in bug 307181 introduces:

1. An update is downloaded and staged successfully in the background.  At this point we remove the original mar file.
2. Firefox is restarted, and updater.exe tries to replace the installation dir with the staged dir.  Let's say that this fails.  At this point there is nothing that we can do, since the mar file is no longer around.
3. We need to attempt to update again when Firefox is restarted.

In order to improve this situation, we need to do two things:

1. Keep the mar file around when the update is staged successfully.
2. Teach updater to fall back to a regular update if the replace process fails.

#1 is relatively easy.  #2 is a bit trickier, since we'll know when that happens in ProcessReplaceRequest.  At that point, all of the program state is set up based on the assumption that updater is invoked to perform a replace.  I think the cleanest way to handle this is to launch updater again with a tailored command line argument which makes it perform an old school on-restart update using the mar file preserved in #1.

Please let me know if that makes sense.
Blocks: bgupdates
This is the easy part.  I'm still working on the fallback path in the updater application.
Assignee: nobody → ehsan
Status: NEW → ASSIGNED
Attachment #630376 - Flags: review?(robert.bugzilla)
Comment on attachment 630376 [details] [diff] [review]
Part 1: Keep the update files around when an update has been staged

Can you add a test to verify the files are left during a background update?
Attachment #630376 - Flags: review?(robert.bugzilla) → review+
Yeah, I'll work on a test when I finish this.
Attached patch WIP (obsolete) — Splinter Review
This is a work in progress patch which demonstrates how the updater side of things will work.  The logic that I ended up implementing is very simple.  During the replace step, if any errors occur, we just write pending (or pending-service if the service is being used) to the update.status file, and we quit and run the callback app.  The callback notices the pending status, and launches another copy of updater.exe to perform the update without staging.

I decided against fixing the program state up in updater itself and retry it.  That makes updater.exe too complicated, and a lot harder to change in the future.  This keeps things rather simple, but I still need to test it on Windows to see if the performance is acceptable.

So far this works fine on Mac and Linux, and should pass all of the existing tests.  I expect to come up with something review-quality which has also been tested on Windows by tomorrow...
If you find the performance isn't good you could just relaunch updater.exe directly again with the same command line parameters and a different update.status file.
(In reply to Brian R. Bondy [:bbondy] from comment #6)
> If you find the performance isn't good you could just relaunch updater.exe
> directly again with the same command line parameters and a different
> update.status file.

I would really prefer not to do that since the command line will be different, and I want to avoid reimplementing the logic in nsUpdateDriver.cpp in the updater if possible.  FWIW, so far my tests don't show a noticeable difference on a ~2yr old Windows machine (that is, relative to the amount of time that applying the update actually takes!)
This version of the patch passes all of the tests on all platforms.  The previous one was failing some tests on Windows.
Attachment #630775 - Attachment is obsolete: true
Comment on attachment 631081 [details] [diff] [review]
Part 2: Fall back to regular updates if the replacing step fails after we've staged an update

Actually I think this patch is good for review now.  It seems to work pretty well for me locally.  I've pushed it to Oak to get Nightlies for more testing.

I will also work on an automated test, but I'd like to get this on m-c as soon as possible for more testing on Nightly users.
Attachment #631081 - Attachment description: WIP 2 → Part 2: Fall back to regular updates if the replacing step fails after we've staged an update
Attachment #631081 - Flags: review?(robert.bugzilla)
(In reply to Ehsan Akhgari [:ehsan] from comment #9)
> Comment on attachment 631081 [details] [diff] [review]
> Part 2: Fall back to regular updates if the replacing step fails after we've
> staged an update
> 
> Actually I think this patch is good for review now.  It seems to work pretty
> well for me locally.
When you say, "It seems to work pretty well for me locally" have you seen any issues and what concerns do you have?
(In reply to Robert Strong [:rstrong] (do not email) from comment #10)
> (In reply to Ehsan Akhgari [:ehsan] from comment #9)
> > Comment on attachment 631081 [details] [diff] [review]
> > Part 2: Fall back to regular updates if the replacing step fails after we've
> > staged an update
> > 
> > Actually I think this patch is good for review now.  It seems to work pretty
> > well for me locally.
> When you say, "It seems to work pretty well for me locally" have you seen
> any issues and what concerns do you have?

No.  :-)  Let me rephrase: I think this is the correct patch, and my testings on Mac, Linux and Windows have not shown any problems with it.
This patch adds tests similar to all of the existing ones for locked files in face of staging updates to verify that the fallback logic works for all of these cases.

These tests are valuable, but I'll also work on another test which tests the whole cycle.
Attachment #631250 - Flags: review?(robert.bugzilla)
These tests do a whole round-trip test of the failure scenario.  Here's what they do:

1. They stage an update in the background, and verify that it succeeds.
2. They lock the installation directory (same way as having a command prompt open in it would for example).
3. They run the aus test app with -process-updates to replace the app, and then they wait to make sure that the fallback kicks in place, and results in a successful update.

There's also a version of the same test using the service to perform the updates.

This should conclude the set of patches I wanted to submit in this bug.
Attachment #631494 - Flags: review?(robert.bugzilla)
I also forgot to mention that I tested Win32 and Win64 builds on Oak and the fallback works correctly in every case.
Any chance of pinging rstrong so we can get this landed. Updating is currently a huge PiTA and if there's a fix available, we should work to get it landed ASAP.
(In reply to Paul [sabret00the] from comment #15)
> Any chance of pinging rstrong so we can get this landed. Updating is
> currently a huge PiTA and if there's a fix available, we should work to get
> it landed ASAP.

We're doing our best.  I hope to get this landed today.
FWIW, the test added in this bug fails intermittently, but the good news is that my patch in bug 762032 seems to fix it.
Depends on: 762032
I guess you need much higher times to wait.
I have done that manually and deleting for instance 
c:\Programm Files\Nightly\firefox.exe 
sometimes takes up to 10 seconds before the handle is free up and you can delete the executable.
(In reply to Frank Zimmer from comment #18)
> I guess you need much higher times to wait.
> I have done that manually and deleting for instance 
> c:\Programm Files\Nightly\firefox.exe 
> sometimes takes up to 10 seconds before the handle is free up and you can
> delete the executable.

With my patches here, we will not be entirely relying on the wait time.
My main concerns at this time are (the first one being the biggest):
1. if this is a common occurrence
2. the fallback flow is to startup and then apply the update on the next startup which is quite far from being ideal.

What do you think about a telemetry probe specifically for whether this is a common occurrence?

Did you gather data comparing the time it took to apply updates with and without staging the update and if so, could you post it to this bug?
(In reply to Robert Strong [:rstrong] (do not email) from comment #20)
> My main concerns at this time are (the first one being the biggest):
> 1. if this is a common occurrence
> 2. the fallback flow is to startup and then apply the update on the next
> startup which is quite far from being ideal.

No, the fallback flow is no worse than before.  Here's what happens:

* I run Firefox.  An update gets downloaded in the background, and gets staged.  The status file will contain "applied".
* I restart, updater tries to rename the installation dir but will not be successful.
* It waits for 1 second, in 100ms chunks.  If the folder is not locked, it gives up and instead of writing a failure message to the log file, it writes "pending" (or "pending-service" if the service is being used), removes the staged directory, and quits.
* Firefox notices that the update is pending, and starts the updater process again.  This happens immediately, and is not noticeable to the user.
* The updater applies the update without using the staged copy.  This is the same update workflow before bug 307181.
* When the updater is done, it quits and Firefox is re-launched normally.

> What do you think about a telemetry probe specifically for whether this is a
> common occurrence?

The telemetry probe for the update status can be used to measure this.  Based on the current data that we have on Windows Nightly 16.a1 users, 2.66% of them are affected by this problem.  So I wouldn't say that this is very common.

> Did you gather data comparing the time it took to apply updates with and
> without staging the update and if so, could you post it to this bug?

Do you mean the time it takes for the update itself to be applied/staged, or the extra time that it takes for the fallback mechanism implemented in these patches to kick in?
Thanks for the answers and I am glad to hear that it isn't common!

The time for applying a complete vs. staging and a partial vs. staging.
I ask because at least on my Windows system it seems to take around the same or more amount of time.
I can do some time measurements for you tomorrow to have accurate numbers, but in general staging an (complete/partial) update involves first copying the entire installation directory, and then applying the (complete/partial) update in the exact same way as when applying the update without staging, so in both cases staging an update takes longer, by the amount of time required to copy the installation directory.
(In reply to Robert Strong [:rstrong] (do not email) from comment #23)
> I ask because at least on my Windows system it seems to take around the same
> or more amount of time.

Yeah, that's expected.  Note that ensure_copy has a fast path for Windows using CopyFile <http://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/updater/updater.cpp#558>, and I added that specifically because copying files by reading and writing buffers manually would take a very long time on Windows.  With that optimization, copying the installation directory is reasonably fast on all platforms, but still staging an update will be slightly slower than applying it.
(In reply to Ehsan Akhgari [:ehsan] from comment #21)
> > What do you think about a telemetry probe specifically for whether this is a
> > common occurrence?
> 
> The telemetry probe for the update status can be used to measure this. 
> Based on the current data that we have on Windows Nightly 16.a1 users, 2.66%
> of them are affected by this problem.  So I wouldn't say that this is very
> common.

In general Telemetry should not be used to ask questions of the Firefox population. Telemetry opt-in rates are fairly low, the population is not random, and the sample is therefore not guaranteed to be representative of the entire population. What the 2.66% says is that only this percentage of the Telemetry population are affected by this problem.
Already understood and agreed. The telemetry data should be sufficient for getting a general idea of how bad the problem is in this instance though.
Comment on attachment 631081 [details] [diff] [review]
Part 2: Fall back to regular updates if the replacing step fails after we've staged an update

>diff --git a/toolkit/mozapps/update/updater/updater.cpp b/toolkit/mozapps/update/updater/updater.cpp
>--- a/toolkit/mozapps/update/updater/updater.cpp
>+++ b/toolkit/mozapps/update/updater/updater.cpp
>@@ -2028,41 +2036,74 @@ UpdateThreadFunc(void *param)
>     }
> 
>     if (rv == OK) {
>       rv = DoUpdate();
>       gArchiveReader.Close();
>     }
>   }
> 
>-  if (rv) {
>-    LOG(("failed: %d\n", rv));
>+  bool reportRealResults = true;
>+  if (sReplaceRequest && rv && !getenv("MOZ_NO_REPLACE_FALLBACK")) {
>+    // When attempting to replace the application, we should fall back
>+    // to non-staged updates in case of a failure.  We do this but
s/but/by/

>+    // setting the status to pending, and quitting and allowing the
>+    // callback application to be run.  The callback application's
setting the status to pending, exiting the updater, and launching the callback application.

or similar

>+    // startup path will see the pending status, and will start the
>+    // updater application again in order to apply the update without
>+    // staging.
>+    // The MOZ_NO_REPLACE_FALLBACK environment variable is used to
>+    // bypass this fallback, and is used in the updater tests.
>+    // The only special thing which we should do here is to remove the
>+    // staged directory as it won't be useful any more.
>+    NS_tchar installDir[MAXPATHLEN];
>+    if (GetInstallationDir(installDir)) {
>+      NS_tchar stageDir[MAXPATHLEN];
>+      NS_tsnprintf(stageDir, sizeof(stageDir)/sizeof(stageDir[0]),
>+#ifdef XP_MACOSX
>+                   NS_T("%s/Updated.app"),
>+#else
>+                   NS_T("%s/updated"),
>+#endif
>+                   installDir);
>+
>+      if (ensure_remove_recursive(stageDir) == 0) {
>+        WriteStatusText(sUsingService ? "pending-service" : "pending");
>+        reportRealResults = false; // pretend success
>+      }
What do you think about trying to apply the update when the recursive remove is unsuccessful? I *think* that is the better behavior.

How can this failure condition be checked via telemetry? I don't think this needs to be done in this bug if it isn't.
Attachment #631081 - Flags: review?(robert.bugzilla) → review+
Comment on attachment 631250 [details] [diff] [review]
Part 3: Verify the fallback behavior on all the locked file tests

Wow... amazing how many additional tests are being created by just copying the existing tests with a few mods. Looks good though I didn't check if any of these are redundant.
Attachment #631250 - Flags: review?(robert.bugzilla) → review+
Comment on attachment 631494 [details] [diff] [review]
Part 4: Add full round-trip tests

>diff --git a/toolkit/mozapps/update/updater/updater.cpp b/toolkit/mozapps/update/updater/updater.cpp
>--- a/toolkit/mozapps/update/updater/updater.cpp
>+++ b/toolkit/mozapps/update/updater/updater.cpp
>@@ -2062,16 +2062,17 @@ UpdateThreadFunc(void *param)
>                    NS_T("%s/Updated.app"),
> #else
>                    NS_T("%s/updated"),
> #endif
>                    installDir);
> 
>       if (ensure_remove_recursive(stageDir) == 0) {
>         WriteStatusText(sUsingService ? "pending-service" : "pending");
>+        putenv("MOZ_PROCESS_UPDATES="); // We need to use -process-updates again in the tests
Please expand this comment to include that this code path shouldn't cause a loop since the status code has been set to pending.

>         reportRealResults = false; // pretend success
>       }
>     }
>   }
> 
>   if (reportRealResults) {
>     if (rv) {
>       LOG(("failed: %d\n", rv));
Attachment #631494 - Flags: review?(robert.bugzilla) → review+
btw: I'm fine with this landing though I'd still like to get startup times with and without an update already staged prior to enabling the pref on Aurora for Windows. If this is a pain I can probably do this sometime next week.
(In reply to Robert Strong [:rstrong] (do not email) from comment #28)
> >+      if (ensure_remove_recursive(stageDir) == 0) {
> >+        WriteStatusText(sUsingService ? "pending-service" : "pending");
> >+        reportRealResults = false; // pretend success
> >+      }
> What do you think about trying to apply the update when the recursive remove
> is unsuccessful? I *think* that is the better behavior.

You're right, I'll make that change before landing.

> How can this failure condition be checked via telemetry? I don't think this
> needs to be done in this bug if it isn't.

Hmm, good question.  It's a bit hard since we're pretending that the update is just pending.  I'll think more about this and will file a new bug.

(In reply to Robert Strong [:rstrong] (do not email) from comment #31)
> btw: I'm fine with this landing though I'd still like to get startup times
> with and without an update already staged prior to enabling the pref on
> Aurora for Windows. If this is a pain I can probably do this sometime next
> week.

I'll try to get this information for you today.
Flags: in-testsuite+
I did some measurements.  On average, it took the updater ~15ms to replace the application after the update had been staged, but for applying updates during restart, the average time that I measured was ~3.720s.  Note that the variation in the latter measurement was quite a lot.  The staging process took ~4.310s.
How was this measured? I'd like to know what the difference is for the user via the time from starting the application, the update is applied or the staged dir is copied, and to the application being relaunched.
(In reply to Robert Strong [:rstrong] (do not email) from comment #36)
> How was this measured? I'd like to know what the difference is for the user
> via the time from starting the application, the update is applied or the
> staged dir is copied, and to the application being relaunched.

I instrumented updater.exe to measure the amount of time it takes for its main function to run.
(In reply to Ehsan Akhgari [:ehsan] from comment #37)
> (In reply to Robert Strong [:rstrong] (do not email) from comment #36)
> > How was this measured? I'd like to know what the difference is for the user
> > via the time from starting the application, the update is applied or the
> > staged dir is copied, and to the application being relaunched.
> 
> I instrumented updater.exe to measure the amount of time it takes for its
> main function to run.
That is good enough for me and thanks!
Comment on attachment 630376 [details] [diff] [review]
Part 1: Keep the update files around when an update has been staged

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 307181
User impact if declined: Updates on Windows could fail in a variety of ways.
Testing completed (on m-c, etc.): Locally, m-c and in the test suite
Risk to taking this patch (and alternatives if risky): This is sort of risky but has been baking on m-c for some time without issues.
String or UUID changes made by this patch: none.
Attachment #630376 - Flags: approval-mozilla-aurora?
Comment on attachment 630376 [details] [diff] [review]
Part 1: Keep the update files around when an update has been staged

[Triage Comment]
Approving for Aurora 15 in support of landing bug 767125 in the next few days.
Attachment #630376 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Depends on: 769246
Depends on: 773251
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: