Closed Bug 1736373 Opened 3 years ago Closed 3 years ago

Firefox update not always getting applied due to pingsender triggering updater during shutdown

Categories

(Toolkit :: Application Update, defect, P1)

defect

Tracking

()

RESOLVED FIXED
95 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- unaffected
firefox93 --- unaffected
firefox94 - unaffected
firefox95 blocking fixed

People

(Reporter: aryx, Assigned: bholley)

References

Details

Attachments

(1 file)

Firefox 95.0a1 20211018095159, updated from 20211018213754. Windows 8.1

Installing Firefox updates both from About Firefox and the update notification became unreliable over the weekend and sometimes fail to apply the patch (restart works without issues, but about:support still reports the old build ID and updates are still found).

:aja reports the same issue.

The first update failure was for me:
To 20211017092521: "Install Failed"
Later: 20211017213754: "The Update could not be installed (patch apply failed)"
There also 2 pending updates to the latest version which has already been installed in a third update attempt.

last-update.log mentioned Last error: 32, err: 7 (sorry, lost the log after the successful update)

Another report of this on Element, a fix or rollback of the change which causes this would be welcome. Both Windows 10 and 8.1 and also both 64- and 32-bit have been reported. Sometimes 3 attempts are required to install successfully.

Flags: needinfo?(ksteuber)
OS: Unspecified → Windows
Hardware: Unspecified → All

This may affect all platforms: multiple reports of getting extra icons in the Dock, and on Linux I got an about:restartrequired error (which we should not be getting with our own Nightlies).

Marking P1/S1 due to potential impact.

Severity: -- → S1
Priority: -- → P1
OS: Windows → All

The macOS dock icon thing may be bug 1731501.

Logging my experience in case it helps:

Just updated Nightly to 95.0a1 (2021-10-18) (64-bit) on MacOS 11.6 (Big Sur, Apple Silicon)

  1. The update hung for a while, with all tabs visible but the main viewport empty and nothing was responsive in the chrome. I had to re-trigger the update via the About modal.
  2. After it updated, the 1Password add-on grumbled that it couldn't verify the browser and also claims FF still needs an update and a restart.
  3. I also see three new Nightlies in my dock, none of which is the Nightly I've marked to keep in the dock.

Quitting Nightly and restarting resulted in the 1P addon working fine again and the Dock showed I was running my expected/pinned nightly. The three special-guest Nightly icons are still in my recent Dock items

Workaround:
Disable Allow Nightly to send technical and interaction data to Mozilla in Settings > Privacy & Security and then restart.

This may be related to bug 1734262.

Bug 1734262 has been identified (thank you Fanolian) as the likely regressor for this change. Could those changes be reverted and the issue be fixed outside the tree to let updates apply reliably?

Flags: needinfo?(bholley)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #7)

Bug 1734262 has been identified (thank you Fanolian) as the likely regressor for this change. Could those changes be reverted and the issue be fixed outside the tree to let updates apply reliably?

FYI the same workaround works for bug 1736438 comment 3 which has landed in Oct 19th (no need to disable the setting before using Save to Pocket button).
So I don't know if this update bug is fixed in Oct 19th too. Perhaps we need to wait for the next update and see.

(In reply to Fanolian from comment #8)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #7)

Bug 1734262 has been identified (thank you Fanolian) as the likely regressor for this change. Could those changes be reverted and the issue be fixed outside the tree to let updates apply reliably?

FYI the same workaround works for bug 1736438 comment 3 which has landed in Oct 19th (no need to disable the setting before using Save to Pocket button).
So I don't know if this update bug is fixed in Oct 19th too. Perhaps we need to wait for the next update and see.

That fix is in the latest nightly. Can you test to see if the issue is resolved?

Flags: needinfo?(Fanolian+BMO)

(In reply to Bobby Holley (:bholley) from comment #9)

That fix is in the latest nightly. Can you test to see if the issue is resolved?

I got this update bug when updating from 18th -> 19th. I haven't got any new updates yet so I cannot test it.
The save-to-pocket issue in bug 1736438 is fixed though.

Flags: needinfo?(Fanolian+BMO)

A new Nightly update is available. Fanolian, does updating with telemetry enabled work for you?

The applied applied on the first attempt for me.

Flags: needinfo?(Fanolian+BMO)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #11)

A new Nightly update is available. Fanolian, does updating with telemetry enabled work for you?

The applied applied on the first attempt for me.

Update with telemetry enabled works for me too. From 20211019095357 to 20211019154506.

Flags: needinfo?(Fanolian+BMO)
Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(ksteuber)
Flags: needinfo?(bholley)
Resolution: --- → DUPLICATE

That's great news.

Nick, it seems bad that a TypeError in PingCentre was able to interfere with updates. Can your team do a brief retro and figure out how we can harden against this sort of thing going forward?

Ryan, can you (or someone in relman) keep an eye on our Nightly population over the next week or two and make sure it recovers?

Flags: needinfo?(ryanvm)
Flags: needinfo?(nalexander)

(In reply to Bobby Holley (:bholley) from comment #14)

That's great news.

Nick, it seems bad that a TypeError in PingCentre was able to interfere with updates. Can your team do a brief retro and figure out how we can harden against this sort of thing going forward?

Yes, I can drive something like this in the team.

Flags: needinfo?(nalexander)

The update from 95.0a1 20211019154506 to 20211019192428 failed to apply from About Firefox + clicking its restart button. Third attempt was downloading the Firefox update but close the application and launch it again afterwards which succeeded.

If anybody else still experiences the update issue, please comment.

Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---

Hm, that's concerning — particularly given that the issue seems so intermittent.

One thing that could be happening is that the pingsender background task is running immediately after shutdown, and having it running (on top of libXUL) is interfering with the updater. That could explain the intermittent manifestation, since the background task would be racing against the updater.

I'm not really clear on the constraints of the updater. Could it wait until the background task is done? Could it run in parallel with the background task? Could we avoid sending shutdown pings if we have an update pending?

Flags: needinfo?(nalexander)
Flags: needinfo?(chutten)

I have something similar happening here (openSUSE Tumbleweed/Nightly)

  1. I get the Update available notification
  2. Select Download and restart to update.
  3. Browser closes then immediately reopens with - https://imgur.com/iBzXg1V
  4. Select Restart and the browser closes. I have to restart the browser manually.

It's been happening for a few days.

(In reply to Pulse from comment #18)

I have something similar happening here (openSUSE Tumbleweed/Nightly)

  1. I get the Update available notification
  2. Select Download and restart to update.
  3. Browser closes then immediately reopens with - https://imgur.com/iBzXg1V
  4. Select Restart and the browser closes. I have to restart the browser manually.

It's been happening for a few days.

Forgot to mention - Build:20211019215100

One thing that could be happening is that the pingsender background task is running immediately after shutdown, and having it running (on top of libXUL) is interfering with the updater.

This would explain why I'm still getting about:restartrequired on Linux, which indicates our libraries are out of sync with the executable.

(In reply to Bobby Holley (:bholley) from comment #17)

I'm not really clear on the constraints of the updater. Could it wait until the background task is done? Could it run in parallel with the background task? Could we avoid sending shutdown pings if we have an update pending?

Ping sending at/after shutdown using pingsender (v1 or v2) is known to be Best Effort, but the improvements in data latency it affords are improvements we're not going to want to lose... though only losing it when there's an update pending would be acceptable. Especially since in that case we could go back to pingsender v1 and still get pings out (except for misconfigured or old systems).

Flags: needinfo?(chutten)

bytesized: between the two of us, we'll need to dig into this. Just getting it onto your radar.

Flags: needinfo?(ksteuber)

It's on my radar. It's kind of perplexing, actually. I believe it normally "works" to update Firefox while another instance of Firefox is running (well, except for the Bug 1480452 issue). And if it didn't work, I wouldn't expect to see a SERVICE_COULD_NOT_LOCK_UPDATER error. I'm not really sure how that would be happening.

Flags: needinfo?(ksteuber)

(In reply to Steve Jalim from comment #4)

  1. I also see three new Nightlies in my dock, none of which is the Nightly I've marked to keep in the dock.

[...] The three special-guest Nightly icons are still in my recent Dock items

I filed bug 1736921 on these "three special-guest recent dock items", BTW. (I'm seeing them as well.) I haven't noticed any other updater issues -- just these three special-guests in the dock area.

(In reply to Kirk Steuber (he/him) [:bytesized] from comment #23)

It's on my radar. It's kind of perplexing, actually. I believe it normally "works" to update Firefox while another instance of Firefox is running (well, except for the Bug 1480452 issue). And if it didn't work, I wouldn't expect to see a SERVICE_COULD_NOT_LOCK_UPDATER error. I'm not really sure how that would be happening.

Then perhaps my theory from comment 17 is off-base. It was just a guess anyway.

Taking a step back: we have anecdotal reports of increased updater failures on Nightly this week, and one report that disabling telemetry seemed to have fixed the problem, which circumstantially pointed that the issue could be related to bug 1734262. However, that's not much to go on.

Do we have some sort of telemetry about these update failures from which we could discern an uptick? Do we have machinery to simulate updates such that we could more-reliably reproduce what seems to be an intermittent issue?

Update failures are super concerning, so we need to get to the bottom of this.

I wonder if bug 1736910 comment 4 could be related?

I don't know if this means anything but looking through my .mozilla/firefox/crash reports folder I see 19 crash reports that haven't appeared in about:support and just have a single number as content. Most of these are from yesterday, with one from today, that was at update time.

The file names are all InstallTime20211020093007 with variation on the time/date. the contents are all something like 1634702012

Pretty sure these are related to this bug somehow. I'll check to see if a new report is generated when the update happens later.

(In reply to Pulse from comment #27)

The file names are all InstallTime20211020093007 with variation on the time/date. the contents are all something like 1634702012

I also have these and the files span all the way back to when I first installed Firefox. This doesnt seem right... I have hundreds of these "installtime" files accumulated to this date...

The many InstallTime folders are tracked by bug 1722777.

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #29)

The many InstallTime folders are tracked by bug 1722777.

That is a separate issue. This is about "installtime" files accumulating in the Crash Reports folder.

Filed https://bugzilla.mozilla.org/show_bug.cgi?id=1737002

(In reply to Bobby Holley (:bholley) from comment #25)

Do we have some sort of telemetry about these update failures from which we could discern an uptick? Do we have machinery to simulate updates such that we could more-reliably reproduce what seems to be an intermittent issue?

:bytesized and :nalexander will have more context, but we have updater information in Firefox Telemetry (the "update" ping and roughly 100 update_ probes in the "main" ping) and in Glean (for the Background Update Task)

Because I need more excuses to use Looker, I looked a little at what's reported through Glean (wow Looker integration is Pretty Nice. One click from the bottom of a Glean Dictionary page to get to a dashboard is sweet. Then point and click to filter normalized channel on "nightly" and...). According to this plot the only Background Update "final state" difference in the past month is a sudden increase around Sept 30 for NEVER_CHECKED (and to see it I had to ignore the very-common NO_UPDATES_FOUND, DOWNLOADING, and READY_FOR_RESTART states).

Also there are basically no exceptions.

So as far as the background updater is concerned via those two measures, everything's just peachy.

(In reply to Bobby Holley (:bholley) from comment #25)

Do we have some sort of telemetry about these update failures from which we could discern an uptick?

There is a whole bunch of update telemetry available (definitions here and here). If you want to see how common staging and applying failures are, you are probably looking for these ones.

Do we have machinery to simulate updates such that we could more-reliably reproduce what seems to be an intermittent issue?

There are ways to simulates updates. But I imagine that this issue will be just as intermittent in a simulation as outside of one. These are basically the two ways of manually testing update:

  1. Set up a local update server and use Enterprise Policies to force a copy of Firefox to use it.
  2. The Update team has the Oak tree for update testing. Pushing anything to it (with the exception of DONTBUILD patches) causes updates to be generated and served pretty much exactly the same way as Nightly.

Folks, this is clearly a significant issue. Next steps are for Kirk and me to discuss this in our 1:1 (almost right now) and we will report back on next steps. I expect we'll work with bholley to roll back parts of Bug 1734262, since that seems to have caused the regression. That'll give us time to get to the bottom of this.

bholley: seems like the minimal rollback is to revert https://hg.mozilla.org/mozilla-central/rev/e1f824e84f03. Is there more needed? Maybe some additional tests rely on the new background task bits now? Would you rather revert everything in the pingsender ticket?

Flags: needinfo?(nalexander) → needinfo?(bholley)

(In reply to Nick Alexander :nalexander [he/him] from comment #33)

Folks, this is clearly a significant issue. Next steps are for Kirk and me to discuss this in our 1:1 (almost right now) and we will report back on next steps. I expect we'll work with bholley to roll back parts of Bug 1734262, since that seems to have caused the regression. That'll give us time to get to the bottom of this.

Kirk and I talked over this issue. We think that the spawned --backgroundtask pingsender instance and the relaunching firefox process are racing to apply a pending update. Most of the time, the pingsender process will win, and firefox will relaunch into an update in progress. A variety of races and resulting issues might be seen when this happens.

We think the sensible thing is roll back parts of the pingsender ticket, as discussed above, so we have some time to investigate and mitigate. One part of that might be Bug 1737117.

(In reply to Nick Alexander :nalexander [he/him] from comment #34)

(In reply to Nick Alexander :nalexander [he/him] from comment #33)

Folks, this is clearly a significant issue. Next steps are for Kirk and me to discuss this in our 1:1 (almost right now) and we will report back on next steps. I expect we'll work with bholley to roll back parts of Bug 1734262, since that seems to have caused the regression. That'll give us time to get to the bottom of this.

Kirk and I talked over this issue. We think that the spawned --backgroundtask pingsender instance and the relaunching firefox process are racing to apply a pending update. Most of the time, the pingsender process will win, and firefox will relaunch into an update in progress. A variety of races and resulting issues might be seen when this happens.

We think the sensible thing is roll back parts of the pingsender ticket, as discussed above, so we have some time to investigate and mitigate. One part of that might be Bug 1737117.

If arbitrary background tasks will try to install pending updates, that would totally explain the issues here. I'll work up a patch to revert to pingsender1 for now.

Flags: needinfo?(bholley)
Assignee: nobody → bholley
Pushed by nbeleuzu@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/5a0ebbb0210f
Revert to pingsender1. r=chutten
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 95 Branch

(In reply to Pulsebot from comment #37)

Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d5cc6be9fd91
Revert to pingsender1. r=chutten

== Change summary for alert #32069 (as of Mon, 25 Oct 2021 01:28:15 GMT) ==

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
20% ts_paint_webext linux1804-64-shippable-qr e10s stylo webrender-sw 333.33 -> 266.25
19% ts_paint_webext linux1804-64-shippable-qr e10s fission stylo webrender 375.50 -> 305.33
18% ts_paint_webext linux1804-64-shippable-qr e10s stylo webrender 376.08 -> 308.83
17% ts_paint linux1804-64-shippable-qr e10s stylo webrender-sw 325.25 -> 270.75
17% ts_paint linux1804-64-shippable-qr e10s fission stylo webrender 362.30 -> 302.08
... ... ... ... ...
7% startup_about_home_paint_realworld_webextensions linux1804-64-shippable-qr e10s fission stylo webrender 791.75 -> 735.00

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=32069

(In reply to Andra Esanu from comment #40)

(In reply to Pulsebot from comment #37)

Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d5cc6be9fd91
Revert to pingsender1. r=chutten

== Change summary for alert #32069 (as of Mon, 25 Oct 2021 01:28:15 GMT) ==

This reverts the regression reported in bug 1736796, which confirms that the switch to pingsender2 caused the regression.

(In reply to Bobby Holley (:bholley) from comment #14)

Ryan, can you (or someone in relman) keep an eye on our Nightly population over the next week or two and make sure it recovers?

The vast majority of users which submitted telemetry since yesterday use a Nightly build from Saturday or newer - users don't seem to be stuck on the affected versions and eventually succeed with the update installation.

Flags: needinfo?(ryanvm)
Summary: Firefox update not always getting applied → Firefox update not always getting applied due to pingsender triggering updater during shutdown

This may be related.

Similar to https://bugzilla.mozilla.org/show_bug.cgi?id=1736504 which I just updated with a comment
about https://hg.mozilla.org/mozilla-central/rev/c223d66e6148b604c1579863ab9a65906b547cdd
where I see the update destroys the pre-existing directory and then drops the whole process back to
a shell prompt in the XTerm I used to launch nightly.

(In reply to Bobby Holley (:bholley) from comment #14)

That's great news.

Nick, it seems bad that a TypeError in PingCentre was able to interfere with updates. Can your team do a brief retro and figure out how we can harden against this sort of thing going forward?

Ryan, can you (or someone in relman) keep an eye on our Nightly population over the next week or two and make sure it recovers?

Coming back to this years later -- we just found another situation where an early failure could, in some cases, get in the way of applying updates: https://bugzilla.mozilla.org/show_bug.cgi?id=1875502. It doesn't look exactly the same but it's similar, so I wanted to make the link.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: