Closed Bug 1423571 Opened 2 years ago Closed 2 years ago

Unable to perform a complete update to Beta/Devedition 58.0b9 from 58.0b5

Categories

(Release Engineering :: Release Requests, defect)

defect
Not set

Tracking

(firefox-esr52 unaffected, firefox57 unaffected, firefox58blocking fixed, firefox59+ fixed, firefox60+ fixed)

RESOLVED FIXED
Tracking Status
firefox-esr52 --- unaffected
firefox57 --- unaffected
firefox58 blocking fixed
firefox59 + fixed
firefox60 + fixed

People

(Reporter: Callek, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: regression)

Attachments

(1 obsolete file)

[Tracking Requested - why for this release]: Seems to affect a large about of users.

Filing now to get a bug, while we investigate scope and solution.

bogdan_maris reported issues doing an update from en-US 58.0b5 on mac and linux, with a started-right-away background update: https://pastebin.com/uDbm92Py (I'll attach this to the bug soon)

Additionally saw some similar confusing errors with windows 10 (specifically that restarting didn't update): https://pastebin.com/gh61SJib

:garbas investigated telemetry and saw a decently sized spike of errors: https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-11-27&keys=__none__!__none__!__none__&max_channel_version=aurora%252F58&measure=UPDATE_DOWNLOAD_CODE_COMPLETE&min_channel_version=null&os=Darwin&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-11-02&table=0&trim=1&use_submission_date=0

Balrog's blob for en-US is in https://irccloud.mozilla.com/pastebin/qOgXXHNL/

I tested locally and was able to reproduce with en-US linux64 where I got the NS_ERROR_FILE_NOT_FOUND while my about window was showing https://irccloud.mozilla.com/file/JrXkXOrd/image.png
Gerry fyi.
Flags: needinfo?(gchang)
Just a little note here: I managed to also reproduce this after an update from 58.0b5 to 58.0b9 using Beta builds on beta channel so this is not restricted to aurora. Here are the logs I got: https://pastebin.com/QHihG120
Summary: Unable to perform a complete update to Devedition 58.0b9 from 58.0b5 → Unable to perform a complete update to Beta/Devedition 58.0b9 from 58.0b5
Matt, can you help provide any insight here from your end?
:rstrong, similarly for you
Flags: needinfo?(robert.strong.bugs)
Flags: needinfo?(mhowell)
The error in those logs looks disturbingly like one that I was seeing while working on bug 1348087 and thought I had fixed. That patch is going to end up backed out from 58, since it continues to raise issues like this one and bug 1420210.
Flags: needinfo?(mhowell)
(In reply to Matt Howell [:mhowell] from comment #4)
> The error in those logs looks disturbingly like one that I was seeing while
> working on bug 1348087 and thought I had fixed. That patch is going to end
> up backed out from 58, since it continues to raise issues like this one and
> bug 1420210.

As per irc, backing out I believe will fix future betas (e.g. b10) which would be great. Important to note though it is only part of the fix.

We still have to deal with current betas. Any beta client that has adopted new download code in 1348087 will likely get stuck when trying to apply complete updates to latest. That's b5 as of now since it doesn't have a partial available to b9 but b6, b7, b8, and b9 will be next.

We need to rescue those users. @rstrong, @mhowell, any ideas? Hot fix? Releng can provide any update logs that would be useful in diagnosing the issue as well as help coordinate the solution.
Matt, can we do the beta backout today so it's in 58.0b10?
Flags: needinfo?(mhowell)
Yes, I can do that now, let me just file a separate bug for it.
Flags: needinfo?(mhowell)
I have the backout (for bugs 1348087 and 1416295) ready locally, happy to push to beta with the new bug number if that's easier.
Oh, that would be great, thank you. Bug 1423967 is the one I just filed.
What's the next step here?  Do we expect that partial updates would work?  If so, can we generate partials from e.g. 58.0b3 to 58.0b10, test that, and if that works then generate partial updates to 58.0b10 from all previous 58.0bX builds, and set up a watershed on 58.0b10?
I hope not, As we will have to generate partials again for b11, b12, etc until all users are on >=58.0b10. That would quickly balloon out if we have to constantly backdate partials from >=58.0b3. Generating and testing a partial is not trivial even when you are doing just one.

Can this be fixed another way? Do we know why 1348087 and 1416295 have caused this?
Another idea proposed by callek would be to set up some temporary watersheds for 58 users. So if you are on 58 beta cycle, you would watershed to the latest partial update available. In other words, since we generate partials for most recent three betas, we would create the following rules in order of highest priority:

<58.0 -> latest beta
<=58.0b1 -> b4
<=58.0b4 -> b8
<=58.0b8 -> b10 (where we have the fix)

A couple things I would like to know before we proceed with either the watershed or partial gen solution:

1. what beta version are we first stranding users who are missing partials? b1 (first 58 for devedition), b3 (first 58 for beta) or b5? This bug is filed for at least b5 and iirc what is interesting about that is that we started rescuing b1 users on b5 forced SSL updates.

2. What does this have in common with previously stranded beta1 users? Does forced SSL updates have any impact on complete updates?

mhowell: Thanks for resolving this for >=b10. Any help would be appreciated for users less than that. Thoughts on my questions in comment 11 and here?
Flags: needinfo?(mhowell)
See Also: → 1423856
See Also: 1423856
I don't have much of an idea about what's causing this right now; I thought I had fixed this error before I ever landed anything, and I'm yet to see it repro myself. Because of that it's hard for me to say when this should have started; I don't know why it wouldn't have been in b1 when my changes landed, but I don't know if there might be some other factor introduced later that's making the situation worse.
Flags: needinfo?(mhowell)
matt: okay thanks for poking. Appreciated.

Callek confirmed that b10 and onwards is now fixed. Complete updates work again.

As far as rescuing earlier beta users, here is the latest plan:

B12 gtb is today. Normally we include three partials per release but we can input up to 6. Given that I propose we choose 6 partials: b6-11 for b12. This will rescue b6,b7,8,9 users for free. We should create a watershed on b12 until we are satisfied all users are >= 58b10 (the fixed version).

For <b6, we make another temp watershed for 58b6. This will give b3,4,5 users a partial update to b6 and from there they will update straight to b12. It's unclear if <b3 users are affected. We can follow up there if needed by adding an additional watershed on 58.0b3.

Some notes on the two proposed watersheds:

58.0b6:
  * this should only apply for >=58.0b1 as <=57 are for sure unaffected.

58.0b12
  * this should only apply for >=58.0b6 as earlier versions are covered either by the above watershed or they are unaffected.
  * >=58.0b10 users can ignore this watershed as they can update via complete mar
  * given the two above bullet constraints, perhaps we should be explicit this watershed is for versions: b6, b7, b8, and b9
the rules for devedition and beta are scheduled for 24h from now and require signoff.

These should be enacted once we release b12 to everyone.

https://aus4-admin.mozilla.org/rules?product=Firefox&channel=beta
https://aus4-admin.mozilla.org/rules?product=Firefox&channel=aurora
I fished a little ADI data out of crash-stats (derived from blocklist IIRC). This is for Friday December 15th, Windows only, not sure if Firefox or also DevEd, via https://crash-stats.mozilla.com/api/ADI/?end_date=2017-12-15&platforms=Windows&product=Firefox&start_date=2017-12-15&versions=58.0b	
	
 Beta	   ADI
58.0b3	     182
58.0b4	   16513
58.0b5	    7844
58.0b6	   17877
58.0b7	   21031
58.0b8	   32261
58.0b9	   55632
58.0b10	  156974
58.0b11	 1051160

Adding Mac & Linux doesn't make much difference to those numbers. Unfortunately I don't have comparative data for the 57 cycle, and adjusting the url for 57.0b yields < 100 ADI for each version which seems implausible given what we know about long tails of older versions in use.
The update rules from comment 14 are now in place. B5 users should receive a partial to b6 and then to b12. Can someone affected confirm this?
Bogdan, could you verify based on Comment #17?
Flags: needinfo?(bogdan.maris)
Okay, I still don't know the root cause of the NS_ERROR_FILE_NOT_FOUND, but this patch avoids the circumstance that the error shows up under by no longer restarting the downloader whenever the about box is opened, since we're not getting any benefit out of doing that anymore.

To test this, I downloaded 58.0b5 on Windows, reproduced the bug, then modified that build with this patch, and I could no longer reproduce the bug after that. So I do think this successfully patches over the problem, though I still want to understand the real root cause better.
Attachment #8937865 - Flags: review?(robert.strong.bugs)
(In reply to Mike Taylor [:miketaylr] (58 Regression Engineering Owner) from comment #18)
> Bogdan, could you verify based on Comment #17?

I still can reproduce this issue updating from 58.0b3 or 58.0b5 (if I open the about window fast enough), lets hope that the patch Matt has prepared in comment 19 does fix this. 

Note that the watershed on 58.0b6 works as planned (beta 58.0b3, 58.0b4, 58.0b5 update to 58.0b6 then to 58.0b12).
Flags: needinfo?(bogdan.maris)
Flags: needinfo?(robert.strong.bugs)
Comment on attachment 8937865 [details] [diff] [review]
Patch - Don't reset background downloads when the about box is opened

Talked with mhowell over irc and Bug 1426487 should fix this bug.
Attachment #8937865 - Attachment is obsolete: true
Attachment #8937865 - Flags: review?(robert.strong.bugs)
Setting 58 status to fixed, since 1423967 was backed out in 58.
(uh, but maybe Comment #20 is important...)
> > Bogdan, could you verify based on Comment #17?
> 
> I still can reproduce this issue updating from 58.0b3 or 58.0b5 (if I open
> the about window fast enough), lets hope that the patch Matt has prepared in
> comment 19 does fix this. 
> 
> Note that the watershed on 58.0b6 works as planned (beta 58.0b3, 58.0b4,
> 58.0b5 update to 58.0b6 then to 58.0b12).

I think the last thing to do here is verify that this issue can't be reproduced after the patch in #c19. [It's not really clear to me if #c20 reflects a test with that patch or not, but I suspect from the phrasing that it does not.]
Flags: needinfo?(bogdan.maris)
Bogdan, can you re-test with a later build please, or verify that you tested with a build that included the patch in comment 19? Thanks.
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #25)
> Bogdan, can you re-test with a later build please, or verify that you tested
> with a build that included the patch in comment 19? Thanks.

Sure!

After testing across platforms (Windows 10 64bit, Ubuntu 16.04 32bit and macOS 10.13.2) using different beta versions that include the backout from bug 1423967 (58.0b10, 58.0b11, 58.0b12, 58.0b13 and 58.0b14), they all successfully updated to latest beta without any issues on "beta" channel.
Flags: needinfo?(bogdan.maris)
Flags: needinfo?(gchang)
Matt, do we think that 59 won't run into the same issues?  Is bug 1426487 expected to get in this week, or should we do the backout from 59 too and get the eventual fix in 60?
Flags: needinfo?(mhowell)
Bug 1426487 includes exactly the patch I had up here, as well as some other changes that should also help, so I do not expect to see this issue again once that lands. I'll redirect the question about the status of bug 1426487 to :rstrong.
Flags: needinfo?(mhowell) → needinfo?(robert.strong.bugs)
11:45 AM <Callek> mhowell: re: 1423571 -- your comment says "once <> lands...." I should note we already went to build for 59 beta 1 for devedition. Is that to say that as of now, mozilla-central (and devedition 59beta1) code would still exhibit the bug?
11:47 AM <mhowell> Callek: oh. then the answer to that question is yes, sorry.
Callek, thanks for catching this, I had not realized the fix didn't land for 59.
I think it would be safer to backout the changes on beta as was done in bug 1423967.
Flags: needinfo?(robert.strong.bugs)
This was fixed for Fx59 via backout on a push directly to mozilla-beta over in bug 1430846. I didn't land the backout on trunk, so we'll need to set Fx60 to affected once those flags are available.
ni myself to do that next week.
Flags: needinfo?(jcristau)
Actually, looks like the 60 flags are already available.
Flags: needinfo?(jcristau)
Bulk change of QA Contact to :jlund, per https://bugzilla.mozilla.org/show_bug.cgi?id=1428483
QA Contact: catlee → jlund
Found a potential related issue that affects the manual update process on the aurora-cdntest channel from DevEdition58.0b2: the update checking from About Firefox fails and the background update notification is displayed, but the update fails using this method, too, the user being redirected to the Firefox download page. Here is the Browser Console output https://goo.gl/QxVq8U. 
I have to mention that this issue was reproduced on Windows 10 x64 and it seems that is not locale-related. 
I will continue investigating this, in order to find out if there are other 58 affected builds, too and if this is reproducible across platforms.
Said in e-mail too, but repeating here:

The issues from 58.0b2 are not completely unexpected. We created a handful of watershed updates that should update first to 58.0b6, then to 58.0b12, then finally to 58.0b12. These watersheds are merely to mitigate the update issues we found in that bug, by providing partial updates which should download and apply faster.

However the failure mode is just as possible to be hit while downloading a partial as when downloading a complete, the overall file size makes the partials more likely to update users than the completes.

Hope that helps explain things,
(If however, you were getting updates directly to 59.0b2 from 58.0b2 on aurora-cdntest, let me know and I'll have to revisit to know why)
(In reply to Justin Wood (:Callek) from comment #37)
> The issues from 58.0b2 are not completely unexpected. We created a handful
> of watershed updates that should update first to 58.0b6, then to 58.0b12,
> then finally to 58.0b12. 

note: beta-cdntest is missing a watershed so QA had different results:

StefanG_QA_SV_LV> Hi,
14:52:10 Is there watershed in place on beta-cdntes channel when updating from 58.0b6, b7, b8, and b9 to 58.0b12 as noted in https://bugzil.la/1423571#c14 ? 
14:52:10 Our tests are going directly to 59.0b3 from 58.0b7 and b9 on beta-cdntest channel.
15:00:27 
<jlund> callek implemented the rules but let me check current state
15:00:46 
<StefanG_QA_SV_LV> thank you 
15:03:31 
<jlund> yes, looks like beta-cdntest is different than beta
15:03:43 https://irccloud.mozilla.com/file/ogg20HqQ/Screen%20Shot%202018-01-23%20at%2015.02.27.png https://irccloud.mozilla.com/file/4Jy1Jgcp/Screen%20Shot%202018-01-23%20at%2015.02.56.pngScreen Shot 2018-01-23 at 15.02.27.png107.19KB • image/pngScreen Shot 2018-01-23 at 15.02.56.png172.16KB • image/png
15:04:04 notice on the first attachment, the 58.0b12 watershed is missing
15:04:21 
<StefanG_QA_SV_LV> yes I can see 
15:05:10 we are expecting the identical watershed on beta-cdntest as on beta
15:05:20 
<jlund> out of curiosity, are you able to update from b7,8,9 to 59.0b3 without issue? If so, it might be proof we don't actually need these watersheds anyway
15:05:30 
<StefanG_QA_SV_LV> yes
15:05:33 
<jlund> yeah, they probably should be identical.
15:05:37 
<StefanG_QA_SV_LV> we are able to update without any issues
I added the missing beta-cdntest watershed rule. Users should now go from 58.0b6->b12 before going to latest

rule:  735
I'm going to resolve this as fixed, we've done the best we can with watersheds at this point
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.