Firefox isn't launched with e10s+sandbox after updating it

RESOLVED FIXED

Status

()

defect
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: m_kato, Assigned: bobowen)

Tracking

({regression})

55 Branch
Unspecified
Windows
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox53 unaffected, firefox54 unaffected, firefox55blocking fixed)

Details

(Whiteboard: sb+)

Attachments

(1 attachment)

After updating Nightly, I cannot sometimes launch Nightly with e10s+content sandbox.  When debugging this, image_path into memory isn't same as process_path.

But I don't know why image_path by GetImageFilePath (GetMappedFileNameW) is *\updated\firefox.exe.

0:008> x
@rsi              process = 0x00000000`0000027c
0000002f`951fec70 process_path = "\Device\HarddiskVolume3\Program Files\Nightly\firefox.exe"
0000002f`951feb18 mem_info = struct _MEMORY_BASIC_INFORMATION
0000002f`951fec50 image_path = "\Device\HarddiskVolume3\Program Files\Nightly\updated\firefox.exe"

0:008> k
Child-SP          RetAddr           Call Site
0000002f`951feaf0 00007ff7`6170b50c firefox!sandbox::GetProcessBaseAddress+0x9f
0000002f`951fecc0 00007ff7`616f37a0 firefox!sandbox::TargetProcess::Create+0x1cc
0000002f`951fedb0 00007ffe`15d8f1ee firefox!sandbox::BrokerServicesBase::SpawnTarget+0x3f0
0000002f`951ff0f0 00007ffe`15f981be xul!mozilla::SandboxBroker::LaunchApp+0x1c6
0000002f`951ff400 00007ffe`15f97ab5 xul!mozilla::ipc::GeckoChildProcessHost::PerformAsyncLaunchInternal+0x592
0000002f`951ff620 00007ffe`15f99419 xul!mozilla::ipc::GeckoChildProcessHost::PerformAsyncLaunch+0x61
0000002f`951ff750 00007ffe`15f98f49 xul!mozilla::ipc::GeckoChildProcessHost::RunPerformAsyncLaunch+0x45
(Inline Function) --------`-------- xul!mozilla::detail::RunnableMethodArguments<std::vector<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::allocator<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > >,enum base::ProcessArchitecture>::applyImpl+0x1a
(Inline Function) --------`-------- xul!mozilla::detail::RunnableMethodArguments<std::vector<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::allocator<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > >,enum base::ProcessArchitecture>::apply+0x1a
0000002f`951ff7b0 00007ffe`15456ac0 xul!mozilla::detail::RunnableMethodImpl<mozilla::ipc::GeckoChildProcessHost * __ptr64 const,bool (__cdecl mozilla::ipc::GeckoChildProcessHost::*)(std::vector<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::allocator<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > >,enum base::ProcessArchitecture) __ptr64,0,0,std::vector<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::allocator<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > >,enum base::ProcessArchitecture>::Run+0x39
0000002f`951ff800 00007ffe`15456a94 xul!MessageLoop::RunTask+0x24
0000002f`951ff830 00007ffe`1552856c xul!MessageLoop::DeferOrRunPendingTask+0x24
0000002f`951ff860 00007ffe`15796c0a xul!MessageLoop::DoWork+0x314
0000002f`951ff910 00007ffe`15797423 xul!base::MessagePumpForIO::DoRunLoop+0x1a
(Inline Function) --------`-------- xul!base::MessagePumpWin::RunWithDispatcher+0x33
0000002f`951ff940 00007ffe`157976c3 xul!base::MessagePumpWin::Run+0x3f
0000002f`951ff990 00007ffe`15797686 xul!MessageLoop::RunHandler+0x1b
0000002f`951ff9c0 00007ffe`15797571 xul!MessageLoop::Run+0x3e
0000002f`951ffa10 00007ffe`1579748a xul!base::Thread::ThreadMain+0xdd
0000002f`951ffb70 00007ffe`54ad2774 xul!`anonymous namespace'::ThreadFunc+0xa
0000002f`951ffba0 00007ffe`567e0d61 KERNEL32!BaseThreadInitThunk+0x14
0000002f`951ffbd0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
Summary: Firefox doesn't launch with e10s+sandbox after updating it → Firefox isn't launched with e10s+sandbox after updating it
Until rebooting Windows, it cannot be launched.
As part of the sandbox launch, the parent process has to patch things in the suspended child process before it releases it.
It has to load the image to calculate where it needs to do that.
If the images are getting out of sync, I suspect that isn't going to end well.

Not sure what \Device\HarddiskVolume3\Program Files\Nightly\updated\firefox.exe is, I guess it's part of the updater.
Given the need to reboot it sounds like there might be some sort of locking problem.

rstrong - any chance this could be a similar issue to bug 1352192, where we launch a child process during update?
Flags: needinfo?(robert.strong.bugs)
The Nightly\updated\firefox.exe is a copy of Nightly\firefox.exe that is then updated while Firefox is running and it shouldn't require a reboot to launch firefox.exe.

With bug 1352192 what happens is one instance updates the installation while another instance is running and when e10s launches a binary it ends up launching the updated binary. This might be related somehow. Any idea how running code could end up referencing the copied firefox.exe?
Flags: needinfo?(robert.strong.bugs)
Reference bug that implemented update staging - bug 307181
So, this appears to be happening after the update has been staged and before the staged update has been moved to the installation directory.
Whiteboard: sb+
[Tracking Requested - why for this release]: Assuming mconley is right, bug 1372222 and bug 1371714 are also this bug. I believe based on that we should block 55b1: I don't think we know how often this is happening to users, but it leaves the browser horribly broken.

Is this a recent regression specifically on windows creator update? Or getting more prevalent with e10s-multi or a more recent sandboxing change?
I haven't experienced this before the creator update, but have seen it twice since - in a week.  I don't know what sandboxing changes recently happened though, so that could have been a coincidence.
The sandbox code, that is getting mismatched paths, landed as part of an update to the chromium sandbox over 2 months ago.

So, it looks like it might be an interaction between our updater, this sandbox code and some change in the Creators update.
I'm just manually updating my laptop, to see if I can reproduce.
CCing Ritu due to the IRC discussion about this potentially being considered a 55b1 blocker.
I am going to track this as a blocker for now until we know more on the frequency of this problem.
(In reply to Ritu Kothari (:ritu) from comment #10)
> I am going to track this as a blocker for now until we know more on the
> frequency of this problem.

Do we have a way of doing this?  We should at least, and as fast as we can, put in some kind of a message/telemetry/whatever when this happens.  After all, we actually make the "terminate content process" call, so we should be able to track when we've made it and if we've ever gotten by that call successfully in the session.  And if not, we should restart with e10s off.

Bob, do you know if we have a way of measuring the frequency/size of the problem?
Flags: needinfo?(bobowencode)
This should add Telemtry when it fails to launch with the error code.
Attachment #8876859 - Flags: review?(jmathies)
Attachment #8876859 - Flags: feedback?(benjamin)
Assignee: nobody → bobowencode
Status: NEW → ASSIGNED
With beta staged rollout (starting with Dawn project), we'll push 55.0b1 and 55.0b2 < 25% of beta population. That should give us some indication of how severe this problem is. 

At the same time, Milan suggested we get in touch with MS to find a good fix.

I will gtb 55.0b1 with Bob's patch to get more telemetry data on sandbox launch failures.
Comment on attachment 8876859 [details] [diff] [review]
Add telemetry for when a sandboxed child process fails to start

r+ a telemetry probe that accumulates process creation failure results.
Attachment #8876859 - Flags: review?(jmathies) → review+
(In reply to Ritu Kothari (:ritu) from comment #13)
> ...
> 
> At the same time, Milan suggested we get in touch with MS to find a good fix.

Include in this conversation somebody that knows how we restart Windows builds after update, and why we'd have the updated/ in the path (comment 0.)  'Cause I can't explain how we do it, so I won't be able to ask the right questions :)
FWIW, I just updated my win10 to creators edition, no trouble at all (so far) launching or browsing on Nightly56 and Beta54 RC.
Comment on attachment 8876859 [details] [diff] [review]
Add telemetry for when a sandboxed child process fails to start

Review of attachment 8876859 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Histograms.json
@@ +12477,5 @@
> +    "kind": "enumerated",
> +    "n_values": 50,
> +    "bug_numbers": [1368600],
> +    "cpp_guard": "XP_WIN",
> +    "description": "Error code when a Windows sandboxed process fails to launch."

What does each error code mean? Could you link from the description to a place where the error codes are defined?
(In reply to François Marier [:francois] from comment #18)
> Comment on attachment 8876859 [details] [diff] [review]
> Add telemetry for when a sandboxed child process fails to start
> 
> Review of attachment 8876859 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/Histograms.json
> @@ +12477,5 @@
> > +    "kind": "enumerated",
> > +    "n_values": 50,
> > +    "bug_numbers": [1368600],
> > +    "cpp_guard": "XP_WIN",
> > +    "description": "Error code when a Windows sandboxed process fails to launch."
> 
> What does each error code mean? Could you link from the description to a
> place where the error codes are defined?

Given that this is blocking 55b1 and that Bob won't get to it until tomorrow, I'm ok if this is addressed in a follow-up bug.

datareview+ with the above follow-up work.
Thanks Francois. Hi Ryan, can we please get this landed on m-b soon? I'd like to gtb 55.0b1 in the next hour or two if possible.
Flags: needinfo?(ryanvm)
Attachment #8876859 - Flags: feedback?(benjamin)
Pushed by bobowencode@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e62914155f64
Add telemetry with the error code for when a Windows sandboxed child process fails to start. r=jimm, data-review=francois
Keywords: leave-open
(In reply to Milan Sreckovic [:milan] from comment #15)
> (In reply to Ritu Kothari (:ritu) from comment #13)
> > ...
> > 
> > At the same time, Milan suggested we get in touch with MS to find a good fix.
> 
> Include in this conversation somebody that knows how we restart Windows
> builds after update, and why we'd have the updated/ in the path (comment 0.)
> 'Cause I can't explain how we do it, so I won't be able to ask the right
> questions :)

I imagine rstrong is the best person from an updater point of view.

From looking at the code, I believe the staged update process is as follows (assuming default 64-bit paths for Nightly):
* Copy from "C:\Program Files\Nightly" to "C:\Program Files\Nightly\updated"
* Apply update to "C:\Program Files\Nightly\updated"
* When restart is triggered ... launch updater.exe and shutdown Firefox.
* updater renames "C:\Program Files\Nightly" to "C:\Program Files\Nightly.bak"
* updater renames "C:\Program Files\Nightly.bak\updated" to "C:\Program Files\Nightly"
* updater deletes "C:\Program Files\Nightly.bak"
* updater relaunches Firefox

It appears that somehow the "C:\Program Files\Nightly\updated\firefox.exe" is getting mapped into memory, prior to the rename to "C:Program Files\Nightly\firefox.exe".
Then even though the child process is getting launched using "C:Program Files\Nightly\firefox.exe", the call to GetMappedFileNameW is still returning "C:\Program Files\Nightly\updated\firefox.exe".

I've had no luck in trying to reproduce.
I also tried running the firefox.exe in updated using a different profile and then triggering the restart and it seems to work fine, although the updated dir doesn't get deleted.
I'm really at a loss as to how GetMappedFileNameW could be returning "C:\Program Files\Nightly\updated\firefox.exe", etc.

Staging doesn't save much if any time over applying an update directly to the install directory on Windows. As a matter of fact, staging was broken on Windows for a release and since it falls back to applying directly to the install directory no one noticed it since updates were still successful. If anyone can reproduce this please try setting the app.update.staging preference to false and try to reproduce. Thanks!
Bug 1370576 might also be related. I've also asked the reporter to check if it can be reproduced without staging.
(In reply to Bob Owen (:bobowen) from comment #23)
...
> I've had no luck in trying to reproduce.
> I also tried running the firefox.exe in updated using a different profile
> and then triggering the restart and it seems to work fine, although the
> updated dir doesn't get deleted.

OK, I've managed to reproduce artificially.
rstrong's comment made me realise that the above was probably blocking the renaming and it was just updating in place, which is why the updated dir was still there.

If I run the ...\updated\firefox.exe with a different profile, then close it and then trigger the restart I get the problem.
The mapped image name is being returned as ...\updated\firefox.exe just as in comment 0.

So, it does look like the image is getting mapped and then the mapped name is not being updated correctly when it is renamed.
What is not clear is how the image is getting mapped before the rename - virus checker maybe?

The same steps on my pre-Creators update Win10 VM, does not cause the issue.
Just about to update the VM as well to test.

Clearly turning off the staged update would make these artificial steps impossible, but it would seem that it would fix the real problem as well.
It looks like staging is also causing bug 1370576 and I suspect it is the same root cause.
Can the updater terminate running firefox.exe processes vs. waiting for them to shutdown?
(In reply to Jim Mathies [:jimm] from comment #28)
> Can the updater terminate running firefox.exe processes vs. waiting for them
> to shutdown?
From what I can tell by the logs in bug 1370576 the firefox.exe process has exited when this happens. Also, if something like this were to be done it would probably be best if it were done by the WatchDog code.
Jim, do you have any idea why the copy of firefox.exe in the updated directory is getting mapped even though it isn't launched until after it is in the installation directory? I suspect Windows is doing something under the covers to cause this.
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #29)
> (In reply to Jim Mathies [:jimm] from comment #28)
> > Can the updater terminate running firefox.exe processes vs. waiting for them
> > to shutdown?
> From what I can tell by the logs in bug 1370576 the firefox.exe process has
> exited when this happens.
The logs show that the process exited and the updater is able to get an exclusive file lock on firefox.exe so it has exited.
Uplifted the addressed review comments to Beta as well:
https://hg.mozilla.org/releases/mozilla-beta/rev/979d797835ad
(In reply to Bob Owen (:bobowen) from comment #26)
> ...
> The same steps on my pre-Creators update Win10 VM, does not cause the issue.
> Just about to update the VM as well to test.

Update finally finished and the issue is now reproducible with those steps, so it definitely points to a change in the Creators update.
Bob, could you check that this isn't reproducible when the app.update.staging.enabled preference is set to false? Thanks!
(In reply to Bob Owen (:bobowen) from comment #26)
> What is not clear is how the image is getting mapped before the rename -
> virus checker maybe?

I don't have anything beyond the Windows Defender (which is telling me that my apps and browser control is off and that I should do something about it.)
Note: I just landed the patch in Bug 1370576 on mozilla-inbound to disable update staging on Windows.
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #34)
> Bob, could you check that this isn't reproducible when the
> app.update.staging.enabled preference is set to false? Thanks!

As I said I'm manually running the firefox.exe in updated before triggering the rest of the update, so it would definitely stop those steps, but I haven't seen the issue during a normal update.
Oh, I thought you reproduced without using artificial steps as well. The reason I was hoping for non artificial steps is because the firefox.exe in updated isn't launched.
I think I might have been two updates behind when I ran into this problem, but I'm not 100% sure of this.  As in, I was still running Wednesday's version on Friday, and did an update from the hamburger menu.
Flags: needinfo?(ryanvm)
Duplicate of this bug: 1371714
This should have been fixed even with staged updates with the patch from bug 1385928.
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Depends on: 1385928
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.