1368600 - Firefox isn't launched with e10s+sandbox after updating it

Reporter

Description

•

8 years ago

After updating Nightly, I cannot sometimes launch Nightly with e10s+content sandbox. When debugging this, image_path into memory isn't same as process_path. But I don't know why image_path by GetImageFilePath (GetMappedFileNameW) is *\updated\firefox.exe. 0:008> x @rsi process = 0x00000000`0000027c 0000002f`951fec70 process_path = "\Device\HarddiskVolume3\Program Files\Nightly\firefox.exe" 0000002f`951feb18 mem_info = struct _MEMORY_BASIC_INFORMATION 0000002f`951fec50 image_path = "\Device\HarddiskVolume3\Program Files\Nightly\updated\firefox.exe" 0:008> k Child-SP RetAddr Call Site 0000002f`951feaf0 00007ff7`6170b50c firefox!sandbox::GetProcessBaseAddress+0x9f 0000002f`951fecc0 00007ff7`616f37a0 firefox!sandbox::TargetProcess::Create+0x1cc 0000002f`951fedb0 00007ffe`15d8f1ee firefox!sandbox::BrokerServicesBase::SpawnTarget+0x3f0 0000002f`951ff0f0 00007ffe`15f981be xul!mozilla::SandboxBroker::LaunchApp+0x1c6 0000002f`951ff400 00007ffe`15f97ab5 xul!mozilla::ipc::GeckoChildProcessHost::PerformAsyncLaunchInternal+0x592 0000002f`951ff620 00007ffe`15f99419 xul!mozilla::ipc::GeckoChildProcessHost::PerformAsyncLaunch+0x61 0000002f`951ff750 00007ffe`15f98f49 xul!mozilla::ipc::GeckoChildProcessHost::RunPerformAsyncLaunch+0x45 (Inline Function) --------`-------- xul!mozilla::detail::RunnableMethodArguments<std::vector<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::allocator<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > >,enum base::ProcessArchitecture>::applyImpl+0x1a (Inline Function) --------`-------- xul!mozilla::detail::RunnableMethodArguments<std::vector<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::allocator<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > >,enum base::ProcessArchitecture>::apply+0x1a 0000002f`951ff7b0 00007ffe`15456ac0 xul!mozilla::detail::RunnableMethodImpl<mozilla::ipc::GeckoChildProcessHost * __ptr64 const,bool (__cdecl mozilla::ipc::GeckoChildProcessHost::*)(std::vector<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::allocator<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > >,enum base::ProcessArchitecture) __ptr64,0,0,std::vector<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::allocator<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > >,enum base::ProcessArchitecture>::Run+0x39 0000002f`951ff800 00007ffe`15456a94 xul!MessageLoop::RunTask+0x24 0000002f`951ff830 00007ffe`1552856c xul!MessageLoop::DeferOrRunPendingTask+0x24 0000002f`951ff860 00007ffe`15796c0a xul!MessageLoop::DoWork+0x314 0000002f`951ff910 00007ffe`15797423 xul!base::MessagePumpForIO::DoRunLoop+0x1a (Inline Function) --------`-------- xul!base::MessagePumpWin::RunWithDispatcher+0x33 0000002f`951ff940 00007ffe`157976c3 xul!base::MessagePumpWin::Run+0x3f 0000002f`951ff990 00007ffe`15797686 xul!MessageLoop::RunHandler+0x1b 0000002f`951ff9c0 00007ffe`15797571 xul!MessageLoop::Run+0x3e 0000002f`951ffa10 00007ffe`1579748a xul!base::Thread::ThreadMain+0xdd 0000002f`951ffb70 00007ffe`54ad2774 xul!`anonymous namespace'::ThreadFunc+0xa 0000002f`951ffba0 00007ffe`567e0d61 KERNEL32!BaseThreadInitThunk+0x14 0000002f`951ffbd0 00000000`00000000 ntdll!RtlUserThreadStart+0x21

Makoto Kato [:m_kato]

Reporter

Updated

•

8 years ago

Summary: Firefox doesn't launch with e10s+sandbox after updating it → Firefox isn't launched with e10s+sandbox after updating it

Makoto Kato [:m_kato]

Reporter

Comment 1

•

8 years ago

Until rebooting Windows, it cannot be launched.

Bob Owen (:bobowen)

Assignee

Comment 2

•

8 years ago

As part of the sandbox launch, the parent process has to patch things in the suspended child process before it releases it. It has to load the image to calculate where it needs to do that. If the images are getting out of sync, I suspect that isn't going to end well. Not sure what \Device\HarddiskVolume3\Program Files\Nightly\updated\firefox.exe is, I guess it's part of the updater. Given the need to reboot it sounds like there might be some sort of locking problem. rstrong - any chance this could be a similar issue to bug 1352192, where we launch a child process during update?

Flags: needinfo?(robert.strong.bugs)

Robert Strong (they/them - no direct email)

Comment 3

•

8 years ago

The Nightly\updated\firefox.exe is a copy of Nightly\firefox.exe that is then updated while Firefox is running and it shouldn't require a reboot to launch firefox.exe. With bug 1352192 what happens is one instance updates the installation while another instance is running and when e10s launches a binary it ends up launching the updated binary. This might be related somehow. Any idea how running code could end up referencing the copied firefox.exe?

Flags: needinfo?(robert.strong.bugs)

Robert Strong (they/them - no direct email)

Comment 4

•

8 years ago

Reference bug that implemented update staging - bug 307181

Robert Strong (they/them - no direct email)

Comment 5

•

8 years ago

So, this appears to be happening after the update has been staged and before the staged update has been moved to the installation directory.

Jim Mathies [:jimm]

Updated

•

8 years ago

Whiteboard: sb+

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Comment 6

•

8 years ago

[Tracking Requested - why for this release]: Assuming mconley is right, bug 1372222 and bug 1371714 are also this bug. I believe based on that we should block 55b1: I don't think we know how often this is happening to users, but it leaves the browser horribly broken. Is this a recent regression specifically on windows creator update? Or getting more prevalent with e10s-multi or a more recent sandboxing change?

status-firefox55: --- → affected

tracking-firefox55: --- → ?

Milan Sreckovic [:milan] (needinfo for best results)

Comment 7

•

8 years ago

I haven't experienced this before the creator update, but have seen it twice since - in a week. I don't know what sandboxing changes recently happened though, so that could have been a coincidence.

Bob Owen (:bobowen)

Assignee

Comment 8

•

8 years ago

The sandbox code, that is getting mismatched paths, landed as part of an update to the chromium sandbox over 2 months ago. So, it looks like it might be an interaction between our updater, this sandbox code and some change in the Creators update. I'm just manually updating my laptop, to see if I can reproduce.

Ryan VanderMeulen [:RyanVM]

Comment 9

•

8 years ago

CCing Ritu due to the IRC discussion about this potentially being considered a 55b1 blocker.

status-firefox53: --- → unaffected

status-firefox54: --- → unaffected

status-firefox-esr52: --- → unaffected

Keywords: regression

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 10

•

8 years ago

I am going to track this as a blocker for now until we know more on the frequency of this problem.

tracking-firefox55: ? → blocking

Milan Sreckovic [:milan] (needinfo for best results)

Comment 11

•

8 years ago

(In reply to Ritu Kothari (:ritu) from comment #10) > I am going to track this as a blocker for now until we know more on the > frequency of this problem. Do we have a way of doing this? We should at least, and as fast as we can, put in some kind of a message/telemetry/whatever when this happens. After all, we actually make the "terminate content process" call, so we should be able to track when we've made it and if we've ever gotten by that call successfully in the session. And if not, we should restart with e10s off. Bob, do you know if we have a way of measuring the frequency/size of the problem?

Flags: needinfo?(bobowencode)

Bob Owen (:bobowen)

Assignee

Comment 12

•

8 years ago

Attached patch Add telemetry for when a sandboxed child process fails to start — Details — Splinter Review

This should add Telemtry when it fails to launch with the error code.

Attachment #8876859 - Flags: review?(jmathies)

Attachment #8876859 - Flags: feedback?(benjamin)

Bob Owen (:bobowen)

Assignee

Updated

•

8 years ago

Assignee: nobody → bobowencode

Status: NEW → ASSIGNED

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 13

•

8 years ago

With beta staged rollout (starting with Dawn project), we'll push 55.0b1 and 55.0b2 < 25% of beta population. That should give us some indication of how severe this problem is. At the same time, Milan suggested we get in touch with MS to find a good fix. I will gtb 55.0b1 with Bob's patch to get more telemetry data on sandbox launch failures.

Jim Mathies [:jimm]

Comment 14

•

8 years ago

Comment on attachment 8876859 [details] [diff] [review] Add telemetry for when a sandboxed child process fails to start r+ a telemetry probe that accumulates process creation failure results.

Attachment #8876859 - Flags: review?(jmathies) → review+

Milan Sreckovic [:milan] (needinfo for best results)

Comment 15

•

8 years ago

(In reply to Ritu Kothari (:ritu) from comment #13) > ... > > At the same time, Milan suggested we get in touch with MS to find a good fix. Include in this conversation somebody that knows how we restart Windows builds after update, and why we'd have the updated/ in the path (comment 0.) 'Cause I can't explain how we do it, so I won't be able to ask the right questions :)

Bob Owen (:bobowen)

Assignee

Comment 16

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=242569bedf9b2873927f47752df1dd1b249ef7cc

Flags: needinfo?(bobowencode)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 17

•

8 years ago

FWIW, I just updated my win10 to creators edition, no trouble at all (so far) launching or browsing on Nightly56 and Beta54 RC.

François Marier [:francois]

Comment 18

•

8 years ago

Comment on attachment 8876859 [details] [diff] [review] Add telemetry for when a sandboxed child process fails to start Review of attachment 8876859 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/Histograms.json @@ +12477,5 @@ > + "kind": "enumerated", > + "n_values": 50, > + "bug_numbers": [1368600], > + "cpp_guard": "XP_WIN", > + "description": "Error code when a Windows sandboxed process fails to launch." What does each error code mean? Could you link from the description to a place where the error codes are defined?

François Marier [:francois]

Comment 19

•

8 years ago

(In reply to François Marier [:francois] from comment #18) > Comment on attachment 8876859 [details] [diff] [review] > Add telemetry for when a sandboxed child process fails to start > > Review of attachment 8876859 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: toolkit/components/telemetry/Histograms.json > @@ +12477,5 @@ > > + "kind": "enumerated", > > + "n_values": 50, > > + "bug_numbers": [1368600], > > + "cpp_guard": "XP_WIN", > > + "description": "Error code when a Windows sandboxed process fails to launch." > > What does each error code mean? Could you link from the description to a > place where the error codes are defined? Given that this is blocking 55b1 and that Bob won't get to it until tomorrow, I'm ok if this is addressed in a follow-up bug. datareview+ with the above follow-up work.

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 20

•

8 years ago

Thanks Francois. Hi Ryan, can we please get this landed on m-b soon? I'd like to gtb 55.0b1 in the next hour or two if possible.

Flags: needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Comment 21

•

8 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/4bd58be05340

status-firefox55: affected → fixed

Bob Owen (:bobowen)

Assignee

Updated

•

8 years ago

Attachment #8876859 - Flags: feedback?(benjamin)

Pulsebot

Comment 22

•

8 years ago

Pushed by bobowencode@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/e62914155f64 Add telemetry with the error code for when a Windows sandboxed child process fails to start. r=jimm, data-review=francois

Bob Owen (:bobowen)

Assignee

Updated

•

8 years ago

Keywords: leave-open

Bob Owen (:bobowen)

Assignee

Comment 23

•

8 years ago

(In reply to Milan Sreckovic [:milan] from comment #15) > (In reply to Ritu Kothari (:ritu) from comment #13) > > ... > > > > At the same time, Milan suggested we get in touch with MS to find a good fix. > > Include in this conversation somebody that knows how we restart Windows > builds after update, and why we'd have the updated/ in the path (comment 0.) > 'Cause I can't explain how we do it, so I won't be able to ask the right > questions :) I imagine rstrong is the best person from an updater point of view. From looking at the code, I believe the staged update process is as follows (assuming default 64-bit paths for Nightly): * Copy from "C:\Program Files\Nightly" to "C:\Program Files\Nightly\updated" * Apply update to "C:\Program Files\Nightly\updated" * When restart is triggered ... launch updater.exe and shutdown Firefox. * updater renames "C:\Program Files\Nightly" to "C:\Program Files\Nightly.bak" * updater renames "C:\Program Files\Nightly.bak\updated" to "C:\Program Files\Nightly" * updater deletes "C:\Program Files\Nightly.bak" * updater relaunches Firefox It appears that somehow the "C:\Program Files\Nightly\updated\firefox.exe" is getting mapped into memory, prior to the rename to "C:Program Files\Nightly\firefox.exe". Then even though the child process is getting launched using "C:Program Files\Nightly\firefox.exe", the call to GetMappedFileNameW is still returning "C:\Program Files\Nightly\updated\firefox.exe". I've had no luck in trying to reproduce. I also tried running the firefox.exe in updated using a different profile and then triggering the restart and it seems to work fine, although the updated dir doesn't get deleted.

Robert Strong (they/them - no direct email)

Comment 24

•

8 years ago

I'm really at a loss as to how GetMappedFileNameW could be returning "C:\Program Files\Nightly\updated\firefox.exe", etc. Staging doesn't save much if any time over applying an update directly to the install directory on Windows. As a matter of fact, staging was broken on Windows for a release and since it falls back to applying directly to the install directory no one noticed it since updates were still successful. If anyone can reproduce this please try setting the app.update.staging preference to false and try to reproduce. Thanks!

Robert Strong (they/them - no direct email)

Comment 25

•

8 years ago

Bug 1370576 might also be related. I've also asked the reporter to check if it can be reproduced without staging.

Bob Owen (:bobowen)

Assignee

Comment 26

•

8 years ago

(In reply to Bob Owen (:bobowen) from comment #23) ... > I've had no luck in trying to reproduce. > I also tried running the firefox.exe in updated using a different profile > and then triggering the restart and it seems to work fine, although the > updated dir doesn't get deleted. OK, I've managed to reproduce artificially. rstrong's comment made me realise that the above was probably blocking the renaming and it was just updating in place, which is why the updated dir was still there. If I run the ...\updated\firefox.exe with a different profile, then close it and then trigger the restart I get the problem. The mapped image name is being returned as ...\updated\firefox.exe just as in comment 0. So, it does look like the image is getting mapped and then the mapped name is not being updated correctly when it is renamed. What is not clear is how the image is getting mapped before the rename - virus checker maybe? The same steps on my pre-Creators update Win10 VM, does not cause the issue. Just about to update the VM as well to test. Clearly turning off the staged update would make these artificial steps impossible, but it would seem that it would fix the real problem as well.

Robert Strong (they/them - no direct email)

Comment 27

•

8 years ago

It looks like staging is also causing bug 1370576 and I suspect it is the same root cause.

Jim Mathies [:jimm]

Comment 28

•

8 years ago

Can the updater terminate running firefox.exe processes vs. waiting for them to shutdown?

Robert Strong (they/them - no direct email)

Updated

•

8 years ago

Blocks: 1370576

Robert Strong (they/them - no direct email)

Updated

•

8 years ago

No longer blocks: 1370576

Depends on: 1370576

Robert Strong (they/them - no direct email)

Comment 29

•

8 years ago

(In reply to Jim Mathies [:jimm] from comment #28) > Can the updater terminate running firefox.exe processes vs. waiting for them > to shutdown? From what I can tell by the logs in bug 1370576 the firefox.exe process has exited when this happens. Also, if something like this were to be done it would probably be best if it were done by the WatchDog code.

Robert Strong (they/them - no direct email)

Comment 30

•

8 years ago

Jim, do you have any idea why the copy of firefox.exe in the updated directory is getting mapped even though it isn't launched until after it is in the installation directory? I suspect Windows is doing something under the covers to cause this.

Robert Strong (they/them - no direct email)

Comment 31

•

8 years ago

(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #29) > (In reply to Jim Mathies [:jimm] from comment #28) > > Can the updater terminate running firefox.exe processes vs. waiting for them > > to shutdown? > From what I can tell by the logs in bug 1370576 the firefox.exe process has > exited when this happens. The logs show that the process exited and the updater is able to get an exclusive file lock on firefox.exe so it has exited.

Ryan VanderMeulen [:RyanVM]

Comment 32

•

8 years ago

bugherder uplift

Uplifted the addressed review comments to Beta as well: https://hg.mozilla.org/releases/mozilla-beta/rev/979d797835ad

Bob Owen (:bobowen)

Assignee

Comment 33

•

8 years ago

(In reply to Bob Owen (:bobowen) from comment #26) > ... > The same steps on my pre-Creators update Win10 VM, does not cause the issue. > Just about to update the VM as well to test. Update finally finished and the issue is now reproducible with those steps, so it definitely points to a change in the Creators update.

Robert Strong (they/them - no direct email)

Comment 34

•

8 years ago

Bob, could you check that this isn't reproducible when the app.update.staging.enabled preference is set to false? Thanks!

Milan Sreckovic [:milan] (needinfo for best results)

Comment 35

•

8 years ago

(In reply to Bob Owen (:bobowen) from comment #26) > What is not clear is how the image is getting mapped before the rename - > virus checker maybe? I don't have anything beyond the Windows Defender (which is telling me that my apps and browser control is off and that I should do something about it.)

Robert Strong (they/them - no direct email)

Comment 36

•

8 years ago

Note: I just landed the patch in Bug 1370576 on mozilla-inbound to disable update staging on Windows.

Bob Owen (:bobowen)

Assignee

Comment 37

•

8 years ago

(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #34) > Bob, could you check that this isn't reproducible when the > app.update.staging.enabled preference is set to false? Thanks! As I said I'm manually running the firefox.exe in updated before triggering the rest of the update, so it would definitely stop those steps, but I haven't seen the issue during a normal update.

Robert Strong (they/them - no direct email)

Comment 38

•

8 years ago

Oh, I thought you reproduced without using artificial steps as well. The reason I was hoping for non artificial steps is because the firefox.exe in updated isn't launched.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 39

•

8 years ago

I think I might have been two updates behind when I ran into this problem, but I'm not 100% sure of this. As in, I was still running Wednesday's version on Friday, and did an update from the hamburger menu.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 40

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e62914155f64

Ryan VanderMeulen [:RyanVM]

Updated

•

8 years ago

Flags: needinfo?(ryanvm)

Bob Owen (:bobowen)

Assignee

Comment 42

•

8 years ago

This should have been fixed even with staged updates with the patch from bug 1385928.

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Depends on: 1385928

Keywords: leave-open

Resolution: --- → FIXED