Bugzilla

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

2 years ago

Flags: needinfo?(hskupin)

Comment 3

•

2 years ago

Hm, ContentParent::Pid()/::GetOsPid() are racy. The stack trace for example in
https://treeherder.mozilla.org/logviewer?job_id=389272588&repo=autoland&lineNumber=2404 shows a code path where any JS could access GetOsPid() which then just calls Pid() and apparently that isn't ok.

jld, could we somehow fix ContentParent::Pid()?
We can backout bug 1787718 for now, but that isn't really the cause, it just makes us call Pid() more often.

Flags: needinfo?(jld)

Comment 4

•

2 years ago

Given that there is nothing that I can do here, I would suggest that we backout the patch. Aryx, could you please take care of it?

Flags: needinfo?(aryx.bugmail)

Updated

•

2 years ago

Comment 5

•

2 years ago

The patch on bug 1787718 got backed out:
https://hg.mozilla.org/integration/autoland/rev/972d0006abb1b5160fa87e1754f2f1d23975c3d9

Flags: needinfo?(aryx.bugmail)

Updated

•

2 years ago

Blocks: 1787718

Comment 6

•

2 years ago

(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #3)

Hm, ContentParent::Pid()/::GetOsPid() are racy. The stack trace for example in
https://treeherder.mozilla.org/logviewer?job_id=389272588&repo=autoland&lineNumber=2404 shows a code path where any JS could access GetOsPid() which then just calls Pid() and apparently that isn't ok.

jld, could we somehow fix ContentParent::Pid()?
We can backout bug 1787718 for now, but that isn't really the cause, it just makes us call Pid() more often.

Jed, do you have some feedback for us?

Assignee

Comment 7

•

2 years ago

My idea here is to protect mChildProcessHandle with a lock, and then get the pid under that lock (so, maybe add GeckoChildProcessHost::GetChildProcessId). That will avoid racing on the handle or doing a use-after-close (on Windows; on Unix a “handle” is a pid), but the pid might not refer to the process after the lock is released (if the process exits). If the pid is used only for logging and not to interact with the process, that should be an acceptable risk.

Flags: needinfo?(jld)

https://treeherder.mozilla.org/logviewer?job_id=398502658&repo=mozilla-central&lineNumber=4234

Comment 8

•

1 year ago

Interestingly we can see those assertions / crashes now even on mozilla-central since November 29th and it's always the test browser_test_powerMetrics.js which is causing it.

Florian made some changes here recently on bug 1802361, so probably those triggered this particular crash.

status-firefox107: --- → unaffected

status-firefox108: --- → unaffected

status-firefox109: --- → affected

Flags: needinfo?(florian)

Florian Quèze [:florian]

Comment 9

•

1 year ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #8)

Interestingly we can see those assertions / crashes now even on mozilla-central since November 29th and it's always the test browser_test_powerMetrics.js which is causing it.

https://treeherder.mozilla.org/logviewer?job_id=398502658&repo=mozilla-central&lineNumber=4234

If I understand these stacks, they just mean the "IPC I/O Parent" thread called GeckoChildProcessHost::AsyncLaunch after the main thread called ChromeUtils::RequestProcInfo (which called GeckoChildProcessHost::GetAll).

I don't think these calls should be a problem, so it's probably at the GeckoChildProcessHost level that something needs fixing. And that seems aligned with what Jed said in comment 7.

Florian made some changes here recently on bug 1802361, so probably those triggered this particular crash.

The only explanation I can see is that maybe the test now takes longer to complete.

Flags: needinfo?(florian)

Updated

•

1 year ago

OS: Unspecified → Linux

Hardware: Unspecified → All

Updated

•

1 year ago

status-firefox110: --- → affected

status-firefox-esr102: --- → affected

Updated

•

1 year ago

status-firefox109: affected → wontfix

status-firefox-esr102: affected → unaffected

Gian-Carlo Pascutto [:gcp]

Comment 10

•

1 year ago

browser_test_powerMetrics.js is causing this crash pretty frequently on Linux. Florian, would it be something you could at least investigate?

Flags: needinfo?(florian)

Florian Quèze [:florian]

Comment 11

•

1 year ago

I don't think I can do more than what I wrote in comment 9.
If we can't fix this and the failure rate becomes a problem, we might need to just disable the test on tsan.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧ from comment #7)

My idea here is to protect mChildProcessHandle with a lock, and then get the pid under that lock

Is this something that you could easily do, or was there a problem with this idea?

Flags: needinfo?(florian) → needinfo?(jld)

Updated

•

1 year ago

Flags: needinfo?(jld)

Gian-Carlo Pascutto [:gcp]

Comment 12

•

1 year ago

Jed said he had a partial patch for this issue, so let's take it.

Assignee: nobody → jld

Priority: P5 → P1

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Assignee

Comment 13

•

1 year ago

Attached file Bug 1763625 - Refactor how we access child process handles. — Details

Comment 14

•

1 year ago

Landed: https://hg.mozilla.org/integration/autoland/rev/dc92467cea22653feff3eb52e706d664b84f3394

Backed out for causing memory-related crashes and failures

backout: https://hg.mozilla.org/integration/autoland/rev/73dfb11f4338225d2e930e583ca37a3ad01f940c
push: https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&revision=dc92467cea22653feff3eb52e706d664b84f3394
failure log: https://treeherder.mozilla.org/logviewer?job_id=407189999&repo=autoland&lineNumber=4099

Flags: needinfo?(jld)

Assignee

Comment 15

•

1 year ago

Silly mistake on my part: the old version of ContentParent::Pid had this:

  if (!mSubprocess || !mSubprocess->GetChildProcessHandle()) {
    return -1;
  }

And I accidentally dropped the !mSubprocess check, so it dereferences a null pointer in that case. Should be easy enough to fix.

Flags: needinfo?(jld)

Comment 16

•

1 year ago

Btw this got re-landed as: https://hg.mozilla.org/integration/autoland/rev/72094220e38c3726f8a24a6306429c17882399c7

Status: NEW → ASSIGNED

status-firefox111: --- → affected

status-firefox112: --- → affected

https://hg.mozilla.org/mozilla-central/rev/72094220e38c

Comment 17

•

1 year ago

Group: dom-core-security → core-security-release

Status: ASSIGNED → RESOLVED

Closed: 1 year ago

status-firefox110: affected → wontfix

status-firefox112: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 112 Branch

BugBot [:suhaib / :marco/ :calixte]

Comment 18

•

1 year ago

The patch landed in nightly and beta is affected.
:jld, is this bug important enough to require an uplift?

If yes, please nominate the patch for beta approval.
If no, please set status-firefox111 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jld)

Brindusa Tot, Desktop QA

Updated

•

1 year ago

Flags: qe-verify-

Whiteboard: [post-critsmash-triage]

Updated

•

1 year ago

status-firefox111: affected → wontfix

Flags: needinfo?(jld)