Closed Bug 1741131 Opened 3 years ago Closed 2 years ago

XPCOMSpinEventLoopStack annotation misalignment with main thread status

Categories

(Core :: XPCOM, defect, P2)

defect

Tracking

()

RESOLVED FIXED
96 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox94 --- unaffected
firefox95 --- wontfix
firefox96 --- fixed

People

(Reporter: jstutte, Assigned: jstutte)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression)

Attachments

(3 files)

In some crash reports we see a misalignment between the annotation and the real status of the main thread. While the annotation suggests to be stuck in some SpinEventLoopUntil, we actually seem to just sit and wait in the main event loop.

Some examples are:

https://crash-stats.mozilla.org/report/index/349cb5a1-4bfd-4c94-ba86-552450211112
https://crash-stats.mozilla.org/report/index/5d4233b5-109a-4ccd-8fa1-e45150211111

Blocks: 1740899
See Also: → 1740895

It is probably worth noting, that these are not random reports. There seems to be a stable relation between the annotation's origin and the nature of the stack, that is: For the same annotation I either always find a SpinEventLoopUntil on the stack (like in bug 1740889) or never (like in bug 1740899).

Nika, all the instances I've seen failing this way so far seemed to be on content processes (though there are also working examples from content processes). Ignoring how crash annotations really work, is there any chance we have a fallible way of reporting changes to them in content processes? Is this going through the parent process somehow?

Flags: needinfo?(nika)

Set release status flags based on info from the regressing bug 1731564

See Also: → 1740896
Flags: needinfo?(jstutte)

:jstutte do you have a severity and priority for this bug?

Severity for end users is very low, but it could help to diagnose other more nasty bugs. If we can't rely on the annotation (we put quite some effort in it), this would be bad for those bugs.

Severity: -- → S3
Flags: needinfo?(jstutte)
Priority: -- → P2

(In reply to Jens Stutte [:jstutte] from comment #1)

Nika, all the instances I've seen failing this way so far seemed to be on content processes (though there are also working examples from content processes). Ignoring how crash annotations really work, is there any chance we have a fallible way of reporting changes to them in content processes? Is this going through the parent process somehow?

It looks like the failures which you're seeing where the stack doesn't line up are all IPCError-browser | ShutDownKill failures, which are caused when a bug causes the parent process to kill the child process during shutdown. In these cases the minidump etc. are generated by the parent process on behalf of the killed child process. I wouldn't be surprised at all if there was a race between when we read the annotations sent by the content process and when we capture the content process' stacks (and I expect that annotations in content processes are sent somewhat asynchronously anyway), so it makes sense that we'd get this annotation out of sync with the captured stack there.

Flags: needinfo?(nika)
Assignee: nobody → jstutte
Status: NEW → ASSIGNED

(In reply to Nika Layzell [:nika] (ni? for response) from comment #5)

I wouldn't be surprised at all if there was a race between when we read the annotations sent by the content process and when we capture the content process' stacks (and I expect that annotations in content processes are sent somewhat asynchronously anyway), so it makes sense that we'd get this annotation out of sync with the captured stack there.

Assuming we cannot force a flush of this information from the content process side when unwinding the stack the best we can do is probably just to document this caveat.

I just learned we can filter on crashstats through proto signature in order to have only "valid" cases. Thanks Randell!

Severity: S3 → S4

(In reply to Jens Stutte [:jstutte] from comment #8)

I just learned we can filter on crashstats through proto signature in order to have only "valid" cases. Thanks Randell!

Be aware that this can filter out valid cases where SpinEventLoopUntil is inlined (which happens on some stacks, it seems).

Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/74c6b68d73d1
Document the possible misalignment of the annotation wrt stack status in child processes. r=xpcom-reviewers,nika
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 96 Branch

See bug 1740899 comment 8:

We see the annotation for "IPC Launch" unexpectedly on the child process despite we know that it lives only in the parent process. The doubt would be that we actually see the parent process' one here.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
No longer blocks: 1740899
See Also: → 1740899
Keywords: leave-open

We suspect to see the parent processes annotation on crash reports from child processes that result from IPC shutdown kills. In order to reduce the confusion we add a prefix for the process type.

The patch landed in nightly and beta is affected.
:jstutte, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)

Actually we do not think this is yet fixed at all.

Flags: needinfo?(jstutte)
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8c92e2f3d901
Be explicit about the process type in the XPCOMSpinEventLoopStack annotation. r=xpcom-reviewers,nika

Just to follow up here: We have the smoking gun that we actually see the parent process' annotation on crashing child processes if they do not have it.

I assume we should amend our documentation we just changed here in that sense, too.

Actually for the XPCOMSpinEventLoopStack annotation I would consider this to be a feature that lets us peek into the parent process' state when the IPCError-browser | ShutDownKill occurs. So while it feels a bit quirky to see parent process' annotations on child process reports, I hesitate to file a bug for it...

Keywords: leave-open

(In reply to Jens Stutte [:jstutte] from comment #19)

Actually for the XPCOMSpinEventLoopStack annotation I would consider this to be a feature that lets us peek into the parent process' state when the IPCError-browser | ShutDownKill occurs. So while it feels a bit quirky to see parent process' annotations on child process reports, I hesitate to file a bug for it...

It might be worth adding some form of attribute on the crash annotations specifying whether you want to have this behaviour at least, even if we decide to turn it on for XPCOMSpinEventLoopStack. I suppose the optimal situation would be where the source process of an annotation is included in the crash report, so in cases like this we can see both, but I imagine that'll be a lot of work for the crash reporting team.

Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/28deb2f4a74a
Amend further the XPCOMSpinEventLoopStack annotation documentation r=xpcom-reviewers,nika
See Also: → 1742836

(In reply to Nika Layzell [:nika] (ni? for response) from comment #21)

It might be worth adding some form of attribute on the crash annotations specifying whether you want to have this behaviour at least, even if we decide to turn it on for XPCOMSpinEventLoopStack. I suppose the optimal situation would be where the source process of an annotation is included in the crash report, so in cases like this we can see both, but I imagine that'll be a lot of work for the crash reporting team.

I filed bug 1742836 for this. It is definitely too confusing to remain just as is, IMHO.

Status: REOPENED → RESOLVED
Closed: 3 years ago2 years ago
Resolution: --- → FIXED
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: