Hang in Nightly with stack sampling; happens when using a Mozilla internal Jenkins Ops tool

NEW
Assigned to

Status

()

P3
critical
3 years ago
a year ago

People

(Reporter: jrgm, Assigned: jrgm)

Tracking

48 Branch
x86_64
Mac OS X
Points:
---

Firefox Tracking Flags

(firefox48 affected)

Details

(Whiteboard: [necko-backlog])

Attachments

(7 attachments)

(Assignee)

Description

3 years ago
Created attachment 8732479 [details]
NightlyHang.txt

Hi Bill,

This is that hang that I experience when using a Jenkins internal ops tool. I'm attaching a process sample from Activity Monitor.

Some notes:
- E10S is not enabled in this profile. (Although, I used to have it on, and would see similar hangs).
- Nightly entered into this hang by initially burning 250% CPU for a few minutes, and then dropped to ~0% CPU and "Not Responding" showing in Activity Monitor.
- A Quit from Activity Monitor had no effect. I had to Force Quit.
(Assignee)

Comment 1

3 years ago
Created attachment 8732688 [details]
experienced this same type of hang again; here's the process stack sample
(Assignee)

Comment 2

3 years ago
Created attachment 8732694 [details]
I'm on a hot streak; here's another process stack sample in the hung state
(Assignee)

Comment 3

3 years ago
Created attachment 8733702 [details]
and another hang stack sample

Do you have enough information from these four stack samples, or shall I just keep submitting more?
(Assignee)

Comment 4

3 years ago
Created attachment 8734603 [details]
another one bites the dust
(Assignee)

Comment 5

3 years ago
Created attachment 8734611 [details]
And another one gone
(Assignee)

Comment 6

3 years ago
Created attachment 8734612 [details]
And another one gone
It looks like the call to PR_SetPollableEvent is expected to be non-blocking. But in this case we're writing so much data that we block waiting for the queue to empty. This may actually be an NSPR bug, but I'll needinfo Patrick since he probably has a better idea.
Assignee: wmccloskey → nobody
Component: General → Networking
Flags: needinfo?(mcmanus)
your timing is pretty amazing. This is a dup of bug 698882 which has been open for years and was just merged to mozilla-central one hour ago.

retest when it hits a nightly build?

so yes, PR_SetPollableEvent uses a blocking queue which is a serious bug, and it can cause a deadlock when tons of events are generated on the socket thread.. which rarely happens - but apparently jenkins makes it happen somehow? in any event, the deadlock should be fixed by 698882 whenever it sticks (it has a uncovered several unrelated latent bugs and been backed out a few times).
Flags: needinfo?(mcmanus)
(Assignee)

Comment 9

3 years ago
Cool. I'll see if I get this hang again. (Given my recent rate of these hangs, if I don't see it in a week or so, it probably means it's been fixed).
Whiteboard: [necko-active]
Assignee: nobody → jrgm
Flags: needinfo?(jrgm)
(Assignee)

Comment 10

3 years ago
So, I believe I had the same hang in the past two weeks, but I didn't have time right then to capture a trace, as I had a more pressing problem to address. Sorry. If I trigger it again, and I have a bit of time to capture the stack, I will.
Flags: needinfo?(jrgm)
Whiteboard: [necko-active] → [necko-backlog]
You need to log in before you can comment on or make changes to this bug.