Closed Bug 1208957 Opened 9 years ago Closed 8 years ago

Intermittent Assertion failure: 0 == rv, nsprpub/pr/src/pthreads/ptthread.c:288, PROCESS-CRASH | Main app process exited normally | application crashed [@ mozalloc_abort(char const*)]

Categories

(Core :: Security: PSM, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla55
Iteration:
55.1 - Mar 20
Tracking Status
firefox52 --- wontfix
firefox-esr52 --- fixed
firefox53 --- fixed
firefox54 --- fixed
firefox55 --- fixed

People

(Reporter: nigelb, Assigned: mrbkap)

References

Details

(Keywords: intermittent-failure, Whiteboard: [psm-intermittent] [e10s-multi:+][stockwell fixed:product])

Attachments

(2 files)

No description provided.
Component: General → Security: PSM
Product: Firefox → Core
Blocks: 1211080
Blocks: 1211082
Blocks: 1219986
Blocks: 1242305
Blocks: 1202325
Blocks: 1202044
Mass whiteboard change to annotate PSM intermittent test failures as [psm-intermittent]. Filter on 31b932bd-1aad-4e29-9f4b-4cd864a3ffdc if that's important to you.
Whiteboard: [psm-intermittent]
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
I'm investigating this for e10s-multi (4 processes) as this appears to happen much more frequently with 4 processes.
Assignee: nobody → mrbkap
Whiteboard: [psm-intermittent] → [psm-intermittent] [e10s-multi:?]
Summary: Intermittent Assertion failure: 0 == rv, nsprpub/pr/src/pthreads/ptthread.c:288 → Intermittent Assertion failure: 0 == rv, nsprpub/pr/src/pthreads/ptthread.c:288, PROCESS-CRASH | Main app process exited normally | application crashed [@ mozalloc_abort(char const*)]
(In reply to Blake Kaplan (:mrbkap) from comment #21) > I'm investigating this for e10s-multi (4 processes) as this appears to > happen much more frequently with 4 processes. Are you sure that this is the same crash as we see on ash? Then we should dupe Bug 1340512 over this one, but to me the two crashes look a bit different (I might be missing something though).
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #22) > Are you sure that this is the same crash as we see on ash? Then we should > dupe Bug 1340512 over this one, but to me the two crashes look a bit > different (I might be missing something though). It looks like there are at least two crashes. I've seen this one a few times as well.
Iteration: --- → 54.3 - Mar 6
Whiteboard: [psm-intermittent] [e10s-multi:?] → [psm-intermittent] [e10s-multi:+]
Iteration: 54.3 - Mar 6 → 55.1 - Mar 20
This appears to be due to a thread shutdown happening way too late in application shutdown -- in all of the instances of this that I've seen, the main thread is late in shutdown (oftentimes in ~nsStringStats other times running atexit-registered functions). My current strategy is to see if there is a specific thread that we are leaking too late on OSX and to fix it if so. https://treeherder.mozilla.org/#/jobs?repo=try&revision=7299921eb98150d00b78fdfb2107a790456c97d8
glad to see work already in progress here. Do let me know if there is help in doing try runs, bisecting data, or looking for patterns.
Whiteboard: [psm-intermittent] [e10s-multi:+] → [psm-intermittent] [e10s-multi:+][stockwell needswork]
:mrbkap, it has been 6 days since your try push, do you have more updates? Luckily this hasn't increased in frequency, but it is still something we determine as high frequency and would like to get fixed soon.
Flags: needinfo?(mrbkap)
I've been working on this pretty much full time. I've been pushing to try and debugging locally. If my current try run [1] doesn't shed more light, I'll probably try to get my hands on a loaner try machine to debug there. [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=01a73b27d7a119d09da4de3fc2347a52391248bf
Flags: needinfo?(mrbkap)
What's happening here is that we have a thread and an associated nsThread (but the thread wasn't started via nsThread::Init!) that is lasting through shutdown. At shutdown, the OS is killing all threads forcing us to clear out the thread private data and we're apparently not joining on the thread before unloading NSPR. Because of this, releasing the nsThread (and its related data) ends up causing us to try to re-initial NSPR thread data, which eventually fails, leading to a fatal assertion. The trick is to figure out which thread it is that we're leaking so late into shutdown and to make sure that we wait for it properly so it has a chance to shut down before the main thread. I hope.
Attachment #8847676 - Flags: review?(wmccloskey)
Attachment #8847677 - Flags: review?(wmccloskey)
Comment on attachment 8847676 [details] Bug 1208957 - Join the watchdog thread to avoid shutdown races. https://reviewboard.mozilla.org/r/120592/#review122678
Attachment #8847676 - Flags: review?(wmccloskey) → review+
Attachment #8847677 - Flags: review?(wmccloskey) → review+
Pushed by mrbkap@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/10a3d094cfc1 Join the watchdog thread to avoid shutdown races. r=billm https://hg.mozilla.org/integration/autoland/rev/9ba55f98e3bf No need for a condvar for thread shutdown. r=billm
Whiteboard: [psm-intermittent] [e10s-multi:+][stockwell needswork] → [psm-intermittent] [e10s-multi:+][stockwell fixed]
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
*bows before mrbkap* Can we please get this across all active branches? OSX debug xpcshell hits this 20-30% of the time, so it would be *fantastic* to see it uplifted around.
Comment on attachment 8847676 [details] Bug 1208957 - Join the watchdog thread to avoid shutdown races. Approval Request Comment [Feature/Bug causing the regression]: n/a (I suspect that bug 1323100 might have "caused" this by registering the watchdog thread with the profiler and therefore forcing creation of an nsThread for it but I haven't tested to be sure). [User impact if declined]: None! This bug should only show up in debug builds (and probably mostly on Treeherder). [Is this code covered by automated tests?]: Yes. [Has the fix been verified in Nightly?]: [Needs manual test from QE? If yes, steps to reproduce]: no [List of other uplifts needed for the feature/fix]: n/a [Is the change risky?]: Despite dealing with threads, this change should not be too risky -- it's moving from a manual condvar to one in the system in order to wait for a thread to clean up after itself. [String changes made/needed]: n/a
Attachment #8847676 - Flags: approval-mozilla-beta?
Attachment #8847676 - Flags: approval-mozilla-aurora?
Hi :mrbkap, According to comment #41, is that OK?
Flags: needinfo?(mrbkap)
Doesn't seem to apply to aurora: grafting 386317:9ba55f98e3bf "Bug 1208957 - No need for a condvar for thread shutdown. r=billm" merging js/xpconnect/src/XPCJSContext.cpp warning: conflicts while merging js/xpconnect/src/XPCJSContext.cpp! (edit, then use 'hg resolve --mark') abort: unresolved conflicts, can't continue (use 'hg resolve' and 'hg graft --continue')
(In reply to Gerry Chang [:gchang] from comment #42) > Hi :mrbkap, > According to comment #41, is that OK? Yes, the data shows that the last failure due to this bug was on the 16th (except for a single failure on mozilla-beta on the 17th). There will be some number of failures coming from that branch until this fix eventually merges there. (I'm leaving the ni on me to fix the merge to Aurora.)
Sylvestre, it appears that these patches apply cleanly to Aurora and Beta. I wonder, though, if maybe you didn't apply them in the right order. They need to be applied in the same order as they appear in comment 38 (that is: "Join the watchdog thread..." followed by "No need for a condvar...").
Flags: needinfo?(mrbkap) → needinfo?(sledru)
Attachment #8847677 - Flags: approval-mozilla-beta?
Attachment #8847677 - Flags: approval-mozilla-aurora?
Comment on attachment 8847676 [details] Bug 1208957 - Join the watchdog thread to avoid shutdown races. Fix an intermittent failure. Aurora54+ & Beta53+.
Attachment #8847676 - Flags: approval-mozilla-beta?
Attachment #8847676 - Flags: approval-mozilla-beta+
Attachment #8847676 - Flags: approval-mozilla-aurora?
Attachment #8847676 - Flags: approval-mozilla-aurora+
Attachment #8847677 - Flags: approval-mozilla-beta?
Attachment #8847677 - Flags: approval-mozilla-beta+
Attachment #8847677 - Flags: approval-mozilla-aurora?
Attachment #8847677 - Flags: approval-mozilla-aurora+
Ok, we tried with a bot, we should manage the order correctly, sorry
Flags: needinfo?(sledru)
Comment on attachment 8847676 [details] Bug 1208957 - Join the watchdog thread to avoid shutdown races. This is an extremely frequent issue on ESR52 with OSX debug xpcshell, so it would be wonderful to get it backported there as well.
Attachment #8847676 - Flags: approval-mozilla-esr52?
Attachment #8847677 - Flags: approval-mozilla-esr52?
Comment on attachment 8847676 [details] Bug 1208957 - Join the watchdog thread to avoid shutdown races. fix a race on shutdown, esr52+
Attachment #8847676 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52+
Attachment #8847677 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52+
Whiteboard: [psm-intermittent] [e10s-multi:+][stockwell fixed] → [psm-intermittent] [e10s-multi:+][stockwell fixed:product]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: