Closed
Bug 1387222
Opened 8 years ago
Closed 8 years ago
Intermittent MnH timed out after 1000 seconds after Assertion failure: 0 == rv, at nsprpub/pr/src/pthreads/ptthread.c:870
Categories
(NSPR :: NSPR, defect, P5)
NSPR
NSPR
Tracking
(firefox57 affected)
RESOLVED
WORKSFORME
| Tracking | Status | |
|---|---|---|
| firefox57 | --- | affected |
People
(Reporter: intermittent-bug-filer, Assigned: gbrown)
References
Details
(Keywords: intermittent-failure)
Filed by: rvandermeulen [at] mozilla.com
https://treeherder.mozilla.org/logviewer.html#?job_id=120794056&repo=mozilla-beta
https://queue.taskcluster.net/v1/task/Kq2WBshmQZSeZuutj74FRw/runs/0/artifacts/public/logs/live_backing.log
Quite possibly not the right component for this failure, but I don't know of any better one at this point either.
Comment 1•8 years ago
|
||
I have also seen the same situation with bug 1386089. But here the output for this specific case:
12:39:49 INFO - Assertion failure: 0 == rv, at /home/worker/workspace/build/src/nsprpub/pr/src/pthreads/ptthread.c:870
12:39:49 INFO - Redirecting call to abort() to mozalloc_abort
12:39:49 INFO - Hit MOZ_CRASH() at /home/worker/workspace/build/src/memory/mozalloc/mozalloc_abort.cpp:33
12:39:49 INFO - Assertion failure: 0 == rv, at /home/worker/workspace/build/src/nsprpub/pr/src/pthreads/ptthread.c:292
12:39:49 INFO - Redirecting call to abort() to mozalloc_abort
12:39:49 INFO - Hit MOZ_CRASH() at /home/worker/workspace/build/src/memory/mozalloc/mozalloc_abort.cpp:33
12:39:49 INFO - [Parent 1621] WARNING: 'NS_FAILED(rr->RetargetDeliveryTo(sts))', file /home/worker/workspace/build/src/dom/fetch/FetchDriver.cpp, line 632
12:39:49 INFO - [Parent 1621] WARNING: 'NS_FAILED(rv)', file /home/worker/workspace/build/src/dom/fetch/FetchConsumer.cpp, line 523
12:39:49 INFO - [Parent 1621] WARNING: Retargeting failed: file /home/worker/workspace/build/src/dom/fetch/FetchConsumer.cpp, line 524
12:39:49 INFO - [Parent 1621] WARNING: 'NS_FAILED(rr->RetargetDeliveryTo(sts))', file /home/worker/workspace/build/src/dom/fetch/FetchDriver.cpp, line 632
12:39:49 INFO - [Parent 1621] WARNING: 'NS_FAILED(rv)', file /home/worker/workspace/build/src/dom/fetch/FetchConsumer.cpp, line 523
12:39:49 INFO - [Parent 1621] WARNING: Retargeting failed: file /home/worker/workspace/build/src/dom/fetch/FetchConsumer.cpp, line 524
12:39:49 INFO - 1501789189642 Marionette DEBUG Received observer notification "xpcom-shutdown"
12:39:49 INFO - 1501789189642 Marionette DEBUG Received observer notification "xpcom-shutdown"
[..]
12:40:53 INFO - Hit MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /home/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:160
12:40:53 INFO - #01: PR_GetThreadName[/Users/cltbld/tasks/task_1501788478/build/application/NightlyDebug.app/Contents/MacOS/libnss3.dylib +0x140049]
12:40:53 INFO - #02: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x405a]
12:40:53 INFO - #03: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x3fd7]
We hit the mozalloc_abort three times which then causes a shutdown of Firefox. At least we see a "xpcom-shutdown" observer notification. But then there is another crash due to a too long shutdown. And this seems to come from NSS.
I have no idea either which component this is best. David, could you help us?
Component: Marionette → Security
Flags: needinfo?(dkeeler)
Product: Testing → Firefox
See Also: → 1386089
Version: Version 3 → unspecified
Comment 2•8 years ago
|
||
This seems more like an NSPR issue, or perhaps a pthread issue. A few thoughts, though:
The assertion failure at pthread.c:870 is due to pthread_setspecific failing. The most recent documentation I could find for pthread_setspecific says:
The pthread_setspecific() function shall fail if:
[ENOMEM]
Insufficient memory exists to associate the non-NULL value with the key.
( http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_getspecific.html )
but since we're passing NULL there, either the documentation is wrong or the implementation is wrong.
For the failure at pthread.c:292, we do have a non-NULL value, so maybe we really are OOM-ing. In which case, why aren't we actually exiting after MOZ_CRASH?
PR_GetThreadName is just an accessor - there's no way we're hanging there.
Flags: needinfo?(dkeeler)
| Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
status-firefox57:
--- → affected
Component: Security → NSPR
Product: Firefox → NSPR
Version: unspecified → other
| Comment hidden (Intermittent Failures Robot) |
Comment 5•8 years ago
|
||
Here some more information based on the latest report:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=144442348&lineNumber=5487
16:06:55 INFO - GECKO(835) | Hit MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /builds/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:163
16:06:56 INFO - GECKO(835) | #01: _pt_root [nsprpub/pr/src/pthreads/ptthread.c:219]
16:06:56 INFO -
16:06:56 INFO - GECKO(835) | #02: libsystem_pthread.dylib + 0x405a
16:06:56 INFO -
16:06:56 INFO - GECKO(835) | #03: libsystem_pthread.dylib + 0x3fd7
16:06:56 INFO -
16:13:06 INFO - Buffered messages finished
16:13:06 ERROR - TEST-UNEXPECTED-TIMEOUT | browser/base/content/test/performance/browser_startup_images.js | application timed out after 370 seconds with no output
16:13:06 ERROR - Force-terminating active process(es).
Something else which is questionable is why the harness times out here:
16:13:06 INFO - Determining child pids from psutil
16:13:06 INFO - Found child pids: []
16:13:06 INFO - Killing process: 835
16:13:06 INFO - TEST-INFO | started process screencapture
16:13:06 INFO - TEST-INFO | screencapture: exit 0
16:13:06 INFO - psutil found pid 835 dead
16:29:46 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/Users/cltbld/tasks/task_1510617616/build/venv/bin/python', '-u', '/Users/cltbld/tasks/task_1510617616/build/tests/mochitest/runtests.py', '--total-chunks', '7', '--this-chunk', '5', '--appname=/Users/cltbld/tasks/task_1510617616/build/application/NightlyDebug.app/Contents/MacOS/firefox', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=/Users/cltbld/tasks/task_1510617616/build/symbols', '--certificate-path=tests/certs', '--quiet', '--log-raw=/Users/cltbld/tasks/task_1510617616/build/blobber_upload_dir/browser-chrome-chunked_raw.log', '--log-errorsummary=/Users/cltbld/tasks/task_1510617616/build/blobber_upload_dir/browser-chrome-chunked_errorsummary.log', '--screenshot-on-fail', '--cleanup-crashes', '--marionette-startup-timeout=180', '--sandbox-read-whitelist=/Users/cltbld/tasks/task_1510617616/build', '--log-raw=-', '--flavor=browser', '--chunk-by-runtime']
16:29:46 ERROR - timed out after 1000 seconds of no output
16:29:46 ERROR - Return code: -15
16:29:46 ERROR - No suite end message was emitted by this harness.
Was killing process `835` not successful? Geoff, do you have an idea? Maybe we need some improvements in how the mochitest harness handles the process.
Updated•8 years ago
|
Flags: needinfo?(gbrown)
| Assignee | ||
Comment 6•8 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #5)
> 16:06:55 INFO - GECKO(835) | Hit MOZ_CRASH(Shutdown too long, probably
> frozen, causing a crash.) at
> /builds/worker/workspace/build/src/toolkit/components/terminator/
> nsTerminator.cpp:163
> 16:06:56 INFO - GECKO(835) | #01: _pt_root
> [nsprpub/pr/src/pthreads/ptthread.c:219]
> 16:06:56 INFO -
> 16:06:56 INFO - GECKO(835) | #02: libsystem_pthread.dylib + 0x405a
> 16:06:56 INFO -
> 16:06:56 INFO - GECKO(835) | #03: libsystem_pthread.dylib + 0x3fd7
> 16:06:56 INFO -
> 16:13:06 INFO - Buffered messages finished
> 16:13:06 ERROR - TEST-UNEXPECTED-TIMEOUT |
> browser/base/content/test/performance/browser_startup_images.js |
> application timed out after 370 seconds with no output
> 16:13:06 ERROR - Force-terminating active process(es).
I think this is consistent with bug 1404190.
> Something else which is questionable is why the harness times out here:
>
> 16:13:06 INFO - Determining child pids from psutil
> 16:13:06 INFO - Found child pids: []
> 16:13:06 INFO - Killing process: 835
> 16:13:06 INFO - TEST-INFO | started process screencapture
> 16:13:06 INFO - TEST-INFO | screencapture: exit 0
> 16:13:06 INFO - psutil found pid 835 dead
> 16:29:46 INFO - Automation Error: mozprocess timed out after 1000
> seconds running
> ['/Users/cltbld/tasks/task_1510617616/build/venv/bin/python', '-u',
> '/Users/cltbld/tasks/task_1510617616/build/tests/mochitest/runtests.py',
> '--total-chunks', '7', '--this-chunk', '5',
> '--appname=/Users/cltbld/tasks/task_1510617616/build/application/
> NightlyDebug.app/Contents/MacOS/firefox', '--utility-path=tests/bin',
> '--extra-profile-file=tests/bin/plugins',
> '--symbols-path=/Users/cltbld/tasks/task_1510617616/build/symbols',
> '--certificate-path=tests/certs', '--quiet',
> '--log-raw=/Users/cltbld/tasks/task_1510617616/build/blobber_upload_dir/
> browser-chrome-chunked_raw.log',
> '--log-errorsummary=/Users/cltbld/tasks/task_1510617616/build/
> blobber_upload_dir/browser-chrome-chunked_errorsummary.log',
> '--screenshot-on-fail', '--cleanup-crashes',
> '--marionette-startup-timeout=180',
> '--sandbox-read-whitelist=/Users/cltbld/tasks/task_1510617616/build',
> '--log-raw=-', '--flavor=browser', '--chunk-by-runtime']
> 16:29:46 ERROR - timed out after 1000 seconds of no output
> 16:29:46 ERROR - Return code: -15
> 16:29:46 ERROR - No suite end message was emitted by this harness.
>
> Was killing process `835` not successful? Geoff, do you have an idea? Maybe
> we need some improvements in how the mochitest harness handles the process.
"psutil found pid 835 dead" comes from https://dxr.mozilla.org/mozilla-central/rev/b056526be38e96b3e381b7e90cd8254ad1d96d9d/testing/mochitest/runtests.py#2822 ; I think that is good evidence that the process was killed successfully.
When the process is killed, I would expect runtests.py to complete the proc.wait() at https://dxr.mozilla.org/mozilla-central/rev/b056526be38e96b3e381b7e90cd8254ad1d96d9d/testing/mochitest/runtests.py#2237 ... but I see no evidence for that. I'll file a separate bug for that...see if we can at least get better logging.
Flags: needinfo?(gbrown)
See Also: → 1404190
| Assignee | ||
Comment 7•8 years ago
|
||
No new failure reports for 2+ months.
Assignee: nobody → gbrown
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•