Closed Bug 1387222 Opened 7 years ago Closed 6 years ago

Intermittent MnH timed out after 1000 seconds after Assertion failure: 0 == rv, at nsprpub/pr/src/pthreads/ptthread.c:870

Categories

(NSPR :: NSPR, defect, P5)

defect

Tracking

(firefox57 affected)

RESOLVED WORKSFORME
Tracking Status
firefox57 --- affected

People

(Reporter: intermittent-bug-filer, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure)

Filed by: rvandermeulen [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=120794056&repo=mozilla-beta

https://queue.taskcluster.net/v1/task/Kq2WBshmQZSeZuutj74FRw/runs/0/artifacts/public/logs/live_backing.log

Quite possibly not the right component for this failure, but I don't know of any better one at this point either.
I have also seen the same situation with bug 1386089. But here the output for this specific case:

12:39:49     INFO -  Assertion failure: 0 == rv, at /home/worker/workspace/build/src/nsprpub/pr/src/pthreads/ptthread.c:870
12:39:49     INFO -  Redirecting call to abort() to mozalloc_abort
12:39:49     INFO -  Hit MOZ_CRASH() at /home/worker/workspace/build/src/memory/mozalloc/mozalloc_abort.cpp:33
12:39:49     INFO -  Assertion failure: 0 == rv, at /home/worker/workspace/build/src/nsprpub/pr/src/pthreads/ptthread.c:292
12:39:49     INFO -  Redirecting call to abort() to mozalloc_abort
12:39:49     INFO -  Hit MOZ_CRASH() at /home/worker/workspace/build/src/memory/mozalloc/mozalloc_abort.cpp:33
12:39:49     INFO -  [Parent 1621] WARNING: 'NS_FAILED(rr->RetargetDeliveryTo(sts))', file /home/worker/workspace/build/src/dom/fetch/FetchDriver.cpp, line 632
12:39:49     INFO -  [Parent 1621] WARNING: 'NS_FAILED(rv)', file /home/worker/workspace/build/src/dom/fetch/FetchConsumer.cpp, line 523
12:39:49     INFO -  [Parent 1621] WARNING: Retargeting failed: file /home/worker/workspace/build/src/dom/fetch/FetchConsumer.cpp, line 524
12:39:49     INFO -  [Parent 1621] WARNING: 'NS_FAILED(rr->RetargetDeliveryTo(sts))', file /home/worker/workspace/build/src/dom/fetch/FetchDriver.cpp, line 632
12:39:49     INFO -  [Parent 1621] WARNING: 'NS_FAILED(rv)', file /home/worker/workspace/build/src/dom/fetch/FetchConsumer.cpp, line 523
12:39:49     INFO -  [Parent 1621] WARNING: Retargeting failed: file /home/worker/workspace/build/src/dom/fetch/FetchConsumer.cpp, line 524
12:39:49     INFO -  1501789189642	Marionette	DEBUG	Received observer notification "xpcom-shutdown"
12:39:49     INFO -  1501789189642	Marionette	DEBUG	Received observer notification "xpcom-shutdown"
[..]
12:40:53     INFO -  Hit MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /home/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:160
12:40:53     INFO -  #01: PR_GetThreadName[/Users/cltbld/tasks/task_1501788478/build/application/NightlyDebug.app/Contents/MacOS/libnss3.dylib +0x140049]
12:40:53     INFO -  #02: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x405a]
12:40:53     INFO -  #03: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x3fd7]

We hit the mozalloc_abort three times which then causes a shutdown of Firefox. At least we see a "xpcom-shutdown" observer notification. But then there is another crash due to a too long shutdown. And this seems to come from NSS. 

I have no idea either which component this is best. David, could you help us?
Component: Marionette → Security
Flags: needinfo?(dkeeler)
Product: Testing → Firefox
See Also: → 1386089
Version: Version 3 → unspecified
This seems more like an NSPR issue, or perhaps a pthread issue. A few thoughts, though:

The assertion failure at pthread.c:870 is due to pthread_setspecific failing. The most recent documentation I could find for pthread_setspecific says:

The pthread_setspecific() function shall fail if:

[ENOMEM]
    Insufficient memory exists to associate the non-NULL value with the key. 

( http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_getspecific.html )

but since we're passing NULL there, either the documentation is wrong or the implementation is wrong.

For the failure at pthread.c:292, we do have a non-NULL value, so maybe we really are OOM-ing. In which case, why aren't we actually exiting after MOZ_CRASH?

PR_GetThreadName is just an accessor - there's no way we're hanging there.
Flags: needinfo?(dkeeler)
Component: Security → NSPR
Product: Firefox → NSPR
Version: unspecified → other
Here some more information based on the latest report:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=144442348&lineNumber=5487

16:06:55     INFO - GECKO(835) | Hit MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /builds/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:163
16:06:56     INFO - GECKO(835) | #01: _pt_root [nsprpub/pr/src/pthreads/ptthread.c:219]
16:06:56     INFO - 
16:06:56     INFO - GECKO(835) | #02: libsystem_pthread.dylib + 0x405a
16:06:56     INFO - 
16:06:56     INFO - GECKO(835) | #03: libsystem_pthread.dylib + 0x3fd7
16:06:56     INFO - 
16:13:06     INFO - Buffered messages finished
16:13:06    ERROR - TEST-UNEXPECTED-TIMEOUT | browser/base/content/test/performance/browser_startup_images.js | application timed out after 370 seconds with no output
16:13:06    ERROR - Force-terminating active process(es).

Something else which is questionable is why the harness times out here:

16:13:06     INFO - Determining child pids from psutil
16:13:06     INFO - Found child pids: []
16:13:06     INFO - Killing process: 835
16:13:06     INFO - TEST-INFO | started process screencapture
16:13:06     INFO - TEST-INFO | screencapture: exit 0
16:13:06     INFO - psutil found pid 835 dead
16:29:46     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/Users/cltbld/tasks/task_1510617616/build/venv/bin/python', '-u', '/Users/cltbld/tasks/task_1510617616/build/tests/mochitest/runtests.py', '--total-chunks', '7', '--this-chunk', '5', '--appname=/Users/cltbld/tasks/task_1510617616/build/application/NightlyDebug.app/Contents/MacOS/firefox', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=/Users/cltbld/tasks/task_1510617616/build/symbols', '--certificate-path=tests/certs', '--quiet', '--log-raw=/Users/cltbld/tasks/task_1510617616/build/blobber_upload_dir/browser-chrome-chunked_raw.log', '--log-errorsummary=/Users/cltbld/tasks/task_1510617616/build/blobber_upload_dir/browser-chrome-chunked_errorsummary.log', '--screenshot-on-fail', '--cleanup-crashes', '--marionette-startup-timeout=180', '--sandbox-read-whitelist=/Users/cltbld/tasks/task_1510617616/build', '--log-raw=-', '--flavor=browser', '--chunk-by-runtime']
16:29:46    ERROR - timed out after 1000 seconds of no output
16:29:46    ERROR - Return code: -15
16:29:46    ERROR - No suite end message was emitted by this harness.

Was killing process `835` not successful? Geoff, do you have an idea? Maybe we need some improvements in how the mochitest harness handles the process.
Flags: needinfo?(gbrown)
(In reply to Henrik Skupin (:whimboo) from comment #5)
> 16:06:55     INFO - GECKO(835) | Hit MOZ_CRASH(Shutdown too long, probably
> frozen, causing a crash.) at
> /builds/worker/workspace/build/src/toolkit/components/terminator/
> nsTerminator.cpp:163
> 16:06:56     INFO - GECKO(835) | #01: _pt_root
> [nsprpub/pr/src/pthreads/ptthread.c:219]
> 16:06:56     INFO - 
> 16:06:56     INFO - GECKO(835) | #02: libsystem_pthread.dylib + 0x405a
> 16:06:56     INFO - 
> 16:06:56     INFO - GECKO(835) | #03: libsystem_pthread.dylib + 0x3fd7
> 16:06:56     INFO - 
> 16:13:06     INFO - Buffered messages finished
> 16:13:06    ERROR - TEST-UNEXPECTED-TIMEOUT |
> browser/base/content/test/performance/browser_startup_images.js |
> application timed out after 370 seconds with no output
> 16:13:06    ERROR - Force-terminating active process(es).

I think this is consistent with bug 1404190.

> Something else which is questionable is why the harness times out here:
> 
> 16:13:06     INFO - Determining child pids from psutil
> 16:13:06     INFO - Found child pids: []
> 16:13:06     INFO - Killing process: 835
> 16:13:06     INFO - TEST-INFO | started process screencapture
> 16:13:06     INFO - TEST-INFO | screencapture: exit 0
> 16:13:06     INFO - psutil found pid 835 dead
> 16:29:46     INFO - Automation Error: mozprocess timed out after 1000
> seconds running
> ['/Users/cltbld/tasks/task_1510617616/build/venv/bin/python', '-u',
> '/Users/cltbld/tasks/task_1510617616/build/tests/mochitest/runtests.py',
> '--total-chunks', '7', '--this-chunk', '5',
> '--appname=/Users/cltbld/tasks/task_1510617616/build/application/
> NightlyDebug.app/Contents/MacOS/firefox', '--utility-path=tests/bin',
> '--extra-profile-file=tests/bin/plugins',
> '--symbols-path=/Users/cltbld/tasks/task_1510617616/build/symbols',
> '--certificate-path=tests/certs', '--quiet',
> '--log-raw=/Users/cltbld/tasks/task_1510617616/build/blobber_upload_dir/
> browser-chrome-chunked_raw.log',
> '--log-errorsummary=/Users/cltbld/tasks/task_1510617616/build/
> blobber_upload_dir/browser-chrome-chunked_errorsummary.log',
> '--screenshot-on-fail', '--cleanup-crashes',
> '--marionette-startup-timeout=180',
> '--sandbox-read-whitelist=/Users/cltbld/tasks/task_1510617616/build',
> '--log-raw=-', '--flavor=browser', '--chunk-by-runtime']
> 16:29:46    ERROR - timed out after 1000 seconds of no output
> 16:29:46    ERROR - Return code: -15
> 16:29:46    ERROR - No suite end message was emitted by this harness.
> 
> Was killing process `835` not successful? Geoff, do you have an idea? Maybe
> we need some improvements in how the mochitest harness handles the process.

"psutil found pid 835 dead" comes from https://dxr.mozilla.org/mozilla-central/rev/b056526be38e96b3e381b7e90cd8254ad1d96d9d/testing/mochitest/runtests.py#2822 ; I think that is good evidence that the process was killed successfully.

When the process is killed, I would expect runtests.py to complete the proc.wait() at https://dxr.mozilla.org/mozilla-central/rev/b056526be38e96b3e381b7e90cd8254ad1d96d9d/testing/mochitest/runtests.py#2237 ... but I see no evidence for that. I'll file a separate bug for that...see if we can at least get better logging.
Flags: needinfo?(gbrown)
See Also: → 1404190
Depends on: 1419121
No new failure reports for 2+ months.
Assignee: nobody → gbrown
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.