1387222 - Intermittent MnH timed out after 1000 seconds after Assertion failure: 0 == rv, at nsprpub/pr/src/pthreads/ptthread.c:870

Reporter

Description

•

8 years ago

treeherder

Filed by: rvandermeulen [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=120794056&repo=mozilla-beta https://queue.taskcluster.net/v1/task/Kq2WBshmQZSeZuutj74FRw/runs/0/artifacts/public/logs/live_backing.log Quite possibly not the right component for this failure, but I don't know of any better one at this point either.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 1

•

8 years ago

I have also seen the same situation with bug 1386089. But here the output for this specific case: 12:39:49 INFO - Assertion failure: 0 == rv, at /home/worker/workspace/build/src/nsprpub/pr/src/pthreads/ptthread.c:870 12:39:49 INFO - Redirecting call to abort() to mozalloc_abort 12:39:49 INFO - Hit MOZ_CRASH() at /home/worker/workspace/build/src/memory/mozalloc/mozalloc_abort.cpp:33 12:39:49 INFO - Assertion failure: 0 == rv, at /home/worker/workspace/build/src/nsprpub/pr/src/pthreads/ptthread.c:292 12:39:49 INFO - Redirecting call to abort() to mozalloc_abort 12:39:49 INFO - Hit MOZ_CRASH() at /home/worker/workspace/build/src/memory/mozalloc/mozalloc_abort.cpp:33 12:39:49 INFO - [Parent 1621] WARNING: 'NS_FAILED(rr->RetargetDeliveryTo(sts))', file /home/worker/workspace/build/src/dom/fetch/FetchDriver.cpp, line 632 12:39:49 INFO - [Parent 1621] WARNING: 'NS_FAILED(rv)', file /home/worker/workspace/build/src/dom/fetch/FetchConsumer.cpp, line 523 12:39:49 INFO - [Parent 1621] WARNING: Retargeting failed: file /home/worker/workspace/build/src/dom/fetch/FetchConsumer.cpp, line 524 12:39:49 INFO - [Parent 1621] WARNING: 'NS_FAILED(rr->RetargetDeliveryTo(sts))', file /home/worker/workspace/build/src/dom/fetch/FetchDriver.cpp, line 632 12:39:49 INFO - [Parent 1621] WARNING: 'NS_FAILED(rv)', file /home/worker/workspace/build/src/dom/fetch/FetchConsumer.cpp, line 523 12:39:49 INFO - [Parent 1621] WARNING: Retargeting failed: file /home/worker/workspace/build/src/dom/fetch/FetchConsumer.cpp, line 524 12:39:49 INFO - 1501789189642 Marionette DEBUG Received observer notification "xpcom-shutdown" 12:39:49 INFO - 1501789189642 Marionette DEBUG Received observer notification "xpcom-shutdown" [..] 12:40:53 INFO - Hit MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /home/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:160 12:40:53 INFO - #01: PR_GetThreadName[/Users/cltbld/tasks/task_1501788478/build/application/NightlyDebug.app/Contents/MacOS/libnss3.dylib +0x140049] 12:40:53 INFO - #02: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x405a] 12:40:53 INFO - #03: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x3fd7] We hit the mozalloc_abort three times which then causes a shutdown of Firefox. At least we see a "xpcom-shutdown" observer notification. But then there is another crash due to a too long shutdown. And this seems to come from NSS. I have no idea either which component this is best. David, could you help us?

Component: Marionette → Security

Flags: needinfo?(dkeeler)

Product: Testing → Firefox

Comment 2

•

8 years ago

This seems more like an NSPR issue, or perhaps a pthread issue. A few thoughts, though: The assertion failure at pthread.c:870 is due to pthread_setspecific failing. The most recent documentation I could find for pthread_setspecific says: The pthread_setspecific() function shall fail if: [ENOMEM] Insufficient memory exists to associate the non-NULL value with the key. ( http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_getspecific.html ) but since we're passing NULL there, either the documentation is wrong or the implementation is wrong. For the failure at pthread.c:292, we do have a non-NULL value, so maybe we really are OOM-ing. In which case, why aren't we actually exiting after MOZ_CRASH? PR_GetThreadName is just an accessor - there's no way we're hanging there.

Flags: needinfo?(dkeeler)

Comment hidden (Intermittent Failures Robot)

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

8 years ago

status-firefox57: --- → affected

Component: Security → NSPR

Product: Firefox → NSPR

Version: unspecified → other

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

8 years ago

Blocks: 1385605

Comment hidden (Intermittent Failures Robot)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 5

•

8 years ago

Here some more information based on the latest report: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=144442348&lineNumber=5487 16:06:55 INFO - GECKO(835) | Hit MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /builds/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:163 16:06:56 INFO - GECKO(835) | #01: _pt_root [nsprpub/pr/src/pthreads/ptthread.c:219] 16:06:56 INFO - 16:06:56 INFO - GECKO(835) | #02: libsystem_pthread.dylib + 0x405a 16:06:56 INFO - 16:06:56 INFO - GECKO(835) | #03: libsystem_pthread.dylib + 0x3fd7 16:06:56 INFO - 16:13:06 INFO - Buffered messages finished 16:13:06 ERROR - TEST-UNEXPECTED-TIMEOUT | browser/base/content/test/performance/browser_startup_images.js | application timed out after 370 seconds with no output 16:13:06 ERROR - Force-terminating active process(es). Something else which is questionable is why the harness times out here: 16:13:06 INFO - Determining child pids from psutil 16:13:06 INFO - Found child pids: [] 16:13:06 INFO - Killing process: 835 16:13:06 INFO - TEST-INFO | started process screencapture 16:13:06 INFO - TEST-INFO | screencapture: exit 0 16:13:06 INFO - psutil found pid 835 dead 16:29:46 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/Users/cltbld/tasks/task_1510617616/build/venv/bin/python', '-u', '/Users/cltbld/tasks/task_1510617616/build/tests/mochitest/runtests.py', '--total-chunks', '7', '--this-chunk', '5', '--appname=/Users/cltbld/tasks/task_1510617616/build/application/NightlyDebug.app/Contents/MacOS/firefox', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=/Users/cltbld/tasks/task_1510617616/build/symbols', '--certificate-path=tests/certs', '--quiet', '--log-raw=/Users/cltbld/tasks/task_1510617616/build/blobber_upload_dir/browser-chrome-chunked_raw.log', '--log-errorsummary=/Users/cltbld/tasks/task_1510617616/build/blobber_upload_dir/browser-chrome-chunked_errorsummary.log', '--screenshot-on-fail', '--cleanup-crashes', '--marionette-startup-timeout=180', '--sandbox-read-whitelist=/Users/cltbld/tasks/task_1510617616/build', '--log-raw=-', '--flavor=browser', '--chunk-by-runtime'] 16:29:46 ERROR - timed out after 1000 seconds of no output 16:29:46 ERROR - Return code: -15 16:29:46 ERROR - No suite end message was emitted by this harness. Was killing process `835` not successful? Geoff, do you have an idea? Maybe we need some improvements in how the mochitest harness handles the process.

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

8 years ago

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Assignee

Comment 6

•

8 years ago

(In reply to Henrik Skupin (:whimboo) from comment #5) > 16:06:55 INFO - GECKO(835) | Hit MOZ_CRASH(Shutdown too long, probably > frozen, causing a crash.) at > /builds/worker/workspace/build/src/toolkit/components/terminator/ > nsTerminator.cpp:163 > 16:06:56 INFO - GECKO(835) | #01: _pt_root > [nsprpub/pr/src/pthreads/ptthread.c:219] > 16:06:56 INFO - > 16:06:56 INFO - GECKO(835) | #02: libsystem_pthread.dylib + 0x405a > 16:06:56 INFO - > 16:06:56 INFO - GECKO(835) | #03: libsystem_pthread.dylib + 0x3fd7 > 16:06:56 INFO - > 16:13:06 INFO - Buffered messages finished > 16:13:06 ERROR - TEST-UNEXPECTED-TIMEOUT | > browser/base/content/test/performance/browser_startup_images.js | > application timed out after 370 seconds with no output > 16:13:06 ERROR - Force-terminating active process(es). I think this is consistent with bug 1404190. > Something else which is questionable is why the harness times out here: > > 16:13:06 INFO - Determining child pids from psutil > 16:13:06 INFO - Found child pids: [] > 16:13:06 INFO - Killing process: 835 > 16:13:06 INFO - TEST-INFO | started process screencapture > 16:13:06 INFO - TEST-INFO | screencapture: exit 0 > 16:13:06 INFO - psutil found pid 835 dead > 16:29:46 INFO - Automation Error: mozprocess timed out after 1000 > seconds running > ['/Users/cltbld/tasks/task_1510617616/build/venv/bin/python', '-u', > '/Users/cltbld/tasks/task_1510617616/build/tests/mochitest/runtests.py', > '--total-chunks', '7', '--this-chunk', '5', > '--appname=/Users/cltbld/tasks/task_1510617616/build/application/ > NightlyDebug.app/Contents/MacOS/firefox', '--utility-path=tests/bin', > '--extra-profile-file=tests/bin/plugins', > '--symbols-path=/Users/cltbld/tasks/task_1510617616/build/symbols', > '--certificate-path=tests/certs', '--quiet', > '--log-raw=/Users/cltbld/tasks/task_1510617616/build/blobber_upload_dir/ > browser-chrome-chunked_raw.log', > '--log-errorsummary=/Users/cltbld/tasks/task_1510617616/build/ > blobber_upload_dir/browser-chrome-chunked_errorsummary.log', > '--screenshot-on-fail', '--cleanup-crashes', > '--marionette-startup-timeout=180', > '--sandbox-read-whitelist=/Users/cltbld/tasks/task_1510617616/build', > '--log-raw=-', '--flavor=browser', '--chunk-by-runtime'] > 16:29:46 ERROR - timed out after 1000 seconds of no output > 16:29:46 ERROR - Return code: -15 > 16:29:46 ERROR - No suite end message was emitted by this harness. > > Was killing process `835` not successful? Geoff, do you have an idea? Maybe > we need some improvements in how the mochitest harness handles the process. "psutil found pid 835 dead" comes from https://dxr.mozilla.org/mozilla-central/rev/b056526be38e96b3e381b7e90cd8254ad1d96d9d/testing/mochitest/runtests.py#2822 ; I think that is good evidence that the process was killed successfully. When the process is killed, I would expect runtests.py to complete the proc.wait() at https://dxr.mozilla.org/mozilla-central/rev/b056526be38e96b3e381b7e90cd8254ad1d96d9d/testing/mochitest/runtests.py#2237 ... but I see no evidence for that. I'll file a separate bug for that...see if we can at least get better logging.

Flags: needinfo?(gbrown)

Updated

•

8 years ago

Depends on: 1419121

Geoff Brown [:gbrown]

Assignee

Comment 7

•

8 years ago

No new failure reports for 2+ months.

Assignee: nobody → gbrown

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WORKSFORME

Bugzilla

Intermittent MnH timed out after 1000 seconds after Assertion failure: 0 == rv, at nsprpub/pr/src/pthreads/ptthread.c:870

Categories

(NSPR :: NSPR, defect, P5)

Tracking

(firefox57 affected)

People

(Reporter: intermittent-bug-filer, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure)

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Updated

Comment 4

Comment 5

Updated

Comment 6

Updated

Comment 7