Closed Bug 1486101 Opened 7 years ago Closed 6 years ago

Intermittent Assertion failure: m_refCount == 1, at /builds/worker/workspace/build/src/js/src/jit/ExecutableAllocator.cpp:51

Categories

(Core :: Networking: HTTP, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla67
Tracking Status
firefox-esr60 --- unaffected
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: valentin)

References

Details

(Keywords: intermittent-failure, regression, Whiteboard: [necko-triaged][stockwell disable-recommended])

Attachments

(2 files)

Filed by: apavel [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=195774502&repo=autoland https://queue.taskcluster.net/v1/task/K6fmvHDiQFCFNgqhq58RLw/runs/0/artifacts/public/logs/live_backing.log https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/K6fmvHDiQFCFNgqhq58RLw/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1&only_show_unexpected=1 [task 2018-08-24T17:05:43.748Z] 17:05:43 INFO - ERROR: GC found live SharedScriptData 0x7f7c65a19e40 with ref count 2 at shutdown [task 2018-08-24T17:05:43.749Z] 17:05:43 INFO - ERROR: GC found live SharedScriptData 0x7f7c65abd000 with ref count 2 at shutdown [task 2018-08-24T17:05:43.750Z] 17:05:43 INFO - ERROR: GC found live SharedScriptData 0x7f7c65e83100 with ref count 2 at shutdown [task 2018-08-24T17:05:43.752Z] 17:05:43 INFO - ERROR: GC found live SharedScriptData 0x7f7c65ae1e20 with ref count 2 at shutdown [task 2018-08-24T17:05:43.753Z] 17:05:43 INFO - ERROR: GC found live SharedScriptData 0x7f7c65bb4dc0 with ref count 2 at shutdown [task 2018-08-24T17:05:43.754Z] 17:05:43 INFO - Assertion failure: m_refCount == 1, at /builds/worker/workspace/build/src/js/src/jit/ExecutableAllocator.cpp:51 [task 2018-08-24T17:05:43.755Z] 17:05:43 INFO - [Parent 1357, IPDL Background] WARNING: 'QuotaClient::IsShuttingDownOnBackgroundThread()', file /builds/worker/workspace/build/src/dom/indexedDB/ActorsParent.cpp, line 21965 [task 2018-08-24T17:05:43.756Z] 17:05:43 INFO - [Parent 1357, IPDL Background] WARNING: 'NS_FAILED(rv)', file /builds/worker/workspace/build/src/dom/indexedDB/ActorsParent.cpp, line 21200 [task 2018-08-24T17:05:43.757Z] 17:05:43 INFO - [Parent 1357, QuotaManager IO] WARNING: 'QuotaClient::IsShuttingDownOnNonBackgroundThread()', file /builds/worker/workspace/build/src/dom/indexedDB/ActorsParent.cpp, line 21336 [task 2018-08-24T17:05:43.759Z] 17:05:43 INFO - [Parent 1357, QuotaManager IO] WARNING: 'NS_FAILED(rv)', file /builds/worker/workspace/build/src/dom/indexedDB/ActorsParent.cpp, line 21200 [task 2018-08-24T17:05:43.760Z] 17:05:43 INFO - 1535130343531 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': storage.local data migration cancelled, unable to open IDB connection: The operation failed for reasons unrelated to the database itself and not covered by any other error code.:: [task 2018-08-24T17:05:43.761Z] 17:05:43 INFO - 1535130343532 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': JSONFile backend is being kept enabled by an unexpected IDBBackend failure: The operation failed for reasons unrelated to the database itself and not covered by any other error code.:: [task 2018-08-24T17:05:43.762Z] 17:05:43 INFO - [Parent 1357, Main Thread] WARNING: 'aRv.Failed()', file /builds/worker/workspace/build/src/dom/indexedDB/IDBDatabase.cpp, line 586 [task 2018-08-24T17:05:44.063Z] 17:05:44 INFO - 1535130344058 Marionette DEBUG Received observer notification xpcom-will-shutdown [task 2018-08-24T17:05:55.009Z] 17:05:55 INFO - REFTEST PROCESS-CRASH | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/svg/smil/repeat/init-repeat-1.svg | application crashed [@ js::jit::ExecutablePool::release(bool)] [task 2018-08-24T17:05:55.010Z] 17:05:55 INFO - Crash dump filename: /tmp/tmpgRWMue.mozrunner/minidumps/0c1146fb-0f3e-8620-aa34-93734a398711.dmp [task 2018-08-24T17:05:55.011Z] 17:05:55 INFO - Operating system: Linux [task 2018-08-24T17:05:55.011Z] 17:05:55 INFO - 0.0.0 Linux 4.4.0-1014-aws #14taskcluster1-Ubuntu SMP Tue Apr 3 10:27:00 UTC 2018 x86_64 [task 2018-08-24T17:05:55.012Z] 17:05:55 INFO - CPU: amd64 [task 2018-08-24T17:05:55.012Z] 17:05:55 INFO - family 6 model 62 stepping 4 [task 2018-08-24T17:05:55.013Z] 17:05:55 INFO - 2 CPUs [task 2018-08-24T17:05:55.014Z] 17:05:55 INFO - [task 2018-08-24T17:05:55.017Z] 17:05:55 INFO - GPU: UNKNOWN [task 2018-08-24T17:05:55.018Z] 17:05:55 INFO - [task 2018-08-24T17:05:55.018Z] 17:05:55 INFO - Crash reason: SIGSEGV [task 2018-08-24T17:05:55.019Z] 17:05:55 INFO - Crash address: 0x0 [task 2018-08-24T17:05:55.020Z] 17:05:55 INFO - Process uptime: not available [task 2018-08-24T17:05:55.020Z] 17:05:55 INFO - [task 2018-08-24T17:05:55.020Z] 17:05:55 INFO - Thread 0 (crashed) [task 2018-08-24T17:05:55.023Z] 17:05:55 INFO - 0 libxul.so!js::jit::ExecutablePool::release(bool) [ExecutableAllocator.cpp:9f1b27308bd6dfaa17c75720660b5bf40cd4ccab : 51 + 0x0] [task 2018-08-24T17:05:55.023Z] 17:05:55 INFO - rax = 0x0000000000000000 rdx = 0x0000000000000000 [task 2018-08-24T17:05:55.024Z] 17:05:55 INFO - rcx = 0x00007f7c7e4172dd rbx = 0x00007f7c63237330 [task 2018-08-24T17:05:55.025Z] 17:05:55 INFO - rsi = 0x00007f7c7e6e6770 rdi = 0x00007f7c7e6e5540 [task 2018-08-24T17:05:55.025Z] 17:05:55 INFO - rbp = 0x00007ffc190b6eb0 rsp = 0x00007ffc190b6ea0 [task 2018-08-24T17:05:55.029Z] 17:05:55 INFO - r8 = 0x00007f7c7e6e6770 r9 = 0x00007f7c7f9c9740 [task 2018-08-24T17:05:55.030Z] 17:05:55 INFO - r10 = 0x0000000000000000 r11 = 0x0000000000000000 [task 2018-08-24T17:05:55.030Z] 17:05:55 INFO - r12 = 0x00007f7c66714000 r13 = 0x00007f7c6689dde0 [task 2018-08-24T17:05:55.030Z] 17:05:55 INFO - r14 = 0x00007f7c66704800 r15 = 0x00007f7c66714688 [task 2018-08-24T17:05:55.030Z] 17:05:55 INFO - rip = 0x00007f7c6fbdcd8e [task 2018-08-24T17:05:55.030Z] 17:05:55 INFO - Found by: given as instruction pointer in context [task 2018-08-24T17:05:55.030Z] 17:05:55 INFO - 1 libxul.so!js::jit::ExecutableAllocator::~ExecutableAllocator() [ExecutableAllocator.cpp:9f1b27308bd6dfaa17c75720660b5bf40cd4ccab : 99 + 0xa] [task 2018-08-24T17:05:55.032Z] 17:05:55 INFO - rbx = 0x0000000000000000 rbp = 0x00007ffc190b6ed0 [task 2018-08-24T17:05:55.032Z] 17:05:55 INFO - rsp = 0x00007ffc190b6ec0 r12 = 0x00007f7c66714000 [task 2018-08-24T17:05:55.032Z] 17:05:55 INFO - r13 = 0x00007f7c6689dde0 r14 = 0x00007f7c66704800 [task 2018-08-24T17:05:55.032Z] 17:05:55 INFO - r15 = 0x00007f7c66714688 rip = 0x00007f7c6fbdcebc [task 2018-08-24T17:05:55.032Z] 17:05:55 INFO - Found by: call frame info [task 2018-08-24T17:05:55.032Z] 17:05:55 INFO - 2 libxul.so!JSRuntime::destroyRuntime() [Utility.h:9f1b27308bd6dfaa17c75720660b5bf40cd4ccab : 561 + 0x8] [task 2018-08-24T17:05:55.032Z] 17:05:55 INFO - rbx = 0x00007f7c66704800 rbp = 0x00007ffc190b6f20 [task 2018-08-24T17:05:55.032Z] 17:05:55 INFO - rsp = 0x00007ffc190b6ee0 r12 = 0x00007f7c66714000 [task 2018-08-24T17:05:55.032Z] 17:05:55 INFO - r13 = 0x00007f7c6689dde0 r14 = 0x00007f7c66722000 [task 2018-08-24T17:05:55.032Z] 17:05:55 INFO - r15 = 0x00007f7c66714688 rip = 0x00007f7c700ef0c6 [task 2018-08-24T17:05:55.032Z] 17:05:55 INFO - Found by: call frame info [task 2018-08-24T17:05:55.032Z] 17:05:55 INFO - 3 libxul.so!js::DestroyContext(JSContext*) [JSContext.cpp:9f1b27308bd6dfaa17c75720660b5bf40cd4ccab : 213 + 0x5] [task 2018-08-24T17:05:55.032Z] 17:05:55 INFO - rbx = 0x00007f7c66899000 rbp = 0x00007ffc190b6f60 [task 2018-08-24T17:05:55.032Z] 17:05:55 INFO - rsp = 0x00007ffc190b6f30 r12 = 0x00007f7c6689dd98 [task 2018-08-24T17:05:55.033Z] 17:05:55 INFO - r13 = 0x00007f7c6689dde0 r14 = 0x00007f7c6689dd80 [task 2018-08-24T17:05:55.033Z] 17:05:55 INFO - r15 = 0x00007f7c66722000 rip = 0x00007f7c700412ad [task 2018-08-24T17:05:55.033Z] 17:05:55 INFO - Found by: call frame info [task 2018-08-24T17:05:55.034Z] 17:05:55 INFO - 4 libxul.so!mozilla::CycleCollectedJSContext::~CycleCollectedJSContext() [CycleCollectedJSContext.cpp:9f1b27308bd6dfaa17c75720660b5bf40cd4ccab : 103 + 0x5] [task 2018-08-24T17:05:55.035Z] 17:05:55 INFO - rbx = 0x00007f7c66899000 rbp = 0x00007ffc190b6fa0 [task 2018-08-24T17:05:55.037Z] 17:05:55 INFO - rsp = 0x00007ffc190b6f70 r12 = 0x00007f7c6689dd98 [task 2018-08-24T17:05:55.037Z] 17:05:55 INFO - r13 = 0x00007f7c6689dde0 r14 = 0x00007f7c6689dd80 [task 2018-08-24T17:05:55.038Z] 17:05:55 INFO - r15 = 0x00007f7c6689de48 rip = 0x00007f7c6c6a916f [task 2018-08-24T17:05:55.038Z] 17:05:55 INFO - Found by: call frame info [task 2018-08-24T17:05:55.039Z] 17:05:55 INFO - 5 libxul.so!XPCJSContext::~XPCJSContext() [XPCJSContext.cpp:9f1b27308bd6dfaa17c75720660b5bf40cd4ccab : 919 + 0x14] [task 2018-08-24T17:05:55.040Z] 17:05:55 INFO - rbx = 0x00007f7c66899000 rbp = 0x00007ffc190b6fd0 [task 2018-08-24T17:05:55.040Z] 17:05:55 INFO - rsp = 0x00007ffc190b6fb0 r12 = 0x00007ffc190b7128 [task 2018-08-24T17:05:55.041Z] 17:05:55 INFO - r13 = 0x00007ffc190b7130 r14 = 0x0000000000000000 [task 2018-08-24T17:05:55.041Z] 17:05:55 INFO - r15 = 0x00007f7c66851d00 rip = 0x00007f7c6ced27bb [task 2018-08-24T17:05:55.042Z] 17:05:55 INFO - Found by: call frame info
Component: SVG → JavaScript Engine: JIT
In the last 7 days, there are 31 failures associated with this bug. They occur mostly on linux64 (debug), linux 32 (debug) and some on windows10-64-ccov (debug). Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=196328748&repo=mozilla-central&lineNumber=60046 :sdetar, Can you have a look into this?
Flags: needinfo?(sdetar)
Whiteboard: [stockwell needswork]
This is a symptom of leaking GC things at shutdown, the log has many lines like: [task 2018-08-24T17:05:42.018Z] 17:05:42 INFO - ERROR: GC found live Cell 0x7f7c63d07040 of kind FUNCTION at shutdown [task 2018-08-24T17:05:42.019Z] 17:05:42 INFO - ERROR: GC found live Cell 0x7f7c63d07080 of kind FUNCTION at shutdown So what's going on here is that scripts keep JIT code alive and the executable code allocator then fails an assertion because there are still references to code pools when it's destroyed. Jon, is this one of these "needs a shutdownCollectedEverything() call" bugs? It worries me a bit that this seems to have gotten worse recently :/
Flags: needinfo?(jcoppeard)
Flags: needinfo?(sdetar)
See Also: → 1485985
Maybe we should wait for bug 1485985 comment 6.
Flags: needinfo?(jcoppeard)
See Also: → 1485983
(In reply to Jan de Mooij [:jandem] from comment #3) > Jon, is this one of these "needs a shutdownCollectedEverything() call" bugs? Yes it is. > It worries me a bit that this seems to have gotten worse recently :/ Yeah we need to find out what is leaking GC things, and probably make this a more serious error.
(In reply to Jon Coppeard (:jonco) from comment #5) > > It worries me a bit that this seems to have gotten worse recently :/ > > Yeah we need to find out what is leaking GC things, and probably make this a > more serious error. Jon, can you or someone else please have a look at it? The ranking is quite high in Orange Factor for this bug and bug 1485983. Thanks.
Flags: needinfo?(jcoppeard)
From comment 9 it appears this specific problem is caused by bug 1484496. For the general problem of the embedding causing us to leak GC things at shutdown, there are some things we could do to help pinpoint the source: - dump more information about what the leaked GC things are - make this show up as a build failure like embedding leaks (bug 1407593) - dump the GC heap when this happens - maybe dump the CC graph too
Flags: needinfo?(jcoppeard)
There is a total of 43 failures in the last 7 days. Jon, this still has a high occurrence rate. I saw your comment above, are you working on it or should this be assigned to someone else? Thanks.
Flags: needinfo?(jcoppeard)
Whiteboard: [stockwell needswork] → [stockwell needswork:owner]
(In reply to Andreea Pavel [:apavel] from comment #13) > Jon, this still has a high occurrence rate. I saw your comment above, are > you working on it or should this be assigned to someone else? Jon is on PTO, but AFAIK this is still blocked on the NI from kmag in bug 1485985.
Jan, thanks. Waiting on bug 1485985 to be fixed here.
Flags: needinfo?(jcoppeard)
(In reply to Andreea Pavel [:apavel] from comment #13) Sorry for the slow reply. I'm not actively working on this.
(In reply to Jon Coppeard (:jonco) from comment #19) > (In reply to Andreea Pavel [:apavel] from comment #13) > Sorry for the slow reply. I'm not actively working on this. No problem, already followed progress here in comment #16
Taking over the ni? flag for Kris from bug 1485985.
Flags: needinfo?(kmaglione+bmo)
Henrik are you working on this?
(In reply to Andreea Pavel [:apavel] from comment #31) > Henrik are you working on this? No, this is not my area. As you can see from my last comment we are waiting for feedback from Kris.
Over the last 7 days there are 42 failures on this bug. These happen on linux32, linux64, osx-10-10, windows10-64, windows10-64-ccov, windows7-32. Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=216333338&repo=mozilla-central&lineNumber=25475
There are 154 failures asociated to this bug in the last 7 days. These are occurring on all platforms, on debug build. :sdetar can you take a look at this?
Flags: needinfo?(sdetar)
It seems from the comments this is still waiting for Kris
Flags: needinfo?(sdetar)
Flags: needinfo?(kmaglione+bmo)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
Hi Kris, there is a high failure rate here, can you please take a look at this?
Flags: needinfo?(kmaglione+bmo)
It looks like this is a leak in Firefox code, not a GC issue, but I don't have anything to do with the tests that are triggering the leak.
Component: JavaScript Engine: JIT → General
Flags: needinfo?(kmaglione+bmo)
Product: Core → Firefox
Justin, can you take a look at this?
Flags: needinfo?(dolske)
There are 112 total failures in the last 7 days on linux, osx and windows, all debug build types. Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=218724268&repo=mozilla-central&lineNumber=62465
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
I think the basic problem here is an assertion in the JS engine is causing leaks that we normally ignore to turn into a failure. Of the 113 failures that started in the last week, maybe 1/4 are in firefox-ui-functional-local-e10s, whatever that is, and the rest are in marionette tests. Based on bug 1485985 comment 10, this is a regression from bug 1484373.
Blocks: 1484373
Andrew, from the fix you mean? Kris just mentioned above that he doesn't have anything to do with this.
It looks like this is just the side-effect of a messy shutdown. The log is full of things like: 23:46:10 INFO - JavaScript error: resource:///actors/AboutReaderChild.jsm, line 80: NS_ERROR_UNEXPECTED: 23:46:10 INFO - JavaScript error: chrome://global/content/process-content.js, line 74: NS_ERROR_UNEXPECTED: 23:46:10 INFO - [Child 6612, Main Thread] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file z:/build/build/src/xpcom/threads/nsThread.cpp, line 870 23:46:10 INFO - [Child 1800, Main Thread] WARNING: '!gThread', file z:/build/build/src/xpcom/threads/nsTimerImpl.cpp, line 299 23:46:11 INFO - [Child 7732, Main Thread] WARNING: '!mMainThread', file z:/build/build/src/xpcom/threads/nsThreadManager.cpp, line 457 After which, it's not super surprising that something doesn't get fully cleaned up. At least for the Windows failures, it seems like this is probably a side-effect of the way we're shutting down in response to OS shutdown. We should maybe just hard kill the content processes in those cases.
Update: this has failed 19 times in the last 7 days, only on debug builds, most of them on firefox-ui-functional tests: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-12-28&endday=2019-01-04&tree=all&bug=1486101 Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=219924275&repo=mozilla-central&lineNumber=67368 06:18:57 INFO - ERROR: GC found live SharedScriptData 000001CFB3869B80 with ref count 2 at shutdown 06:18:57 INFO - ERROR: GC found live SharedScriptData 000001CFB59F6D30 with ref count 2 at shutdown 06:18:57 INFO - ERROR: GC found live SharedScriptData 000001CFB196F580 with ref count 2 at shutdown 06:18:57 INFO - Assertion failure: m_refCount == 1, at z:/build/build/src/js/src/jit/ExecutableAllocator.cpp:49 06:20:05 INFO - TEST-UNEXPECTED-ERROR | testing\firefox-ui\tests\functional\sessionstore\test_restore_windows_after_windows_shutdown.py TestWindowsShutdownRegisterRestart.test_manual_restart | IOError: Process still running 70s after quit request 06:20:05 INFO - Traceback (most recent call last): 06:20:05 INFO - File "Z:\task_1546580829\build\venv\lib\site-packages\marionette_harness\marionette_test\testcases.py", line 159, in run 06:20:05 INFO - testMethod() 06:20:05 INFO - File "Z:\task_1546580829\build\tests\firefox-ui\tests\testing\firefox-ui\tests\functional\sessionstore\test_restore_windows_after_windows_shutdown.py", line 44, in test_manual_restart 06:20:05 INFO - self.windows_shutdown_with_variety(restart_by_os=False, expect_restore=True) 06:20:05 INFO - File "Z:\task_1546580829\build\tests\firefox-ui\tests\testing\firefox-ui\tests\functional\sessionstore\session_store_test_case.py", line 310, in windows_shutdown_with_variety 06:20:05 INFO - self.marionette.quit(in_app=True, callback=lambda: self.simulate_os_shutdown()) 06:20:05 INFO - File "Z:\task_1546580829\build\venv\lib\site-packages\marionette_driver\decorators.py", line 26, in _ 06:20:05 INFO - return func(*args, **kwargs) 06:20:05 INFO - File "Z:\task_1546580829\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1128, in quit 06:20:05 INFO - raise IOError(message.format(self.shutdown_timeout)) 06:20:05 INFO - TEST-INFO took 123836ms 06:20:05 ERROR - test_end for testing\firefox-ui\tests\functional\sessionstore\test_restore_windows_after_windows_shutdown.py TestWindowsShutdownRegisterRestart.test_manual_restart logged while not in progress. Logged with data: {"status": "ERROR", "extra": {"class_name": "test_restore_windows_after_windows_shutdown.TestWindowsShutdownRegisterRestart", "method_name": "test_manual_restart"}, "expected": "PASS", "test": "testing\\firefox-ui\\tests\\functional\\sessionstore\\test_restore_windows_after_windows_shutdown.py TestWindowsShutdownRegisterRestart.test_manual_restart", "message": "InvalidSessionIdException: Please start a session", "stack": "Traceback (most recent call last):\n File \"Z:\\task_1546580829\\build\\venv\\lib\\site-packages\\marionette_harness\\marionette_test\\testcases.py\", line 190, in run\n self.tearDown()\n File \"Z:\\task_1546580829\\build\\tests\\firefox-ui\\tests\\testing\\firefox-ui\\tests\\functional\\sessionstore\\session_store_test_case.py\", line 65, in tearDown\n super(SessionStoreTestCase, self).tearDown()\n File \"Z:\\task_1546580829\\build\\venv\\lib\\site-packages\\firefox_puppeteer\\mixins.py\", line 97, in tearDown\n self.marionette.set_context('chrome')\n File \"Z:\\task_1546580829\\build\\venv\\lib\\site-packages\\marionette_driver\\marionette.py\", line 1461, in set_context\n {\"value\": context})\n File \"Z:\\task_1546580829\\build\\venv\\lib\\site-packages\\marionette_driver\\decorators.py\", line 26, in _\n return func(*args, **kwargs)\n File \"Z:\\task_1546580829\\build\\venv\\lib\\site-packages\\marionette_driver\\marionette.py\", line 753, in _send_message\n raise errors.InvalidSessionIdException(\"Please start a session\")\n"} 06:20:05 INFO - 06:20:05 INFO - SUMMARY 06:20:05 INFO - ------- 06:20:05 INFO - passed: 63 06:20:05 INFO - failed: 2 06:20:05 INFO - todo: 2 06:20:05 INFO - 06:20:05 INFO - FAILED TESTS 06:20:05 INFO - ------- 06:20:05 INFO - test_restore_windows_after_windows_shutdown.py test_restore_windows_after_windows_shutdown.TestWindowsShutdownRegisterRestart.test_manual_restart 06:20:05 INFO - test_restore_windows_after_windows_shutdown.py test_restore_windows_after_windows_shutdown.TestWindowsShutdownRegisterRestart.test_manual_restart 06:20:05 INFO - SUITE-END | took 1249s 06:20:06 ERROR - Return code: 10 06:20:06 ERROR - # TBPL FAILURE # 06:20:06 WARNING - setting return code to 2 Andrew, care to have a crack at this? Justin, could you take a look over this or assign it to someone? Thank you
Flags: needinfo?(continuation)
Also note that the background hang monitor doesn't kick in and kills the process after around 60 seconds. But maybe that is not possible because it is blocked itself? But from the testing side we hard kill Firefox after that amount of time in Marionette tests, and here mostly session restore tests seem to cause it.

The failure rate here is still high, 30 total failures in the last 7 days and 304 total failures in the last 30 days.

Andrew/Justin do you have an update here?

Flags: needinfo?(dolske)
Flags: needinfo?(continuation)
Flags: needinfo?(dolske)
Flags: needinfo?(continuation)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

I don't know anything about this test suite, and I'm not sure why this leak is hitting an assertion when we usually don't hit it.

Flags: needinfo?(continuation)

(In reply to Andrew McCreight [:mccr8] from comment #61)

I don't know anything about this test suite, and I'm not sure why this leak is hitting an assertion when we usually don't hit it.

It's not only Firefox ui tests which are causing this assertion and crash. But also lots of mochitests on Linux triggering it. Please see the OrangeFactor history.

06:20:05 INFO - FAILED TESTS
06:20:05 INFO - -------
06:20:05 INFO - test_restore_windows_after_windows_shutdown.py test_restore_windows_after_windows_shutdown.TestWindowsShutdownRegisterRestart.test_manual_restart
06:20:05 INFO - SUITE-END | took 1249s

This test failure is specific for Firefox UI tests and is only failing on Windows 10 64bit and that for ccov debug builds. Maybe something is wrong here with the shutdown for such builds? Marco, is there something you can help with?

Flags: needinfo?(mcastelluccio)
Priority: P5 → P1

Marco, can you please take a look at the above comment?

The only thing I can think of is that shutdown for ccov builds is very slow, as the coverage counters are written out on shutdown. So the timing of the shutdown can be different in ccov builds. I wonder if the test fails in test-verify mode too.

Flags: needinfo?(mcastelluccio)

(In reply to Marco Castelluccio [:marco] from comment #69)

The only thing I can think of is that shutdown for ccov builds is very slow, as the coverage counters are written out on shutdown. So the timing of the shutdown can be different in ccov builds. I wonder if the test fails in test-verify mode too.

At least the Marionette and firefox-ui tests aren't supported yet in test-verify mode. I cannot say something for mochitests.

Also the ccov failures are only a small set compared to normal debug builds which fail way more often.

There are 139 failures associated to this bug in the last 7 days. These are hitting all platforms but only the debug builds.

:Dolske any updates on this?

There are 118 total failures in the last 7 days, Justin can you assign someone to work on this?

We cannot disable it since this fails in different tests.

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

(In reply to Andrew McCreight [:mccr8] from comment #53)

I think the basic problem here is an assertion in the JS engine is causing
leaks that we normally ignore to turn into a failure. Of the 113 failures
that started in the last week, maybe 1/4 are in
firefox-ui-functional-local-e10s, whatever that is, and the rest are in
marionette tests.

firefox-ui tests are also run under marionette. Could this be a webdriver bug?

So I was able to trace down the problem to create a minimized testcase which can be run with Marionette. It seems to be a situation when in a new tab a page load is started, but it is getting closed before the page load has been finished.

I can reproduce this crash kinda easily on MacOS when running the following command with the testcase as attached:

mach marionette test -vv --gecko-log - %testcase.py% --run-until-failure

Firefox should crash after a couple of iterations of this test. If not, repeat the command or additionally use --repeat 100 or such to run more than 30 iterations.

I hope that helps to nail down this problem. Who could actually investigate that?

I hate to keep bouncing this around, but tracking down an ill-defined leak is going to be difficult for the front-end team to assist with.

Andrew, does the reduced test case in comment 86 help? I'm not sure who is best positioned to look at leaks.

Flags: needinfo?(dolske) → needinfo?(continuation)

I'll try to debug this leak.

Flags: needinfo?(continuation)
This modifies test_title.py to be whimboo's minimized test case. It also comments out a bunch of GC assertions so that we don't assert when there's a leak. It also makes the bloat log fatally assert after dumping out the log. With these combination of patches, you can run the following command to make it run until there's a leak, and you'll get some bloat log info: MOZ_DISABLE_CONTENT_SANDBOX=t XPCOM_MEM_BLOAT_LOG=1 ./mach marionette test -vv --gecko-log - testing/marionette/harness/marionette_harness/tests/unit/test_title.py --run-until-failure You can set more env vars to get CC logs, etc.

With this patch, I was able to get a CC log of the case when we leaked. There were 3 non-garbage items in the final CC log. A nsWebBrowser, a nsXPCWrappedJS (nsIStreamListener), and a nsXPCWrappedJS (nsIInterfaceRequestor). The two XPCWJS were in the same object chain. I looked at the JS object being used to implement the WJS and I figured out that it must be a FaviconLoad from FaviconLoader.jsm. It looks like it is being used as some kind of networking listener thing. The bloat log also includes an HttpChannelChild.

On a hunch, I applied the patches from bug 1520062 (which was landed and backed out) and with that I am no longer able to reproduce the leak, so I think the leak is from hitting some error case.

Depends on: 1520062

This bug also seems to have had zero instances on January 23, when bug 1520062 was in the tree, so that's consistent with my theory.

Component: General → Networking: HTTP
Product: Firefox → Core

Thanks a lot Andrew! That sounds promising.

Andrew, thanks. I will duplicate it then to get this off our triage list.

Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE

The test failures are still present and as such we should allow sheriffs to easily classify failures. Otherwise we might also see spreading failures across different bugs. If you don't mind I would like to keep this open, given that it is such a high frequent failure.

Note that we can remove the P1 which should also already help to get it off your triage list.

Status: RESOLVED → REOPENED
Priority: P1 → --
Resolution: DUPLICATE → ---

Oh, I didn't realize, sorry for that. Marking as P2 (to actually remove it from our triage list) which is the same priority as the blocker.

Status: REOPENED → NEW
Priority: -- → P2

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #104)

Thanks a lot Andrew! That sounds promising.

Thanks for the minimized test case. It would have been much harder to investigate without it.

Whiteboard: [stockwell disable-recommended] → [stockwell disable-recommended][necko-triaged]
Whiteboard: [stockwell disable-recommended][necko-triaged] → [necko-triaged]
Whiteboard: [necko-triaged] → [necko-triaged][stockwell needswork:owner]
Blocks: 1485983
Whiteboard: [necko-triaged][stockwell disable-recommended] → [necko-triaged][stockwell needswork:owner]

There are 24 total failures in the last 7 days and 702 total failures in the last 30 days on all platforms, only debug build types.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=229899897&repo=autoland&lineNumber=14338

[task 2019-02-22T14:18:25.055Z] 14:18:25 INFO - TEST-START | toolkit/components/extensions/test/mochitest/test-oop-extensions/test_ext_idle.html
[task 2019-02-22T14:18:25.074Z] 14:18:25 INFO - GECKO(3526) | ERROR: GC found live Cell 0xf74575e7040 of kind FUNCTION at shutdown
[task 2019-02-22T14:18:25.074Z] 14:18:25 INFO - GECKO(3526) | ERROR: GC found live Cell 0xf74575e7080 of kind FUNCTION at shutdown
[task 2019-02-22T14:18:25.074Z] 14:18:25 INFO - GECKO(3526) | ERROR: GC found live Cell 0xf74575e70c0 of kind FUNCTION at shutdown
[task 2019-02-22T14:18:25.074Z] 14:18:25 INFO - GECKO(3526) | ERROR: GC found live Cell 0xf74575e7100 of kind FUNCTION at shutdown
[task 2019-02-22T14:18:26.485Z] 14:18:26 INFO - GECKO(3526) | ERROR: GC found live SharedScriptData 0x7efc2a670900 with ref count 2 at shutdown
[task 2019-02-22T14:18:26.486Z] 14:18:26 INFO - GECKO(3526) | Assertion failure: m_refCount == 1, at /builds/worker/workspace/build/src/js/src/jit/ExecutableAllocator.cpp:49
[task 2019-02-22T14:18:26.487Z] 14:18:26 INFO - GECKO(3526) | [Parent 3526, Main Thread] WARNING: Cannot set transparency mode on non-popup windows.: file /builds/worker/workspace/build/src/widget/gtk/nsWindow.cpp, line 4059
[task 2019-02-22T14:18:26.487Z] 14:18:26 INFO - GECKO(3526) | [Parent 3526, Main Thread] WARNING: Cannot set transparency mode on non-popup windows.: file /builds/worker/workspace/build/src/widget/gtk/nsWindow.cpp, line 4059
[task 2019-02-22T14:18:26.488Z] 14:18:26 INFO - GECKO(3526) | ++DOMWINDOW == 9 (0x7efc65285000) [pid = 3639] [serial = 128] [outer = 0x7efc7fecf800]
[task 2019-02-22T14:18:26.651Z] 14:18:26 INFO - GECKO(3526) | Console message: [JavaScript Warning: "The character encoding of a framed document was not declared. The document may appear different if viewed without the document framing it." {file: "http://mochi.test:8888/tests/toolkit/components/extensions/test/mochitest/test-oop-extensions/test_ext_idle.html" line: 0}]
[task 2019-02-22T14:18:26.652Z] 14:18:26 INFO - GECKO(3526) | Console message: Warning: attempting to write 5552 bytes to preference extensions.webextensions.uuids. This is bad for general performance and memory usage. Such an amount of data should rather be written to an external file. This preference will not be sent to any content processes.
[task 2019-02-22T14:18:26.693Z] 14:18:26 INFO - GECKO(3526) | [Parent 3526, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /builds/worker/workspace/build/src/netwerk/protocol/res/SubstitutingProtocolHandler.cpp, line 314
[task 2019-02-22T14:18:26.757Z] 14:18:26 INFO - GECKO(3526) | [Parent 3526, Main Thread] WARNING: Need TabChild to get the nativeWindow from!: file /builds/worker/workspace/build/src/widget/PuppetWidget.cpp, line 1078
[task 2019-02-22T14:18:26.761Z] 14:18:26 INFO - GECKO(3526) | ++DOCSHELL 0x7f98535a5000 == 18 [pid = 3619] [id = {79c099e3-272e-4e07-a27a-fc6c2cf57e7d}]
[task 2019-02-22T14:18:26.763Z] 14:18:26 INFO - GECKO(3526) | ++DOMWINDOW == 55 (0x7f985358e400) [pid = 3619] [serial = 176] [outer = (nil)]
[task 2019-02-22T14:18:26.765Z] 14:18:26 INFO - GECKO(3526) | [Child 3619, Main Thread] WARNING: Fallback to BasicLayerManager: file /builds/worker/workspace/build/src/dom/ipc/TabChild.cpp, line 2586
task 2019-02-22T14:18:37.338Z] 14:18:37 INFO - GECKO(3526) | MEMORY STAT | vsize 1497MB | residentFast 139MB | heapAllocated 14MB
[task 2019-02-22T14:18:37.357Z] 14:18:37 INFO - Buffered messages logged at 14:18:26
[task 2019-02-22T14:18:37.357Z] 14:18:37 INFO - AddTask.js | Entering test testWithRealIdleService
[task 2019-02-22T14:18:37.357Z] 14:18:37 INFO - Extension loaded
[task 2019-02-22T14:18:37.357Z] 14:18:37 INFO - Setting interval to 152
[task 2019-02-22T14:18:37.357Z] 14:18:37 INFO - TEST-PASS | toolkit/components/extensions/test/mochitest/test-oop-extensions/test_ext_idle.html | Idle status is active - Expected: active, Actual: active
[task 2019-02-22T14:18:37.359Z] 14:18:37 INFO - Buffered messages logged at 14:18:27
[task 2019-02-22T14:18:37.359Z] 14:18:37 INFO - Listener added
[task 2019-02-22T14:18:37.359Z] 14:18:37 INFO - Buffered messages logged at 14:18:37
[task 2019-02-22T14:18:37.360Z] 14:18:37 INFO - TEST-PASS | toolkit/components/extensions/test/mochitest/test-oop-extensions/test_ext_idle.html | listener fired with the expected state - Expected: idle, Actual: idle
[task 2019-02-22T14:18:37.360Z] 14:18:37 INFO - Listener fired
[task 2019-02-22T14:18:37.363Z] 14:18:37 INFO - TEST-PASS | toolkit/components/extensions/test/mochitest/test-oop-extensions/test_ext_idle.html | Idle status is idle - Expected: idle, Actual: idle
[task 2019-02-22T14:18:37.365Z] 14:18:37 INFO - TEST-PASS | toolkit/components/extensions/test/mochitest/test-oop-extensions/test_ext_idle.html | idle
[task 2019-02-22T14:18:37.367Z] 14:18:37 INFO - TEST-PASS | toolkit/components/extensions/test/mochitest/test-oop-extensions/test_ext_idle.html | test result correct
[task 2019-02-22T14:18:37.373Z] 14:18:37 INFO - AddTask.js | Leaving test testWithRealIdleService
[task 2019-02-22T14:18:37.373Z] 14:18:37 INFO - Buffered messages finished
[task 2019-02-22T14:18:37.374Z] 14:18:37 INFO - TEST-UNEXPECTED-ERROR | toolkit/components/extensions/test/mochitest/test-oop-extensions/test_ext_idle.html | unexpected-crash-dump-found - This test left crash dumps behind, but we weren't expecting it to!
[task 2019-02-22T14:18:37.376Z] 14:18:37 INFO - Found unexpected crash dump file /tmp/tmp9g3QJx.mozrunner/minidumps/051c3f0d-70e0-f358-553b-891fb40e584a.dmp.
[task 2019-02-22T14:18:37.377Z] 14:18:37 INFO - Found unexpected crash dump file /tmp/tmp9g3QJx.mozrunner/minidumps/051c3f0d-70e0-f358-553b-891fb40e584a.extra.
[task 2019-02-22T14:18:37.377Z] 14:18:37 INFO - TEST-UNEXPECTED-CRASH | toolkit/components/extensions/test/mochitest/test-oop-extensions/test_ext_idle.html | Finished in 12302ms
[task 2019-02-22T14:18:37.378Z] 14:18:37 INFO - {u'runtime': 12302}
[task 2019-02-22T14:18:37.378Z] 14:18:37 INFO - TEST-INFO took 12302ms
[task 2019-02-22T14:18:37.420Z] 14:18:37 INFO - GECKO(3526) | ++DOMWINDOW == 7 (0x7efc65281800) [pid = 3639] [serial = 129] [outer = 0x7efc7fecf800]
[task 2019-02-22T14:18:37.436Z] 14:18:37 INFO - GECKO(3526) | [Parent 3526, Main Thread] WARNING: Cannot set transparency mode on non-popup windows.: file /builds/worker/workspace/build/src/widget/gtk/nsWindow.cpp, line 4059

Andrew are there updates here? Failure rate seems to have decreased starting with ~18th of February.

Flags: needinfo?(continuation)

See comment 100. Bug 1520062 landed and there have been no failures more recently than 2/22.

Flags: needinfo?(continuation)

(In reply to Andrew McCreight [:mccr8] from comment #113)

See comment 100. Bug 1520062 landed and there have been no failures more recently than 2/22.

Yes, that is correct, I posted the comment on the 22nd.

Closing as fixed given that no more failures are present in the last 3 days on trunk. Beta is still affected but has a very low number of intermittents, so maybe a wontfix.

Assignee: nobody → valentin.gosu
Status: NEW → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67

(In reply to Mats Palmgren (:mats) from comment #118)

Fwiw, I still see this assertion fairly often on Try, e.g.:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=83a6688560d512a4095a7497a412cf58c52109e0&selectedJob=243543274

Can you please file a new bug, and then also classify the failure? That way we could see how often it actually fails on try. Thanks.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: