Closed Bug 1371207 Opened 7 years ago Closed 7 years ago

Intermittent test_crash.py TestCrash.test_crash_content_process | application crashed [@ js::ctypes::ConvertToJS]

Categories

(Core :: JavaScript Engine, defect)

55 Branch
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox54 --- unaffected
firefox55 --- affected
firefox56 --- affected

People

(Reporter: intermittent-bug-filer, Assigned: whimboo)

References

Details

(Keywords: crash, intermittent-failure, regression)

Crash Data

As seen when running my try build [1] for bug 1223277 there is a recent regression in our content specific unit crash test [2] for Marionette. Here the first frames of the crashing thread 0:

https://treeherder.mozilla.org/logviewer.html#?job_id=104881595&repo=try&lineNumber=2932

[task 2017-06-06T13:35:00.677708Z] 13:35:00     INFO - Crash reason:  SIGSEGV
[task 2017-06-06T13:35:00.678500Z] 13:35:00     INFO - Crash address: 0x8
[task 2017-06-06T13:35:00.679270Z] 13:35:00     INFO - Process uptime: not available
[task 2017-06-06T13:35:00.680007Z] 13:35:00     INFO - 
[task 2017-06-06T13:35:00.681489Z] 13:35:00     INFO -  0  libxul.so!js::ctypes::ConvertToJS [CTypes.cpp:5b605b51a2f8 : 3203 + 0x0]
[task 2017-06-06T13:35:00.682305Z] 13:35:00     INFO -     eip = 0xf3a4a7f9   esp = 0xffcf1ff0   ebp = 0xffcf20f8   ebx = 0xf56305ec
[task 2017-06-06T13:35:00.683051Z] 13:35:00     INFO -     esi = 0x00000000   edi = 0xffcf2028   eax = 0x00000008   ecx = 0xffffff81
[task 2017-06-06T13:35:00.683715Z] 13:35:00     INFO -     edx = 0xffcf2101   efl = 0x00210297
[task 2017-06-06T13:35:00.684484Z] 13:35:00     INFO -     Found by: given as instruction pointer in context
[task 2017-06-06T13:35:00.685233Z] 13:35:00     INFO -  1  libxul.so!js::ctypes::PointerType::ContentsGetter [CTypes.cpp:5b605b51a2f8 : 5342 + 0x22]
[task 2017-06-06T13:35:00.685965Z] 13:35:00     INFO -     eip = 0xf3a4fb68   esp = 0xffcf2100   ebp = 0xffcf2188   ebx = 0xf56305ec
[task 2017-06-06T13:35:00.686829Z] 13:35:00     INFO -     esi = 0x00000000   edi = 0xf716b000
[task 2017-06-06T13:35:00.687495Z] 13:35:00     INFO -     Found by: call frame info
[task 2017-06-06T13:35:00.688319Z] 13:35:00     INFO -  2  libxul.so!js::ctypes::Property<js::ctypes::PointerType::IsPointer, js::ctypes::PointerType::ContentsGetter>::Fun [CallNonGenericMethod.h:5b605b51a2f8 : 100 + 0x12]
[task 2017-06-06T13:35:00.689048Z] 13:35:00     INFO -     eip = 0xf3a4fe1c   esp = 0xffcf2190   ebp = 0xffcf21d8   ebx = 0xf56305ec
[task 2017-06-06T13:35:00.689793Z] 13:35:00     INFO -     esi = 0xda361790   edi = 0xffcf21a8
[task 2017-06-06T13:35:00.690537Z] 13:35:00     INFO -     Found by: call frame info
[task 2017-06-06T13:35:00.691338Z] 13:35:00     INFO -  3  libxul.so!js::InternalCallOrConstruct [jscntxtinlines.h:5b605b51a2f8 : 293 + 0x10]
[task 2017-06-06T13:35:00.692091Z] 13:35:00     INFO -     eip = 0xf3aae987   esp = 0xffcf21e0   ebp = 0xffcf2278   ebx = 0xf56305ec
[task 2017-06-06T13:35:00.692831Z] 13:35:00     INFO -     esi = 0xffcf2240   edi = 0xf716b000
[task 2017-06-06T13:35:00.693564Z] 13:35:00     INFO -     Found by: call frame info

I did a regression test locally and it turns out the changes on bug 1365307 are causing this problem. Maybe the throttling of the HTTP transaction somewhat conflicts with the shutdown of Firefox as triggered by the content crash by the crash reporter (to be enabled with MOZ_CRASHREPORTER_SHUTDOWN env variable)?

To run this test you can use the following command (maybe do an s/def test_/def tst_/ before to only run this single test):

> mach marionette-test testing/marionette/harness/marionette_harness/tests/unit/test_crash.py


[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=5b605b51a2f8dbad15b671a91ec49abad550a58c&selectedJob=104881595&bugfiler

[2] https://dxr.mozilla.org/mozilla-central/rev/a49112c7a5765802096b3fc298069b9495436107/testing/marionette/harness/marionette_harness/tests/unit/test_crash.py#108
Blocks: 1223277
Component: Marionette → JavaScript Engine
Flags: needinfo?(mcmanus)
Flags: needinfo?(honzab.moz)
Keywords: regression
OS: Unspecified → All
Product: Testing → Core
Hardware: Unspecified → All
Version: Version 3 → 55 Branch
honza will look at it
Flags: needinfo?(mcmanus)
(In reply to Henrik Skupin (:whimboo) from comment #1)
> As seen when running my try build [1] for bug 1223277 there is a recent
> regression in our content specific unit crash test [2] for Marionette. Here

Just to add, I do not have to attach my patch from bug 1223277 to reproduce it locally. I don't know why we do not see this failure on Treeherder, but maybe this is also related to artifact builds I locally do? The frequency of test failures increased a lot over the last days. Means with bug 1365307 landed it happened 1 out of 10 repeats (--repeat 10). But with latest mozilla-central it is close to fail each time I run this test.
Henrik, can you please recheck if this is still a problem after bug 1369496 has landed?

If yes, I'll start looking into it early next week.
Flags: needinfo?(hskupin)
Yes, it's still happening with an artifact build created from changeset https://hg.mozilla.org/mozilla-central/rev/1742b1bdadd1.

Maybe it's also somewhat related to the following assertions?

Assertion failed at /home/worker/workspace/build/src/gfx/cairo/cairo/src/cairo-hash.c:196: hash_table->live_entries == 0
WARNING: YOU ARE LEAKING THE WORLD (at least one JSRuntime and everything alive inside it, that is) AT JS_ShutDown TIME.  FIX THIS!
[Parent 56485] ###!!! ASSERTION: Component Manager being held past XPCOM shutdown.: 'cnt == 0', file /home/worker/workspace/build/src/xpcom/build/XPCOMInit.cpp, line 1042
[Parent 56485] WARNING: '!compMgr', file /home/worker/workspace/build/src/xpcom/components/nsComponentManagerUtils.cpp, line 63
Leaked URLs:

There are a huge amount of URLs which leak. 

And then I see:

Assertion failure: isEmpty() (failing this assertion means this LinkedList's creator is buggy: it should have removed all this list's elements before the list's destruction), at /home/worker/workspace/build/src/obj-firefox/dist/include/mozilla/LinkedList.h:455
#01: __cxa_finalize_ranges[/usr/lib/system/libsystem_c.dylib +0x60178]
#02: exit[/usr/lib/system/libsystem_c.dylib +0x604b2]
Flags: needinfo?(hskupin)
Thanks.  I'll take this bug, but as I said, I won't be able to start with it sooner than early next week.  Hope it's OK.
Assignee: nobody → honzab.moz
Flags: needinfo?(honzab.moz)
Sounds great. Thank you in advance.
I can't run the test locally:

"Timed out waiting for connection on localhost:2828"


To be honest, I don't believe I will be of much help here.  The code added in bug 1365307 doesn't touch any JS or linked list code, even distantly.  If the crash has already existed before, then this is just and only a timing issue or the code restructure by optimization so that the crash just occurs more often.

You should contact people responsible for the code that crashes, not someone who wrote a patch that happens to make it worse by clearly unrelated changes.


I'll try to clone and build the try changeset locally to get exactly the same build yet.  If I still fail to run the test or fail to reproduce the crash I'll return the bug back to the ring.

In the mean time, can you do a try push or a local test that will turn "network.http.throttle.enable" to false?  That disables some parts of the functionality introduced in bug 1365307, just check if there is any effect.

Thanks.
@ /c/Mozilla/src/try-5b605b51a2f8dbad15b671a91ec49abad550a58c
$ ./mach marionette-test testing/marionette/harness/marionette_harness/tests/unit/test_crash.py
 0:00.00 LOG: MainThread INFO Using workspace for temporary data: "c:\Mozilla\src\try-5b605b51a2f8dbad15b671a91ec49abad550a58c"
 0:00.00 LOG: MainThread mozversion INFO application_buildid: 20170613140333
 0:00.00 LOG: MainThread mozversion INFO application_display_name: Nightly
 0:00.00 LOG: MainThread mozversion INFO application_id: {ec8030f7-c20a-464f-9b0e-13a3a9e97384}
 0:00.00 LOG: MainThread mozversion INFO application_name: Firefox
 0:00.00 LOG: MainThread mozversion INFO application_remotingname: firefox
 0:00.00 LOG: MainThread mozversion INFO application_vendor: Mozilla
 0:00.00 LOG: MainThread mozversion INFO application_version: 55.0a1
 0:00.00 LOG: MainThread mozversion INFO platform_buildid: 20170613140333
 0:00.00 LOG: MainThread mozversion INFO platform_version: 55.0a1
 0:00.01 LOG: MainThread INFO Application command: c:/_/obj/try-5b605b51a2f8dbad15b671a91ec49abad550a58c/_obj-browser-debug\dist\bin\firefox.exe -no-remote -marionette -profile c:\users\xmayhe~1\appdata\local\temp\tmpri1vph.mozrunner
Error running mach:

    ['marionette-test', 'testing/marionette/harness/marionette_harness/tests/unit/test_crash.py']

The error occurred in code that was called by the mach command. This is either
a bug in the called code itself or in the way that mach is calling it.

You should consider filing a bug for this issue.

If filing a bug, please include the full output of mach, including this error
message.

The details of the failure are as follows:

IOError: Process has been unexpectedly closed (Exit code: 2147483651) (Reason: Timed out waiting for connection on localhost:2828!)

  File "c:\Mozilla\src\try-5b605b51a2f8dbad15b671a91ec49abad550a58c\testing/marionette/mach_commands.py", line 81, in run_marionette_test
    return run_marionette(tests, topsrcdir=self.topsrcdir, **kwargs)
  File "c:\Mozilla\src\try-5b605b51a2f8dbad15b671a91ec49abad550a58c\testing/marionette/mach_commands.py", line 59, in run_marionette
    failed = MarionetteHarness(MarionetteTestRunner, args=vars(args)).run()
  File "c:\Mozilla\src\try-5b605b51a2f8dbad15b671a91ec49abad550a58c\testing/marionette/harness\marionette_harness\runtests.py", line 72, in run
    runner.run_tests(tests)
  File "c:\Mozilla\src\try-5b605b51a2f8dbad15b671a91ec49abad550a58c\testing/marionette/harness\marionette_harness\runner\base.py", line 827, in run_tests
    self.marionette = self.driverclass(**self._build_kwargs())
  File "c:\Mozilla\src\try-5b605b51a2f8dbad15b671a91ec49abad550a58c\testing/marionette/client\marionette_driver\marionette.py", line 611, in __init__
    self.raise_for_port(timeout=self.startup_timeout)
  File "c:\Mozilla\src\try-5b605b51a2f8dbad15b671a91ec49abad550a58c\testing/marionette/client\marionette_driver\decorators.py", line 28, in _
    m._handle_socket_failure()
  File "c:\Mozilla\src\try-5b605b51a2f8dbad15b671a91ec49abad550a58c\testing/marionette/client\marionette_driver\decorators.py", line 23, in _
    return func(*args, **kwargs)
  File "c:\Mozilla\src\try-5b605b51a2f8dbad15b671a91ec49abad550a58c\testing/marionette/client\marionette_driver\marionette.py", line 697, in raise_for_port
    self.host, self.port))


Giving up.
Assignee: honzab.moz → nobody
(In reply to Henrik Skupin (:whimboo) from comment #5)
> Assertion failure: isEmpty() (failing this assertion means this LinkedList's
> creator is buggy: it should have removed all this list's elements before the
> list's destruction), at
> /home/worker/workspace/build/src/obj-firefox/dist/include/mozilla/LinkedList.
> h:455
> #01: __cxa_finalize_ranges[/usr/lib/system/libsystem_c.dylib +0x60178]
> #02: exit[/usr/lib/system/libsystem_c.dylib +0x604b2]

I from time to time am getting this assertion failure too at shutdown when running my local build as a regular browser.  I managed to reveal the element, it's nsSHistory pointing to doc shell with chrome://global/content/backgroundPageThumbs.xhtml as current URI.  The crashing process is the parent process.  I had the net logging on and all transactions were according it released.  So I don't see a connection with bug 1365307.
Honza, thank you for digging into the problem. When I regression tested the crash behavior it indeed started with your push. But maybe it's a side-effect including something else. Especially because the failure rate raised even more afterward, so a combination might be possible.

I will do some more investigation myself in the next days.
(In reply to Honza Bambas (:mayhemer) from comment #10)
> > Assertion failure: isEmpty() (failing this assertion means this LinkedList's
> > creator is buggy: it should have removed all this list's elements before the
> > list's destruction), at
> > /home/worker/workspace/build/src/obj-firefox/dist/include/mozilla/LinkedList.
> > h:455
> > #01: __cxa_finalize_ranges[/usr/lib/system/libsystem_c.dylib +0x60178]
> > #02: exit[/usr/lib/system/libsystem_c.dylib +0x604b2]

So this is definitely a different crash and may overlay the underlying problem I was trying to investigate on this bug. I will file a separate bug for this specific crash, and will come back to this one once fixed.
Depends on: 1373614
I was able to reproduce this crash by a try build for another patch I'm working on:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=de0f62d300600e1e6586f72bf516a217499efd9a

It reproduces constantly, so I will try if I can use this method to track down the regression range.
So this may actually only be a side-effect of the crash we force in that test. Because we do not mock the mozcrash module long enough, the original module picks up the minidump files and processes it. This should actually not happen, but is due to delayed minidump file creation. Maybe the delay got introduced by the changes on bug 1369496.

Anyway, on bug 1223277 I added code which extends the lifetime of the mozcrash mock, so that the mock fetches it correctly, and that this crash is no longer visible in the Treeherder logs. This should be fine enough.
Assignee: nobody → hskupin
No longer blocks: 1223277
Status: NEW → ASSIGNED
Depends on: 1223277
It might be fixed by the changes of bug 1223277, but I'm not fully sure. So marking as WFM for now.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.