Closed Bug 980800 Opened 7 years ago Closed 6 years ago

Firefox startup crashes when trying to create a local socket via ctypes [@ pt_SetSocketOption ]

Categories

(Testing Graveyard :: Mozmill, defect)

All
macOS
defect
Not set
critical

Tracking

(firefox28 affected, firefox29 ?, firefox30 ?, firefox31 affected)

RESOLVED WONTFIX
Tracking Status
firefox28 --- affected
firefox29 --- ?
firefox30 --- ?
firefox31 --- affected

People

(Reporter: whimboo, Unassigned)

References

Details

(Keywords: crash, crashreportid, sec-other, Whiteboard: [mozmill][mm-osx-106-3 only])

Crash Data

Since a while are having problems with our socket connection in Mozmill, which is used to communicate between the Jsbridge Python and extension part. Intermittently we can see disconnects right after startup when we initiate the server socket via ctypes and NSS:

https://github.com/mozilla/mozmill/blob/master/jsbridge/jsbridge/extension/resource/modules/Sockets.jsm#L98

As noticed today those disconnects are actually caused because Firefox crashes. So we submitted a report for now:

Crash report: bp-a51d2cc8-2af4-47b1-aff1-e0a072140307

Crash Reason 	EXC_BAD_ACCESS / KERN_PROTECTION_FAILURE
Crash Address 	0xb500000

Here the first 10 stack frames:

0 	libnss3.dylib 	pt_SetSocketOption 	
1 	libmozglue.dylib 	arena_malloc 	memory/mozjemalloc/jemalloc.c
2 	XUL 	ffi_call 	js/src/ctypes/libffi/src/x86/ffi64.c
3 	XUL 	js::ctypes::FunctionType::Call 	js/src/ctypes/CTypes.cpp
4 	XUL 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/jscntxtinlines.h
5 	XUL 	js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value*, JS::MutableHandle<JS::Value>) 	js/src/vm/Interpreter.cpp
6 	XUL 	js::DirectProxyHandler::call(JSContext*, JS::Handle<JSObject*>, JS::CallArgs const&) 	js/src/jsproxy.cpp
7 	XUL 	js::CrossCompartmentWrapper::call(JSContext*, JS::Handle<JSObject*>, JS::CallArgs const&) 	js/src/jswrapper.cpp
8 	XUL 	js::Proxy::call(JSContext*, JS::Handle<JSObject*>, JS::CallArgs const&) 	js/src/jsproxy.cpp
9 	XUL 	proxy_Call 	js/src/jsproxy.cpp
10 	XUL 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/jscntxtinlines.h

We haven't found a reproducible pattern yet, but we will continue to investigate that. For now I will close this bug as a security one given that we access some weird memory location.

Andreea, can you please check which platforms are affected here? Any startup disconnect for jsbridge should be that problem. So mainly all failures we see at the moment.
Flags: needinfo?(andreea.matei)
Whiteboard: [mozmill] → [mozmill][qa-automation-blocked]
Crash Signature: [@ pt_SetSocketOption ]
I checked several machines from the last 3 weeks which failed with jsbridge, submitted crash reports but only another one on the same machine mm-osx-109-4 from February 14th had the same signature. So until now only OS X is affected, on 10.9.2.
Flags: needinfo?(andreea.matei)
Given the callstack, the reason is almost certainly that the JS function you passed to C++ has been GCed. See the second big warning here: https://developer.mozilla.org/en-US/docs/Mozilla/js-ctypes/js-ctypes_reference/Callbacks
Sounds like this test thing is using ctypes incorrectly, which isn't really a security issues.  This should get moved to whatever component is appropriate for this test code, and can be opened up.
Keywords: sec-other
Assignee: nobody → nobody
Component: Libraries → Mozmill
Product: NSS → Testing
QA Contact: hskupin
Version: 3.15.6 → unspecified
Opening this bug up until I have more time to work on it.
Group: core-security
Bobby, so we are defining a NSS object which contains a helper method to call 'PR_SetSocketOption()':
https://github.com/mozilla/mozmill/blob/master/jsbridge/jsbridge/extension/resource/modules/NSS.jsm#L150

When we call that method I don't see that any callback is in use:
https://github.com/mozilla/mozmill/blob/master/jsbridge/jsbridge/extension/resource/modules/Sockets.jsm#L116

So I'm not sure if I understand the problem. Could it probably mean that fd is invalid?
Flags: needinfo?(bobbyholley)
Oh yeah, my bad - I read the callstack backwards :P

Yeah, this looks like it should all work. Someone needs to fire up a debugger and figure out what's going on
Flags: needinfo?(bobbyholley)
Bobby, so we should move this back into the NSS component?

Andreea, have we ever seen this crash again? Hopefully with Mozmill 2.0.6 we will see the crash exposed on OS X now. Lets have an eye on it. Would be good to get some reproducible steps.
(In reply to Henrik Skupin (:whimboo) from comment #7)
> Bobby, so we should move this back into the NSS component?

Hard to say. It's probably not js-ctypes, but it could be either NSS or Mozmill (using NSS incorrectly).

> Andreea, have we ever seen this crash again? Hopefully with Mozmill 2.0.6 we
> will see the crash exposed on OS X now. Lets have an eye on it. Would be
> good to get some reproducible steps.

Yeah, STR would be good here.
Duplicate of this bug: 993300
Firefox crashed today on our mm-osx-107-4 with the exact same signature (Aurora de).
https://crash-stats.mozilla.com/report/index/b3d18e86-97d9-45de-9166-4f6812140520

It happened with a functional testrun when running this test: tests/functional/restartTests/testAddons_changeTheme/test2.js
Failed twice on mm-osx-107-4, after tests/functional/restartTests/testAddons_changeTheme/test1.js
https://crash-stats.mozilla.com/report/index/0d1f5ab5-834e-44c8-97ba-711902140610

I tried to reproduce it by running changeTheme tests in a loop (x10) and by running a a complete testrun on the affected node.
Fairly low volume, which actually doesn't block us from testing with Mozmill.
Whiteboard: [mozmill][qa-automation-blocked] → [mozmill]
We've had a recent surge in crashes. All of them point to a fault in a XUL library.

Seems FirefoxOS is also experiencing lots of crashes with todays build. They received a large merge from mc with the following pushlog:
http://hg.mozilla.org/integration/b2g-inbound/pushloghtml?fromchange=4355feecf4bd&tochange=f4e8988b3881

Something from that pushlog is likely to be the regressor.
The new app bundle structure for v2 signing landed on m-c yesterday. If we have a much higher crash rate due to this change, we should get this investigated as long as we have the chance. So maybe bug 1047584 plays into role here.
(In reply to Andrei Eftimie from comment #15)
> Seems FirefoxOS is also experiencing lots of crashes with todays build. They
> received a large merge from mc with the following pushlog:

Wait. What do you mean with Firefox OS here? And which version on which platforms? Any links? Please give more details.
(In reply to Henrik Skupin (:whimboo) from comment #17)
> (In reply to Andrei Eftimie from comment #15)
> > Seems FirefoxOS is also experiencing lots of crashes with todays build. They
> > received a large merge from mc with the following pushlog:
> 
> Wait. What do you mean with Firefox OS here? And which version on which
> platforms? Any links? Please give more details.

Found it: bug 1075387

Not sure if it's clear from the report, but they see multiple signature crashes, in different places of their tests, all pointing to the libxul.so (which is what we've seen in the last week).
Libxul contains nearly everything. So there is a minimal relation to this issue. If we see some of those other issues, mark the filed bugs appropriately (cc me) or file new ones. This one is separate.
I couldn't reproduce the crash in running the test about 50 times and the testrun about 5 times on the affected machine. I put it back online now.
Crashed 9 times on the last beta today with the same signature here
All on the same machine (check link from Comment 14)

So may be something related to the machine's configuration, I'm not sure, given that it always happen on the same test I'll continue to try reproducing it.
A good hint with mm-osx-106-3 here. I think it's a good idea to concentrate on it. Btw. does it vary when it crashes? Or is it always happening after the exact same test?
Whiteboard: [mozmill] → [mozmill][mm-osx-106-3 only]
Duplicate of this bug: 1074929
We haven't seen this crash for a long time and it's unlikely that we would fix that in Mozmill proper given that we transition to Marionette now. Closing as wontfix.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Duplicate of this bug: 1223302
Duplicate of this bug: 1223302
For reference, this is due to mozmill using structure declarations in ctypes that don't actually match the C structures on 64-bit platforms, so PR_SetSocketOption tries to read the new option value from after the end of a heap allocation, which sometimes crashes and otherwise just reads garbage; see bug 1223302 comment #9.  If mozmill is sufficiently end-of-life that that's not worth fixing then this could stay closed.
Duplicate of this bug: 1223302
Yes, Mozmill has reached its EOL a while ago. No further releases will be made. If tests are still using Mozmill and they broke due to changes in Firefox, they will need to be reimplemented by using Marionette.
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.