Closed Bug 946178 Opened 11 years ago Closed 11 years ago

Trees closed due to Marionette Bustage - | test_outgoing_radio_off.js | InvalidResponseException: Could not successfully complete transport of message to Gecko, socket closed? or AttributeError: 'NoneType' object has no attribute 'close'

Categories

(Remote Protocol :: Marionette, defect)

x86
Gonk (Firefox OS)
defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Unassigned)

References

()

Details

https://tbpl.mozilla.org/php/getParsedLog.php?id=31435030&tree=B2g-Inbound

b2g_emulator_vm b2g-inbound opt test marionette-webapi on 2013-12-04 01:59:40 PST for push 7ecc3f3f84b3

slave: tst-linux64-ec2-489

TEST-UNEXPECTED-FAIL | test_outgoing_radio_off.js | InvalidResponseException: Could not successfully complete transport of message to Gecko, socket closed?
TEST-UNEXPECTED-FAIL | test_outgoing_badNumber.js | InvalidResponseException: Could not successfully complete transport of message to Gecko, socket closed?
TEST-UNEXPECTED-FAIL | test_outgoing_busy.js | InvalidResponseException: Could not successfully complete transport of message to Gecko, socket closed?

investigating
Summary: Intermittent TEST-UNEXPECTED-FAIL | test_outgoing_radio_off.js | InvalidResponseException: Could not successfully complete transport of message to Gecko, socket closed? → Intermittent TEST-UNEXPECTED-FAIL | test_outgoing_radio_off.js | InvalidResponseException: Could not successfully complete transport of message to Gecko, socket closed? or AttributeError: 'NoneType' object has no attribute 'close'
So did spend some time looking into this. First sign of this problem were around 2am after the push for https://hg.mozilla.org/integration/b2g-inbound/rev/253e97c9c3f0

However this seems not the cause of the problems since for the merge from b2g-i hours before this changeset got not merge, but also m-c shows the problem.

As example

05:05 < Tomcat|sheriffduty> https://tbpl.mozilla.org/?tree=B2g-Inbound&rev=263f538a5509 is the b2g-i cset one push after the merge 
05:05 < Tomcat|sheriffduty> marionette test green
05:06 < Tomcat|sheriffduty> https://tbpl.mozilla.org/?rev=9688476c1544 is the cset of the merge - marionette red

And even mozilla-inbound shows now this error. Could this be some kind of infra related that hit us at around 2am till now ?

Trees are closed
Severity: normal → blocker
Summary: Intermittent TEST-UNEXPECTED-FAIL | test_outgoing_radio_off.js | InvalidResponseException: Could not successfully complete transport of message to Gecko, socket closed? or AttributeError: 'NoneType' object has no attribute 'close' → Trees closed due to Marionette Bustage - | test_outgoing_radio_off.js | InvalidResponseException: Could not successfully complete transport of message to Gecko, socket closed? or AttributeError: 'NoneType' object has no attribute 'close'
b2g26 appears to be unaffected
Looks like the b2g process crashes while the tests are running. From https://tbpl.mozilla.org/php/getParsedLog.php?id=31439005&full=1&branch=mozilla-inbound#error0 as an example: 

03:49:17     INFO -  12-04 06:46:33.869    45    45 I Gecko   : MARIONETTE LOG: INFO: == Test Start ==
03:49:17     INFO -  12-04 06:46:33.879    45    45 I Gecko   : MobileConnection initialized
03:49:17     INFO -  12-04 06:46:33.899    45    45 I Gecko   : MARIONETTE TEST RESULT:TEST-PASS | test_outgoing_radio_off.js | connection is instanceof [object MozMobileConnection] - true was true, expected true
03:49:17    ERROR -  12-04 06:46:33.989    45    45 F libc    : Fatal signal 11 (SIGSEGV) at 0x0000002d (code=-6)
03:49:17    ERROR -  This usually indicates the B2G process has crashed
03:49:17     INFO -  12-04 06:46:34.408    33    33 I ServiceManager: service 'media.resource_manager' died
03:49:17     INFO -  12-04 06:46:34.499    37    37 I DEBUG   : debuggerd committing suicide to free the zombie!


You see the test starting, it does a check, then suddenly the b2g process crashes. If marionette has a bug, it will display a related error in the logs. This looks like a change was made somewhere else in the b2g process.
Judging from the test and its output, the code triggering the error is likely in this function: http://mxr.mozilla.org/mozilla-central/source/dom/telephony/test/marionette/test_outgoing_radio_off.js#20 but before onradiostatechanged is called (because otherwise, we'd get some addition output). So, it's likely where it uses mozMobileConnection to do some RIL stuff, so it's likely this mozMobileConnection webapi call or some ril layer code has caused this.
tomcat, do we have a good idea of what checkin caused the crash to start happening?
Flags: needinfo?(cbook)
Tomcat's gone for the day, but no we don't.
Flags: needinfo?(cbook)
FYI, I disabled test_outgoing_radio_off.js on m-c and am waiting for the tests to finish. If that works, it will at least allow us to reopen until someone who knows this test better can investigate.

https://hg.mozilla.org/mozilla-central/rev/b2b20bc6576a
The failure just moved to the next test, so we're still stuck.
https://tbpl.mozilla.org/php/getParsedLog.php?id=31446693&tree=Mozilla-Central
At Clint's suggestion, I diffed sources.xml between a good build and a bad one to see if it's another external repo causing problems. However, the only differences I'm seeing between good and bad builds are the Gecko and Gaia revisions. Are there external repos not covered by sources.xml?
Disabling all tests that use setRadioEnabled.

https://hg.mozilla.org/mozilla-central/rev/9906961b21af
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #10)
> At Clint's suggestion, I diffed sources.xml between a good build and a bad
> one to see if it's another external repo causing problems. However, the only
> differences I'm seeing between good and bad builds are the Gecko and Gaia
> revisions. Are there external repos not covered by sources.xml?

Not that I'm aware of.
Generating new emulator builds on previously-green changesets is producing the same failures, so this is definitely not tied to a particular recent Gecko change.
We're pretty confident at this point that there's an underlying B2G issue here deeper than Gecko. I don't think we want to wait for TPE to wake up before investigating further. Can you please find someone to help?
Flags: needinfo?(overholt)
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #15)
> Disabling all tests that use setRadioEnabled.
> 
> https://hg.mozilla.org/mozilla-central/rev/9906961b21af

Still busted.
https://tbpl.mozilla.org/php/getParsedLog.php?id=31450341&tree=Mozilla-Central
The crashes are reproducible when running the TBPL builds locally, so that rules out an infra issue.
This try job *might* give a crash stack:
https://tbpl.mozilla.org/?tree=Try&rev=5f5b7c060570

I couldn't reproduce the crash locally, but someone who can could apply it for faster results.
Jgriffin applied this locally and it looks like no minidumps are being generated. He verified that crashreporting is enabled.
Depends on: 933203
FYI, the real proof is in the pudding, so we'll see if these tests go green again after the above backout.
The backout worked. Trees reopened at 17:11 MVT.
Flags: needinfo?(overholt)
Oh, turns out that the "gaia-revlink" that we tinderboxprint is only suitable for confusing the crap out of you - that's the push before the backout, but that URL leads to a display which has the summary of whatever happens to be the tip commit prominently featured at the top, the better to confuse you with.
I'm trying to understand the process here. If the backout worked, shouldn't this bug be closed, or are we waiting on re-enabling something?
marking this as fixed since the fix of the problem is worked on in the backout bug 933203
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.