Closed Bug 1127304 Opened 7 years ago Closed 7 years ago

Flame device fails to start after flashing latest build M-C and M-I

Categories

(Core :: Canvas: WebGL, defect)

ARM
Gonk (Firefox OS)
defect
Not set
blocker

Tracking

()

VERIFIED FIXED
blocking-b2g 2.5+
Tracking Status
b2g-master --- verified

People

(Reporter: RobertC, Unassigned)

References

Details

(Keywords: qablocker, regression, smoketest, Whiteboard: [fromAutomation])

Attachments

(1 file)

Attached file logcat.txt
After flashing with the latest mozilla-inbound build the device fails to start.
We tried restarting the device, adb reboot, but the device still fails to start.

Device info:
Flame 319MB v18D-1

Regression info from Jenkins sanity runs.

Last working:
Device firmware (base) 	L1TC100118D0
Device firmware (date) 	29 Jan 2015 00:20:38
Device firmware (incremental) 	eng.cltbld.20150129.032025
Device firmware (release) 	4.4.2
Device identifier 	flame
Gaia date 	28 Jan 2015 10:25:55
Gaia revision 	9d2378a9ef09
Gecko build 	20150128235034
Gecko revision 	e5c85f765f2d
Gecko version 	38.0a1

First broken:
Device firmware (base) 	L1TC100118D0
Device firmware (date) 	29 Jan 2015 01:18:49
Device firmware (incremental) 	eng.cltbld.20150129.041838
Device firmware (release) 	4.4.2
Device identifier 	flame
Gaia date 	28 Jan 2015 10:25:55
Gaia revision 	9d2378a9ef09
Gecko build 	20150129005832
Gecko revision 	fc58ed477ccb
Gecko version 	38.0a1

It looks like between these builds there were only gecko changes.
gecko diff:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=e5c85f765f2d&tochange=fc58ed477ccb
Keywords: smoketest
This is now in mozilla-central now:

Last good:
Device firmware (base) 	L1TC100118D0
Device firmware (date) 	28 Jan 2015 18:52:23
Device firmware (incremental) 	eng.cltbld.20150128.215212
Device firmware (release) 	4.4.2
Device identifier 	flame
Gaia date 	28 Jan 2015 10:25:55
Gaia revision 	9d2378a9ef09
Gecko build 	20150128183748
Gecko revision 	6bfc0e1c4b29
Gecko version 	38.0a1

First bad:
Device firmware (base) 	L1TC100118D0
Device firmware (date) 	29 Jan 2015 06:43:07
Device firmware (incremental) 	eng.cltbld.20150129.094256
Device firmware (release) 	4.4.2
Device identifier 	flame
Gaia date 	28 Jan 2015 10:25:55
Gaia revision 	9d2378a9ef09
Gecko build 	20150129060452
Gecko revision 	a98d16e6a3b4
Gecko version 	38.0a1

As stated above after flashing with 20150129060452 the device won't start
Summary: Flame fails to start after flashing with latest m-i build → Flame device fails to start after flashing latest build M-C and M-I
[Blocking Requested - why for this release]:
Will fail smoketests and automation.

http://hg.mozilla.org/integration/mozilla-inbound/rev/ea243bbbb45c
looking through the pushlogs, could this be the culprit? Bob Owens can you take a look?
blocking-b2g: --- → 3.0?
Flags: needinfo?(bobowen.code)
(In reply to Peter Bylenga [:PBylenga] from comment #3)
> [Blocking Requested - why for this release]:
> Will fail smoketests and automation.
> 
> http://hg.mozilla.org/integration/mozilla-inbound/rev/ea243bbbb45c
> looking through the pushlogs, could this be the culprit? Bob Owens can you
> take a look?

Hmm, all of that patch should be within Windows #ifdefs, so I don't see how it could have affected B2G.
Flags: needinfo?(bobowen.code)
I rebuilt with ea243bbbb45c backed out but that didn't fix the crash.
Device lab is essentially fully down with black-screened phones. I'll dig further to make sure it's from this, but seems highly likely.
WE have a busted nighlty env today for fxOS and QA is blocked completely. Can we get some engg help to see what is the culprit here keeping the pushlog from comment #2 in mind?

Note: QA is trying to further bisect it, and it looks like http://hg.mozilla.org/integration/mozilla-inbound/rev/ea243bbbb45c may not be the cuplrit..
alright, so we crash at:
I/Gecko   (13287): RUNTIME ASSERT: Uninitialized GL function: fEGLImageTargetTexture2D
That comes from GLContext.h's ASSERT_SYMBOL_PRESENT, which is called for this function here:
http://mxr.mozilla.org/mozilla-central/source/gfx/gl/GLContext.h#2354

So we're working with a GLContext that has some uninitialized functions, for some reason.
which means it's likely to be https://hg.mozilla.org/mozilla-central/rev/176166c0bae9 (bug 1124394), which reworks GLContext.h and .cpp a bit.
Flags: needinfo?(jgilbert)
I confirm that backing out https://hg.mozilla.org/mozilla-central/rev/176166c0bae9 fixes the issue.
I backed it out and pushed the backout to inbound in https://hg.mozilla.org/integration/mozilla-inbound/rev/b556a1f684ed

I'll merge that backout around before the next b2g nightlies get scheduled in an hour and a half.
In the future, run a DEBUG build if you're hitting runtime asserts. Runtime asserts mean something is seriously messed up. Debug asserts have a good chance of telling us what.
Flags: needinfo?(jgilbert)
The attached logcat has the error.  the logcat was attached at the creation of the bug.
Automation requires the engineering build to be used (Marionette).

I think the engineering build has a limited debug output log?
QAWanted to check the next nightly M-C after 4pm when it's available.
Keywords: qaurgent, qawanted
QA Contact: pcheng
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #14)
> The attached logcat has the error.  the logcat was attached at the creation
> of the bug.
> Automation requires the engineering build to be used (Marionette).
> 
> I think the engineering build has a limited debug output log?

I'm not clear on what an engineering build is. Does it include a full DEBUG build of gecko? That is what we need.
(In reply to Peter Bylenga [:PBylenga] from comment #15)
> QAWanted to check the next nightly M-C after 4pm when it's available.

Verified that issue is fixed on latest nightly M-C after 4pm. Device can be flashed and enters FTU.

Device: Flame (nightly user build, 319MB mem)
BuildID: 20150129160230
Gaia: 8238eeacc7030b2cdbf7ab4eba2f36779b702599
Gecko: 29b05d283b00
Gonk: e7c90613521145db090dd24147afd5ceb5703190
Version: 38.0a1 (3.0 master) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:38.0) Gecko/38.0 Firefox/38.0
Status: NEW → RESOLVED
Closed: 7 years ago
Keywords: qaurgent, qawanted
Resolution: --- → FIXED
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
Status: RESOLVED → VERIFIED
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
(In reply to Jeff Gilbert [:jgilbert] from comment #16)
> (In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from
> comment #14)
> > The attached logcat has the error.  the logcat was attached at the creation
> > of the bug.
> > Automation requires the engineering build to be used (Marionette).
> > 
> > I think the engineering build has a limited debug output log?
> 
> I'm not clear on what an engineering build is. Does it include a full DEBUG
> build of gecko? That is what we need.

No, we usually don't run DEBUG builds on devices because they are too slow (it's only bearable on high end devices like the N5, but on a flame). If you really need it I can do a debug build with the patch that was backed out and give you a stack trace.
(In reply to Fabrice Desré [:fabrice] from comment #18)
> (In reply to Jeff Gilbert [:jgilbert] from comment #16)
> > (In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from
> > comment #14)
> > > The attached logcat has the error.  the logcat was attached at the creation
> > > of the bug.
> > > Automation requires the engineering build to be used (Marionette).
> > > 
> > > I think the engineering build has a limited debug output log?
> > 
> > I'm not clear on what an engineering build is. Does it include a full DEBUG
> > build of gecko? That is what we need.
> 
> No, we usually don't run DEBUG builds on devices because they are too slow
> (it's only bearable on high end devices like the N5, but on a flame). If you
> really need it I can do a debug build with the patch that was backed out and
> give you a stack trace.

It's not a stack trace I need, it's the additional asserts. Just because it seems to run properly in opt builds doesn't mean it's not broken under the surface. Ignoring debug builds is a huge hit to the confidence of a clean run.

DEBUG+OPT builds are fine. Getting assert coverage is the important part. I really do understand that working with slow devices sucks, but skipping DEBUG builds completely is risky.

Since we're already backed it out, I'm going to be doing these investigations myself anyways, so there's no need for you to duplicate work here.

My desire for the results from a DEBUG run are because often these such things are caused by situations explicitly caught by our sanity asserts. If this case is indeed like this, I would have been able to immediately recommend a fix, instead of having to slosh this in and out of landing again.
(In reply to Jeff Gilbert [:jgilbert] from comment #19)
> (In reply to Fabrice Desré [:fabrice] from comment #18)
> > (In reply to Jeff Gilbert [:jgilbert] from comment #16)
> > > (In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from
> > > comment #14)
> > > > The attached logcat has the error.  the logcat was attached at the creation
> > > > of the bug.
> > > > Automation requires the engineering build to be used (Marionette).
> > > > 
> > > > I think the engineering build has a limited debug output log?
> > > 
> > > I'm not clear on what an engineering build is. Does it include a full DEBUG
> > > build of gecko? That is what we need.
> > 
> > No, we usually don't run DEBUG builds on devices because they are too slow
> > (it's only bearable on high end devices like the N5, but on a flame). If you
> > really need it I can do a debug build with the patch that was backed out and
> > give you a stack trace.
> 
> It's not a stack trace I need, it's the additional asserts. Just because it
> seems to run properly in opt builds doesn't mean it's not broken under the
> surface. Ignoring debug builds is a huge hit to the confidence of a clean
> run.
> 
> DEBUG+OPT builds are fine. Getting assert coverage is the important part. I
> really do understand that working with slow devices sucks, but skipping
> DEBUG builds completely is risky.
> 
> Since we're already backed it out, I'm going to be doing these
> investigations myself anyways, so there's no need for you to duplicate work
> here.
> 
> My desire for the results from a DEBUG run are because often these such
> things are caused by situations explicitly caught by our sanity asserts. If
> this case is indeed like this, I would have been able to immediately
> recommend a fix, instead of having to slosh this in and out of landing again.

Thank you for expressing the context of why you need the DEBUG version run.  

Based on what you're asking, it seems like a case by case basis; QA can't run this all the time because it's pretty costly to run each time.  Automation runs with a partial debug, engineering builds are based off of VARIANT=eng which has some DEBUG=1... I can't recall what levels and where, it's flagged at build time int he build time script.

Having said that I saw : 
https://pvtbuilds.mozilla.org/pvt/mozilla.org/b2gotoro/tinderbox-builds/b2g-inbound-flame-kk-eng-debug/
According to the log, the build script contains B2G_DEBUG=1

When asked, we could probably run these.
Ok, great, thanks for the clarification!
We do run automated monkey tests against debug builds from the b2g-inbound branch. These picked up the crash, but there's not a lot of information provided. Perhaps we can make these more useful, or add some sanity tests based on debug builds. Where would we see the information on the additional asserts, would they be in the logcat?
Flags: needinfo?(jgilbert)
Moving the smoketest blocker bug to the component where the regression came from.
blocking-b2g: 2.5? → 2.5+
Component: General → Canvas: WebGL
Product: Firefox OS → Core
Flags: needinfo?(jgilbert)
You need to log in before you can comment on or make changes to this bug.