Closed Bug 859867 Opened 12 years ago Closed 12 years ago

Need to check out one of the new linux hardware slaves to troubleshoot B2G emulator problems

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Product:

Component:

Type:

task

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jgriffin, Unassigned)

References

Details

Attachments

(2 files)

mozharness env 12 years ago Rail Aliiev [:rail] 904 bytes, patch	mozilla : review+ rail : checked-in-	Details \| Diff \| Splinter Review
puppet 12 years ago Rail Aliiev [:rail] 1.26 KB, patch	dustin : review+ rail : checked-in+	Details \| Diff \| Splinter Review

Jonathan Griffin (:jgriffin)

Reporter

Description

•

12 years ago

The new Ubuntu 12.04 hardware slaves do not cooperate with B2G tests: https://tbpl.mozilla.org/?tree=Cedar&jobname=b2g_ics_armv7a_gecko_emulator_hw The logs show that the emulator is failing to start. I'll need to check out one of these new machines in order to figure out what's going on.

Kim Moir [:kmoir] ET

Updated

•

12 years ago

Depends on: 859893

Kim Moir [:kmoir] ET

Comment 1

•

12 years ago

talos-linux64-ix-004.test.releng.scl3.mozilla.com has been allocated for this purpose, I'll send you the information on how to connect.

Jonathan Griffin (:jgriffin)

Reporter

Comment 2

•

12 years ago

So basically the problem on these slaves has something to do with compiz. When I logged into this slave, and tried to launch the emulator, it just hung hard. I noticed that compiz was taking 100% of 1 core. After I killed the emulator, and killed compiz (which caused it to be automatically restarted), I could launch and kill the emulator successfully as many times as I wanted to, and compiz stayed at a reasonable CPU consumption (< 30%). So either the window manager/Xorg is in some bad state initially, or attempting to launch the emulator puts it in a bad state initially, but killing compiz is enough to "fix" it. I don't really know how to troubleshoot it beyond this, not being an expert at linux window manager configs.

Jonathan Griffin (:jgriffin)

Reporter

Comment 3

•

12 years ago

This sounds potentially relevant: https://bugs.launchpad.net/ubuntu/+source/compiz/+bug/995118 Are the slaves configured as the default, to blank the screen after X seconds of inactivity?

Jonathan Griffin (:jgriffin)

Reporter

Comment 4

•

12 years ago

Also see: https://bugs.launchpad.net/ubuntu/+source/compiz/+bug/969860/comments/96

Rail Aliiev [:rail]

Comment 5

•

12 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #3) > This sounds potentially relevant: > > https://bugs.launchpad.net/ubuntu/+source/compiz/+bug/995118 > > Are the slaves configured as the default, to blank the screen after X > seconds of inactivity? No, we disable that in http://hg.mozilla.org/build/puppet/file/6355249ba9a5/modules/gui/files/gsettings.gschema.override

Rail Aliiev [:rail]

Updated

•

12 years ago

Blocks: 820235

Jonathan Griffin (:jgriffin)

Reporter

Comment 6

•

12 years ago

Logging into this VM later, compiz isn't running at all. :(

Rail Aliiev [:rail]

Comment 7

•

12 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #6) > Logging into this VM later, compiz isn't running at all. :( Can you try now? It turns out that some of the loaning steps are not documented properly for this platform yet...

Chris Cooper [:coop] (he/him)

Comment 8

•

12 years ago

jgriffin: did you successfully get access to this machine?

Flags: needinfo?(jgriffin)

Rail Aliiev [:rail]

Updated

•

12 years ago

Blocks: 850105
No longer blocks: 820235

Jonathan Griffin (:jgriffin)

Reporter

Comment 9

•

12 years ago

(In reply to Chris Cooper [:coop] from comment #8) > jgriffin: did you successfully get access to this machine? I did, and compiz was running again, and the emulator behaved fine. The only time I had any problems was the very first time I launched the emulator, at which point compiz got stuck at 100% cpu and the emulator hung. It might be worthwhile rebooting this VM (I can't do it since I don't have sudo) and see if the problem recurs.

Flags: needinfo?(jgriffin)

Rail Aliiev [:rail]

Comment 10

•

12 years ago

I rebooted the machine. BTW, "sudo reboot" is allowed for cltbld.

Jonathan Griffin (:jgriffin)

Reporter

Comment 11

•

12 years ago

I have still not been able to reproduce this problem, except for the very first time I attempted. After a reboot, the tests still function normally. Is it possible compiz gets in a weird state somehow when the test slave is first connected to buildbot? The only things I can think of to help debug this are: 1] - add some output to the mozharness script to dump the process list, to see if these are all cases caused by compiz misbehaving (something like ps -eo pcpu,pid,user,args | sort -k 1 -r | head -10) 2] - add some debugging output to Marionette on cedar to see if the emulator is dumping any error messages to stdout that we're not seeing

Jonathan Griffin (:jgriffin)

Reporter

Updated

•

12 years ago

Depends on: 865837

Jonathan Griffin (:jgriffin)

Reporter

Comment 12

•

12 years ago

I've just added some 'ps' logging to mozharness for this, and I've also bumped the emulator launch timeout on cedar from 60s to 180s. We'll see if either of these give us more info.

Jonathan Griffin (:jgriffin)

Reporter

Comment 13

•

12 years ago

emulator launch timeout increased: https://hg.mozilla.org/projects/cedar/rev/f48348119595

Jonathan Griffin (:jgriffin)

Reporter

Comment 14

•

12 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #12) > I've just added some 'ps' logging to mozharness for this, and I've also > bumped the emulator launch timeout on cedar from 60s to 180s. We'll see if > either of these give us more info. This shows that compiz doesn't seem to be the problem: 16:57:55 INFO - %CPU PID USER COMMAND 16:57:55 INFO - 1.5 2655 cltbld /tools/buildbot/bin/python scripts/scripts/b2g_emulator_unittest.py --cfg b2g/emulator_automation_config.py --test-suite crashtest --this-chunk 3 --total-chunks 3 --download-symbols true 16:57:55 INFO - 0.5 2658 root [flush-8:0] 16:57:55 INFO - 0.4 1117 root [kipmi0] 16:57:55 INFO - 0.1 2335 cltbld compiz 16:57:55 INFO - 0.1 1921 root X :0 -nolisten tcp 16:57:55 INFO - 0.0 2764 cltbld ps -eo pcpu,pid,user,args 16:57:55 INFO - 0.0 2640 cltbld /usr/lib/deja-dup/deja-dup/deja-dup-monitor 16:57:55 INFO - 0.0 2633 cltbld /usr/bin/python /usr/lib/unity-scope-video-remote/unity-scope-video-remote 16:57:55 INFO - 0.0 2605 cltbld /usr/lib/unity-lens-music/unity-musicstore-daemon

Jonathan Griffin (:jgriffin)

Reporter

Comment 15

•

12 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #13) > emulator launch timeout increased: > https://hg.mozilla.org/projects/cedar/rev/f48348119595 Increasing the timeout didn't help.

Jonathan Griffin (:jgriffin)

Reporter

Comment 16

•

12 years ago

Added some more debugging output to cedar: https://hg.mozilla.org/projects/cedar/rev/8d27a357c892

Jonathan Griffin (:jgriffin)

Reporter

Comment 17

•

12 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #16) > Added some more debugging output to cedar: > https://hg.mozilla.org/projects/cedar/rev/8d27a357c892 No help here either. I have to assume this has something to do with a difference in how we're starting a session. I'm using ssh to the console, and then invoking the emulator from there. Buildbot seems to be using a remote X session; but I can't quite tell what it's doing. :rail, can you describe how buildbot invokes commands on the slave?

Jonathan Griffin (:jgriffin)

Reporter

Comment 18

•

12 years ago

:rail, can you explain how buildbot connects to and invokes commands on the slave? It seems like it's setting up a remote X session, but I'd like to know how exactly, so I can try to reproduce that way.

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Comment 19

•

12 years ago

Buildbot (buildslave actually) starts from a running X session: Unity starts ~/.config/autostart/gnome-terminal.desktop which starts gnome-terminal, which runs runslave.py (it talks to slavealloc and download buildbot startup file). So, basically buildbot inherits all variables from existing X sessions. See also: http://hg.mozilla.org/build/puppet/file/e6e380e85cb1/modules/buildslave/manifests/startup/desktop.pp#l16 http://hg.mozilla.org/build/puppet/file/e6e380e85cb1/modules/buildslave/templates/gnome-terminal.desktop.erb http://hg.mozilla.org/build/puppet/file/e6e380e85cb1/modules/buildslave/files/runslave.py The best way to reproduce the existing environment would be VNCing and running gnome-terminal. Let me know (or ping on IRC) if you need more details.

Flags: needinfo?(rail)

Jonathan Griffin (:jgriffin)

Reporter

Comment 20

•

12 years ago

I did just that and the problem still doesn't reproduce. Are there any other differences in how buildbot uses/configures the slave that could be relevant?

Flags: needinfo?(rail)

Jonathan Griffin (:jgriffin)

Reporter

Updated

•

12 years ago

Assignee: nobody → jgriffin

Rail Aliiev [:rail]

Comment 21

•

12 years ago

jgrifin, did you try to reproduce the steps from the logs, including the same environment variables? We can try to reproduce this together, maybe I'll have some ideas...

Flags: needinfo?(rail)

Jonathan Griffin (:jgriffin)

Reporter

Comment 22

•

12 years ago

Yes, I did set all the same environment variables, at least the ones that made sense. If you want to try and reproduce it together, ping me on irc.

Rail Aliiev [:rail]

Comment 23

•

12 years ago

I managed to reproduce the issue mentioned in comment 2 (compiz eating 100% of CPU). I straced the compiz process and it was trying to sched_yiled() in loop. I think, the problem is the nvidia driver's default OpenGL yield behavior, which uses sched_yiled(). We can change this behaviour by setting __GL_YIELD env variable to "NOTHING" or "USLEEP". I tried both and they worked. I'll try to reproduce the same on the slave loaned to jgriffin.

Jonathan Griffin (:jgriffin)

Reporter

Comment 24

•

12 years ago

I couldn't repro this on my slave (even after reboot), but it sounds like the problem. Can we add this env variable to the slave config and see if it fixes this?

Rail Aliiev [:rail]

Comment 25

•

12 years ago

I would prefer to set this variable from mozharness to avoid possible issues with desktop Talos tests. I'll prep a patch.

Rail Aliiev [:rail]

Comment 26

•

12 years ago

Attached patch mozharness env — Details — Splinter Review

Attachment #746956 - Flags: review?(aki)

Aki Sasaki (not active)

Comment 27

•

12 years ago

Comment on attachment 746956 [details] [diff] [review] mozharness env I'm going to guess this line needs changing to env = self.config.get('env', {}) http://hg.mozilla.org/build/mozharness/file/1287327b0f15/scripts/b2g_emulator_unittest.py#l351

Attachment #746956 - Flags: review?(aki) → review+

Rail Aliiev [:rail]

Comment 28

•

12 years ago

Comment on attachment 746956 [details] [diff] [review] mozharness env (In reply to Aki Sasaki [:aki] from comment #27) > Comment on attachment 746956 [details] [diff] [review] > mozharness env > > I'm going to guess this line needs changing to > > env = self.config.get('env', {}) > > http://hg.mozilla.org/build/mozharness/file/1287327b0f15/scripts/ > b2g_emulator_unittest.py#l351 Ah, I missed that point. https://hg.mozilla.org/build/mozharness/rev/f08c264808e7

Attachment #746956 - Flags: checked-in+

Rail Aliiev [:rail]

Updated

•

12 years ago

Attachment #746956 - Flags: checked-in+ → checked-in-

Rail Aliiev [:rail]

Comment 29

•

12 years ago

I backed out this patch because it didn't really help. It turns out that the bug is not 100% reproducible. Sometimes compiz behaves correctly, sometimes not. To make it work properly we need to set the variable *before* compiz starts, as a part of X start up. This change may affect desktop Talos results and requires some testing before we land it to all machines. Patch for puppet incoming.

Rail Aliiev [:rail]

Comment 30

•

12 years ago

Attached patch puppet — Details — Splinter Review

I' not going to deploy the patch before we test Talos.

Attachment #747431 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 31

•

12 years ago

Comment on attachment 747431 [details] [diff] [review] puppet Review of attachment 747431 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/gui/manifests/init.pp @@ +65,5 @@ > "/etc/X11/edid.bin": > source => "puppet:///modules/${module_name}/edid.bin"; > + "/etc/X11/Xsession.d/98nvidia": > + content => "export __GL_YIELD=NOTHING\n", > + notify => Service['x11']; This should *definitely* have a comment with a brief explanation and pointing to this bug :) Fine with that change.

Attachment #747431 - Flags: review?(dustin) → review+

Rail Aliiev [:rail]

Comment 32

•

12 years ago

Comment on attachment 747431 [details] [diff] [review] puppet https://hg.mozilla.org/build/puppet/rev/3ea4d1800d7c

Attachment #747431 - Flags: checked-in+

Jonathan Griffin (:jgriffin)

Reporter

Comment 33

•

12 years ago

I don't think there's any need for this slave any longer; rail, let me know if there's anything left for me to do.

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Comment 34

•

12 years ago

I filed bug 895425 to reimage the machine.

Flags: needinfo?(rail)

Jonathan Griffin (:jgriffin)

Reporter

Updated

•

12 years ago

Assignee: jgriffin → nobody

Chris Cooper [:coop] (he/him)

Updated

•

12 years ago

Blocks: talos-linux64-ix-004

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.