Closed
Bug 859867
Opened 11 years ago
Closed 11 years ago
Need to check out one of the new linux hardware slaves to troubleshoot B2G emulator problems
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jgriffin, Unassigned)
References
Details
Attachments
(2 files)
904 bytes,
patch
|
mozilla
:
review+
rail
:
checked-in-
|
Details | Diff | Splinter Review |
1.26 KB,
patch
|
dustin
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
The new Ubuntu 12.04 hardware slaves do not cooperate with B2G tests: https://tbpl.mozilla.org/?tree=Cedar&jobname=b2g_ics_armv7a_gecko_emulator_hw The logs show that the emulator is failing to start. I'll need to check out one of these new machines in order to figure out what's going on.
Comment 1•11 years ago
|
||
talos-linux64-ix-004.test.releng.scl3.mozilla.com has been allocated for this purpose, I'll send you the information on how to connect.
Reporter | ||
Comment 2•11 years ago
|
||
So basically the problem on these slaves has something to do with compiz. When I logged into this slave, and tried to launch the emulator, it just hung hard. I noticed that compiz was taking 100% of 1 core. After I killed the emulator, and killed compiz (which caused it to be automatically restarted), I could launch and kill the emulator successfully as many times as I wanted to, and compiz stayed at a reasonable CPU consumption (< 30%). So either the window manager/Xorg is in some bad state initially, or attempting to launch the emulator puts it in a bad state initially, but killing compiz is enough to "fix" it. I don't really know how to troubleshoot it beyond this, not being an expert at linux window manager configs.
Reporter | ||
Comment 3•11 years ago
|
||
This sounds potentially relevant: https://bugs.launchpad.net/ubuntu/+source/compiz/+bug/995118 Are the slaves configured as the default, to blank the screen after X seconds of inactivity?
Reporter | ||
Comment 4•11 years ago
|
||
Also see: https://bugs.launchpad.net/ubuntu/+source/compiz/+bug/969860/comments/96
Comment 5•11 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #3) > This sounds potentially relevant: > > https://bugs.launchpad.net/ubuntu/+source/compiz/+bug/995118 > > Are the slaves configured as the default, to blank the screen after X > seconds of inactivity? No, we disable that in http://hg.mozilla.org/build/puppet/file/6355249ba9a5/modules/gui/files/gsettings.gschema.override
Reporter | ||
Comment 6•11 years ago
|
||
Logging into this VM later, compiz isn't running at all. :(
Comment 7•11 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #6) > Logging into this VM later, compiz isn't running at all. :( Can you try now? It turns out that some of the loaning steps are not documented properly for this platform yet...
Comment 8•11 years ago
|
||
jgriffin: did you successfully get access to this machine?
Flags: needinfo?(jgriffin)
Updated•11 years ago
|
Reporter | ||
Comment 9•11 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #8) > jgriffin: did you successfully get access to this machine? I did, and compiz was running again, and the emulator behaved fine. The only time I had any problems was the very first time I launched the emulator, at which point compiz got stuck at 100% cpu and the emulator hung. It might be worthwhile rebooting this VM (I can't do it since I don't have sudo) and see if the problem recurs.
Flags: needinfo?(jgriffin)
Comment 10•11 years ago
|
||
I rebooted the machine. BTW, "sudo reboot" is allowed for cltbld.
Reporter | ||
Comment 11•11 years ago
|
||
I have still not been able to reproduce this problem, except for the very first time I attempted. After a reboot, the tests still function normally. Is it possible compiz gets in a weird state somehow when the test slave is first connected to buildbot? The only things I can think of to help debug this are: 1] - add some output to the mozharness script to dump the process list, to see if these are all cases caused by compiz misbehaving (something like ps -eo pcpu,pid,user,args | sort -k 1 -r | head -10) 2] - add some debugging output to Marionette on cedar to see if the emulator is dumping any error messages to stdout that we're not seeing
Reporter | ||
Comment 12•11 years ago
|
||
I've just added some 'ps' logging to mozharness for this, and I've also bumped the emulator launch timeout on cedar from 60s to 180s. We'll see if either of these give us more info.
Reporter | ||
Comment 13•11 years ago
|
||
emulator launch timeout increased: https://hg.mozilla.org/projects/cedar/rev/f48348119595
Reporter | ||
Comment 14•11 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #12) > I've just added some 'ps' logging to mozharness for this, and I've also > bumped the emulator launch timeout on cedar from 60s to 180s. We'll see if > either of these give us more info. This shows that compiz doesn't seem to be the problem: 16:57:55 INFO - %CPU PID USER COMMAND 16:57:55 INFO - 1.5 2655 cltbld /tools/buildbot/bin/python scripts/scripts/b2g_emulator_unittest.py --cfg b2g/emulator_automation_config.py --test-suite crashtest --this-chunk 3 --total-chunks 3 --download-symbols true 16:57:55 INFO - 0.5 2658 root [flush-8:0] 16:57:55 INFO - 0.4 1117 root [kipmi0] 16:57:55 INFO - 0.1 2335 cltbld compiz 16:57:55 INFO - 0.1 1921 root X :0 -nolisten tcp 16:57:55 INFO - 0.0 2764 cltbld ps -eo pcpu,pid,user,args 16:57:55 INFO - 0.0 2640 cltbld /usr/lib/deja-dup/deja-dup/deja-dup-monitor 16:57:55 INFO - 0.0 2633 cltbld /usr/bin/python /usr/lib/unity-scope-video-remote/unity-scope-video-remote 16:57:55 INFO - 0.0 2605 cltbld /usr/lib/unity-lens-music/unity-musicstore-daemon
Reporter | ||
Comment 15•11 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #13) > emulator launch timeout increased: > https://hg.mozilla.org/projects/cedar/rev/f48348119595 Increasing the timeout didn't help.
Reporter | ||
Comment 16•11 years ago
|
||
Added some more debugging output to cedar: https://hg.mozilla.org/projects/cedar/rev/8d27a357c892
Reporter | ||
Comment 17•11 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #16) > Added some more debugging output to cedar: > https://hg.mozilla.org/projects/cedar/rev/8d27a357c892 No help here either. I have to assume this has something to do with a difference in how we're starting a session. I'm using ssh to the console, and then invoking the emulator from there. Buildbot seems to be using a remote X session; but I can't quite tell what it's doing. :rail, can you describe how buildbot invokes commands on the slave?
Reporter | ||
Comment 18•11 years ago
|
||
:rail, can you explain how buildbot connects to and invokes commands on the slave? It seems like it's setting up a remote X session, but I'd like to know how exactly, so I can try to reproduce that way.
Flags: needinfo?(rail)
Comment 19•11 years ago
|
||
Buildbot (buildslave actually) starts from a running X session: Unity starts ~/.config/autostart/gnome-terminal.desktop which starts gnome-terminal, which runs runslave.py (it talks to slavealloc and download buildbot startup file). So, basically buildbot inherits all variables from existing X sessions. See also: http://hg.mozilla.org/build/puppet/file/e6e380e85cb1/modules/buildslave/manifests/startup/desktop.pp#l16 http://hg.mozilla.org/build/puppet/file/e6e380e85cb1/modules/buildslave/templates/gnome-terminal.desktop.erb http://hg.mozilla.org/build/puppet/file/e6e380e85cb1/modules/buildslave/files/runslave.py The best way to reproduce the existing environment would be VNCing and running gnome-terminal. Let me know (or ping on IRC) if you need more details.
Flags: needinfo?(rail)
Reporter | ||
Comment 20•11 years ago
|
||
I did just that and the problem still doesn't reproduce. Are there any other differences in how buildbot uses/configures the slave that could be relevant?
Flags: needinfo?(rail)
Reporter | ||
Updated•11 years ago
|
Assignee: nobody → jgriffin
Comment 21•11 years ago
|
||
jgrifin, did you try to reproduce the steps from the logs, including the same environment variables? We can try to reproduce this together, maybe I'll have some ideas...
Flags: needinfo?(rail)
Reporter | ||
Comment 22•11 years ago
|
||
Yes, I did set all the same environment variables, at least the ones that made sense. If you want to try and reproduce it together, ping me on irc.
Comment 23•11 years ago
|
||
I managed to reproduce the issue mentioned in comment 2 (compiz eating 100% of CPU). I straced the compiz process and it was trying to sched_yiled() in loop. I think, the problem is the nvidia driver's default OpenGL yield behavior, which uses sched_yiled(). We can change this behaviour by setting __GL_YIELD env variable to "NOTHING" or "USLEEP". I tried both and they worked. I'll try to reproduce the same on the slave loaned to jgriffin.
Reporter | ||
Comment 24•11 years ago
|
||
I couldn't repro this on my slave (even after reboot), but it sounds like the problem. Can we add this env variable to the slave config and see if it fixes this?
Comment 25•11 years ago
|
||
I would prefer to set this variable from mozharness to avoid possible issues with desktop Talos tests. I'll prep a patch.
Comment 26•11 years ago
|
||
Attachment #746956 -
Flags: review?(aki)
Comment 27•11 years ago
|
||
Comment on attachment 746956 [details] [diff] [review] mozharness env I'm going to guess this line needs changing to env = self.config.get('env', {}) http://hg.mozilla.org/build/mozharness/file/1287327b0f15/scripts/b2g_emulator_unittest.py#l351
Attachment #746956 -
Flags: review?(aki) → review+
Comment 28•11 years ago
|
||
Comment on attachment 746956 [details] [diff] [review] mozharness env (In reply to Aki Sasaki [:aki] from comment #27) > Comment on attachment 746956 [details] [diff] [review] > mozharness env > > I'm going to guess this line needs changing to > > env = self.config.get('env', {}) > > http://hg.mozilla.org/build/mozharness/file/1287327b0f15/scripts/ > b2g_emulator_unittest.py#l351 Ah, I missed that point. https://hg.mozilla.org/build/mozharness/rev/f08c264808e7
Attachment #746956 -
Flags: checked-in+
Updated•11 years ago
|
Attachment #746956 -
Flags: checked-in+ → checked-in-
Comment 29•11 years ago
|
||
I backed out this patch because it didn't really help. It turns out that the bug is not 100% reproducible. Sometimes compiz behaves correctly, sometimes not. To make it work properly we need to set the variable *before* compiz starts, as a part of X start up. This change may affect desktop Talos results and requires some testing before we land it to all machines. Patch for puppet incoming.
Comment 30•11 years ago
|
||
I' not going to deploy the patch before we test Talos.
Attachment #747431 -
Flags: review?(dustin)
Comment 31•11 years ago
|
||
Comment on attachment 747431 [details] [diff] [review] puppet Review of attachment 747431 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/gui/manifests/init.pp @@ +65,5 @@ > "/etc/X11/edid.bin": > source => "puppet:///modules/${module_name}/edid.bin"; > + "/etc/X11/Xsession.d/98nvidia": > + content => "export __GL_YIELD=NOTHING\n", > + notify => Service['x11']; This should *definitely* have a comment with a brief explanation and pointing to this bug :) Fine with that change.
Attachment #747431 -
Flags: review?(dustin) → review+
Comment 32•11 years ago
|
||
Comment on attachment 747431 [details] [diff] [review] puppet https://hg.mozilla.org/build/puppet/rev/3ea4d1800d7c
Attachment #747431 -
Flags: checked-in+
Reporter | ||
Comment 33•11 years ago
|
||
I don't think there's any need for this slave any longer; rail, let me know if there's anything left for me to do.
Flags: needinfo?(rail)
Reporter | ||
Updated•11 years ago
|
Assignee: jgriffin → nobody
Updated•11 years ago
|
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•