Closed Bug 859867 Opened 11 years ago Closed 11 years ago

Need to check out one of the new linux hardware slaves to troubleshoot B2G emulator problems

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jgriffin, Unassigned)

References

Details

Attachments

(2 files)

The new Ubuntu 12.04 hardware slaves do not cooperate with B2G tests:  https://tbpl.mozilla.org/?tree=Cedar&jobname=b2g_ics_armv7a_gecko_emulator_hw

The logs show that the emulator is failing to start.  I'll need to check out one of these new machines in order to figure out what's going on.
Depends on: 859893
talos-linux64-ix-004.test.releng.scl3.mozilla.com has been allocated for this purpose, I'll send you the information on how to connect.
So basically the problem on these slaves has something to do with compiz.  When I logged into this slave, and tried to launch the emulator, it just hung hard.  I noticed that compiz was taking 100% of 1 core.

After I killed the emulator, and killed compiz (which caused it to be automatically restarted), I could launch and kill the emulator successfully as many times as I wanted to, and compiz stayed at a reasonable CPU consumption (< 30%).

So either the window manager/Xorg is in some bad state initially, or attempting to launch the emulator puts it in a bad state initially, but killing compiz is enough to "fix" it.

I don't really know how to troubleshoot it beyond this, not being an expert at linux window manager configs.
This sounds potentially relevant:

https://bugs.launchpad.net/ubuntu/+source/compiz/+bug/995118

Are the slaves configured as the default, to blank the screen after X seconds of inactivity?
(In reply to Jonathan Griffin (:jgriffin) from comment #3)
> This sounds potentially relevant:
> 
> https://bugs.launchpad.net/ubuntu/+source/compiz/+bug/995118
> 
> Are the slaves configured as the default, to blank the screen after X
> seconds of inactivity?

No, we disable that in 
http://hg.mozilla.org/build/puppet/file/6355249ba9a5/modules/gui/files/gsettings.gschema.override
Blocks: 820235
Logging into this VM later, compiz isn't running at all.  :(
(In reply to Jonathan Griffin (:jgriffin) from comment #6)
> Logging into this VM later, compiz isn't running at all.  :(

Can you try now? It turns out that some of the loaning steps are not documented properly for this platform yet...
jgriffin: did you successfully get access to this machine?
Flags: needinfo?(jgriffin)
Blocks: 850105
No longer blocks: 820235
(In reply to Chris Cooper [:coop] from comment #8)
> jgriffin: did you successfully get access to this machine?

I did, and compiz was running again, and the emulator behaved fine.  The only time I had any problems was the very first time I launched the emulator, at which point compiz got stuck at 100% cpu and the emulator hung.

It might be worthwhile rebooting this VM (I can't do it since I don't have sudo) and see if the problem recurs.
Flags: needinfo?(jgriffin)
I rebooted the machine. BTW, "sudo reboot" is allowed for cltbld.
I have still not been able to reproduce this problem, except for the very first time I attempted.  After a reboot, the tests still function normally.  Is it possible compiz gets in a weird state somehow when the test slave is first connected to buildbot?

The only things I can think of to help debug this are:

1] - add some output to the mozharness script to dump the process list, to see if these are all cases caused by compiz misbehaving (something like ps -eo pcpu,pid,user,args | sort -k 1 -r | head -10)

2] - add some debugging output to Marionette on cedar to see if the emulator is dumping any error messages to stdout that we're not seeing
Depends on: 865837
I've just added some 'ps' logging to mozharness for this, and I've also bumped the emulator launch timeout on cedar from 60s to 180s.  We'll see if either of these give us more info.
emulator launch timeout increased:  https://hg.mozilla.org/projects/cedar/rev/f48348119595
(In reply to Jonathan Griffin (:jgriffin) from comment #12)
> I've just added some 'ps' logging to mozharness for this, and I've also
> bumped the emulator launch timeout on cedar from 60s to 180s.  We'll see if
> either of these give us more info.

This shows that compiz doesn't seem to be the problem:

16:57:55     INFO - %CPU   PID USER     COMMAND
16:57:55     INFO -  1.5  2655 cltbld   /tools/buildbot/bin/python scripts/scripts/b2g_emulator_unittest.py --cfg b2g/emulator_automation_config.py --test-suite crashtest --this-chunk 3 --total-chunks 3 --download-symbols true
16:57:55     INFO -  0.5  2658 root     [flush-8:0]
16:57:55     INFO -  0.4  1117 root     [kipmi0]
16:57:55     INFO -  0.1  2335 cltbld   compiz
16:57:55     INFO -  0.1  1921 root     X :0 -nolisten tcp
16:57:55     INFO -  0.0  2764 cltbld   ps -eo pcpu,pid,user,args
16:57:55     INFO -  0.0  2640 cltbld   /usr/lib/deja-dup/deja-dup/deja-dup-monitor
16:57:55     INFO -  0.0  2633 cltbld   /usr/bin/python /usr/lib/unity-scope-video-remote/unity-scope-video-remote
16:57:55     INFO -  0.0  2605 cltbld   /usr/lib/unity-lens-music/unity-musicstore-daemon
(In reply to Jonathan Griffin (:jgriffin) from comment #13)
> emulator launch timeout increased: 
> https://hg.mozilla.org/projects/cedar/rev/f48348119595

Increasing the timeout didn't help.
Added some more debugging output to cedar: https://hg.mozilla.org/projects/cedar/rev/8d27a357c892
(In reply to Jonathan Griffin (:jgriffin) from comment #16)
> Added some more debugging output to cedar:
> https://hg.mozilla.org/projects/cedar/rev/8d27a357c892

No help here either.

I have to assume this has something to do with a difference in how we're starting a session.  I'm using ssh to the console, and then invoking the emulator from there.  Buildbot seems to be using a remote X session; but I can't quite tell what it's doing.  :rail, can you describe how buildbot invokes commands on the slave?
:rail, can you explain how buildbot connects to and invokes commands on the slave?  It seems like it's setting up a remote X session, but I'd like to know how exactly, so I can try to reproduce that way.
Flags: needinfo?(rail)
Buildbot (buildslave actually) starts from a running X session: Unity starts ~/.config/autostart/gnome-terminal.desktop which starts gnome-terminal, which runs runslave.py (it talks to slavealloc and download buildbot startup file). So, basically buildbot inherits all variables from existing X sessions.

See also:
http://hg.mozilla.org/build/puppet/file/e6e380e85cb1/modules/buildslave/manifests/startup/desktop.pp#l16
http://hg.mozilla.org/build/puppet/file/e6e380e85cb1/modules/buildslave/templates/gnome-terminal.desktop.erb
http://hg.mozilla.org/build/puppet/file/e6e380e85cb1/modules/buildslave/files/runslave.py

The best way to reproduce the existing environment would be VNCing and running gnome-terminal.

Let me know (or ping on IRC) if you need more details.
Flags: needinfo?(rail)
I did just that and the problem still doesn't reproduce.  Are there any other differences in how buildbot uses/configures the slave that could be relevant?
Flags: needinfo?(rail)
Assignee: nobody → jgriffin
jgrifin, did you try to reproduce the steps from the logs, including the same environment variables? We can try to reproduce this together, maybe I'll have some ideas...
Flags: needinfo?(rail)
Yes, I did set all the same environment variables, at least the ones that made sense.  If you want to try and reproduce it together, ping me on irc.
I managed to reproduce the issue mentioned in comment 2 (compiz eating 100% of CPU).

I straced the compiz process and it was trying to sched_yiled() in loop.

I think, the problem is the nvidia driver's default OpenGL yield behavior, which uses sched_yiled().

We can change this behaviour by setting __GL_YIELD env variable to "NOTHING" or "USLEEP". I tried both and they worked. 

I'll try to reproduce the same on the slave loaned to jgriffin.
I couldn't repro this on my slave (even after reboot), but it sounds like the problem.  Can we add this env variable to the slave config and see if it fixes this?
I would prefer to set this variable from mozharness to avoid possible issues with desktop Talos tests. I'll prep a patch.
Attached patch mozharness envSplinter Review
Attachment #746956 - Flags: review?(aki)
Comment on attachment 746956 [details] [diff] [review]
mozharness env

I'm going to guess this line needs changing to

env = self.config.get('env', {})

http://hg.mozilla.org/build/mozharness/file/1287327b0f15/scripts/b2g_emulator_unittest.py#l351
Attachment #746956 - Flags: review?(aki) → review+
Comment on attachment 746956 [details] [diff] [review]
mozharness env

(In reply to Aki Sasaki [:aki] from comment #27)
> Comment on attachment 746956 [details] [diff] [review]
> mozharness env
> 
> I'm going to guess this line needs changing to
> 
> env = self.config.get('env', {})
> 
> http://hg.mozilla.org/build/mozharness/file/1287327b0f15/scripts/
> b2g_emulator_unittest.py#l351

Ah, I missed that point. https://hg.mozilla.org/build/mozharness/rev/f08c264808e7
Attachment #746956 - Flags: checked-in+
Attachment #746956 - Flags: checked-in+ → checked-in-
I backed out this patch because it didn't really help. It turns out that the bug is not 100% reproducible. Sometimes compiz behaves correctly, sometimes not. To make it work properly we need to set the variable *before* compiz starts, as a part of X start up. This change may affect desktop Talos results and requires some testing before we land it to all machines. Patch for puppet incoming.
Attached patch puppetSplinter Review
I' not going to deploy the patch before we test Talos.
Attachment #747431 - Flags: review?(dustin)
Comment on attachment 747431 [details] [diff] [review]
puppet

Review of attachment 747431 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/gui/manifests/init.pp
@@ +65,5 @@
>                  "/etc/X11/edid.bin":
>                      source => "puppet:///modules/${module_name}/edid.bin";
> +                "/etc/X11/Xsession.d/98nvidia":
> +                    content => "export __GL_YIELD=NOTHING\n",
> +                    notify => Service['x11'];

This should *definitely* have a comment with a brief explanation and pointing to this bug :)

Fine with that change.
Attachment #747431 - Flags: review?(dustin) → review+
I don't think there's any need for this slave any longer; rail, let me know if there's anything left for me to do.
Flags: needinfo?(rail)
I filed bug 895425 to reimage the machine.
Flags: needinfo?(rail)
Assignee: jgriffin → nobody
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: