Closed Bug 982748 Opened 10 years ago Closed 9 years ago

runs on loaner slaves are != runs on production boxes

Categories

(Release Engineering :: General, defect)

All
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: jmaher, Unassigned)

References

Details

Attachments

(3 files)

I have been trying to bring up a new talos test which runs a binary that starts up a webserver and measures, key, mouse, and rendering latency.  This works fine on loaner slaves, but the difference is I vnc into them and launch the command.

Sticking with linux, I have ran the exact same commands that are run on my try server push and it passes for me on the loaner slave over and over again.  On try server it fails.

To top it off, I put in a dumpScreen call to see what is happening and I always get a black screen from the production box, but a valid screen from the loaner box.

I have been investigating this for a while with no luck.
the loaners are created exactly the same way as the production machines, so I'm not sure what could be going on here.

can you start the command via ssh, making sure to set DISPLAY properly first?
how do you launch it with the display over ssh?  I have put about 3 days of time into this trying all types of things.  This isn't the first time I have had this problem.
export DISPLAY=:0.0
/run/my/command.sh

Which display are you vnc'ing into?
Attached file works on my dev-master
Attached file fail on a try master
Attached patch differences.diffSplinter Review
A diff between the two shows nothing relevant.
I did a fresh push to try and my results are the same: https://tbpl.mozilla.org/?tree=Try&rev=4bef01bf1038.

launching via DISPLAY=:0:0 yields similar failures to try server, let me debug around that a bit more.
some more data:
FAILED - running after a reboot with display set and via ssh
PASSED - connect via vnc, ssh in set display and run via ssh
PASSED - kill vnc server and connections, ssh in set display and run via ssh

Somehow the act of connecting via VNC does something to the system which allows the tests to run.
That is great news!

Why :0:0 instead of :0? I'm curious.

What puzzles me is that both the dev-master run and the try run had "DISPLAY=:0" on their logs, however, the former passed and the latter failed. I could not find any differences in the log and I had not ssh'ed or VNC'ed into the host after a reboot.
ok, I did :0 for the display and it worked just fine after a reboot and no vnc love.

we have this:
passing -
* vnc to a loaner box, run the commands from try server
* ssh to a loaner box, run the commands from try server
* hook loaner box up to a dev master - run same job as try server
* hook secondary slave up to a dev master - run same job as try server

failing - 
* push to try server
* retrigger on try server
* repeated pushes to try server

the logs between the dev master and try are nearly identical (especially if you remove timestamp and ids for the build/job).
(In reply to Joel Maher (:jmaher) from comment #8)
> some more data:
> FAILED - running after a reboot with display set and via ssh
> PASSED - connect via vnc, ssh in set display and run via ssh
> PASSED - kill vnc server and connections, ssh in set display and run via ssh
> 
> Somehow the act of connecting via VNC does something to the system which
> allows the tests to run.

We had similar issues with gaia-ui tests which disappeared when the mesa libs were updated on the slaves.  The problem disappeared with bug 975034.  Might be worth checking out a new slave, if your slave pre-dates that change.
on my loaner machine I have mesa 9.2.0:
[root@talos-linux32-ix-001.test.releng.scl3.mozilla.com ~]# sudo dpkg -l | grep mesa
ii  libgl1-mesa-dri                        9.2.0~git20130216.dd599188-0ubuntu0sarvatt~precise                      free implementation of the OpenGL API -- DRI modules
ii  libgl1-mesa-glx                        9.2.0~git20130216.dd599188-0ubuntu0sarvatt~precise                      free implementation of the OpenGL API -- GLX runtime
ii  libglapi-mesa                          9.2.0~git20130216.dd599188-0ubuntu0sarvatt~precise                      free implementation of the GL API -- shared library
ii  libglu1-mesa                           9.0.0-0ubuntu1~precise1                                                 Mesa OpenGL utility library (GLU)
[root@talos-linux32-ix-001.test.releng.scl3.mozilla.com ~]# 


This is a hardware machine, but could fall prey to the same issues as ec2. From reading in bug 975034, we had upgraded to 8.0.4.  Should we consider that for the hardware slaves?

One other question would be how could we test this?  Effectvely we could test that it doesn't break anything locally (not a guarantee for production) and then roll to production.  Once rolled out, try our jobs in production.  

this might be worth looking into here.
News:
We have found that sometimes we succeed on try:
https://tbpl.mozilla.org/?tree=Try&rev=0bfd96a29f51

I took two slaves from production (one that succeeded and one that failed and put them on my dev-master).

Once I put both on staging, I actually happen to see the failure in both of them:
http://dev-master1.build.mozilla.org:8041/builders/Ubuntu%20HW%2012.04%20try%20talos%20svgr
A re-trigger succeeded in both!!!!


Maybe we're hitting some sort of race condition?

Machine connects to buildbot at 13:31:09:
2014-03-13 13:31:09-0700 [Broker,client] Connected to dev-master1.srv.releng.scl3.mozilla.com:9041; slave is ready

I don't know if these background tasks that are happening while we're taking a job might be related or not (from /var/log/syslog).
Mar 13 13:31:01 talos-linux32-ix-003 rtkit-daemon[2431]: Successfully limited resources.
Mar 13 13:31:01 talos-linux32-ix-003 rtkit-daemon[2431]: Running.
Mar 13 13:31:01 talos-linux32-ix-003 rtkit-daemon[2431]: Watchdog thread running.
Mar 13 13:31:01 talos-linux32-ix-003 rtkit-daemon[2431]: Canary thread running.
Mar 13 13:31:16 talos-linux32-ix-003 goa[2573]: goa-daemon version 3.4.0 starting [main.c:112, main()]
Mar 13 13:31:32 talos-linux32-ix-003 dbus[762]: [system] Failed to activate service 'org.freedesktop.Avahi': timed out
Mar 13 13:32:01 talos-linux32-ix-003 dbus[762]: [system] Activating service name='com.ubuntu.DeviceDriver' (using servicehelper)
Mar 13 13:32:02 talos-linux32-ix-003 dbus[762]: [system] Successfully activated service 'com.ubuntu.DeviceDriver'
Mar 13 13:32:11 talos-linux32-ix-003 dbus[762]: [system] Activating service name='org.freedesktop.PackageKit' (using servicehelper)
Mar 13 13:32:11 talos-linux32-ix-003 AptDaemon: INFO: Initializing daemon
Mar 13 13:32:12 talos-linux32-ix-003 AptDaemon.PackageKit: INFO: Initializing PackageKit compat layer
Mar 13 13:32:12 talos-linux32-ix-003 dbus[762]: [system] Successfully activated service 'org.freedesktop.PackageKit'
Mar 13 13:32:12 talos-linux32-ix-003 AptDaemon.PackageKit: INFO: Initializing PackageKit transaction
Mar 13 13:32:12 talos-linux32-ix-003 AptDaemon.Worker: INFO: Simulating trans: /org/debian/apt/transaction/79168471b3f94c809b82e08c381821e8
Mar 13 13:32:12 talos-linux32-ix-003 AptDaemon.Worker: INFO: Processing transaction /org/debian/apt/transaction/79168471b3f94c809b82e08c381821e8
Mar 13 13:32:12 talos-linux32-ix-003 AptDaemon.PackageKit: INFO: Get updates()
Mar 13 13:32:12 talos-linux32-ix-003 AptDaemon.Worker: INFO: Finished transaction /org/debian/apt/transaction/79168471b3f94c809b82e08c381821e8
Depends on: 984944
Docker images and taskcluster will solve ALL the things!
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INCOMPLETE
Actually I am not sure we can run all talos jobs on docker.  The one test which caused me to open this bug did a lot of custom input and capturing.  Still I think we should keep this closed.
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: