Closed Bug 982748 Opened 11 years ago Closed 9 years ago

runs on loaner slaves are != runs on production boxes

Tracking

(Not tracked)

Status:

RESOLVED INCOMPLETE

People

(Reporter: jmaher, Unassigned)

References

Details

Attachments

(3 files)

works on my dev-master 11 years ago Armen [:armenzg] 79.60 KB, text/plain		Details
fail on a try master 11 years ago Armen [:armenzg] 58.95 KB, text/plain		Details
differences.diff 11 years ago Armen [:armenzg] 64.81 KB, patch		Details \| Diff \| Splinter Review

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Description

•

11 years ago

I have been trying to bring up a new talos test which runs a binary that starts up a webserver and measures, key, mouse, and rendering latency. This works fine on loaner slaves, but the difference is I vnc into them and launch the command. Sticking with linux, I have ran the exact same commands that are run on my try server push and it passes for me on the loaner slave over and over again. On try server it fails. To top it off, I put in a dumpScreen call to see what is happening and I always get a black screen from the production box, but a valid screen from the loaner box. I have been investigating this for a while with no luck.

Chris AtLee [:catlee]

Comment 1

•

11 years ago

the loaners are created exactly the same way as the production machines, so I'm not sure what could be going on here. can you start the command via ssh, making sure to set DISPLAY properly first?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 2

•

11 years ago

how do you launch it with the display over ssh? I have put about 3 days of time into this trying all types of things. This isn't the first time I have had this problem.

Chris AtLee [:catlee]

Comment 3

•

11 years ago

export DISPLAY=:0.0 /run/my/command.sh Which display are you vnc'ing into?

Armen [:armenzg]

Comment 4

•

11 years ago

Attached file works on my dev-master — Details

Armen [:armenzg]

Comment 5

•

11 years ago

Attached file fail on a try master — Details

Armen [:armenzg]

Comment 6

•

11 years ago

Attached patch differences.diff — Details — Splinter Review

A diff between the two shows nothing relevant.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 7

•

11 years ago

I did a fresh push to try and my results are the same: https://tbpl.mozilla.org/?tree=Try&rev=4bef01bf1038. launching via DISPLAY=:0:0 yields similar failures to try server, let me debug around that a bit more.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 8

•

11 years ago

some more data: FAILED - running after a reboot with display set and via ssh PASSED - connect via vnc, ssh in set display and run via ssh PASSED - kill vnc server and connections, ssh in set display and run via ssh Somehow the act of connecting via VNC does something to the system which allows the tests to run.

Armen [:armenzg]

Comment 9

•

11 years ago

That is great news! Why :0:0 instead of :0? I'm curious. What puzzles me is that both the dev-master run and the try run had "DISPLAY=:0" on their logs, however, the former passed and the latter failed. I could not find any differences in the log and I had not ssh'ed or VNC'ed into the host after a reboot.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 10

•

11 years ago

ok, I did :0 for the display and it worked just fine after a reboot and no vnc love. we have this: passing - * vnc to a loaner box, run the commands from try server * ssh to a loaner box, run the commands from try server * hook loaner box up to a dev master - run same job as try server * hook secondary slave up to a dev master - run same job as try server failing - * push to try server * retrigger on try server * repeated pushes to try server the logs between the dev master and try are nearly identical (especially if you remove timestamp and ids for the build/job).

Jonathan Griffin (:jgriffin)

Comment 11

•

11 years ago

(In reply to Joel Maher (:jmaher) from comment #8) > some more data: > FAILED - running after a reboot with display set and via ssh > PASSED - connect via vnc, ssh in set display and run via ssh > PASSED - kill vnc server and connections, ssh in set display and run via ssh > > Somehow the act of connecting via VNC does something to the system which > allows the tests to run. We had similar issues with gaia-ui tests which disappeared when the mesa libs were updated on the slaves. The problem disappeared with bug 975034. Might be worth checking out a new slave, if your slave pre-dates that change.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 12

•

11 years ago

on my loaner machine I have mesa 9.2.0: [root@talos-linux32-ix-001.test.releng.scl3.mozilla.com ~]# sudo dpkg -l | grep mesa ii libgl1-mesa-dri 9.2.0~git20130216.dd599188-0ubuntu0sarvatt~precise free implementation of the OpenGL API -- DRI modules ii libgl1-mesa-glx 9.2.0~git20130216.dd599188-0ubuntu0sarvatt~precise free implementation of the OpenGL API -- GLX runtime ii libglapi-mesa 9.2.0~git20130216.dd599188-0ubuntu0sarvatt~precise free implementation of the GL API -- shared library ii libglu1-mesa 9.0.0-0ubuntu1~precise1 Mesa OpenGL utility library (GLU) [root@talos-linux32-ix-001.test.releng.scl3.mozilla.com ~]# This is a hardware machine, but could fall prey to the same issues as ec2. From reading in bug 975034, we had upgraded to 8.0.4. Should we consider that for the hardware slaves? One other question would be how could we test this? Effectvely we could test that it doesn't break anything locally (not a guarantee for production) and then roll to production. Once rolled out, try our jobs in production. this might be worth looking into here.

Armen [:armenzg]

Comment 13

•

11 years ago

News: We have found that sometimes we succeed on try: https://tbpl.mozilla.org/?tree=Try&rev=0bfd96a29f51 I took two slaves from production (one that succeeded and one that failed and put them on my dev-master). Once I put both on staging, I actually happen to see the failure in both of them: http://dev-master1.build.mozilla.org:8041/builders/Ubuntu%20HW%2012.04%20try%20talos%20svgr A re-trigger succeeded in both!!!! Maybe we're hitting some sort of race condition? Machine connects to buildbot at 13:31:09: 2014-03-13 13:31:09-0700 [Broker,client] Connected to dev-master1.srv.releng.scl3.mozilla.com:9041; slave is ready I don't know if these background tasks that are happening while we're taking a job might be related or not (from /var/log/syslog). Mar 13 13:31:01 talos-linux32-ix-003 rtkit-daemon[2431]: Successfully limited resources. Mar 13 13:31:01 talos-linux32-ix-003 rtkit-daemon[2431]: Running. Mar 13 13:31:01 talos-linux32-ix-003 rtkit-daemon[2431]: Watchdog thread running. Mar 13 13:31:01 talos-linux32-ix-003 rtkit-daemon[2431]: Canary thread running. Mar 13 13:31:16 talos-linux32-ix-003 goa[2573]: goa-daemon version 3.4.0 starting [main.c:112, main()] Mar 13 13:31:32 talos-linux32-ix-003 dbus[762]: [system] Failed to activate service 'org.freedesktop.Avahi': timed out Mar 13 13:32:01 talos-linux32-ix-003 dbus[762]: [system] Activating service name='com.ubuntu.DeviceDriver' (using servicehelper) Mar 13 13:32:02 talos-linux32-ix-003 dbus[762]: [system] Successfully activated service 'com.ubuntu.DeviceDriver' Mar 13 13:32:11 talos-linux32-ix-003 dbus[762]: [system] Activating service name='org.freedesktop.PackageKit' (using servicehelper) Mar 13 13:32:11 talos-linux32-ix-003 AptDaemon: INFO: Initializing daemon Mar 13 13:32:12 talos-linux32-ix-003 AptDaemon.PackageKit: INFO: Initializing PackageKit compat layer Mar 13 13:32:12 talos-linux32-ix-003 dbus[762]: [system] Successfully activated service 'org.freedesktop.PackageKit' Mar 13 13:32:12 talos-linux32-ix-003 AptDaemon.PackageKit: INFO: Initializing PackageKit transaction Mar 13 13:32:12 talos-linux32-ix-003 AptDaemon.Worker: INFO: Simulating trans: /org/debian/apt/transaction/79168471b3f94c809b82e08c381821e8 Mar 13 13:32:12 talos-linux32-ix-003 AptDaemon.Worker: INFO: Processing transaction /org/debian/apt/transaction/79168471b3f94c809b82e08c381821e8 Mar 13 13:32:12 talos-linux32-ix-003 AptDaemon.PackageKit: INFO: Get updates() Mar 13 13:32:12 talos-linux32-ix-003 AptDaemon.Worker: INFO: Finished transaction /org/debian/apt/transaction/79168471b3f94c809b82e08c381821e8

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

11 years ago

Depends on: 984944

Rail Aliiev [:rail]

Comment 14

•

9 years ago

Docker images and taskcluster will solve ALL the things!

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → INCOMPLETE

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 15

•

9 years ago

Actually I am not sure we can run all talos jobs on docker. The one test which caused me to open this bug did a lot of custom input and capturing. Still I think we should keep this closed.

Nobody; OK to take it and work on it

Assignee

Updated

•

7 years ago

Component: General Automation → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

runs on loaner slaves are != runs on production boxes

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: jmaher, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Comment 14

Comment 15

Updated

Attachment

General

Description

File Name

Content Type