Closed Bug 1269784 Opened 6 years ago Closed 6 years ago

run linux64 talos jobs in taskcluster but using hardware, not aws

Categories

(Firefox Build System :: Task Configuration, task, P4)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Unassigned)

References

Details

Attachments

(2 files)

in bug 1253341, we did a few experiments on AWS and determined that the noise level was too high for getting reliable alerts out in a timely manner to developers.

Moving forward we want to look at hardware options, and have 2 choices:
1) a cloud based hardware (bare metal) provider- one choice we know of is packet.net.
2) using our existing hardware in a datacenter and running docker+tc-worker.

As discussed in bug 1253341 and on irc, we want to start with packet.net and go from there.  using the "Tiny 0" instances should be fine- the cpu seems a bit slow, but not so worrisome.  The price is 8 times cheaper than "Tiny 1" which is a bit overkill.

:garndt had indicated briefly looking at packet.net this week to get an idea of the hurdles and we can circle back and schedule time to tackle whatever remaining issues exist.
Depends on: 1230652, 1269040, 1269340
:garndt, can you help me outline some next steps or work items for using packet.net and the docker-worker setup?  I am happy to investigate/hack where possible, but some pointers would be good as we venture into the unknown.
Flags: needinfo?(garndt)
I got our trial account setup with my contact at packet.net and started talking with one of their representatives about best practices for setting up an instance.

My next step was going to define a server within their interface and spin it up, install our necessary packages and docker, clone docker-worker, and see where that gets me.
Flags: needinfo?(garndt)
so some trial stuff was done on packet.net with their lowest costing server:
https://tools.taskcluster.net/task-inspector/#KQHm6S9zRcaqO1UKz_xEfg/0

Took a little longer than one of the runs I saw on try because it's a lot slower to get the docker image from s3 when you're not in AWS.

some temporary modifications I had to make to docker-worker. https://github.com/taskcluster/docker-worker/pull/229
ok, packet.net trial for the type1 machines:
http://localhost:8000/perf.html#/compare?originalProject=try&originalRevision=c584daa56c64&newProject=try&newRevision=3688e2109fda63355c2e183c4814657bb11b1659&framework=7&showOnlyImportant=0


                Delta %	avg delta %	stddev	avg stddev
ix.bb           20.02	0.4	        71.59	1.43
c3xlarge.bb	80.78	1.97	        205.87	5.02
c3large.tc	35.77	0.89	        127.01	3.17
p.net.t1        122.32  2.78            258.64  5.88

taking data from the aws bug (https://bugzilla.mozilla.org/show_bug.cgi?id=1253341#c98), I added p.net.t1- this is pretty noisy.

Shall we try running on type0 with the longer run times?  

possibly running on our IX machines using tc+docker?

:garndt, you can disable the packet.net machines now
ok, after a few tries on on type0 machines.

Next up is running on buildbot IX machines.  I think we need to:
* get 5 machines pulled out of the pool
* figure out how to get tc+docker running on them
* push to try and party
Depends on: 1273200
Component: General → Task Configuration
Priority: -- → P4
Depends on: 1274302
                Delta % avg delta %     stddev  avg stddev
ix.bb           20.02   0.4             71.59   1.43
c3xlarge.bb     80.78   1.97            205.87  5.02
c3large.tc      35.77   0.89            127.01  3.17
p.net.t1        122.32  2.78            258.64  5.88
ix.tc           3.39    0.08             46.66  1.06

!!!  the ix.tc is quite stable:
http://localhost:8000/perf.html#/compare?originalProject=try&originalRevision=b5a085aa6d69&newProject=try&newRevision=80cca75e6d5999ca2ba2840005092a6efcdc11a5&framework=7&showOnlyImportant=0

I am really excited about this!
Unless others have questions/concerns, I would like the next steps here to be to schedule linux64 opt talos jobs as tier-2 with a pool of 20 machines.  These are reporting now to perfherder as 'talos-aws' framework, so the results won't get mixed up- we can compare results over time, look for trends, and most importantly ensure we are catching regressions.

Issues to solve:
* getting tp5o.zip pageset in tooltool and working from mozharness/etc.
* not hardcode the framework for all talos to be talos-aws, make it conditional based on some criteria.
* verify tp5o_scroll, and tscrollx work (they were disabled from the early buildbot on c3.2xlarge experiment)
* fix talos_from_code to work with tooltool (similar to first point)
* clean up configs so treeherder names are proper
* figure out how to run the tc-worker/etc. reliably from a fresh boot, right now I ssh in and run a command manually and the session needs to stay open (including my vpn connection)
* determine if getting 15 more loaners is the right approach, or if we should do this outside of the loaner world.
* reimage the additional machines to ubuntu 14.04
* land changes in taskcluster-worker to support local-worker scenario officially

I am sure there are a few other steps, I will probably pick this up later this week with bugs/needinfo's/etc. as needed.
we also use v4l2loopback to give us graphics for the docker image, I am not sure if we need physical graphics/hardware, but need to confirm for all tests.  We have canvasmark and glterrain- those would be interesting.
brief update here-
* I have a docker image that is similar to our desktop-test image, but has nvidia drivers which match the host (hand edited host): https://hub.docker.com/r/elvis314/ubuntu_with_nvidia_driver/
* the host has a window manager and video driver installed: NVIDIA-Linux-x86_64-361.28.run 
* I have hacked up the tc-worker to allow mounting the /tmp/.X11-unix file as rw
* I have hacked up the tc-worker to allow privileged = true
* I had to edit the test-linux.sh script to use DISPLAY=unix:1.0 instead of :0 when starting xfvb/window manager

so far this is where I am at- the docker image runs jobs, but I am iterating on getting the docker image to run 'glxinfo' successfully- once that is done and talos runs, I can replicate this at a larger scale (my 5 loaners) and get a few hundred jobs to see if this is noisy or not.
some success, I have installed the host driver properly so that the host itself can do DRI and GLX.  This is better and magically the docker image can match the host with the above changes.

Now we get into problems, and it is easiest to just see the log file:
https://public-artifacts.taskcluster.net/bBd_aWz9QFmZalHrZNZhcQ/0/public/logs/live_backing.log

and the problem:
09:27:55     INFO -  PROCESS | 1231 | [1231] ###!!! ABORT: X_ShmDetach: BadShmSeg (invalid shared segment parameter): file /home/worker/workspace/build/src/toolkit/xre/nsX11ErrorHandler.cpp, line 157
09:27:55     INFO -  PROCESS | 1231 | [1231] ###!!! ABORT: X_ShmDetach: BadShmSeg (invalid shared segment parameter): file /home/worker/workspace/build/src/toolkit/xre/nsX11ErrorHandler.cpp, line 157
09:27:55     INFO -  PROCESS | 1231 | ExceptionHandler::GenerateDump cloned child 1297
09:27:55     INFO -  PROCESS | 1231 | ExceptionHandler::SendContinueSignalToChild sent continue signal to child
09:27:55     INFO -  PROCESS | 1231 | ExceptionHandler::WaitForContinueSignal waiting for continue signal...



I can reproduce this by hand and hacking around hasn't helped much.  In fact, this happens all the time, sometimes early on sometimes not.

Running with the same config and just ignoring the volume mounting of /tmp.x11-unix (effectively not using glx), I am able to get the tests to run although there is common errors about GLX missing each time the browser starts, and glxinfo doesn't work (as expected).

So either we need to fix this issue, or remove docker from the equation.  My thoughts on this issue is that since we can see some instances of the browser working that this is random/intermittent.  Possibly there is an IO problem or the host/docker are fighting for the file.  I have a bit more fiddling around on the host before I might call this a lost cause.
oh, good news, some hackery on the host.

I did:
stop x11
wait...
start x11
run test manually, a-ok!
repeat

I bet I don't have the host setup properly as my manually getting X to launch properly is a bit of random hacking.

I think the x11 stuff is really:
stop Xsession
stop lightdm

then run...and I have green on try:
https://public-artifacts.taskcluster.net/Ia1i4boKT22cebjTphoqNg/0/public/logs/live_backing.log

retriggered to see if it works twice
This script gets the NVIDIA driver installed/updated and working so that we can have proper GL support on our host.

For docker, I have all my work uploaded here:
https://github.com/jmaher/docker3d

The biggest concerns I have with docker + GL are:
1) we have to have the same version of the graphics driver on host+docker
2) we have to run docker in --privileged mode
3) we have to run |xhost +| on the host to disable any access controls to X
4) any slight changes to the host/docker could cause GL to fail- it isn't immediately evident that this is failing, so we would only know by examining log files which only happens when failures occur.
5) I got much higher rates of failures on the test jobs

Now for the data...using the same methodology as before here are the numbers:
                Delta % avg delta %     stddev  avg stddev
ix.bb            20.02  0.4              71.59  1.43
c3xlarge.bb      80.78  1.97            205.87  5.02
c3large.tc       35.77  0.89            127.01  3.17
p.net.t1        122.32  2.78            258.64  5.88
ix.tc*           16.06  0.37             46.66  1.06
ix.tc.gl**      174.51  2.18            166.37  2.08

* ix.tc previously had different deltas due to not taking the abs() value, the stddev are the same.
** http://localhost:8000/perf.html#/compare?originalProject=try&originalRevision=65ef7d96a008&newProject=try&newRevision=52b6078049d495b6ee3f366ee48d0fce843917ba&framework=7&filter=opt&showOnlyImportant=0

Overall the taskcluster experiment on IX without GL was more stable that without taskcluster.  Turning on GL made this a cluster and greater than twice the levels of noise that we currently have.

My recommendations for running Talos on taskcluster are that we need to run without docker on dedicated hardware machines in scl3.
looking at the next steps, it is to get a generic-worker running on linux (bug 1282011).
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
See Also: → 1280440
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.