Closed Bug 1285582 Opened 8 years ago Closed 8 years ago

Xvfb not running in TASKCLUSTER_INTERACTIVE mode

Categories

(Firefox Build System :: Task Configuration, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED
mozilla50

People

(Reporter: ahal, Assigned: ahal)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

In test-linux.sh, we start Xvfb:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/scripts/tester/test-linux.sh#74

However, after starting an interactive shell, running ps -ef | grep Xvfb shows that it is not running. And to back this up tests that need to open Firefox fail with "Cannot open display :0". Running Xvfb in the background and then re-running tests causes them to work.

Here is the relevant log, I'm not sure how to interpret it (I kind of wish taskcluster didn't also dump the contents of the script):
> # run XVfb in the background, if necessary
> if $NEED_XVFB; then
> Xvfb :0 -nolisten tcp -screen 0 1600x1200x24 \
>  ~/artifacts/public/xvfb.log 2>&1 &
> export DISPLAY=:0
> xvfb_pid=$!
> # Only error code 255 matters, because it signifies that no
> # display could be opened. As long as we can open the display
> # tests should work. We'll retry a few times with a sleep before
> # failing.
> retry_count=0
> max_retries=2
> xvfb_test=0
> until [ $retry_count -gt $max_retries ]; do
> xvinfo || xvfb_test=$?
> if [ $xvfb_test != 255 ]; then
> retry_count=$(($max_retries + 1))
> else
> retry_count=$(($retry_count + 1))
> echo "Failed to start Xvfb, retry: $retry_count"
> sleep 2
> fi
> done
> if [ $xvfb_test == 255 ]; then fail "xvfb did not start properly"; fi
> fi
> + true
> + export DISPLAY=:0
> + DISPLAY=:0
> + xvfb_pid=34
> + retry_count=0
> + max_retries=2
> + xvfb_test=0
> + '[' 0 -gt 2 ']'
> + xvinfo
> + Xvfb :0 -nolisten tcp -screen 0 1600x1200x24
Is this only in interactive mode?  I wonder if this is related to bug 1263815

There are some weird things around starting Xvfb, for sure.
How would I test if it's working in non-interactive mode? The test jobs are all green if that says anything..

Though from the log above, it looks like the Xvfb call is succeeding? I guess either it's falling over shortly after or else is returning an error other than 255? I'm not familiar with X display servers or how they work. I've attached the contents of ~/artifacts/public/xvfb.log
Yeah, that's about what I saw in bug 1263815: Xvfb's log just stops, with no actual error message.  I suspect it's segfaulting or some other such joyfulness.
The weird(er) thing is that when I run Xvfb manually I see a very similar log, except everything magically works. Here's a copy pasta from an interactive shell.
(The output there is a bit munged because of Xvfb being run as a background process, but the important thing to note is that it doesn't fall over or anything)
Actually, the major difference between those two outputs is this line found in the first attachment but not the second:
_XSERVTransmkdir: ERROR: euid != 0,directory /tmp/.X11-unix will not be created.
I verified that when running in the context of an interactive shell, we are running as root. Whereas when running in the context of test-linux.sh, we have dropped down to the worker user due to:
https://dxr.mozilla.org/mozilla-central/source/testing/docker/desktop-test/bin/test.sh#15

So perhaps Xvfb just needs to be run as root. But there are still questions I don't know the answer to:

1) How do we run this as root without running the rest of the script as root?
2) This used to work, what regressed it? That fallback to the worker user looks like it's been there for a long time, I think longer than this has been broken..

Finally, when running Xvfb from an interactive shell tests pass and the display session shows up in the interactive displays section. However when connecting to that session, I only see a black screen (when reftests are running I see the tests flicker in and out a bit, but still mostly just a black screen). I think this might be worth filing a separate bug for once the Xvfb issue is fixed.
Given comment 7, do you have any thoughts on how I should proceed?
Flags: needinfo?(dustin)
(In reply to Andrew Halberstadt [:ahal] from comment #7)
> Finally, when running Xvfb from an interactive shell tests pass and the
> display session shows up in the interactive displays section. However when
> connecting to that session, I only see a black screen (when reftests are
> running I see the tests flicker in and out a bit, but still mostly just a
> black screen). I think this might be worth filing a separate bug for once
> the Xvfb issue is fixed.

Nvm about this issue. It happens because x11vnc and XSession fail due to Xvfb not running.. starting those manually gets this working again. Though would still like feedback on how best to run Xvfb as root.
That is surprising, and indeed the changes to run as 'worker' landed about 10 months ago (bug 1199379).

Probably the best way to run Xvfb as root would be to use `sudo` to launch it.  If that doesn't work for whatever reason (we had some obscure issues with pulseaudio, bug 1214809), then you may need to run Xvfb from testing/docker/desktop-test/bin/test.sh before dropping privileges.. which could be even more difficult :(
Flags: needinfo?(dustin)
Is there a way to use sudo without hardcoding the password somewhere? Or is that not important.
Yes, we could just add that command to /etc/sudoers for worker in the desktop-test image.

I could have sworn we had code in place to define sudoers already, but I don't see it, so it would need to be added.
I tried this patch:
https://hg.mozilla.org/try/rev/701e5956d8d6dc10dd07d54107fb2f8a465b6c63

But it still didn't work. All 'sudo' invocations output "sudo: unable to resolve host taskcluster-worker". Apparently that error means /etc/hosts isn't set up properly (why?). But I think this is a red herring anyway as the commands still proceed to work after that error is printed.

So to recap, I was able to run sudo Xvfb without requiring a password, but upon opening an interactive shell Xvfb is still not working. So maybe running as root wasn't the problem after all. I can see no other obvious errors in either stdout or xvfb.log. The only idea I have now is to improve logging in the test-linux.sh script and hope something jumps out.
So I'm dumb..

When using interactive mode, the process running test-linux.sh terminates almost immediately which means so does the Xvfb process it spawns. I was getting tripped up because for some reason I thought this used to work.. but I must have only been testing xpcshell which doesn't require Xvfb, so only thought it used to work.

I guess the solutions are:
1) Run under screen session
2) Start Xvfb in the taskcluster-interactive.sh script

I'm leaning towards 2) because I'm not sure what the implications are (if any) of leaving screen sessions lying around.
Screen sessions won't hurt (Docker will kill them when cleaning up the container).  But I prefer #2 too.
So I actually changed my mind and ended up going with screen, because I couldn't think of a good way to share code between the two. Not that duplicating is that terrible in this case I guess. But since I have the patch, I'll upload so you can take a look. Feel free to r- and I'll just resubmit with the code duplicated in that case.
Another benefit of screen is that we don't need to check to see if Xvfb is already running (in the event that the developer starts multiple interactive shells).
Normally we start Xvfb as a background task, then run the tests from the
same script. However, in interactive mode we were starting Xvfb, the script
would exit, and then we would potentially run tests later on from another
script. Unforunately this meant that Xvfb was dying with the first script
and tests would fail.

This patch runs Xvfb in a screen session so that it will still be available
later on when running an interactive shell.

Review commit: https://reviewboard.mozilla.org/r/64932/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/64932/
Attachment #8771974 - Flags: review?(dustin)
https://reviewboard.mozilla.org/r/64932/#review61896

::: taskcluster/scripts/tester/test-linux.sh:156
(Diff revision 1)
>    # arguments in from our own invocation
>    ${mozharness_bin};
> +  screen -XS xvfb quit
>  fi
> +
> +ps -ef | grep Xvfb

This is a leftover debug statement.. TODO: Remove
Assignee: nobody → ahalberstadt
Priority: -- → P1
https://reviewboard.mozilla.org/r/64932/#review62350

::: taskcluster/scripts/tester/test-linux.sh:78
(Diff revision 1)
> -# run XVfb in the background, if necessary
> +# run Xvfb in the background, if necessary
>  if $NEED_XVFB; then
> -    Xvfb :0 -nolisten tcp -screen 0 1600x1200x24 \
> -       > ~/artifacts/public/xvfb.log 2>&1 &
> +    screen -dmS xvfb Xvfb :0 -nolisten tcp -screen 0 1600x1200x24 \
> +       > ~/artifacts/public/xvfb.log 2>&1
>      export DISPLAY=:0
>      xvfb_pid=$!

Uhoh :(

This is going to get the wrong pid, which means that the cleanup command fails later.  I don't remember why we needed to clean things up (docker's pretty good at killing)

Also, this is going to conflict with jmaher's work on bug 1263815, where I just encouraged him to use the xvfb_pid to see if xvfb started successfully.

I could be convinced that killing xvfb in cleanup isn't necessary, if Joel agrees.

::: taskcluster/scripts/tester/test-linux.sh:153
(Diff revision 1)
>  # In interactive mode, the user will be prompted with options for what to do.
>  if [ "$TASKCLUSTER_INTERACTIVE" != "true" ]; then
>    # run the given mozharness script and configs, but pass the rest of the
>    # arguments in from our own invocation
>    ${mozharness_bin};
> +  screen -XS xvfb quit

This should probably be in cleanup() too?  If this will have the effect of killing xvfb (does it listen to SIGHUP?) then maybe the pid really isn't useful..
(In reply to Dustin J. Mitchell [:dustin] from comment #20)
> Uhoh :(
> 
> This is going to get the wrong pid, which means that the cleanup command
> fails later.  I don't remember why we needed to clean things up (docker's
> pretty good at killing)
> 
> Also, this is going to conflict with jmaher's work on bug 1263815, where I
> just encouraged him to use the xvfb_pid to see if xvfb started successfully.
> 
> I could be convinced that killing xvfb in cleanup isn't necessary, if Joel
> agrees.

Oh right! I'll remove the xvfb_pid stuff for this patch so it's easier to see, but the jury can still be out on whether we take it or not. Cc'ing jmaher so he is aware.

 
> This should probably be in cleanup() too?  If this will have the effect of
> killing xvfb (does it listen to SIGHUP?) then maybe the pid really isn't
> useful..

Sure I can move it to cleanup(), we'll just need a second TASKCLUSTER_INTERACTIVE check. I'm *pretty sure* this kills it. I'll do some more testing to make sure though.
Comment on attachment 8771974 [details]
Bug 1285582 - Make sure Xvfb is running for interactive tasks,

https://reviewboard.mozilla.org/r/64932/#review62910
Attachment #8771974 - Flags: review?(dustin) → review-
Comment on attachment 8771974 [details]
Bug 1285582 - Make sure Xvfb is running for interactive tasks,

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/64932/diff/1-2/
Attachment #8771974 - Flags: review- → review?(dustin)
Comment on attachment 8771974 [details]
Bug 1285582 - Make sure Xvfb is running for interactive tasks,

https://reviewboard.mozilla.org/r/64932/#review63644
Attachment #8771974 - Flags: review?(dustin) → review+
there will be a conflict as on mozilla-inbound I landed a rename to test-linux.sh.
Mercurial can auto-merge a file rename (was able to rebase my change on top of yours locally without manual intervention). So autoland should be fine in this case.
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/70b5b544b245
Make sure Xvfb is running for interactive tasks, r=dustin
https://hg.mozilla.org/mozilla-central/rev/70b5b544b245
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: