Closed Bug 1263815 Opened 4 years ago Closed 4 years ago

Intermittent [test-linux.sh:error] xvfb did not start properly

Categories

(Firefox Build System :: Task Configuration, task, P5)

Tracking

(Not tracked)

RESOLVED FIXED
mozilla51

People

(Reporter: philor, Assigned: garndt)

References

Details

(Keywords: intermittent-failure)

Attachments

(3 files)

Example failed job: https://tools.taskcluster.net/task-inspector/#Fc_6KMY0Sv2VeM79-Ug26A/0

Joel -- should this be given any attention?
Flags: needinfo?(jmaher)
we see this about once/day- while I would love to fix every issue, until this becomes more problematic we could do better (3600 second timeouts with balrog for example).  

My threshold is usually 30 per week before I look into it for patterns.
Flags: needinfo?(jmaher)
Component: General → Task Configuration
Priority: -- → P5
:selena, This problem is getting worse (5-6 times more frequent than 6 weeks ago), can we put this in the queue of things to look into?
Flags: needinfo?(sdeckelmann)
Looking at orange factor, here are something I've noticed:

* Generally low reports prior to May 1st (perhaps some more tests were enabled around this time?)
* Some spikes between May 18th and May 24th
* Big pike June 5th which started a trend of more reports.  Running 7 day average started to noticeably trend upwards since then.

As far as workers, the only thing I'm aware of deploying is a new ami on May 5th that added a statsum client.  Recreating this AMI could have caused packages to be updated.  We also have it auto update high priority patches on boot (like security updates), but I'm not sure if this could have contributed to it.

Looking at desktop-test image updates (at least those indexed by taskcluster), there was an image landing on m-c on 4-5 (4-4 for inbound) and the next one wasn't until 6-8 (6-7 for inbound).
failure:
 https://public-artifacts.taskcluster.net/Zf9Y20ceQROL8HNO8Mbp2g/0/public/test/public/xvfb.log

success:
 https://public-artifacts.taskcluster.net/Q3yVZzInTTquOT894Vj-yw/0/public/test/public/xvfb.log

I'm guessing Xvfb is segfaulting or something like that, as it seems to truncate the logging.  That, or it may just be really slow to start -- we're only giving it four seconds.  So potential solutions include increasing the number of retries waiting for it to start, and adding an outer loop to retry starting Xvfb a few times.
:garndt, want to try some experiments this week?
Flags: needinfo?(garndt)
isn't this something to change in taskcluster/scripts/tester/test-linux.sh?  Sounds like from comment 16 that the possible thing to try is increasing xvfb retries.
Flags: needinfo?(garndt)
Flags: needinfo?(sdeckelmann)
I'm attempting a trial here, I think the changes I made should cause it to try to retry.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=18e90571747053268f15f679f91dfb43c4735d3d
So I retriggered a bunch of jobs in comment 22 without a failure because of xvfb.  it's a small sample, so not definitive but I think at least what's in that try push isn't causing more harm than good.  I'm on PTO for 2 weeks so if someone wants to grab this and improve upon it so something could land before then, that's cool.  Just didn't want to leave this in limbo.
Comment on attachment 8771845 [details]
Bug 1263815 - retry xvfb startup.

https://reviewboard.mozilla.org/r/64820/#review62344

::: taskcluster/scripts/tester/test-linux.sh:76
(Diff revision 1)
> -# run XVfb in the background, if necessary
> +start_xvfb() {
> -if $NEED_XVFB; then
>      Xvfb :0 -nolisten tcp -screen 0 1600x1200x24 \
>         > ~/artifacts/public/xvfb.log 2>&1 &
> -    export DISPLAY=:0
>      xvfb_pid=$!

It occurs to me that we could short-circuit this a little by checking kill -0 $xvfb_pid.  If that fails it means that xvfb has exited, and there's no sense waiting for xvinfo to succeed.

::: taskcluster/scripts/tester/test-linux.sh:97
(Diff revision 1)
>          fi
>      done
> -    if [ $xvfb_test == 255 ]; then fail "xvfb did not start properly"; fi
> +    if [ $xvfb_test == 255 ]; then
> +        return false
> +    else
> +        return true

`return false` -> `return 1`
`return true` -> `return 0`

    foo() {
        return false
    }
    dustin@hopper ~ $ foo
    bash: return: false: numeric argument required

<p>r- for the return issue.  It's up to you if you want to do the <code>kill -0</code> trick or not.</p>
Attachment #8771845 - Flags: review?(dustin) → review-
Summary: Intermittent [test-linux.sh:error] xvfb did not start properly → Intermittent [test-linux.sh:error],[build-linux.sh:error] xvfb did not start properly
Huh, how did that break searching for bug suggestions?
Summary: Intermittent [test-linux.sh:error],[build-linux.sh:error] xvfb did not start properly → Intermittent [test-linux.sh:error] xvfb did not start properly
I haven't had luck in hacking this locally to make it work- I would like to find someone to hack on this.  :garndt, I will be on pto the remainder of this week, could you help find someone to pick this up?
Flags: needinfo?(garndt)
See Also: → 1293889
Flags: needinfo?(garndt)
Attachment #8779753 - Flags: review?(dustin)
disregard that review request.  I need to make another change and test it out on try.  I'm also not sure why I pushed up some changes and it didn't add it as another revision to the already existing patch.  I clearly can't reviewboard.
I think that may need a rebase, too, as IIRC those files aren't massively duplicated anymore.
Assignee: nobody → garndt
Reviewing the tasks here I did not come across that failed because of this issue.  It's a small sample but we can at least be sure that the changes shouldn't at least make existing functionality worse.  At best it helps reduce the rate of these.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=05afa70a10118bf03ccf9d77b988925cf5451932
Comment on attachment 8779753 [details]
Bug 1263815 - retry xvfb startup.

https://reviewboard.mozilla.org/r/70682/#review68198

Oh look, duplicate code. Bonus points if you factor the xvfb code into `testing/docker/recipes/xvfb.sh` then ADD via Dockerfile magic syntax and source it from these respective scripts.
Comment on attachment 8779753 [details]
Bug 1263815 - retry xvfb startup.

https://reviewboard.mozilla.org/r/70682/#review68198

Oh good idea.  I forgot about the recipes inclusion stuff.  I think in the case of the test scripts they use different resolutions, displays, and also start it up in screen so maybe not a complete reuse between build and test.
I suspect parameterizing the resolution argument, and starting in screen everywhere, would be adequate.  From the task-configuration side, I'd like to have build and test using the same script (so it's more of a generic invoke-mozharness script), so eliminating differences is a move in the right direction.  OTOH, if that's out of scope for this bug I can do it later.
This failure is happening a lot those days. Would there be a chance to get this fixed soon?
Comment on attachment 8779753 [details]
Bug 1263815 - retry xvfb startup.

https://reviewboard.mozilla.org/r/70682/#review70228

I cleared the r? earlier based on :gps's comments, but I don't mind landing this as-is, as long as the cleanup of common code happens too.
Attachment #8779753 - Flags: review+
Comment on attachment 8779753 [details]
Bug 1263815 - retry xvfb startup.

https://reviewboard.mozilla.org/r/70682/#review70230

I started a try job here if you want to check that before landing
Comment on attachment 8779753 [details]
Bug 1263815 - retry xvfb startup.

Pushed some changes and reflagging.  This broke out the common bits into a reusable recipe for the docker images.

Try run is here, and I retried some of the builds/tests to see if anything broke.  I do not see anything obvious.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=bb9d95346dc3f0c94dd7ef053449a2a6a583279e&exclusion_profile=false&group_state=expanded&filter-tier=1&filter-tier=2&filter-tier=3
Attachment #8779753 - Flags: review+ → review?(dustin)
Comment on attachment 8779753 [details]
Bug 1263815 - retry xvfb startup.

https://reviewboard.mozilla.org/r/70682/#review70238

Looks great

::: testing/docker/recipes/xvfb.sh:21
(Diff revision 3)
> +        kill $xvfb_pid || true
> +        screen -XS xvfb quit || true
> +    fi
> +}
> +
> +# run mozharness in XVfb, if necessary; this is an array to maintain the quoting in the -s argument

I don't know what this comment is referring to..
Attachment #8779753 - Flags: review?(dustin) → review+
Backout by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/db37e1f887d1
Backed out changeset e6e5569551be for valgrind failures because Docker image generation isn't deterministic
This change has run into the same misfortune as bug 1272629 comment 34 regarding lack of determinism in the Docker image building process.

We can't make changes to the base centos6 build image right now because it busts valgrind for unknown reasons. You'll have to perform your hacks in the top-most Docker image until we figure out what's wrong with valgrind. Welcome to the wasp nest.
Comment on attachment 8779753 [details]
Bug 1263815 - retry xvfb startup.

This change moves installing screen into the desktop-build image instead of touching the base image that breaks valgrind.  I also attempted to fix a couple of others issues I noticed with retrying and detecting if it's an interactive task.
Attachment #8779753 - Flags: review+ → review?(dustin)
Comment on attachment 8779753 [details]
Bug 1263815 - retry xvfb startup.

https://reviewboard.mozilla.org/r/70682/#review71222

::: testing/docker/recipes/xvfb.sh:71
(Diff revisions 4 - 5)
>          else
>              retry_count=$(($retry_count + 1))
>              sleep 10
>          fi
>      done
> +    set -e

Were these just for debugging?  I think the -e output has confused a lot of people.
Attachment #8779753 - Flags: review?(dustin) → review+
Comment on attachment 8779753 [details]
Bug 1263815 - retry xvfb startup.

https://reviewboard.mozilla.org/r/70682/#review71222

> Were these just for debugging?  I think the -e output has confused a lot of people.

This was intentional.  It appears that when returning 1 cause the script to exit when running with +e so I just change the flag while I'm retrying.  perhaps I'm not setting this option in the correct way and there is something better to be done.  Another alterantive is not returning a value, but set a global variable.
https://hg.mozilla.org/mozilla-central/rev/6f22ddaf30e7
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Please rebase this for Aurora/Beta uplift as well.
Flags: needinfo?(garndt)
As seen in bug 1298026 this probably didn't completely fix the problem.  Perhaps it reduces the likelihood, but not a complete fix.  Before uplifting anything, I thinks someone will need to dig into why this is failing.  I'm not completely sure yet.
Flags: needinfo?(garndt)
if we don't have time to look into this in the next 1.5 weeks, I would like to get this uplifted to reduce the errors we see on aurora and then we will be merging aurora to beta in 2 more weeks.  We can live with much fewer instances downstream better than waiting for perfection.  I guess if we fix this in 3 weeks, we can uplift to aurora/beta.
Attached patch patch.diffSplinter Review
I think this patch resolves the conflicts with Aurora if you're able to check it out.
Flags: needinfo?(ryanvm)
Thanks!
Flags: needinfo?(ryanvm)
Whiteboard: [checkin-needed-aurora]
Whiteboard: [checkin-needed-aurora]
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.