Closed Bug 1612100 Opened 5 years ago Closed 2 years ago

Intermittent /css/css-shapes/shape-outside/shape-image/gradients/shape-outside-linear-gradient-014.html | Testing /shape-outside-linear-gradient-014.html == shape-outside-linear-gradient-014.html

Categories

(Core :: Layout: Floats, defect, P5)

defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: intermittent-bug-filer, Assigned: CosminS)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disabled])

Attachments

(1 file)

Filed by: apavel [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=286837691&repo=try
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Ukf_aeZJTVau8Et6nVCn8g/runs/0/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Ukf_aeZJTVau8Et6nVCn8g/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1


[task 2020-01-29T11:52:32.805Z] 11:52:32 INFO - TEST-START | /css/css-shapes/shape-outside/shape-image/gradients/shape-outside-linear-gradient-014.html
[task 2020-01-29T11:52:32.805Z] 11:52:32 INFO - PID 7648 | 1580298752799 Marionette INFO Testing http://web-platform.test:8000/css/css-shapes/shape-outside/shape-image/gradients/shape-outside-linear-gradient-014.html == http://web-platform.test:8000/css/css-shapes/shape-outside/shape-image/gradients/reference/shape-outside-linear-gradient-001-ref.html
[task 2020-01-29T11:52:32.879Z] 11:52:32 INFO - PID 7648 | [Child 5608, Main Thread] WARNING: Trying to request nsIHttpChannel from DocumentChannel, this is likely broken: file z:/build/build/src/netwerk/ipc/DocumentChannel.cpp, line 63
[task 2020-01-29T11:52:33.036Z] 11:52:33 INFO - PID 7648 | 1580298753025 Marionette INFO No differences allowed
[task 2020-01-29T11:52:33.133Z] 11:52:33 INFO - TEST-UNEXPECTED-FAIL | /css/css-shapes/shape-outside/shape-image/gradients/shape-outside-linear-gradient-014.html | Testing http://web-platform.test:8000/css/css-shapes/shape-outside/shape-image/gradients/shape-outside-linear-gradient-014.html == http://web-platform.test:8000/css/css-shapes/shape-outside/shape-image/gradients/reference/shape-outside-linear-gradient-001-ref.html
[task 2020-01-29T11:52:33.133Z] 11:52:33 INFO - Found 40000 pixels different, maximum difference per channel 255

It seems like the thing that made it a lot more frequent was probably a good bit more recent than that: probably 2020-04-22 around 18:00-22:00 UTC.

In other words, I'm more interested in this range for the increased frequency.

These frequent failures might be related to some issues in windows machines. Some examples of machines the tests are failing on:
https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/t-win10-64-gpu-s/workers/aws/i-0f0d0ddf12600a870
https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/t-win10-64-gpu-s/workers/aws/i-07500595c40f5831d
https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/t-win10-64-gpu-s/workers/aws/i-066bb79c93076d308
https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/t-win10-64-gpu-s/workers/aws/i-09508deb8b775b311

All the machines from the push these failures started happening were later terminated. The later on retriggers are green.
Range 1 and Range 2 of retriggers. The failures here don't look in anyway related to https://hg.mozilla.org/integration/autoland/rev/7f4f1d605c69b3a471727810b6fc876187e8211e
Maybe Rob has more info about the state of the windows machines. There was also a spike of retries on windows jobs for a couple of days.

Flags: needinfo?(rthijssen)

we had an issue with windows workers during a recent github outage that resulted in a number of machines going idle (doing no work but being included in the number of running workers). i terminated a large number of machines over a period of several hours. i used the script at https://gist.github.com/grenade/63bf380b79b995065cb6530df34725c8 to make the determination as to whether a machine was idle or productive, by querying the taskcluster api to see if the instance had recent task runs associated with it. if the api response indicated that the instance had not run a task in the preceding thirty minutes, it was terminated.

since we had a spike of retries, it is apparent that some of those terminations were against machines that must have actually been doing productive work. i include a link to the script above in case anyone wishes to scrutinise it for flaws.

i apologise for the inconvenience caused and can only say that we were in a situation where we had to get rid of a lot of unproductive workers in order to reduce a large task backlog that at the time we were unable to create new capacity for.

Flags: needinfo?(rthijssen)

/cc :jrmuizel

:grenade - can you find out what graphics cards are in the windows gpu workers? The failures are only happening on windows-qr builds, i.e. webrender enabled. And the regression range ("range 1" from comment 18) points to either bug 1632239 or bug 1624988, both of which are plausible candidates as they touch webrender code. The first of the two is specific to Windows and certain graphics cards, so if the workers this is running have those graphics cards, that might explain the problem.

Flags: needinfo?(rthijssen)

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #21)

/cc :jrmuizel

:grenade - can you find out what graphics cards are in the windows gpu workers? The failures are only happening on windows-qr builds, i.e. webrender enabled. And the regression range ("range 1" from comment 18) points to either bug 1632239 or bug 1624988, both of which are plausible candidates as they touch webrender code. The first of the two is specific to Windows and certain graphics cards, so if the workers this is running have those graphics cards, that might explain the problem.

the best way to get current information about worker systems is to create a task on the worker type you are interested in, with commands that will query the system for the information you need. to answer the question above, i created tasks with definitions similar to:

retries: 0
created: '2020-04-28T08:23:55.176Z'
deadline: '2020-05-01T08:43:34.349Z'
expires: '2021-05-02T08:43:34.349Z'
provisionerId: gecko-t
workerType: t-win10-64-gpu-s
priority: highest
tags: {}
scopes: []
payload:
  command:
    - wmic path win32_VideoController get name
    - 'wmic path win32_VideoController get /all /format:table'
    - 'wmic path win32_VideoController get /all /format:list'
    - 'wmic path win32_VideoController get /all /format:csv'
  maxRunTime: 60
extra: {}
metadata:
  name: determine video controller on gecko-t/t-win10-64-gpu-s
  description: |-
    ## query wmic for video controller info
    this task demonstrates how to query wmic for video controller metadata
    - determine the name of the video controller
    - get metadata about video controller in table format
    - get metadata about video controller in list format
    - get metadata about video controller in csv format
  owner: grenade@mozilla.com
  source: 'https://bugzilla.mozilla.org/show_bug.cgi?id=1612100#c21'

these tasks show that the video controllers present are:

  • gecko-t/t-win10-64-gpu-s:
    • NVIDIA Tesla M60
    • Microsoft Basic Display Adapter
Flags: needinfo?(rthijssen)

Thanks!

Jeff, your patch affected AMD cards, so I would imagine that it shouldn't affect the above cards. So maybe it was Glenn's patch (bug 1624988) that caused it? Either way it should be possible to do a try push with the changes backed out and see if the problem still occurs.

... and of course both backouts still have the intermittent failures.

Try push with both changes backed out https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=d7cc8f1d3b30f87303478c7c211e74777cd8e1ba also has the problem. I'm doing more retriggers on the seemingly-unrelated changes.

Assignee: nobody → csabou
Status: NEW → ASSIGNED
Pushed by csabou@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/194abeed612d Update expectations for shape-outside-linear-gradient tests on win qr. r=kats
Keywords: leave-open
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]

James, any idea why the webreftest failure started on a push which seems unrelated? See comment 30.

Flags: needinfo?(james)
Severity: normal → S3
Flags: needinfo?(james)
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Keywords: leave-open
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: