Closed Bug 1770562 Opened 2 years ago Closed 1 year ago

Perma raptor youtube OS X 10.15 WebRender Shippable failing as exception with deadline exceeded

Categories

(Testing :: Raptor, defect, P2)

defect

Tracking

(firefox-esr91 unaffected, firefox100 unaffected, firefox101 unaffected, firefox102 wontfix, firefox103 wontfix, firefox111 fixed)

RESOLVED FIXED
Tracking Status
firefox-esr91 --- unaffected
firefox100 --- unaffected
firefox101 --- unaffected
firefox102 --- wontfix
firefox103 --- wontfix
firefox111 --- fixed

People

(Reporter: imoraru, Assigned: aglavic)

References

Details

(Keywords: intermittent-failure, Whiteboard: [retriggered])

Attachments

(1 file)

The failure has no failure log.

Last green jobs were on May 18th.

Push where ytp-h264-p job failed for the first time.
Push where ytp-v9-p job failed for the first time.

Hi Andrew! Can you please take a look at this?
Thank you!

Flags: needinfo?(aerickson)

Because these started to fail not with the same push but adjacent ones, could this be related to the macOS testing machines?

Flags: needinfo?(gmierz2)
Flags: needinfo?(dhouse)
Flags: needinfo?(aerickson)
Summary: Perma OS X 10.15 WebRender Shippable failing as exception with deadline exceeded → Perma raptor youtube OS X 10.15 WebRender Shippable failing as exception with deadline exceeded
Whiteboard: [retriggered]

:dhouse, it looks like the Mac power testing machines are unresponsive, could you take a look?

Flags: needinfo?(gmierz2)
Severity: -- → S3
Priority: -- → P2
Whiteboard: [retriggered] → [retriggered][stockwell needswork:owner]

generic-worker is failing because of errors from taskcluster like:

{
  "code": "ResourceNotFound",
  "message": "Worker pool releng-hardware/gecko-t-osx-1014-power does not exist\n\n---\n\n* method:     registerWorker\n* errorCode:  ResourceNotFound\n* statusCode: 404\n* time:       2022-06-01T15:23:45.161Z",

I'm reconfiguring them so they get connected.

:sparky they're claiming work again now. Please let me know if there are any problems.

Flags: needinfo?(dhouse) → needinfo?(gmierz2)

Thank you very much :dhouse! Will do :)

Flags: needinfo?(gmierz2)

Hi :dhouse, we are still having issues can you investigate?

Flags: needinfo?(dhouse)

:andrej I fixed this. It was an issue with scopes for the workers to pull from the 1015 queue. I set some monitoring to check if this happens again.
Can you confirm if the mac power testers are working correctly now?

Flags: needinfo?(dhouse) → needinfo?(aglavic)
Assignee: nobody → dhouse

Hi :dhouse

Here is some mac power data. Is this what you were referring to? (as of posting this, looks like nothing since mid-May)

https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightCommonAlerts=0&series=mozilla-central,3370249,1,10&timerange=7776000

Flags: needinfo?(dhouse)

(In reply to Kash [:kshampur] ⌚EST from comment #14)

Hi :dhouse

Here is some mac power data. Is this what you were referring to? (as of posting this, looks like nothing since mid-May)

https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightCommonAlerts=0&series=mozilla-central,3370249,1,10&timerange=7776000

This task record is on the previous queue "releng-hardware/gecko-t-osx-1014-power"

:andrej can you switch these to run on the pool "releng-hardware/gecko-t-osx-1015-power" ?
If you want, I can just switch them to using the old queue name. Or do you want to rename the queue to remove the macos version like "gecko-t-osx-power"?

Flags: needinfo?(dhouse)

:sparky can you switch these to using the new pool name? This doesn't seem ideal to have the os version hard-coded. So if you want to change the pool name, I'm can switch it np (or switch back to using the 1014 pool name to just make this work).

Flags: needinfo?(gmierz2)

:andrej, could you take care of making the change? We'd need to change these lines of code: https://searchfox.org/mozilla-central/search?q=t-osx-1014-power&path=&case=false&regexp=false

Flags: needinfo?(gmierz2)
Flags: needinfo?(aglavic)
Flags: needinfo?(aglavic)

Got it, I'll get straight on this

Flags: needinfo?(aglavic)

What we are doing:
Switching mac power data to run on osx-1015 from osx-1014

Why:
Generic workers were failing and attempting to pull from osx-1014, so we are setting them to pull from the osx-1015 worker pool

Attachment #9284530 - Attachment description: Bug 1770562 - Perma raptor youtube. r=#perftest → Bug 1770562 - Perma Fail raptor youtube. r=#perftest
Blocks: andrej22H2
Attachment #9284530 - Attachment description: Bug 1770562 - Perma Fail raptor youtube. r=#perftest → Bug 1770562 - Perma raptor youtube. r=#perftest
Attachment #9284530 - Attachment description: Bug 1770562 - Perma raptor youtube. r=#perftest → Bug 1770562 - Use macosx-1015 for power tests instead of macosx-1014. r=#perftest
Pushed by aglavic@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/016cbd986226
Use macosx-1015 for power tests instead of macosx-1014. r=dhouse
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 104 Branch

Since nightly and release are affected, beta will likely be affected too.
For more information, please visit auto_nag documentation.

The patch landed in nightly and beta is affected.
:dhouse, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox103 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(dhouse)

:andrej what do you think? I'm re-assigning to you since I do not know if it needs to be uplifted.

Assignee: dhouse → aglavic
Flags: needinfo?(dhouse) → needinfo?(aglavic)

I don't think this needs to be uplifted

Flags: needinfo?(aglavic)

Hi dhouse,
Can you advise on the issue? You mentioned in my phabricator patch that the worker pool was not registered to taskcluster but that it should work now after you re-enabled the API key.

Flags: needinfo?(aglavic) → needinfo?(dhouse)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 104 Branch → ---
Whiteboard: [retriggered][stockwell unknown] → [retriggered]

(In reply to Andrej (:andrej) from comment #30)

Hi dhouse,
Can you advise on the issue? You mentioned in my phabricator patch that the worker pool was not registered to taskcluster but that it should work now after you re-enabled the API key.

I got them working again last night: I checked and my api key was disabled again. I will check into it more this week. I think there is automation disabling the api key.

No intermittent fails in the past 2 weeks, closing bug

Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Flags: needinfo?(dhouse)

Hi Andrej! Can you please take a look at this? It is happening again, last green job was on Feb 10th: range from last green job.

Status: RESOLVED → REOPENED
Flags: needinfo?(aglavic)
Resolution: FIXED → ---

The worker these ran on is offline and gecko-t-osx-1015-power pool lists no workers online. masterwayz, is this something you could look into as I see dhouse is no longer available? Thank you.

Flags: needinfo?(mgoossens)

Will look into this! Will probably end up being later today due to some bitbar access permissions that someone else first needs to fix for me.
And yes, I'll have to be pinged instead of dhouse.
(leaving NI in place)

masterwayz, any updates on this kind of workers?

I lack the access/context here so passing this along.

Flags: needinfo?(mgoossens) → needinfo?(aerickson)

Michelle, I was able to ssh into these hosts.

Details at https://mozilla-hub.atlassian.net/wiki/spaces/ROPS/pages/179601939/Bitbar+Mac+Hosts.

Please let me know if you have any issues.

Flags: needinfo?(aerickson) → needinfo?(mgoossens)

Ah right, those documents didn't exist yet, thanks for making them!
I will do this once I get access to that SSH key...

Jobs should be running now again!

Flags: needinfo?(mgoossens)

Jobs are picked up again, thank you for the fix.

Status: REOPENED → RESOLVED
Closed: 2 years ago1 year ago
Flags: needinfo?(aglavic)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: