intermittent funsize partials timeouts

NEW
Assigned to

Status

enhancement
3 months ago
2 months ago

People

(Reporter: mtabara, Assigned: sfraser)

Tracking

({leave-open})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [releaseduty])

Attachments

(5 attachments)

RelMan pointed that we've started to see funsize jobs failing with this lately. Subsequent rereuns eventually go green like in this case.

We've seen occurrences of the same issue in bug 1411358 but that seems too broad.

Jcristau also suggested that these timeouts should be less permisive. A normal task usually takes <10 min yet the timeout is set to 3600. This should help increase the end-to-end time we wait to see the jobs green. Something similar to what we did in bug 1531072.

Interesting, something inside make_incremental_update.sh is breaking, but we're not getting the logs. I will update the timeout as suggested and try to improve the reporting.

Reporter

Updated

3 months ago
Whiteboard: [releaseduty]
Assignee

Updated

3 months ago
Assignee: nobody → sfraser
Assignee

Updated

3 months ago
Keywords: leave-open

Comment 3

3 months ago
Pushed by sfraser@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/817014bcd372
Improve logging and timeouts in partials generation r=mtabara

Turns out 600s isn't enough for Linux64 Asan partial genertion, where the job may take up to 36 minutes. Windows asan is ~20 minutes.

Attachment #9048364 - Attachment description: Bug 1532236 - longer timeout for asan partial generation, r?tomprince → Bug 1532236 - longer timeout for asan partial generation, r=tomprince

Comment 7

3 months ago
Pushed by nthomas@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1cc8b60d8a6b
longer timeout for asan partial generation, r=tomprince

jlund pointed out we usually specify parameters like a timeout elsewhere, like the kind config. My fix was only a quick followup to resolve the bustage in the next nightly.

Comment 11

3 months ago
Pushed by sfraser@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/bc4e03f4ea20
Remove extra newlines from partials logging r=mtabara

https://tools.taskcluster.net/groups/Jj8j2VzfQ-WCPqsqZHPGbA/tasks/MOuYO-Y2QyyvH0ft4ry44A/runs/0/logs/public%2Flogs%2Flive.log had an error in today's nightly, and it looks to have been diffing libxul.so:

2019-03-05 12:15:29,381 - WARNING - target.partial-1.mar: diffing "libxul.so"
2019-03-05 12:24:15,463 - WARNING - target.partial-1.mar: patch "libxul.so.patch" "libxul.so"

Each of the four partials took nine minutes to process that file. Caching seems disabled.

https://hg.mozilla.org/mozilla-central/rev/a91196fe2eb3#l2.64 introduced a default level, which was a string, not an int, so the caching has been broken for a while.

Comment 15

3 months ago
Pushed by sfraser@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ce3dfcdb5861
Convert level into integer in partials transform r=mtabara

Comment 17

3 months ago
Pushed by sfraser@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f6d891b25f43
Convert level into integer in partials transform r=mtabara
Assignee

Updated

3 months ago
Flags: needinfo?(sfraser)

Found in postmortem triaging.
sfraser: can we close this?

Flags: needinfo?(sfraser)

There are still a few errors to sort out with this, sorry.

Flags: needinfo?(sfraser)

Reinsert awscli for partials, which is needed for caching. Also update packages and fix the metrics recording

Comment 23

2 months ago
Pushed by sfraser@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7b2ae2ea0495
Reinsert awscli, required for partials caching r=mtabara

Comment 25

2 months ago
Pushed by sfraser@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e108f9ad99a5
Reinsert awscli, required for partials caching r=mtabara

https://trello.com/c/roEzdOJx/172-partial-hu-linux-task-failing

looks like this is failing at least some of the time on new 67 beta (devedition)

aws-cli isn't found. Do we need to uplift given this missed the merge a few hours earlier?

I think we have enabled s3 caching for releases between 66 and 67, where previously we had no caching at all. See the scopes on the last 66 beta and deved 67.0b1. I don't know if that was deliberate or not. We did have local caching for releases until bug 1501113.

I think it'll either be a case of increasing the task timeout (even just to 20 minutes) or adding the caching in, as some of the files now take a lot longer to diff than they did at this point last year.

I'd vote for an uplift.

Flags: needinfo?(sfraser)
You need to log in before you can comment on or make changes to this bug.