Closed Bug 1554964 Opened 5 years ago Closed 4 years ago

Restore daily Chromium updates for Windows

Categories

(Testing :: Raptor, task, P1)

task

Tracking

(firefox75 fixed)

RESOLVED FIXED
mozilla75
Tracking Status
firefox75 --- fixed

People

(Reporter: igoldan, Assigned: onegru)

References

Details

Attachments

(2 files)

Description copy/pasted from Jira:

Investigate why the Chromium updates affected Raptor & fix the issue.

These are the first 2 patches which showed intermittens. Use the associated (unstable) Chromium revisions to work on the fix.

What happened:

Since May 23 we pinned hardcoded Chromium revisions for Windows & OSX platforms.

It’s because on May 19 the new daily Chromium revisions caused lots & lots of Raptor intermittents on these 2 platforms. Very likely a breaking change landed on this browser, which affected the Raptor test runner.

We didn’t touch the Linux platforms, as the Chromium revisions there are stable.

I managed to reproduce the timeout twice on Windows 10, using Chromium 661171 (one of the 1st unstable revisions). You can see from the devtools logs that from a point on we're not getting the fcp metrics. This is what's causing the timeouts.

I know we disabled metrics like these in the past. Should we consider disabling fcp also on Windows & removed the pinned Chromium versions?

Flags: needinfo?(rwood)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #2)

I know we disabled metrics like these in the past. Should we consider disabling fcp also on Windows & removed the pinned Chromium versions?

I say Windows only, because I noticed that Chromium for OSX restabilized. I made a try push in which I removed the pinned versions.

Hmmm... Now that I think about it, we don't have a way of specifying metrics per platform.

Flags: needinfo?(rwood)

Anyway, I pushed to Try the removal of the fcp from Rap-Cr tasks, to consolidate my theory.

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #5)

Anyway, I pushed to Try the removal of the fcp from Rap-Cr tasks, to consolidate my theory.

I see that all pageload tests failed because of this, while all benchmarks ran successfully. Have I missed something?
Still, the successfulness of the benchmarks (which previously failed) confirms there's something wrong when recording the fcp.

Flags: needinfo?(rwood)

Reproducing & debugging this locally didn't bring something new, expect that at some point window.performance.getEntriesByType("paint") returns an empty array.

Once this happens, the next 10 reattempts become useless. They produce the same output and then the test officially times out & notifies the control server.

Given this, I'm not sure how to proceed further. Feels to me like one of the official APIs got broken while under active development; probably something that will be addressed in the following weeks at most.

Dave, thoughts?

Flags: needinfo?(dave.hunt)

Digging through the Chromium's commit history, I found that the unstable Windows revision consisted of this patch.

Digging more, I think these are all the patches which landed after the last stable revision.

Made another Try push, where I removed the pinned Chromium revisions. If OSX platform stabilized, I'll remove the pin from there.

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #6)

I see that all pageload tests failed because of this, while all benchmarks ran successfully. Have I missed something?
Still, the successfulness of the benchmarks (which previously failed) confirms there's something wrong when recording the fcp.

Why would the benchmarks fail when they're not measuring fcp?

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #7)

Once this happens, the next 10 reattempts become useless. They produce the same output and then the test officially times out & notifies the control server.

Is this just warm page load or does it also affect cold page load?

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #8)

Given this, I'm not sure how to proceed further. Feels to me like one of the official APIs got broken while under active development; probably something that will be addressed in the following weeks at most.

I would suggest reaching out to the Chromium developers to see if this is something they're aware of. I wonder if it's also affecting their page load tests, or if this is even something that may ultimately affect their users. Let's test new builds on a weekly basis to see if this gets fixed in the meantime.

Flags: needinfo?(dave.hunt)

(In reply to Dave Hunt [:davehunt] [he/him] ⌚️UTC from comment #12)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #6)

I see that all pageload tests failed because of this, while all benchmarks ran successfully. Have I missed something?
Still, the successfulness of the benchmarks (which previously failed) confirms there's something wrong when recording the fcp.

Why would the benchmarks fail when they're not measuring fcp?

That's what I was wondering also. I filed bug 1555654 to look into this. Seems like my patch was wrong. I should have removed fcp from alert_on = also.

(In reply to Dave Hunt [:davehunt] [he/him] ⌚️UTC from comment #12)

Is this just warm page load or does it also affect cold page load?

I only tested warm pageload. I could attempt to reproduce on cold pageloads.

Flags: needinfo?(rwood)

I filed issue 969614, so the Chromium team will take a look over this problem

Depends on: 1556695

Did a Try push yesterday and noticed that Raptor Chromium remained stable on OSX. Thus, I filed bug 1556695 to remove the pinned version.

:igoldan any updates here?

Flags: needinfo?(igoldan)
Summary: Restore daily Chromium updates for Windows & OSX → Restore daily Chromium updates for Windows

Pinging :igoldan for a status update, thanks!

Priority: P1 → P2
Assignee: igoldan → fstrugariu
Assignee: fstrugariu → alexandru.ionescu

me and ariakab will take over this.

Flags: needinfo?(igoldan)

chromium responded to the issue:
I've just skimmed through performance.cc and haven't found any suspicious change happening around the date. It may also relate to extension.

Flags: needinfo?(dave.hunt)

(In reply to Alexandru Ionescu :alexandrui from comment #21)

chromium responded to the issue:
I've just skimmed through performance.cc and haven't found any suspicious change happening around the date. It may also relate to extension.

I had seen that response. I don't have anything to add here, did you mean to needinfo me? I understand we've been able to replicate the issue locally on Windows? Has this helped with identifying the cause?

Flags: needinfo?(dave.hunt)

You were asking above the state of this and I was just letting you know. Today I'm going to replicate it locally on the machine we received, hope there will be some progress in one way or another.

Assignee: alexandru.ionescu → igoldan

Did a new Try push to check the Windows platforms.

Looks like Raptor Chromium visibly got more stable on Windows. We have less intermittents than previously.
But I don't think we're got enough to disable the pinned versions.

Flags: needinfo?(dave.hunt)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriff from comment #24)

Did a new Try push to check the Windows platforms.

Looks like Raptor Chromium visibly got more stable on Windows. We have less intermittents than previously.
But I don't think we're got enough to disable the pinned versions.

Do the failures appear similar to what we were seeing before? Can you retrigger the failing jobs to see how frequent the failures are? Perhaps also see if the cold page load jobs are intermittent, as these don't appear to have failed in your push.

Flags: needinfo?(dave.hunt)

(In reply to Dave Hunt [:davehunt] [he/him] ⌚️UTC from comment #25)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriff from comment #24)

Did a new Try push to check the Windows platforms.

Looks like Raptor Chromium visibly got more stable on Windows. We have less intermittents than previously.
But I don't think we're got enough to disable the pinned versions.

Do the failures appear similar to what we were seeing before?

Yes, at least for the warm page loads.

Can you retrigger the failing jobs to see how frequent the failures are?

Just did that.

Perhaps also see if the cold page load jobs are intermittent, as these don't appear to have failed in your push.

Just did that.

Following up: for each of the failed jobs I made 10 retriggers. Good thing is none of the retriggers failed. So the Chromium is even more stable :)
Also did lots of retriggers for the cold page loads & none failed.

Flags: needinfo?(dave.hunt)

It sounds like maybe we're stable enough to re-enable the Chromium updates for Windows. Could you take care of this?

Flags: needinfo?(dave.hunt)
Type: defect → task
Priority: P2 → P1
Status: NEW → ASSIGNED
Flags: needinfo?(igoldan)

A month has passed. Before unpinning, I should recheck if things are still stable on Windows.
Did another Try push.

I'm afraid Raptor's Chromium jobs on Windows are once more highly unstable.

Flags: needinfo?(igoldan)
Priority: P1 → P2

We now have latest Chrome release running on Windows, and Chromium will be running less frequently than before.

Should I reattempt to enable latest Chromium revisions?

Flags: needinfo?(dave.hunt)

(In reply to Ionuț Goldan [:igoldan] from comment #32)

Should I reattempt to enable latest Chromium revisions?

Yes, let's give it a go. Also, I suspect we don't report the Chromium version in the Perfherder data. If not, could you file a bug for that?

Flags: needinfo?(dave.hunt)

Ionuts, I assume you are still working on this bug?

Priority: P2 → P1

Not currently, but this is indeed for 2020/Q1.

Priority: P1 → P2
Assignee: igoldan → nobody
Status: ASSIGNED → NEW
Assignee: nobody → onegru
Priority: P2 → P1
Status: NEW → ASSIGNED
Pushed by igoldan@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/114d8b270dbb
Restore daily Chromium updates for Windows r=perftest-reviewers,sparky
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla75
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: