Closed Bug 1160013 Opened 9 years ago Closed 9 years ago

ONLY BAD SLAVES HIT Intermittent cache-match.https.html | Cache.matchAll with no matching entries - Test timed out and many more

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox39 --- unaffected
firefox40 --- affected
firefox41 --- affected
firefox-esr31 --- unaffected
firefox-esr38 --- unaffected

People

(Reporter: philor, Assigned: bkelly)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

      No description provided.
I ran this in DEBUG mode on try and its hitting an assert I added in bug 1160147.  I forgot to have the CachePushStreamChild hold the Cache DOM object alive like I intended.  This patch fixes that.
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Attachment #8601220 - Flags: review?(amarchesini)
Attachment #8601220 - Flags: review?(amarchesini) → review+
Comment 33 shows this patch is not adequate to fix all errors.  I think it should reduce the frequency of this intermittent, though.
Keywords: leave-open
I think this one has to do with how the test is doing the pre-populated cache.  Its also only hitting on main thread, so it could be an PBackground actor startup thing.
No longer blocks: 1161055
Actually, it happens on worker.  So not a PBackground actor startup thing.
So this file creates 21 Cache objects with most of them each having 12 entries put into them.  In addition, each Cache object is deleted before being created.  This reflects about 294 operations being scheduled roughly at the same time.

It seems plausible Cache is running into a performance problem here.
On my fast desktop I see some ActionRunnables take 100ms to start and 200ms to complete.  200ms * ~300 requests is 60 seconds.  There should be some operation overlap and that was the worst case I saw, but triggering the timeout seems to be in the realm of possibility.
On the try server during one of these failures I see ActionRunnables taking anywhere from 3 seconds to 10 seconds to execute.  I think this is the culprit.

I will attempt to fix by implementing the lowest hanging optimization; keeping the sqlite database connection open between requests.
Depends on: 1134671
Depends on: 1162211
I was unable to conclusively resolve this today.  Try was acting up.  I have some patches which might land and fix it, but I'm not holding my breath.  I will look at it again when I return from PTO next week.
This try build shows fairly promising results for silencing this intermittent:

  https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7257315c37a

Unfortunately those patches introduce a perma-orange on mulet M1:

  https://treeherder.mozilla.org/#/jobs?repo=try&revision=d3347a2c69b7

So this will have to wait until I return next week.  Sorry!
Depends on: 1162342
Summary: Initermittent cache-match.https.html | Cache.matchAll with no matching entries - Test timed out and many more → Intermittent cache-match.https.html | Cache.matchAll with no matching entries - Test timed out and many more
This should be fixed when I land bug 1162342 tomorrow.  One more small patch to write and get reviewed.
Bug 1162342 has landed in m-c.  I've also asked for approval to uplift to aurora to help silence these failure there as well.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
It seems this is just failing on PGO builds now.  Ryan, do you think its worth opening a new bug for the PGO failures?  I would do it but I'm not sure how to make it linkable in treeherder.
Flags: needinfo?(ryanvm)
I think the fact that they're all on Aurora is more relevant than the PGO part (all Windows opt builds on Aurora are PGO). We aren't seeing trunk PGO failures, so that tells me that something's different about Gecko 40 such that the fix didn't fully work.
Flags: needinfo?(ryanvm)
The failure in comment 182 is on mozilla-inbound.
Gah, so it is! Let's just reopen this bug.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla41 → ---
Most PGO builds seem to be running these tests in the same time as non-pgo:  about 20 seconds.

In the failure in comment 182, though, the cache-match.https.html test gets progressively slower.  (It runs three times for sw, window, and worker):

  sw:     32360ms
  window: 39709ms
  worker: timed out at 60000ms

This is slower than normal to begin with.  It seems the machine might be under additional load or something.
There are only two machines in use in the last 4 pgo failures:

  comment 179 was t-w864-ix-092 on May 28
  comment 180 was t-w864-ix-092 on June 3
  comment 181 and comment 182 was t-w864-ix-163 on June 15

I'd like to see if future failures also overlap on machines.
That's three for t-w864-ix-163 now.

Ryan, this is starting to look like a problem with a specific machine or a couple machines.  Whats the best way to address that from an infrastructure point of view?
Flags: needinfo?(ryanvm)
Nice catch! I've disabled the slave for now.
Flags: needinfo?(ryanvm)
Well crap.  Comment 192 is a new machine.
So now all the failures since you disabled that machine are on t-w864-ix-070.  wtf...

Any ideas?  Are these machines getting into a bad state somehow until rebooted?
Flags: needinfo?(ryanvm)
An interesting question which of course someone with no access to it or experience in determining "a bad state" cannot answer.

Seems suggestive that it is also the only thing to hit bug 1057615 in the last dozen days, and since it was just in the middle of hitting some w-p-t-1 failures that have never before been seen, I rebooted it just to see what would happen.
Oh, not never before seen, just hitting bug 1156577 which it is also the only thing to have hit in the last dozen days.
(In reply to Ben Kelly [:bkelly] from comment #197)
> Any ideas?  Are these machines getting into a bad state somehow until
> rebooted?

Disabled t-w864-ix-070.
Flags: needinfo?(ryanvm)
Look who's back!
Yea, I thought t-w864-ix-070 was disabled.
Disabled isn't a permanent state, that's "decommed." Now we know that the "bad state" isn't fixed by a reimage.
The last few failures have a fun new process crash in service-workers/fetch-event-async-respond-with.https.html.  This is a new test added in the last couple weeks.  Maybe these should go in a new bug?
Or maybe the process crash is unrelated.  4 of the last 5 stacks seem to be t-w864-ix-163 which seems like the "test host has gone into the weeds" problem.
Keywords: leave-open
Summary: Intermittent cache-match.https.html | Cache.matchAll with no matching entries - Test timed out and many more → ONLY BAD SLAVES HIT Intermittent cache-match.https.html | Cache.matchAll with no matching entries - Test timed out and many more
The failure in comment 214 is an unrelated crash.
We haven't seen this in over a month.  Closing.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → WORKSFORME
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.