testing/web-platform/tests/fetch/stale-while-revalidate/fetch-sw.https.tentative.html fetch count failure
Categories
(Core :: Networking, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox68 | --- | affected |
People
(Reporter: mayhemer, Unassigned)
References
Details
(Whiteboard: [necko-triaged])
With nightly, web platform test fetch-sw.https.tentative.html [1][2] fails because we don't deliver two 'fetch' notifications to the service worker.
This is independent of having bug 1536511 in the tree!
The test does the following:
- registers SW 'sw-intercept.js' and waits for 'activated' status, the SW does:
async function broadcast(msg) {
for (const client of await clients.matchAll()) {
client.postMessage(msg);
}
}
self.addEventListener('fetch', event => {
event.waitUntil(broadcast(event.request.url));
event.respondWith(fetch(event.request));
});
self.addEventListener('activate', event => {
self.clients.claim();
});
- the test observes for messages from the SW:
const uri = 'stale-script.py?token=...';
var service_worker_count = 0;
navigator.serviceWorker.addEventListener('message', function once(event) {
if (event.data.endsWith(uri)) {
service_worker_count++;
}
});
- the test does two fetches for the 'uri':
const response = await fetch(uri);
const response2 = await fetch(uri);
- if you remove (comment out) the check for same Unique-Id (for which to pass you need bug 1536511) it will fail the assertion about
service_worker_count
few lines below:
(the stale-script.py?query...
request just gets the number of requests for the 'uri' resource at the server; two fetches mean to get '2'.
while(true) {
const revalidation_check = await fetch(`stale-script.py?query&token=` + request_token);
if (revalidation_check.headers.get('Count') == '2') {
// The service worker should not see the revalidation request.
assert_equals(service_worker_count, 2);
break;
}
await wait25ms(test);
}
Adding dumps to the test shows that we really don't hit the 'fetch' event for the second (or first? - haven't tried) fetch for the 'uri'. adding a third fetch request produces two calls to 'fetch' and makes the test pass (with bug 1536511).
no idea if this is something in the fetch code, notification code, service worker code.
[1] https://searchfox.org/mozilla-central/source/testing/web-platform/tests/fetch/stale-while-revalidate/fetch-sw.https.tentative.html
[2] https://github.com/web-platform-tests/wpt/blob/master/fetch/stale-while-revalidate/fetch-sw.https.tentative.html
Comment 2•6 years ago
|
||
The WPT test has a bug. The call to clients.claim() is racing the fetch() commands. It's not enough for the test to wait for 'activated'.
Presumably setupRegistration wants to become something like this:
async function setupRegistrationAndWaitToBeControlled(t, scope) {
const controlled = new Promise((resolve) => {
navigator.serviceWorker.oncontrollerchange = () => { resolve(); };
});
const reg = await navigator.serviceWorker.register('sw-intercept.js');
await wait_for_state(t, reg.installing, 'activated');
await controlled;
add_completion_callback(_ => reg.unregister());
return reg;
}
![]() |
Reporter | |
Comment 3•6 years ago
|
||
Thanks!
I still can't figure out what is the right setting of the .ini meta file for the test to make it work with patch for bug 1536511.
I will test your suggested fix tomorrow. I'm wondering if a PR of the change to WPT and then sync back to our tree is the way or we can land a fix like this locally as a workaround (within the patch for bug 1536511).
Comment 4•6 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #3)
I still can't figure out what is the right setting of the .ini meta file for the test to make it work with patch for bug 1536511.
Assuming my proposed changes fixes things, I think you can just delete the ini file. Unless you think you think the test should still be failing? It's hard to consult the logs with the gazillion RetargetDeliveryTo warnings though for me to look more deeply.
I will test your suggested fix tomorrow. I'm wondering if a PR of the change to WPT and then sync back to our tree is the way or we can land a fix like this locally as a workaround (within the patch for bug 1536511).
You can just land the fix as part of your patch here in mozilla-central and the magic upstreaming robot will upstream it for us. The key thing is that we review it and we believe the change is correct as it relates to the (ServiceWorker) spec as it already exists. If spec changes or discussion are needed because we have encountered divergent behavior with another browser, then we would potentially need to do things up on github. In this case it's clear the test is buggy and needs to wait for the "controllerchange" event.
![]() |
Reporter | |
Comment 5•6 years ago
|
||
Thanks! I incorporated your change in to the patch for bug 1536511, asking review. We can probably duplicate this bug to that one then.
![]() |
Reporter | |
Comment 6•6 years ago
|
||
Getting puzzled. The test times out now (was before too, but I was somehow believing it was because of its brokenness). There is no output from the test in the log (when you delete all the NS_FAILED() flood), like none of the two checks would even be triggered.
Locally (same build, same platform) the test passes for me. Has to be intermittent, and I'm thinking we just loop in the while(true)
loop as the 'Count' header never reaches or jumps over the number of 2. Changing it >= 2 (instead of == 2) I'm still getting timeouts...
This starts to block bug 1536511 fro landing.
Comment 7•6 years ago
|
||
The test is timing out in non-standard modes:
- The "W" failure is non-e10s mode which can be reproduced by testing with "--disable-e10s"
- The "W-sw-e10s" failure is parent-intercept e10s mode which can be reproduced by testing with "--setpref dom.serviceWorkers.parent_intercept=true"
Under both cases right now, the Service Worker will be running in the parent process directly. (In non-e10s mode, this is obviously because there is only one process. In parent-intercept mode, this is because we haven't landed ServiceWorkerPrivate remoting yet, so it's a hack that the ServiceWorker runs in the parent.) This means that the nsHttpChannel will be created directly. Is it possible this alters the behavior of your patch somehow?
![]() |
Reporter | |
Comment 8•6 years ago
|
||
Thanks a lot for these details, Andrew. I am able to repro locally with those two switches, but have no idea what the cause is. Any dump
added to the test makes it just pass as expected. I decided to just prolong the polling timeout as I can't spend time on diagnosing the test.
There is tho one interesting thing I found. I pushed to try having a typo in the test. It failed for the two non-standard modes that it was failing before, now because of the typo. However, in standard mode [1] it passed! It means there is something wrong with the infra when we run these tests for the standard mode. I checked the log for TEST-START | /fetch/stale-while-revalidate/fetch-sw.https.tentative.html
- it's there, with TEST-OK
result.
Comment 9•6 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #8)
There is tho one interesting thing I found. I pushed to try having a typo in the test. It failed for the two non-standard modes that it was failing before, now because of the typo. However, in standard mode [1] it passed! It means there is something wrong with the infra when we run these tests for the standard mode. I checked the log for
TEST-START | /fetch/stale-while-revalidate/fetch-sw.https.tentative.html
- it's there, withTEST-OK
result.
It looks like this is your typo:
- await wait25ms(test);
+ await wait250ms(test);
But this only matters if the statement is executed. Which it won't be if const revalidation_check = await fetch(
stale-script.py?query&token=+ request_token);
returned 2 on the first try because break
will terminate the loop.
Thanks a lot for these details, Andrew. I am able to repro locally with those two switches, but have no idea what the cause is. Any
dump
added to the test makes it just pass as expected. I decided to just prolong the polling timeout as I can't spend time on diagnosing the test
Okay, so as I understand what the test is doing:
- The first fetch goes to the network and the python script and count=1 in the python script.
- The second fetch gets served out of cache thanks to stale-while-revalidate, resulting in the network connection being spun up but the fetch doesn't wait for it.
- The test polls the python script because it has no visibility into when the revalidation network request will actually occur.
- This means there's basically a race between the "query" and revalidation traffic.
And as I understand what you're seeing:
- In the passing tests, we never have to "await". The revalidation has happened by the time we run the "query".
- In the failing tests, we do have to "await", suggesting there's a race and that perhaps something bad is happening, like in the absence of explicit cache-control headers, we're deciding to cache the returned "Count" value of 1 forever? (With the results coming from https://github.com/web-platform-tests/wpt/blob/master/fetch/stale-while-revalidate/stale-script.py#L23).
Is that a possible thing, or does that help provide an idea of what else could be going wrong?
![]() |
Reporter | |
Comment 10•6 years ago
|
||
I changed the delay to 250ms (properly :)) and the failure is gone. I understand now why it fails with so short poll timeout and only on debug builds: tailing of the reval request is the cause; the probe request made every 25ms just delays (keeps on hold) the internal revalidation request every time its made (they share the request context; If I had looked into MOZ_LOG, I would have figured this out sooner...)
I honestly don't know how to solve this in non-racy way. When we change somehow the tailing algorithm, it may cause the delay of 250ms be inefficient again... One idea is to let the server handle the response asynchronously:
const probe = fetch(stale-script.py?query&count=2&token=' + ...);
const response = await fetch(uri);
const response2 = await fetch(uri);
await probe;
and the 'probe' url handler will hold the response until two requests for the |url| come to the server.
it's still not clear to me how it was possible that the counter on the server was updated to '2' that fast that we didn't need to loop and hit the typo... that seems to me very unlikely and that's why I took it as a test infra failure of some kind.
Note that w/o any timing information (l-m, etag, age, max-age) the 'query' response will not be reused from cache.
to conclude:
- I will land the patch for bug 1536511 with a delay changed to 250ms
- I will leave this bug open (swap the dependencies) to figure out the right fix
![]() |
Reporter | |
Comment 11•6 years ago
•
|
||
No idea if Stash is updated when a handler polls it for changes. I don't know python's BaseManager that well, but that's what is used here.
![]() |
Reporter | |
Updated•6 years ago
|
Updated•6 years ago
|
Updated•2 years ago
|
Description
•