Open Bug 1417308 Opened 3 years ago Updated 8 months ago

Navigation timing responseStart reported wrong

Categories

(Core :: DOM: Core & HTML, defect, P3)

57 Branch
defect

Tracking

()

People

(Reporter: peter, Assigned: valentin)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: regression, regressionwindow-wanted, Whiteboard: [necko-triaged][webpagetest])

Attachments

(1 file)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36

Steps to reproduce:

I automated Firefox 57 through Browsertime, access the same URL Wikipedia URL multiple times. When loadEventEnd has ended I execute:

(function() {
  var t = window.performance.timing;

  return {
    domainLookupTime: (t.domainLookupEnd - t.domainLookupStart),
    redirectionTime: (t.fetchStart - t.navigationStart),
    serverConnectionTime: (t.connectEnd - t.connectStart),
    serverResponseTime: (t.responseEnd - t.requestStart),
    pageDownloadTime: (t.responseEnd - t.responseStart),
    domInteractiveTime: (t.domInteractive - t.navigationStart),
    domContentLoadedTime: (t.domContentLoadedEventStart - t.navigationStart),
    pageLoadTime: (t.loadEventStart - t.navigationStart),
    frontEndTime: (t.loadEventStart - t.responseEnd),
    backEndTime: (t.responseStart - t.navigationStart)
  };
})();


Actual results:

Backend time (responseStart - navigationStart) is sometimes 0 (but it is actually a new request and nothing is cached). But not always.


Expected results:

The responseStart should have a value different from navigation start. 

Testing the exact same way on Firefox 54 and Chrome 63 doesn't result in a 0. 

I've tested it manually by loading different URLs but haven't been able to reproduce. 

I'll do one change in my setup by clearing the cache through the WebExtension API to make sure it isn't a incompatible thing between Firefox 57/Geckodriver/Selenium and report back.
I've added so all Firefox caches are cleared between runs using a WebExtension and then everything looks ok. I think then Selenium doesn't clear the session as it used to be in Firefox 57. I wonder if the problem is Selenium or Geckodriver then.
Can you please use https://mozilla.github.io/mozregression/ to determine the regression range?
Component: Untriaged → DOM
Product: Firefox → Core
(In reply to Peter Hedenskog from comment #1)
> I've added so all Firefox caches are cleared between runs using a
> WebExtension and then everything looks ok. I think then Selenium doesn't
> clear the session as it used to be in Firefox 57. I wonder if the problem is
> Selenium or Geckodriver then.

Hi David, are you the right person to check Peter's issue and comment? Thanks!
Flags: needinfo?(dburns)
Without being able to see what the selenium code looks like it's a bit difficult to see the true error.

Selenium and Geckodriver doesnt cache anything between session as we make sure to use a fresh profile between sessions. If it looks like it is caching then I suggest making sure you use the #quit command to clear the session. This command will also delete the profile in use.

When you do 

```python
  driver = webdriver.Firefox()
```

It will create a new profile.

hth
Flags: needinfo?(dburns)
Hi Peter,
Could you help us get the regression window following comment 2? Also, see if David's comment 4 helps resolve your issue. Thanks.
Flags: needinfo?(peter)
Priority: -- → P3
Hey sorry for not getting back. 

David: Yep we create new Profile. When I run the exact same code with FF 54 it works like a charm. Just changing the browser to 57 I can see that the Navigation Timing metrics sometimes (not always) is wrong. I wasn't using the latest Geckodriver (used 0.18.0) but I guess that doesn't matter. 

My guess was that it was getting a cached copy of the page but then my guess is wrong maybe. The thing is if I clear the cache between the runs, the metrics is correct. FF 57 is fast but backend time as 0 or 1 ms is probably too fast. When I checked now I can see that also connectEnd, connectStart, responseEnd and requestStart seems affected (they are reported as zero). I guess it doesn't give you any extra info but this is where I get the data:
https://results.sitespeed.io/en.wikipedia.org/2017-11-14-22-34-52/pages/en.wikipedia.org/wiki/Barack_Obama/3.html#browsertime

I have no way to debug it further until the new HAR plugin is available, sorry. I know not reproducible bugs sucks but I thought I should report it anyway so you can keep track of it if more users get the same. 

I've tried to look at the RUM data we have at work to see if I can see the same for real users but at the moment we disregard users with faulty nav timings, need to check if that number increased with the 57 release.

Hmm I'm haven't tried with mozregression but I can do it if you think it can help?
Flags: needinfo?(peter)
Not upgrading geckodriver is fine though it solidifies my thoughts that selenium/geckodriver is not the cause.

With a lot of the speed up work in Firefox for 57 I know there is more caching in general so maybe something has gone "rogue"
Does this site use a service worker with a FetchEvent at all?
No service worker at all. We use HTTP/2.
Its possible we are speculatively loading here:

https://searchfox.org/mozilla-central/source/dom/performance/PerformanceTiming.cpp#95

Another possibility is that these are things like cross-origin <img> requests without Timing-Origin-Allow header, etc.
Adding some necko folks who might be able to help further diagnose what the network stack is doing.
Sorry for all the questions, but are you by any chance running this in AWS or another virtualized environment with a windows operating system?  We've had some issues without time counters being consistent across browser processes in some of these environments.
Yep it's in Docker on AWS. And it happened with the switch to 57.

The headers from the server looks like this:

accept-ranges:bytes
age:80375
backend-timing:D=225027 t=1510765277142761
cache-control:private, s-maxage=0, max-age=0, must-revalidate
content-encoding:gzip
content-language:en
content-length:239442
content-type:text/html; charset=UTF-8
date:Thu, 16 Nov 2017 15:20:53 GMT
last-modified:Wed, 15 Nov 2017 00:18:58 GMT
link:</static/images/project-logos/enwiki.png>;rel=preload;as=image;media=not all and (min-resolution: 1.5dppx),</static/images/project-logos/enwiki-1.5x.png>;rel=preload;as=image;media=(min-resolution: 1.5dppx) and (max-resolution: 1.999999dppx),</static/images/project-logos/enwiki-2x.png>;rel=preload;as=image;media=(min-resolution: 2dppx)
p3p:CP="This is not a P3P policy! See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."
server:mw1271.eqiad.wmnet
status:200
strict-transport-security:max-age=106384710; includeSubDomains; preload
vary:Accept-Encoding,Cookie,Authorization
via:1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
x-analytics:ns=0;page_id=534366;WMF-Last-Access=16-Nov-2017;WMF-Last-Access-Global=16-Nov-2017;https=1
x-cache:cp1053 hit/6, cp3031 hit/5, cp3042 hit/204
x-cache-status:hit-front
x-client-ip:84.55.65.82
x-content-type-options:nosniff
x-powered-by:HHVM/3.18.6-dev
x-ua-compatible:IE=Edge
x-varnish:370913959 14487101, 179780583 190590610, 379753910 612690641
Last thing, it isn't consistent and happens only sometimes.
If you could capture an http log when this triggers, that might be helpful:

https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

Honza, what logging values would be useful for this problem?
Flags: needinfo?(honzab.moz)
(In reply to Ben Kelly [:bkelly] from comment #15)
> If you could capture an http log when this triggers, that might be helpful:
> 
> https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging
> 
> Honza, what logging values would be useful for this problem?

I think nsHttp:5 would do, for both child and parent (please use env vars).  Valentin, do you know about any other module that should be added?
Flags: needinfo?(honzab.moz) → needinfo?(valentin.gosu)
I can make a hack to turn on the logging, can get back with that tomorrow.

I can reproduce (in Docker) on Mac OS X the same I get on AWS.
Attached file firefox_log.txt.0.zip
I've could reproduce with the log turned on but needed 21 runs to the same URL (sorry for the size). The URL is https://en.wikipedia.org/wiki/Barack_Obama and access number 15 gives the wrong Navigation Timing metrics.
I got this problems also when cache is cleared so it isn't an issue about caching.
This was probably caused by bug 919391 - before it we had negative timings. Now at least it's the same as navigationStart.
There's a high chance this _is_ related to caching. We only set responseStart (from what I can tell) in nsHttpTransaction.

Peter,
When the problem reproduces, can you post the all values in perfomance.timing ?

Thanks
Assignee: nobody → valentin.gosu
Flags: needinfo?(valentin.gosu) → needinfo?(peter)
Whiteboard: [necko-triaged]
(In reply to Valentin Gosu [:valentin] from comment #20)
> This was probably caused by bug 919391 - before it we had negative timings.
> Now at least it's the same as navigationStart.
> There's a high chance this _is_ related to caching. We only set
> responseStart (from what I can tell) in nsHttpTransaction.
> 
> Peter,
> When the problem reproduces, can you post the all values in
> perfomance.timing ?
> 
> Thanks

Yes I'll fix that first thing tomorrow.
Hmm so the moz HTTP log gets deleted if we open a new Firefox right, so the log I provided is of no value? I've made 21 runs (open a new FF each time) and the error happened in of the runs in between. Sorry I missed that.

I've reproduced now and running https://github.com/sitespeedio/browsertime/blob/master/browserscripts/timings/navigationTiming.js gave me:

connectEnd    0
connectStart    0
domComplete    3620
domContentLoadedEventEnd    1743
domContentLoadedEventStart    1742
domInteractive    1720
domLoading    138
domainLookupEnd    0
domainLookupStart    0
fetchStart    0
loadEventEnd    3621
loadEventStart    3620
navigationStart    0
requestStart    0
responseEnd    0
responseStart    0
secureConnectionStart 0
Definitely seems cache related. All of the timestamps that should come from the network are zero.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(peter)
Ok. What's really (In reply to Valentin Gosu [:valentin] from comment #20)
> This was probably caused by bug 919391 - before it we had negative timings.
> Now at least it's the same as navigationStart.
> There's a high chance this _is_ related to caching. We only set
> responseStart (from what I can tell) in nsHttpTransaction.
> 
> Peter,
> When the problem reproduces, can you post the all values in
> perfomance.timing ?
> 
> Thanks

Ok. To be more clear if that's help for the caching issue: I got this behavior using a new profile for each run. I also get this when I use a WebExtension to clear the cache between each runs: https://github.com/sitespeedio/browsertime-extension/blob/master/background.js#L146-L165
(In reply to Peter Hedenskog from comment #24)
> Ok. To be more clear if that's help for the caching issue: I got this
> behavior using a new profile for each run. I also get this when I use a
> WebExtension to clear the cache between each runs:
> https://github.com/sitespeedio/browsertime-extension/blob/master/background.
> js#L146-L165

That's even more interesting :)
I wouldn't expect this to happen with an empty cache, but maybe there's a case where we don't properly set the timings when the request is only from the network. The timings are recorded after onload has fired, right?
(In reply to Valentin Gosu [:valentin] from comment #25)
> (In reply to Peter Hedenskog from comment #24)
> > Ok. To be more clear if that's help for the caching issue: I got this
> > behavior using a new profile for each run. I also get this when I use a
> > WebExtension to clear the cache between each runs:
> > https://github.com/sitespeedio/browsertime-extension/blob/master/background.
> > js#L146-L165
> 
> That's even more interesting :)
> I wouldn't expect this to happen with an empty cache, but maybe there's a
> case where we don't properly set the timings when the request is only from
> the network. The timings are recorded after onload has fired, right?

Yes we wait on onloadEventEnd + 2 seconds before we start.
Could we get some more love on this one :) I'm really interested to know if this also could effect our RUM data?
(In reply to Peter Hedenskog from comment #27)
> Could we get some more love on this one :) I'm really interested to know if
> this also could effect our RUM data?

Thanks for the poke on this. I tried for a bunch of time to reproduce it, but didn't manage. However, I did uncover bug 1429422, as it's easier to reproduce, and it's not clear if can get accurate results on this one while that one is still active.
Depends on: 1429422
Here's a solid repro:

Navigate to facebook.com and run this:

(function() {
  function perfTimingsToNavTimings(p, timeOrigin) {
    const timings = [...Object.entries(p)].filter(e=>typeof e[1] === 'number').map(e=>[e[0], e[1] === 0 ? 0 : e[1] + timeOrigin]);
    const timingsObj = {};
    for (const [key,value] of timings) {
      timingsObj[key] = value;
    }
    return timingsObj;
  }
  a = performance.timing.toJSON();
  b = perfTimingsToNavTimings(performance.getEntriesByType('navigation')[0].toJSON(), performance.timeOrigin);
  diffs = {};
  for (const key of Object.keys(b)) {
    if (a[key] != null)
      diffs[key] = b[key] - a[key]
  }
  console.log({
    'performance.timing': a,
    'PerformanceNavigationTiming': b,
    difference: diffs
  });
})();

What's interesting is that the PerformanceNavigationTiming entry is correct on the responseStart/responseEnd values, but performance.timing is way off.
Duplicate of this bug: 1287221
Whiteboard: [necko-triaged] → [necko-triaged][webpagetest]
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.