Closed Bug 576006 Opened 14 years ago Closed 13 years ago

Implement timing interface for channels

Categories

(Core :: Networking, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
mozilla6

People

(Reporter: Biesinger, Assigned: Biesinger)

References

(Blocks 1 open bug)

Details

(Whiteboard: [strings])

Attachments

(1 file, 9 obsolete files)

42.39 KB, patch
Details | Diff | Splinter Review
Implementing the web timing spec (bug 554045) requires a way for channel consumers to get detailed timing information from a channel.

This can be done using a per-channel nsIHttpActivityObserver or a special interface that just stores the timing information when requested by a load flag or always. I'm leaning towards the last option with a load flag because that seems simplest to access, and the load flag ensures that the additional PR_Time calls don't slow things down too much.
Attached patch draft v2 (obsolete) — Splinter Review
Attachment #492884 - Attachment is obsolete: true
The timing data generated using this patch is within 100 μs of the timing indicated by Wireshark, most of the timings within ~40 μs, so I'm fairly happy with the accuracy.

The one exception is DNS. nsHttpChannel::AsyncOpen does DNS prefetching, so I need to come up with a good way to get that timing data.
merged to bf68a4a3ef6a
Attachment #496282 - Attachment is obsolete: true
Attached patch patch (obsolete) — Splinter Review
My rationale for removing the DNS prefetching is as follows:
- In the normal case, the prefetch saves almost no time because the transaction will get initiated immediately, triggering DNS lookup
- In the case when the channel is blocked because there's too many connections for this host, the other connections are for the same host, and therefore those transactions will have triggered DNS resolution
- In the case that we are blocked because we reached the connection limit: That limit has been increased to 256, so this will rarely be hit

And removing this code is necessary to get DNS timings.
Attachment #522071 - Attachment is obsolete: true
Attachment #524790 - Flags: review?(bzbarsky)
(patch relative to cset 0a780111aacb)
can you just fake the timing report by ignoring the prefetch? It seems dumb to give up the optimization for the sake of a timing interface.
So how stable is the spec we're implementing here?  Olli?

Also, do we actually want PRTime (which can go backwards, etc) or do we want interval times a la nsTimeStamp (which don't increase when the computer is off or asleep, iirc, but promise to be nondecreasing)?
Navigation timing should be quite stable now.
Don't know about Resource timing - at least I doubt it has been reviewed properly.
What about my question about the sorts of times we want here?
"Throughout this work, time is measured in milliseconds since midnight of January 1, 1970 (UTC)."
We'd want non-decreasing-increase-while-asleep time, I think.
Though in practice PR_Now() is probably just fine, no?
> Though in practice PR_Now() is probably just fine, no?

Unless the user (or NTP) changes the date on the computer...

But ok.
(In reply to comment #7)
> can you just fake the timing report by ignoring the prefetch? It seems dumb to
> give up the optimization for the sake of a timing interface.

I don't understand why the prefetch is there (in the specific place I'm removing it from). Like I wrote, it gives basically no advantage and complicates the code.

We could not fake it in a way that makes sense. We could give sub-millisecond answers, but that seems kinda useless...

(In reply to comment #9)
> Navigation timing should be quite stable now.
> Don't know about Resource timing - at least I doubt it has been reviewed
> properly.

Resourcetiming is far from done. NavigationTiming is pretty stable. That said, I suspect that any changes to the spec wouldn't affect this networking layer code, it just exposes timings as they correspond to necko-level events.

(In reply to comment #8)
> Also, do we actually want PRTime (which can go backwards, etc) or do we want
> interval times a la nsTimeStamp (which don't increase when the computer is off
> or asleep, iirc, but promise to be nondecreasing)?

We want a timestamp that can be mapped to a wallclock time. I don't think any non-PRTime/PR_Now gives us that... TimeStamp definitely doesn't. What use is it to tell a web developer "500 dimensionless units have passed"? :-)
We can convert timestamp units to ms, for what it's worth.

All I'm saying is that with PR_Time the durations can end up negative, since it's measuring what the computer clock measures, not how much time has actually passed.
(In reply to comment #13)
> Like I wrote, it gives basically no advantage and
> complicates the code.

you asserted that it cannot queue except to a known host, but that depends on the connection limits. Mobile in particular has much lower limits which will create queuing along with higher rtt's which will disproportionately benefit from the prefetch.

All this does is prepopulate the cache, which may or may not actually change the state of the cache. I'm not sure why that can't be ignored for the sake of your measurements - especially if you assert it doesn't actually change the timing.
> The timing data generated using this patch is within 100 μs of the timing
> indicated by Wireshark, most of the timings within ~40 μs

Which OSes have you tested on?  Honza tells me he sees crappy (tens of milliseconds) resolution for both PR_Now and PR_IntervalNow on Windows (see bug 534764 comment 58 and bug 534764 comment 63 and 64).
(In reply to comment #14)
> We can convert timestamp units to ms, for what it's worth.
> 
> All I'm saying is that with PR_Time the durations can end up negative, since
> it's measuring what the computer clock measures, not how much time has actually
> passed.

That is true. But the navigationtiming spec exposes msec-since-epoch, so there's nothing we can do to solve that problem, right?

(In reply to comment #15)
> (In reply to comment #13)
> > Like I wrote, it gives basically no advantage and
> > complicates the code.
> 
> you asserted that it cannot queue except to a known host, but that depends on
> the connection limits. Mobile in particular has much lower limits which will
> create queuing along with higher rtt's which will disproportionately benefit
> from the prefetch.
> 
> All this does is prepopulate the cache, which may or may not actually change
> the state of the cache. I'm not sure why that can't be ignored for the sake of
> your measurements - especially if you assert it doesn't actually change the
> timing.

I didn't assert it doesn't change the timing. With the prefetch code here, the actual timing information for the DNS lookup will be useless because it will just be the time for the cache lookup in almost all cases.

What are the connection limits on mobile?

(Would it make people happier if I readded the prefetching for now and discussed that removal in a separate patch? I would like to get this into Fx5 if possible)

(In reply to comment #16)
> > The timing data generated using this patch is within 100 μs of the timing
> > indicated by Wireshark, most of the timings within ~40 μs
> 
> Which OSes have you tested on?  Honza tells me he sees crappy (tens of
> milliseconds) resolution for both PR_Now and PR_IntervalNow on Windows (see bug
> 534764 comment 58 and bug 534764 comment 63 and 64).

I tested on Linux. Those results are unfortunate :/ But maybe we can fix that in a separate bug?
(In reply to comment #17)

> What are the connection limits on mobile?
> 

4 per connection, 6 total.

They are so low because we don't have a concept of idle connections vs active connections.. and a large number of active connections tends to overwhelm the queues on the lower bandwidth mobile devices, while 6 is enough to saturate them in most cases.. I did a stack full of measurements on this a little while back. It would be nice to be able to keep a larger idle pool though.

desktop queues recover a lot faster in the face of page aborts, etc, so having a potentially large number of them isn't as big of a deal.

> (Would it make people happier if I readded the prefetching for now and
> discussed that removal in a separate patch? I would like to get this into Fx5
> if possible)
> 

that makes sense to me - I'd love to see the timing interface land but I'm obviously in favor of the prefetch.
> But the navigationtiming spec exposes msec-since-epoch

I think that's a bug in the spec....
Attached patch patch v2 (obsolete) — Splinter Review
OK, readded that prefetch for now.

It might be a bit late to make such a large change to the spec...
Attachment #524790 - Attachment is obsolete: true
Attachment #525255 - Flags: superreview?(bzbarsky)
Attachment #525255 - Flags: review?(jduell.mcbugs)
Attachment #524790 - Flags: review?(bzbarsky)
> It might be a bit late to make such a large change to the spec...

Yeah, I was afraid you'd say that.  The problem is that the current spec is broken-by-design.  Best thing for a spec.  :(
Why the time-from-epoch is broken by design? If OSes can't
handle that easily, that is problem in them, not in the spec.
Well, ok, if there are leap seconds or such, but in general, I don't think
using epoch is that bad.
It's not "bad" except for when it'll silently give you incorrect results due to clock corrections...

Ah, well.  I'll assume that this part of the spec is frozen.
You're free to provide your input to public-web-perf@w3.org...
Target Milestone: --- → mozilla6
On Windows platform depending on a hardware PR_Now() granularity might be from 15 to 60 milliseconds.

Personally I prefer having a walltime like API to see how things work in parallel as well.

Other option to update the spec would be to have time 0 at the start of the browsers session and count all times monotonically increasing since that time.

To have a precise timing is critical here as we would like to use this API for metrics extension(s) and make a critical decisions based on data collected from it.  PR_Now is broken on Windows and I don't know about any trivial fix for it, but it might be doable with some affort.  If we change the spec according to my "other option" suggestion above the fix would be much simpler as we can turn TimeStamp and TimeDuration to use PerformanceCounter.
TimeStamp and TimeDuration already don't use PR_Now.

For what it's worth Microsoft said they already implement something like my proposal in from the post in comment 26, last paragraph.  Google is planning to move to something like that as well.  That will involve a single TimeStamp() + PR_Now() at navigation start, then measuring everything using TimeDuration.
Comment on attachment 525255 [details] [diff] [review]
patch v2

Will change the patch to use TimeStamp internally and expose them as a PRTime (channelCreationTime + TimeStamp) in the IDL
Attachment #525255 - Attachment is obsolete: true
Attachment #525255 - Flags: superreview?(bzbarsky)
Attachment #525255 - Flags: review?(jduell.mcbugs)
Would it be possible to get relative time somehow? I expect problems correlating channel creation time with moments available to document/docshell.
Relative time in what sense?
Maybe relative to the moment of the request--certain event happened x ms ago. 

I expect this interface to be used to implement some fields of performance.timing, and there will be 2 sequences of events aligned to different wall clock moments. I'm afraid it will be hard to correlate time of channel creation with the time of document creation or whatever we use as the base in performance.timing.

If resource usage is involved, it might make sense to limit time period while relative values are available.
What I'm actually going to do is expose both a PRTime and a TimeStamp on the IDL, and you can use that timestamp to calculate relative times as desired. I'm still also exposing PRTime so that firebug and other extensions can make sense of the data, too.
Attached patch patch with TimeStamp (obsolete) — Splinter Review
Now with TimeStamp
Attachment #527887 - Flags: superreview?(bzbarsky)
Attachment #527887 - Flags: review?(jduell.mcbugs)
(In reply to comment #17)
> > Which OSes have you tested on?  Honza tells me he sees crappy (tens of
> > milliseconds) resolution for both PR_Now and PR_IntervalNow on Windows (see bug
> > 534764 comment 58 and bug 534764 comment 63 and 64).
> 
> I tested on Linux. Those results are unfortunate :/ But maybe we can fix that
> in a separate bug?

bug 539093 will fix the resolution here with the new version of the patch that uses TimeStamp.
Comment on attachment 527887 [details] [diff] [review]
patch with TimeStamp

I'm not familiar with the DOM code, and the existing patch for bug 570341 isn't be hooked into this patch's code yet, so in some cases I've had a hard time mapping the timing spec onto the things we're measuring here.  But they're measuring what they say they measure :)

Some issues:

I pushed this patch to try, and it caused xpcshell tests to fail on about 2/3 of platforms with one or more of the following tests failing (and I don't see any of these marked as random orange):

- test_offline_support.js | | false == true  (or application crashed)
- test_download_manager.js | application crashed 
- test_privatebrowsing.js | Download failed (name: download-C, state: 2)
- test_bug_406857.js | Unexpected download state change received

I noticed that in the offline case, we wind up calling SendStatus(STATUS_RESOLVING) but never STATUS_RESOLVED (since mDNSRequest will be null).  But I figured that'd be ok for timing purposes (since the channel fails), and don't see why it'd cause the test to fail.

There's no e10s support in the patch:  we can add it in a followup if we can squeeze this into FF 6.  Otherwise, it'd be nice to just add it at the same time.  Most of the accessor function can be moved into HttpBaseClass, and then the timing data can be moved over during OnStart/OnDataAvail.

Finally, it would be nice to have an automated test here (I assume TestProtocols.cpp doesn't get run automatically), even if it's just something that opens the same channel twice and makes sure that the correct fields are filled in (in the from-network vs from-cache cases), and that values that must be later/earlier are so.  Again, I can deal with that being in the e10s followup if we're keen to get this into FF 6.

+r with the xpcshell issues investigated.

>+11=Looked up %1$S…

Marking as needing String changes

>diff --git a/netwerk/base/src/nsSocketTransport2.cpp b/netwerk/base/src/nsSocketTransport2.cpp
>--- a/netwerk/base/src/nsSocketTransport2.cpp
>+++ b/netwerk/base/src/nsSocketTransport2.cpp
>@@ -938,24 +938,22 @@ nsSocketTransport::ResolveHost()
>     if (NS_FAILED(rv)) return rv;
> 
>     mResolving = PR_TRUE;
> 
>     PRUint32 dnsFlags = 0;
>     if (mConnectionFlags & nsSocketTransport::BYPASS_CACHE)
>         dnsFlags = nsIDNSService::RESOLVE_BYPASS_CACHE;
> 
>+    SendStatus(STATUS_RESOLVING);
>     rv = dns->AsyncResolve(SocketHost(), dnsFlags, this, nsnull,
>                            getter_AddRefs(mDNSRequest));
>     if (NS_SUCCEEDED(rv)) {
>         SOCKET_LOG(("  advancing to STATE_RESOLVING\n"));
>         mState = STATE_RESOLVING;
>-        // only report that we are resolving if we are still resolving...
>-        if (mResolving)
>-            SendStatus(STATUS_RESOLVING);

Just above this is code that returns early if DNS is done by the proxy server.  I don't see any language in the spec about what to do in the case of resolution by proxy.  Is that worth bringing up in the WG?  Seems like we we could just set both to 'fetchStart'?

>+struct TimingStruct {
>+  mozilla::TimeStamp channelCreation;
>+  mozilla::TimeStamp asyncOpen;
>+
>+  mozilla::TimeStamp domainLookupStart;

any reason for the blank line?
Attachment #527887 - Flags: review?(jduell.mcbugs) → review+
Whiteboard: [strings]
(In reply to comment #36)
> I pushed this patch to try, and it caused xpcshell tests to fail on about
> 2/3 of platforms with one or more of the following tests failing (and I
> don't see any of these marked as random orange):

Will look into this...

> There's no e10s support in the patch:  we can add it in a followup if we can
> squeeze this into FF 6.  Otherwise, it'd be nice to just add it at the same
> time.  Most of the accessor function can be moved into HttpBaseClass, and
> then the timing data can be moved over during OnStart/OnDataAvail.

Yes, I'd like to get this into FF6 and do e10s as a followup patch.

> Finally, it would be nice to have an automated test here (I assume
> TestProtocols.cpp doesn't get run automatically),

Yes, TestProtocols is a developer tool, not an automated test.

> even if it's just
> something that opens the same channel twice and makes sure that the correct
> fields are filled in (in the from-network vs from-cache cases), and that
> values that must be later/earlier are so.  Again, I can deal with that being
> in the e10s followup if we're keen to get this into FF 6.

Will do in the followup.

> Just above this is code that returns early if DNS is done by the proxy
> server.  I don't see any language in the spec about what to do in the case
> of resolution by proxy.  Is that worth bringing up in the WG?  Seems like we
> we could just set both to 'fetchStart'?

Isn't this addressed by step 9 of the processing model?
"If no domain lookup is required, go to step 11."

> >+struct TimingStruct {
> >+  mozilla::TimeStamp channelCreation;
> >+  mozilla::TimeStamp asyncOpen;
> >+
> >+  mozilla::TimeStamp domainLookupStart;
> 
> any reason for the blank line?

I feel that the channel creation/async open timestamps are somewhat separate from actual networking timings. hence the blank line.
Comment on attachment 527887 [details] [diff] [review]
patch with TimeStamp

Two nits:

1)  PRTime in the idl for PRTimes, please.
2)  [notxpcom] for the timestamp methods.

sr=me with that.
Attachment #527887 - Flags: superreview?(bzbarsky) → superreview+
Just fixes some bitrot from diff context, and some changes that had already landed in TestProtocol.cpp.  Carrying forward my +r and bz's +sr.

Still needs xpcshell failures investigated.
Attachment #527887 - Attachment is obsolete: true
Attachment #534145 - Flags: superreview+
Attachment #534145 - Flags: review+
Merged to trunk (again) and fixes at least one of the xpcshell failures. webbrowserpersist also needs to whitelist the new status code.

note, I'm not at all sure that webbrowserpersist's implementation of OnStatus is correct but I don't want to deal with that right now.
Attachment #534145 - Attachment is obsolete: true
Attachment #534161 - Flags: superreview+
Attachment #534161 - Flags: review+
pushed to try, let's see if this verison works:
http://tbpl.mozilla.org/?tree=Try&pusher=cbiesinger@gmail.com
Status: NEW → ASSIGNED
          case NS_NET_STATUS_RESOLVING_HOST:
 +        case nsISocketTransport::STATUS_RESOLVED:
          case NS_NET_STATUS_BEGIN_FTP_TRANSACTION:
          case NS_NET_STATUS_END_FTP_TRANSACTION:
          case NS_NET_STATUS_CONNECTING_TO:
          case NS_NET_STATUS_CONNECTED_TO:
          case NS_NET_STATUS_SENDING_TO:
          case NS_NET_STATUS_RECEIVING_FROM:
          case NS_NET_STATUS_WAITING_FOR:

Might be nice to #define NS_NET_STATUS_RESOLVED_HOST in nsISocketTransport.idl, just to be consistent style-wise?
In the list of measured times I missing the WAITING phase, the time after we have sent the request and are now waiting for the response first-byte to arrive.  There should probably be responseEnd property as well.
Honza - this patch is now ready for checkin, could you file a new bug for adding that timing information?

(I assume what you want is the timestamp right after the write() call?)
Attachment #534172 - Attachment is obsolete: true
http://hg.mozilla.org/mozilla-central/rev/17b04197ed58

Filed Bug 658819 for e10s
Filed bug 658822 for the DNS issue from comment 18 et al
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Hrm.  So notxpcom doesn't give us signatures returning TimeStamp?  That's sad.  :(
Ah right... yeah it doesn't. I was hoping it would, too, but I got disappointed :( I guess that's only for functions?
Blocks: 658876
Blocks: 658822
Blocks: 658894
Depends on: 662555
> I was hoping it would, too, but I got disappointed

For what it's worth, bsmedberg says that using notxpcom getFoo() methods instead of attributes would have let this work like we want it to.  Followup bug?
triage: no sec review needed
You need to log in before you can comment on or make changes to this bug.