Last Comment Bug 662555 - Race condition when measuring requestEnd
: Race condition when measuring requestEnd
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Networking: HTTP (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla9
Assigned To: Honza Bambas (:mayhemer)
:
: Patrick McManus [:mcmanus]
Mentors:
Depends on:
Blocks: 570341 576006 658894 673226
  Show dependency treegraph
 
Reported: 2011-06-07 09:47 PDT by Honza Bambas (:mayhemer)
Modified: 2011-08-21 11:27 PDT (History)
7 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
v1 (1.72 KB, patch)
2011-06-18 08:53 PDT, Honza Bambas (:mayhemer)
jduell.mcbugs: review+
Details | Diff | Splinter Review
v1.1 (1.77 KB, patch)
2011-08-02 11:58 PDT, Honza Bambas (:mayhemer)
honzab.moz: review+
Details | Diff | Splinter Review
v2 (1.74 KB, patch)
2011-08-02 12:25 PDT, Honza Bambas (:mayhemer)
jduell.mcbugs: review+
Details | Diff | Splinter Review

Description Honza Bambas (:mayhemer) 2011-06-07 09:47:05 PDT
nsHttpChannel::OnStopRequest calls gHttpHandler->CancelTransaction.  That method posts an event to the socket thread that cancels the transaction asynchronously (by call to nsHttpTransaction::Close).  In that method we stamp the requestEnd time.   Sometimes this event fires later then we read the value of requestEnd time on the main thread, giving us a null value for it.

The time (also more precise) has be stamped at nsHttpTransaction::HandleContent when we detect eof.


In general we should check if the timings are (and can be/have to be) in sync with the activity distributor.
Comment 1 Jason Duell [:jduell] (needinfo me) 2011-06-08 10:46:41 PDT
Would it be good enough to set the requestEnd timestamp in nsHttpChannel::OnStopRequest?  AFAICT we really only care about when the data finished getting to the user, not when the transaction was cancelled asynchronously.
Comment 2 Honza Bambas (:mayhemer) 2011-06-09 09:45:50 PDT
We could do that in OnStopRequest, right, but just to note that this is also an event posted from the socket thread to the main thread.  So, there is some delay.  Question is, do we want to include that delay or not?  All other measures are captured on the socket thread, so why not this one as well?  Also these should be pure-as-can-be network only timings.

Also it shows that we don't protect the timings with a mutex.  I wouldn't actually add a mutex (negative perf effect) but rather (if possible, wouldn't break the web timing API) make the values accessible only after OnStopRequest has been called, e.g. mPending == false.  But it is not for this bug.

Adding a hard block for bug 658894 as it shots leak test down (undetected by try before).
Comment 3 Jason Duell [:jduell] (needinfo me) 2011-06-09 13:59:57 PDT
> We could do that in OnStopRequest, right, but just to note that this is also an event posted from the socket thread to the main thread.  So, there is some delay.  Question is, do we want to include that delay or not?

Well, OnStop is when the channel "ends" from the client's perspective, so it seems like a good place to take the measurement.

>  All other measures are captured on the socket thread, so why not this one as well? 

Some events (DNS resolution, TCP connection, etc) do not get notifiction on the user thread, so we can't measure them from there.  But the timings are ultimately for a user to understand how much time in the network layer their code takes, and that include whatever delay we take in posting OnStop to the main thread.   I'm not too worried about which thread we measure which event on, as long as the metrics are sensible.  And OnStopReq seems the sensible place for requestEnd.
Comment 4 Honza Bambas (:mayhemer) 2011-06-18 08:48:31 PDT
There are 3 points of view IMO:

1. To be precise.  To exactly measure the times we spend by some activities.  We are using this API for our C++ coded telemetry, that has to be precise as it can be, but we can use different ways to get the telemetry, so not a strong argument...

2. To be compatible with other ways we measure.  Mainly with the activity observer that Firebug uses.  NS_HTTP_ACTIVITY_SUBTYPE_RESPONSE_COMPLETE is what Firebug uses for stamping the "response end" time and it is exactly at the place I suggest, [1].  I checked we are compatible for the rest of the notification, which is good.

3. Deliver what Web Timing API wants:  "This attribute (responseEnd) must return the time immediately after the user agent receives the last byte of the current document or immediately before the transport connection is closed, whichever comes first. The document here can be received either from the server, relevant application caches or from local resources."  Also, there is a web version of Firebug that would like to use it.  It wouldn't be good to give different numbers using XPI Firebug and web Firebug.


So, it is clear this needs to be put to EOF detection, the same place where the activity observer notifies.


[1] http://hg.mozilla.org/mozilla-central/annotate/e319657f9c92/netwerk/protocol/http/nsHttpTransaction.cpp#l1125
Comment 5 Honza Bambas (:mayhemer) 2011-06-18 08:53:00 PDT
Created attachment 540246 [details] [diff] [review]
v1

This is a patch that puts the stamp to the right place.  We have to leave it as well in nsHttpTransaction::Close when we get an error state.  In case of an error that method seems to be called before nsHttpChannel::OnStopRequest, so we are safe.
Comment 6 Jason Duell [:jduell] (needinfo me) 2011-06-22 22:35:46 PDT
Comment on attachment 540246 [details] [diff] [review]
v1

Review of attachment 540246 [details] [diff] [review]:
-----------------------------------------------------------------

> This attribute (responseEnd) must return the time immediately 
> after the user agent receives the last byte of the current document 
> or immediately before the transport connection is closed, whichever comes first

You're right.
Comment 7 Honza Bambas (:mayhemer) 2011-06-28 08:01:35 PDT
http://hg.mozilla.org/integration/mozilla-inbound/rev/edf3a8d6cff3
Comment 8 Christian :Biesinger (don't email me, ping me on IRC) 2011-06-28 09:15:36 PDT
thanks for fixing this! but shouldn't this have a TimingEnabled() check? I guess I forgot that in the original codepath...
Comment 9 Honza Bambas (:mayhemer) 2011-06-28 09:34:03 PDT
(In reply to comment #8)
> thanks for fixing this! but shouldn't this have a TimingEnabled() check? I
> guess I forgot that in the original codepath...

Probably should.  I can do that in a followup.
Comment 10 Joe Drew (not getting mail) 2011-06-28 11:44:14 PDT
Backed out for apparently causing reftest assertion failures on Windows debug: http://hg.mozilla.org/integration/mozilla-inbound/rev/3890c438f22d
Comment 11 Honza Bambas (:mayhemer) 2011-08-02 11:37:28 PDT
I have no idea why this should cause reftest assertions.  Unfortunatelly logs are gone by now.

I believe this was backed out because the original version of the bug 658894 patch was causing assertion failures (changeset fcf08eff68eb) but a new version currently landed on m-c (eb5866601f88) doesn't suffer from it.

I vote for just relanding this.
Comment 12 Honza Bambas (:mayhemer) 2011-08-02 11:58:11 PDT
Created attachment 550146 [details] [diff] [review]
v1.1

Same as the v1 patch just added the TimingEnabled() check.
Comment 13 Honza Bambas (:mayhemer) 2011-08-02 12:09:45 PDT
Hmm... looks like the spec has changed in the mean time:

responseEnd: "This attribute must return the time immediately after the user agent finishes receiving the last byte of the resource from from relevant application caches or from local resources."

There is no more anything about closing the connection.

I will update the patch.
Comment 14 Honza Bambas (:mayhemer) 2011-08-02 12:25:56 PDT
Created attachment 550161 [details] [diff] [review]
v2

This will also fix bug 673226.

http://tbpl.mozilla.org/?tree=Try&rev=07e1908a3b94
Comment 15 Ed Morley [:emorley] 2011-08-20 15:59:09 PDT
Sent to try:
http://tbpl.allizom.org/?tree=Try&usebuildbot=1&rev=626e32292fdc

Assuming all green, will push to inbound after.

I'm happy to fix it this time, but for the future please can you make sure the format (context, commit message, author) is correct:
http://blog.bonardo.net/2010/06/22/so-youre-about-to-use-checkin-needed

Thanks :-)
Comment 17 :Ms2ger (⌚ UTC+1/+2) 2011-08-21 11:27:57 PDT
http://hg.mozilla.org/mozilla-central/rev/95996d0763a6

Note You need to log in before you can comment on or make changes to this bug.