Closed Bug 1132572 Opened 5 years ago Closed 5 years ago

TSan: multiple data races under netwerk/base/nsLoadGroup.cpp:926 TelemetryReportChannel

Categories

(Core :: Networking: HTTP, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox37 --- wontfix
firefox38 --- fixed
firefox39 --- fixed
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: froydnj, Assigned: valentin)

References

(Blocks 1 open bug)

Details

(Whiteboard: [tsan])

Attachments

(3 files, 3 obsolete files)

[cribbing from decoder's script, hopefully he won't mind]

The attached logfile shows a thread/data race detected by TSan (ThreadSanitizer).

Typically, races reported by TSan are not false positives, but it is possible that the race is benign. Even in this case though, we should try to come up with a fix unless this would cause unacceptable performance issues. Also note that seemingly benign races can possibly be harmful (also depending on the compiler and the architecture) [1].

If the bug cannot be fixed, then this bug should be used to either make a compile-time annotation for blacklisting or add an entry to the runtime blacklist.

There are a number of races shown in the attached log, all related to how the networking code records time stamps for reporting to telemetry.  I'm not sure how to make these atomic, and locking around recording them seems rather silly...

[1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
Assignee: nobody → valentin.gosu
nathan, thanks. This one is more interesting..

it shows main thread things accessing channel timing information while the socket thread is changing it. That's a decidedly non-atomic data structure.

valentin said he'll look at it - very likely related to https://bugzilla.mozilla.org/show_bug.cgi?id=1079837
Comment on attachment 8565405 [details] [diff] [review]
Lock nsHttpTransaction.mLock when accessing nsHttpTransaction.mTimings

Review of attachment 8565405 [details] [diff] [review]:
-----------------------------------------------------------------

I think nsHttpTransaction:Timings() should also be locked.

::: netwerk/protocol/http/nsHttpTransaction.h
@@ +148,5 @@
>      }
>      void SetPushedStream(Http2PushedStream *push) { mPushedStream = push; }
>  
> +    // nsHttpChannel needs a lock to access the timing info from the main thread
> +    Mutex& GetMutexRef() { return mLock; }

This isn't a workable api.. callers have to get the mutex, lock it, and then know what data can be accessed. It made the channel code fragile.

I think we are better off with

a] a timings() method that gets a private lock and makes a copy of the structure. that can be used for the "channel is done with the transaction" code path that is make a copy through assignment anyhow

b] nsHttpTransaction accessors for the various members of timingStruct.. so they can handle the locking without doing a copy via method [a]
Attachment #8565405 - Flags: review?(mcmanus) → review-
Attachment #8565405 - Attachment is obsolete: true
Comment on attachment 8567363 [details] [diff] [review]
Fix data race for nsHttpTransaction's timing info

Review of attachment 8567363 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/protocol/http/nsHttpTransaction.cpp
@@ +708,5 @@
>          return NS_BASE_STREAM_CLOSED; // stop iterating
>  
> +    if (trans->TimingEnabled()) {
> +        if (trans->Timings().responseStart.IsNull()) {
> +            trans->SetResponseStart(TimeStamp::Now());

how about SetResponseStartOnce() ? That will avoid getting the lock twice.. which is both overhead and maybe racy.

This comment would apply to all of TimingStruct::Set* methods

@@ +1842,5 @@
>      }
>  }
>  
> +const TimingStruct
> +nsHttpTransaction::Timings() {

nit brace

non nit - I would have to read the asm (or google :)).. but I think this is making two copies of the TminigStruct.. one on the stack and then one into the lvalue.

Can't we just make it nsHttpTransaction::Timings(TimingStruct &outVal) ?

Alternatively, it looks like maybe only nsHttpChannel uses Timings() and it is really always accessing a single member of the structure.. so maybe we just need GetDomainLookupStart(), etc.. and can get rid of Timings() entirely.
Attachment #8567363 - Flags: review?(mcmanus)
Attachment #8567363 - Attachment is obsolete: true
(In reply to Patrick McManus [:mcmanus] from comment #5)
> Comment on attachment 8567363 [details] [diff] [review]
> Fix data race for nsHttpTransaction's timing info
> 
> Review of attachment 8567363 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: netwerk/protocol/http/nsHttpTransaction.cpp
> @@ +708,5 @@
> >          return NS_BASE_STREAM_CLOSED; // stop iterating
> >  
> > +    if (trans->TimingEnabled()) {
> > +        if (trans->Timings().responseStart.IsNull()) {
> > +            trans->SetResponseStart(TimeStamp::Now());
> 
> how about SetResponseStartOnce() ? That will avoid getting the lock twice..
> which is both overhead and maybe racy.
> 
> This comment would apply to all of TimingStruct::Set* methods
> 

I added a default bool onlyIfNull = false parameter to the set methods.

> non nit - I would have to read the asm (or google :)).. but I think this is
> making two copies of the TminigStruct.. one on the stack and then one into
> the lvalue.
> 
> Can't we just make it nsHttpTransaction::Timings(TimingStruct &outVal) ?
> 

According to this it seems it should be just fine: http://stackoverflow.com/a/4986802
Also it's a little less verbose. But I can change it if you prefer it like that :)

> Alternatively, it looks like maybe only nsHttpChannel uses Timings() and it
> is really always accessing a single member of the structure.. so maybe we
> just need GetDomainLookupStart(), etc.. and can get rid of Timings()
> entirely.

It's also accessed in OnStopRequest, where it copies the entire TimingStruct.
Attachment #8568335 - Flags: review?(mcmanus) → review+
(In reply to Valentin Gosu [:valentin] from comment #7)
> 
> According to this it seems it should be just fine:
> http://stackoverflow.com/a/4986802

interesting - thanks for doing the homework
(In reply to Patrick McManus [:mcmanus] from comment #8)
> (In reply to Valentin Gosu [:valentin] from comment #7)
> > 
> > According to this it seems it should be just fine:
> > http://stackoverflow.com/a/4986802
> 
> interesting - thanks for doing the homework

That's a very different use-case from what you're doing here, though.  With a std::vector<>, you really don't want to copy all the allocated memory (and the contained things, if they were more complicated than ints), and moving just a couple pointers around is a lot cheaper.  For TimingStruct, a move is essentially equivalent to a copy (possibly even more expensive, if you null out the members of the TimingStruct you move'd from).
Right, the move operators aren't really useful for a struct of this size, but the return-value-optimizations should ensure that only one copy is performed.
Patrick, is that acceptable overhead, or would you rather we had Getters for every timing field?
Flags: needinfo?(mcmanus)
(In reply to Valentin Gosu [:valentin] from comment #10)
> Right, the move operators aren't really useful for a struct of this size,
> but the return-value-optimizations should ensure that only one copy is
> performed.
> Patrick, is that acceptable overhead, or would you rather we had Getters for
> every timing field?

Answering my own question: In bug 1123920 I need to get requestStart in OnTransportStatus, which gets called several times, so the additional overhead is probably unfortunate there.

Sorry for the extra bugmail :)
Flags: needinfo?(mcmanus)
Attachment #8568335 - Attachment is obsolete: true
Attachment #8568886 - Flags: review?(mcmanus) → review+
https://hg.mozilla.org/mozilla-central/rev/924e81b9fbf5
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Comment on attachment 8568886 [details] [diff] [review]
Fix data race for nsHttpTransaction's timing info

Approving for Aurora as a prereq of bug 1123920. Approval request in bug 1123920 comment 21. Aurora+
Attachment #8568886 - Flags: approval-mozilla-aurora+
Comment on attachment 8576678 [details] [diff] [review]
[Beta] Fix data race for nsHttpTransaction's timing info

Required as prereq for bug 1123920. See bug 1123920 comment 21 for uplift request.
Attachment #8576678 - Flags: approval-mozilla-b2g37?
Attachment #8576678 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
You need to log in before you can comment on or make changes to this bug.