Closed Bug 506805 Opened 11 years ago Closed 11 years ago

Remove locking in AsyncExecuteStatements

Categories

(Toolkit :: Storage, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla1.9.2a1

People

(Reporter: sdwilsh, Assigned: sdwilsh)

References

Details

(Keywords: dev-doc-complete)

Attachments

(1 file, 2 obsolete files)

Attached patch v1.0 (obsolete) — Splinter Review
We can get away with not locking at all when executing statements asynchronously.  Modern architectures have 32-bit atomic reads and writes, so if we are careful in how we read and write to variables, we can get rid of locking completely.  Doing this results in zero hanging in the location bar (bug 503701) whereas it was trivial to reproduce before this patch.

I'm still using PR_AtomicSet because I'm paranoid.  The JS guys are relying on atomic read and atomic write of 32-bit sized values already, so I could get rid of those calls if I wanted to.
Attachment #390957 - Flags: review?(bugmail)
Attachment #390957 - Flags: review?(bent.mozilla)
Whiteboard: [needs review asuth][needs review bent]
Is there a summary of the memory models on our supported platforms?  I heart mutexes because they take care of the memory barriers and compiler ramifications for us (most of the time).

For example, shouldn't mState be marked volatile for read purposes?
(In reply to comment #1)
> Is there a summary of the memory models on our supported platforms?  I heart
> mutexes because they take care of the memory barriers and compiler
> ramifications for us (most of the time).
Andreas had mentioned something to me yesterday, but I honestly can't recall where it was.

> For example, shouldn't mState be marked volatile for read purposes?
He also had a reason for why this wasn't needed, but I can't recall either.  Kinda wishing I had had more sleep yesterday so I could remember the answers to your questions :/
We are OK on x86 because it follows the MESI protocol (http://en.wikipedia.org/wiki/MESI_protocol).  We just have to make sure the compiler doesn't re-order instructions (so, should probably use volatile to protect against that).

Looking for other architectures still.
See bug 477187 comment 13 and bug 477187 comment 15 for the research Andreas did on cache coherency.

This patch is OK because it uses the PR_AtomicSet functions which properly lock the memory bus on x86 and x86-64 via xchgl.  On ppc, it uses lwarx and stwcx which is safe [1].  On windows (all AFAICT), it uses InterlockedExchange, which provides the needed memory barrier for this to be safe [2].  Linux uses the same assembly as OS X, so it is safe as well.

I think it's safe to say that PR_AtomicSet handles this sanely so we can safely use it.

[1] http://➡.ws/赽
[2] http://➡.ws/᷊Ủ
My concern was more about the reads and compiler assumptions related to the reads; not that PR_AtomicSet did what it said it would do.

It looks like only mCancelRequested in executeAndProcessStatement would be remotely in danger, but thanks to the function calls being made in the loop the compiler has to assume the value could change during one of those calls.  Even with -O3 the generated code is safe.  This should hold true on architectures like the PowerPC where the value has to be fetched to a local register before comparison.  Having said that, I don't think there's any harm marking mCancelRequested volatile, and there could be some benefit should we ever get a compiler clever enough to pierce the function calls.

(I had previously misread the patch and thought there was a place where mState was used twice in logic in a function...)
Changing it to volatile would mean that I have to reinterpret_cast<PRInt32> every time I pass it to PR_AtomicSet (so once), but I can do that if you want me to.
C is underspecified enough to allow all sorts of reordering without volatile or explicit barrier calls. Cc'ing gurus for advice. What's more, gcc hackers keep threatening (IIRC) to change things so as to optimize harder.

/be
Getting comments/feedback/review sooner rather than later would be great.  This bug is the long pole for landing bug 455555 (P1 blocker) for the alpha (freeze tonight).
Comment on attachment 390957 [details] [diff] [review]
v1.0

r=asuth if you mark everything that you atomic set with "volatile".  If the only downside is casts, I think I can live with it.
Attachment #390957 - Flags: review?(bugmail) → review+
Turns out to cast away volatile when passing the variables to PR_AtomicSet, I need to use a const_cast as opposed to reinterpret_cast (at least on gcc).  Pushing that change to the try server to make sure windows is OK with this.

In the future, we could change all this to just use memory barriers, but that's a bigger change than I'd like to do just before a freeze.
Whiteboard: [needs review asuth][needs review bent] → [needs review bent]
Attached patch v1.1 (obsolete) — Splinter Review
Attachment #390957 - Attachment is obsolete: true
Attachment #391431 - Flags: review?(bent.mozilla)
Attachment #390957 - Flags: review?(bent.mozilla)
const_cast works fine on windows.  The current patch also addresses Brendan's concerns in comment 7 (uses volatile now).
Attached patch v2.0Splinter Review
After talking with bent over irc about this.  Turns out it was a bit wrong.  We held the lock during execution for a reason - and that was to get the return value right in cancel.  This was added in bug 458998.  Well, having that return anything wasn't really worthwhile, and you get that information in handleCompletion anyway.  Callers couldn't do anything different with that information anyway.

So, this patch makes cancel a void function instead of returning a boolean.  Need sr on the API change.  It should be noted that this was how it was originally implemented, and we added the return variable in bug 458998 (admittedly, with poor reasoning on my part).
Attachment #391431 - Attachment is obsolete: true
Attachment #391475 - Flags: superreview?(vladimir)
Attachment #391475 - Flags: review?(bugmail)
Attachment #391475 - Flags: review?(bent.mozilla)
Attachment #391431 - Flags: review?(bent.mozilla)
Whiteboard: [needs review bent] → [needs review bent][needs review asuth][needs sr vlad]
Comment on attachment 391475 [details] [diff] [review]
v2.0

sr=me on that api change.
Attachment #391475 - Flags: superreview?(vladimir) → superreview+
Attachment #391475 - Flags: review?(bugmail) → review+
Flags: in-testsuite-
Whiteboard: [needs review bent][needs review asuth][needs sr vlad] → [needs review bent]
Comment on attachment 391475 [details] [diff] [review]
v2.0

>+, mCancelRequested(false)

I'd go with 0/1 rather than false/true since this is an int.
Attachment #391475 - Flags: review?(bent.mozilla) → review+
http://hg.mozilla.org/projects/places//rev/7f4a5f6f26f1
Keywords: dev-doc-needed
Whiteboard: [needs review bent] → [fixed-in-places]
Target Milestone: --- → mozilla1.9.2a1
Making a variable volatile simply tells the compiler that it must do everything to the variable that you've told it to by the next sequence point, even if it can't see why doing so would matter.  Volatile doesn't specify what instructions it should use, so the compiler could use multiple byte stores to write a 32-bit value, for example.  (If you use volatile sig_atomic_t, that's a magic combination that tells the compiler to ensure partial updates won't be seen.)  And volatile doesn't restrict cache behavior or specify barrier semantics; it leaves you at the mercy of the processor as far as multi-threaded behavior is concerned.

When you call a function the compiler doesn't know about, though, that restricts the rearrangements the compiler can do in the same way that volatile does --- except it applies to *all* reachable objects: the compiler can't predict which ones the unknown function will access, so it has to put all of them in order.  (Andrew makes this point in comment #5.  And returning to an unknown function is adequate, too: it's crossing compilation unit boundaries that makes the difference.)  However, in the presence of link-time code generation, I wonder whether this strategy still applies: the compiler "knows" about the entire program.

PR_AtomicSet seems to be defined in a way that guarantees it'll be treated as "unknown" in this sense, regardless of how you do the translation.  So as far as writes are concerned, if you're using PR_AtomicSet, making the variable volatile doesn't get you anything more.

As Andrew says, the read is the interesting issue.  A read of an ordinary variable in a loop could be hoisted out of the loop; a read of a volatile variable would stay put, but a processor could let the reads hit cache, effectively hoisting the read back out of the loop.  So the read needs to be done in a way which guarantees the processor won't move it around either.  There doesn't seem to be a PR_AtomicGet function.

It's not clear to me why a mutex is expensive here: if it's just protecting a status flag, you should almost never be blocking on it.  It would be nice to know what's actually going on there.  And mutexes make all these stupid memory model questions --- which even people who work with the all the time get subtly wrong from time to time --- go away.
(In reply to comment #17)
> Making a variable volatile simply tells the compiler that it must do
> everything to the variable that you've told it to by the next sequence
> point, even if it can't see why doing so would matter.  Volatile doesn't
> specify what instructions it should use, so the compiler could use multiple
> byte stores to write a 32-bit value, for example.

The compiler is also required not to change the number of accesses to a volatile variable, irrespective of sequence points, even if that would make sense in optimization terms -- as a silly example, in

  extern volatile int n;
  int x() { return n + n; }
  int y() { int N = n;  return N + N; }

x() must read 'n' twice, and y() must read it once.

Embedded and kernel people put a lot of pressure on compiler people to guarantee that the compiler *will* use memory load and store instructions that match the width of the declared type, assuming such instructions exist.  They generally do get that guarantee, but it's not in the standard, except (as Jim mentioned) for sig_atomic_t.

> And volatile doesn't restrict cache behavior or specify barrier
> semantics; it leaves you at the mercy of the processor as far as
> multi-threaded behavior is concerned.

Even single-threaded, there's a really big, nasty gotcha in there: the compiler is entitled to move accesses to non-volatile variables across accesses to volatile variables.  In other words, 'volatile' is not even an optimization barrier, never mind a CPU-level memory barrier.  In this example

  extern volatile int v, u;
  extern int i;

  int e() { return v + u; }
  int f() { return u + v; }
  int g() { return v + i; }
  int h() { return i + v; }

GCC 4.3 reads 'v' before 'u' in e(), and 'u' before 'v' in f(), but reads 'v' before 'i' in both g() and h().  [This is a bad example, as evaluation order is not defined in this case even for volatile variables; I should have copied everything into local variables first.  But it makes the point.]

> When you call a function the compiler doesn't know about, though, that
> restricts the rearrangements the compiler can do in the same way that volatile
> does --- except it applies to *all* reachable objects: the compiler can't
> predict which ones the unknown function will access, so it has to put all of
> them in order.

Right.  But keep in mind what "reachable" means -- the compiler is entitled to assume that function-local and 'static' variables whose addresses are never taken are unreachable to a function not defined in the current translation unit, for instance.

> However, in the presence of link-time code
> generation, I wonder whether this strategy still applies: the compiler "knows"
> about the entire program.

One of the first things compiler people tend to do with link-time analysis is relax this very restriction.  It helps a *lot* with scientific codes, which are overrepresented in SPEC. :-/

> PR_AtomicSet seems to be defined in a way that guarantees it'll be treated as
> "unknown" in this sense, regardless of how you do the translation.

If you meant PR_ATOMIC_SET, which is defined when possible in terms of compiler intrinsics that do (as far as I know) constitute optimization barriers, then yes.  PR_AtomicSet, when used bare, appears to be broken in the presence of link-time optimization -- and so is PR_ATOMIC_SET with recent gcc on a wacky CPU, because pratom.h doesn't fall back to a generic GCC optimization barrier [e.g. asm volatile("" ::: "memory")] if it isn't happy with the __sync_* intrinsics.  I'd call that a NSPR bug.

(The GNU toolchain doesn't have much in the way of link-time optimization *now*, but it is one of the top three major development projects as I write this.)

> So as far as writes are concerned, if you're using PR_AtomicSet, making
> the variable volatile doesn't get you anything more.
> As Andrew says, the read is the interesting issue.  A read of an ordinary
> variable in a loop could be hoisted out of the loop; a read of a volatile
> variable would stay put, but a processor could let the reads hit cache,
> effectively hoisting the read back out of the loop.  So the read needs to be
> done in a way which guarantees the processor won't move it around either. 

Yah.

BTW, I suggest reading http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/memory-barriers.txt;hb=HEAD which discusses all of these issues at length, and knows a lot more about the multiprocessor issues than I do.  Linux developers have a habit of making egregiously incorrect statements about compilers but I don't see any of them in that document.

> It's not clear to me why a mutex is expensive here: if it's just protecting a
> status flag, you should almost never be blocking on it.  It would be nice to
> know what's actually going on there.

A mutex is going to involve at least one exclusive memory bus transaction per lock/unlock pair.  That's a major pipeline stall (I have seen numbers in the range of 30-200 cycles), possibly for all CPUs (depending on how smart the memory controller is).  If it's being frequently updated by multiple threads, you're also playing cache line ping-pong -- although the status flag itself might do that in the absence of mutexes.
(In reply to comment #18)
> A mutex is going to involve at least one exclusive memory bus transaction per
> lock/unlock pair.  That's a major pipeline stall (I have seen numbers in the
> range of 30-200 cycles), possibly for all CPUs (depending on how smart the
> memory controller is).  If it's being frequently updated by multiple threads,
> you're also playing cache line ping-pong -- although the status flag itself
> might do that in the absence of mutexes.

Sure --- but that's not even close to the kind of latencies that (if I understand correctly) they were trying to remove here --- user-visible performance degradation doesn't get improved by switching from mutexes to unprotected memory references.  Either the lock is too coarse-grained, or someone's holding it longer than they need to, or something unexpected is going on.
The lock was being held during sqlite query execution. I think that we should bring the lock back and unlock while running queries.
IMO, moving away from pthread_* to atomic ops and/or relying on details
of the memory coherence model, is a mistake from a software verification
point of view.  It (terminally) interferes with the ability of run time
tools such as the Intel Thread Checker, or valgrind --tool=helgrind,
to find races and potential deadlocks in such code.  I view it as trading
a short term benefit (performance) for a long term loss (verifiability).

We should be attempting to move away from doing low level locking with
(or without) lock prefixes etc, and have a threading substrate which
is by default, for production, based only on POSIX pthreads primitives.

Sure, performance is important.  Do we really need to make locking
incredibly cheap?  Is it possible instead to structure systems to
that the rate of lock acquisition/releasing is reduced?

Writing reliable sequential code without runtime tools is kind of doable.
But writing correct threaded code is almost beyond the reach of humans,
and we need to rely on automated testing/verification to an extent that
we don't need to for sequential code.
I feel very strongly that we need to avoid depending on or even thinking about memory consistency models and the associated compiler issues. That way lies madness. Seriously, I have watched people with PhDs who do this stuff for a living make mistakes over and over again, very subtle mistakes that aren't testable, that are very platform-specific, yet have disastrous consequences.

If we absolutely have to do something very clever for performance, we should create new synchronization primitives or synchronizing data structures in XPCOM, verify them using a tool like SPIN, and teach our race detection tools about them. But it seems that here, we don't.
As the code currently/recently stands, the only interesting synchronization thing happening is to support cancellation.  Cancellation only makes sense for non-mutations (so SELECTs) of the database where you no longer care about the results and want to avoid wasted CPU and IO cycles.  For example, because the application is being shutdown or the user input changed to invalidate the previous autocomplete query.

Since cancellation is not/does not need to be precise, we could get away without any synchronization primitives.  We just would not have a strong guarantee that cancellation would actually do anything at all.

I agree that mutexes are a better idea for synchronization than homebrew for all the previously discussed reasons.  I think the code as committed works for storage's very weak needs.  If someone sees a glaring deficiency in the current implementation that actively is going to cause a deadlock or wholesale data-loss, I say let's fix it.  Otherwise, I don't think the review burden and possibility of getting further changes wrong is worth it; as roc says, getting this stuff right is hard, and I'm a bit tired out after having reviewed many variations on this patch and its predecessors.  (I'm of course fine if others do the reviews :)
> Since cancellation is not/does not need to be precise, we could get away
> without any synchronization primitives.

This logic is not generally reliable. It is possible for memory other than the variable you actually touch to become corrupted by insufficient synchronization.

In this case it doesn't seem like much work to guard access to these shared variables with a mutex. Can't we do that and not set a bad precedent?

If we need some sort of very efficient cancellation flag, that's sort of thing we could add as an XPCOM primitive. Some abstraction like
class MonotonicFlag {
  PRBool IsRaised();
  void Raise();
};
where the only guarantees are that IsRaised returns false while no-one has called Raise, and that after some point in time after Raise has been called, all threads will return true from IsRaised.
(In reply to comment #23)
> As the code currently/recently stands, the only interesting synchronization
> thing happening is to support cancellation.  Cancellation only makes sense for
> non-mutations (so SELECTs) of the database where you no longer care about the
> results and want to avoid wasted CPU and IO cycles.  For example, because the
> application is being shutdown or the user input changed to invalidate the
> previous autocomplete query.

Just to clarify: there is a pthreads operation called "cancellation", which is widely considered to be treacherous to use, even in its synchronous form.  But we're not talking about POSIX threads cancellation, right?

We're talking about having one thread be able to communicate with another thread and have it terminate its work early.  The working thread is willing to poll a variable.

What is the argument against simply having a mutex to protect the cancellation flag, and having the worker and notifiers acquire that mutex to set and test the flag?  Such a mutex would be a leaf in the lock ordering, so it couldn't participate in any deadlocks.  And if the worker is doing database I/O, the overhead of that mutex will be dwarfed by statistical noise.  On Linux, at least, you'll get the uncontended path through the locking primitives, which doesn't even make any system calls.
(In reply to comment #20)
> The lock was being held during sqlite query execution.

Ugh. Don't do that.

> I think that we should
> bring the lock back and unlock while running queries.

Yup.

/be
This landed in mozilla-central yesterday morning.  I won't be backing this out because that means I'll have to backout an entire branch merge, but I'll be filing a follow-up shortly to address the concerns raised.
http://hg.mozilla.org/mozilla-central/rev/7f4a5f6f26f1
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [fixed-in-places]
Blocks: 507674
You need to log in before you can comment on or make changes to this bug.