Closed Bug 1131308 Opened 10 years ago Closed 10 years ago

Replace {INCREMENT,DECREMENT}_RECURSION_LEVEL in pldhash.cpp with an RAII class

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla42
Tracking Status
firefox42 --- fixed

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

Attachments

(2 files, 6 obsolete files)

These macros are horrible. RAII will improve things greatly.
Assignee: nobody → continuation
This may be horribly overengineered, but as a preperatory step I encapsulated the recursion level into a class.
Comment on attachment 8562955 [details] [diff] [review] part 1 - Encapsulate recursion level in a class. Review of attachment 8562955 [details] [diff] [review]: ----------------------------------------------------------------- ::: xpcom/glue/pldhash.h @@ +159,5 @@ > + void Increment(); > + void Decrement(); > + > +#ifdef DEBUG > + void MarkImmutable(); I find it strange that some of the methods in this class are DEBUG only, while some are always present but empty when non-DEBUG. I would make all the methods except the zero-arg constructor DEBUG-ONLY, and use |#ifdef DEBUG| at call sites as necessary, as the current code does. That way you don't have to look at the class's innards to see that calls like mRecursionLevel.increment() are DEBUG-only.
Two other possible nits: - Nest RecursionLevel within PLDHashTable? - Define RecursionLevel's methods in pldhash.h for conciseness?
(In reply to Nicholas Nethercote [:njn] from comment #3) > Two other possible nits: > - Nest RecursionLevel within PLDHashTable? Yes. > - Define RecursionLevel's methods in pldhash.h for conciseness? No opinion.
(In reply to Nicholas Nethercote [:njn] from comment #3) > Two other possible nits: > - Nest RecursionLevel within PLDHashTable? Yeah, good point. > - Define RecursionLevel's methods in pldhash.h for conciseness? I was doing this to reduce the size of the header file, given how widely it is included. Maybe that isn't a big deal.
I'm probably not going to work on this soon. This should let us get rid of a couple of gotos in PLDHashTable::Add().
Assignee: continuation → nobody
I'll do it, though I probably won't get to it until May.
Assignee: nobody → n.nethercote
I'm looking at this again, and I think the whole mRecursionLevel field is flawed, as implemented. AIUI, it's in one of these three states: - Mutable(n) - Immutable |Mutable| has a count associated, indicating how many operations are underway. It's safe to have multiple reads going at once, but only one write can be happening at once. |Search| causes the following transitions (before --> during --> after): - Mutable(n) --> Mutable(n+1) --> Mutable(n) - Immutable --> Immutable --> Immutable |Add| and |Remove| cause the following transitions while they're occurring: - Mutable(0) --> Mutable(1) --> Mutable(0) - Mutable(_) --> ABORT! - Immutable --> ABORT! So this will catch bad things like if call Search then concurrently call Add. But it will *miss* bad things like if you call Add and then concurrently call Search. Basically, it's a botched emulation of an rwlock. If we change it to the following states, it should work better: - Read(n) - Write - Immutable |Search| would have these transitions: - Read(n) --> Read(n+1) --> Read(n) - Write --> ABORT! - Immutable --> Immutable --> Immutable |Add| and |Remove| would have these transitions: - Read(0) --> Write --> Read(0) - Read(n) --> ABORT! - Write --> ABORT! - Immutable --> ABORT!
Actually, I think the Read(n)/Write status needs to be separate from the Immutable status. Because currently Finish() thinks everything is fine if the table is immutable, even though it's possible that another thread is in the middle of reading it!
Hm, what I tried to do was to make mRecursionLevel a real recursion counter like it was originally trying to be (preventing Add/Remove from being called inside some other operation). I did add a comment saying that this didn't enforce threadafety. I was thinking we would do the threadsafety checks separately (bug 1164758, which I forgot to file until you reminded me!) since recursion+threadsafety seemed hard to get right with a single counter. But I'm happy with anything!
AFAICT, emulating an rwlock as per comment 8 will provide both recursion and thread-safety checks. Am I missing something?
(In reply to Nicholas Nethercote [:njn] from comment #11) > AFAICT, emulating an rwlock as per comment 8 will provide both recursion and > thread-safety checks. Am I missing something? No, I think it's a great idea, I just meant that I didn't see how it could be done with a single counter like the old code was pretending to do. You just have to have at another variable to track your state, right?
(Or I guess you could use the high bits)
My current code has this: static const uint32_t kIdle = 0; static const uint32_t kRead1 = 1; static const uint32_t kReadMax = UINT32_MAX - 1; static const uint32_t kWrite = UINT32_MAX; Which gives a tri-state tagged union covering Idle, Read(n) and Write. And there's also a bool for tracking mutability separately.
Comment on attachment 8615823 [details] [diff] [review] (part 1) - Improve PLDHashTable's internal checking Review of attachment 8615823 [details] [diff] [review]: ----------------------------------------------------------------- Apologies for letting this review hang around over the weekend. I am also glad to see that there's a part 2, as part 2 is exactly what my first comment was going to be about. :) I am not completely sure that some of the small timeslices in the below examples make sense. I guess the first example is not nearly so problematic (only triggers intermittent oranges, which we already have with the current code?) as the second one; the second one implies that we can actually break the invariants of the class, which seems like bad news. I didn't exhaustively look for concurrency bugs, and it's entirely possible that my examples below are completely bogus. Please review. Looking in my local threading book (_Programming with POSIX Threads_), it looks like what you want is the ability to wait on any pending readers prior to starting a write operation. But that actually involves locks at the datastructure level, which I don't think is the way we want to go. So I'm not quite sure where to go with this. ::: xpcom/glue/pldhash.h @@ +182,5 @@ > + return *this; > + } > + > + bool IsIdle() const { return mState == kIdle; } > + bool IsRead() const { return kRead1 <= mState && mState <= kReadMax; } It seems possible for an assertion using this function to fail, e.g.: T1: enter StartWriteOp, pass at least the IsIdle assert NB: at this point, mState == kIdle. T2: enter StartReadOp T2: do whatever read thing we were going to do T2: enter EndReadOp, call IsRead() T2: test kRead1 <= mState, which is OK T1: resume StartWriteOp T1: set mState = kWrite T1: maybe go and start doing things, maybe yield T2: test mState <= kReadMax, which fails and asserts Maybe that's the point, though? @@ +208,5 @@ > + > + void EndReadOp() > + { > + MOZ_ASSERT(IsRead()); > + MOZ_ASSERT(mState >= kRead1); This assert seems redundant with the IsRead() assert, and seems vulnerable to a similar sequence as described above: T1: enter StartWriteOp, pass assert T2: enter StartReadOp T2: do whatever read thing we were going to do T2: enter EndReadOp, pass IsRead() assert T1: resume StartWriteOp T1: set mState = kWrite T1: maybe exit StartWriteOp and start writing to the table T2: pass mState >= kRead1 assert T2: decrement mState Oops, we are now in a position where we think we have a lot of readers, but nobody is writing to the table.
Attachment #8615823 - Flags: review?(nfroyd)
Comment on attachment 8615825 [details] [diff] [review] (part 2) - Use RAII for PLDHashTable::mChecker operations Review of attachment 8615825 [details] [diff] [review]: ----------------------------------------------------------------- This part seems fine as far as it goes, but I don't know what changes it might need if part 1 gets modified at all. So just f+'ing this for now.
Attachment #8615825 - Flags: review?(nfroyd) → feedback+
Thinking about this a little more, the scenarios described in comment 19 are actually the whole point of the class? i.e. it's OK if these happen, because they indicate that two threads are using the class incorrectly? The second scenario in comment 19 might still be bad, but it's possible I've misunderstood exactly how these checks are intended to function.
Flags: needinfo?(n.nethercote)
Both of your examples start with this: > T1: enter StartWriteOp, pass assert > T2: enter StartReadOp We'll assert on the second line because you can't start a new read while you have an in-progress write. > Thinking about this a little more, the scenarios described in comment 19 are actually the whole point > of the class? i.e. it's OK if these happen, because they indicate that two threads are using the > class incorrectly? Thread-safety is intended to come from outside the class. E.g. if a PLDHashTable is accessed from multiple threads, that external code should use locking of some kind. And if it doesn't, this class will hopefully detect that. It will also detect if you have re-entrancy problems within a single thread, e.g. you remove an element which triggers a destructor which somehow triggers another table modification. Maybe you're concerned that the state checking and modifications done in each of the Start/End functions isn't atomic? For example, it's conceivable that you could have two concurrent writes (which is bad) and if you got really unlucky they could interleave in a way such that Checker would miss it: > table starts in Idle, Writable state > T1: MOZ_ASSERT(IsIdle() && IsWritable()); // passes > T2: MOZ_ASSERT(IsIdle() && IsWritable()); // passes > T1: mState = kWrite; > T2: mState = kWrite; and then if something similar could happen in EndWriteOp. I could argue that since this is just checking code, it's not the end of the world if it possibly misses some cases. (E.g. the handling of Enumerate() is already imperfect. I want to eventually change the API to distinguish a read-only Enumerate() from one that removes elements, which would fix that hole, but one thing at a time.) But that's not the strongest argument I've ever made. It might be possible to (a) combine mState and mIsWritable into a single value, and (b) use the primitives in mfbt/Atomics.h (e.g. exchange()) to atomically do the check-and-update state operation. Are we getting clearer? :)
Flags: needinfo?(n.nethercote) → needinfo?(nfroyd)
I don't have time to reply to your long, excellent comment, but I wanted to try to clear up this bit: (In reply to Nicholas Nethercote [:njn] from comment #22) > Both of your examples start with this: > > > T1: enter StartWriteOp, pass assert > > T2: enter StartReadOp > > > > We'll assert on the second line because you can't start a new read while you have an in-progress > > write. The examples were intended to show that we can pass through the assert in StartWriteOp, i.e. mState == kIdle, but we can get context-switched before we get around to setting mState to kWrite. And then when we come back to the writing thread, the state is much different than when we passed the assert.
Flags: needinfo?(nfroyd)
> The examples were intended to show that we can pass through the assert in > StartWriteOp, i.e. mState == kIdle, but we can get context-switched before > we get around to setting mState to kWrite. I see, yes, reading comprehension fail on my end. So I think we're on the same page now. FWIW, I think the current code has similar problems with code like this: MOZ_ASSERT(mRecursionLevel == 0); INCREMENT_RECURSION_LEVEL(this); Again, the state checking and updating are not atomic.
Attachment #8562955 - Attachment is obsolete: true
This rearrangement is necessary so that we can treat Iterator as a read op and RemovingIterator as a write op for checking purposes. (That's not easy to do if RemovingIterator is a subclass of Iterator.) This also fixes a couple of small problems with RemovingIterator. - Its move constructor was moving |aOther.mTable| instead of |aOther|. This meant that |aOther| wasn't being zeroed out appropriately. Previously this didn't matter, but with RemovingIterator now being treated as a write op, it was triggering checking failures in ~RemovingIterator because it was as if there were two non-zeroed RemovingIterators running concurrently. - test_pldhash_RemovingIterator() was testing Iterator's move constructor instead of RemovingIterator's move constructor, due to a copy/paste mistake(!)
Attachment #8624396 - Flags: review?(nfroyd)
This version has the careful-with-atomicity changes we discussed, which worked out pretty nicely.
Attachment #8624398 - Flags: review?(nfroyd)
Attachment #8615823 - Attachment is obsolete: true
I think this is barely changed from the earlier version that you f+'d.
Attachment #8624399 - Flags: review?(nfroyd)
Attachment #8615825 - Attachment is obsolete: true
Attachment #8624396 - Flags: review?(nfroyd) → review+
Attachment #8624399 - Flags: review?(nfroyd) → review+
Comment on attachment 8624398 [details] [diff] [review] (part 1) - Improve PLDHashTable's internal checking Review of attachment 8624398 [details] [diff] [review]: ----------------------------------------------------------------- Apologies for the delay. This looks great. ::: xpcom/glue/pldhash.h @@ +203,5 @@ > + // (b) check the old value was reasonable. This ensures we don't have > + // interleaving problems. > + // > + // For |mIsWritable| we don't need to be as careful because it can only in > + // transition in one direction (from writable to non-writable). Nit: trailing whitespace.
Attachment #8624398 - Flags: review?(nfroyd) → review+
> Apologies for the delay. This looks great. It's going to need some more work before landing. I did a try run and got some violations in netwerk/cache2/CacheIndex.cpp where we do an iteration that does both lookups and removes. I think it's valid because its all on one thread and the removes all happen at the end, after the lookups. So I might need to tweak how things work a bit. We weren't getting violations in this code before because these patches strengthen the checking of iteration/enumeration.
This is a simplified version of the previous part 0 -- it no longer does the shared base class, but instead just does the small fixes.
This is a slightly different version of the previous part 1 and 2 (now merged for my sanity) that treats RemovingIter as a read op except at the end when the removal step occurs when it temporarily is treated like a write operation.
Attachment #8624398 - Attachment is obsolete: true
Attachment #8624396 - Attachment is obsolete: true
Attachment #8624399 - Attachment is obsolete: true
Comment on attachment 8629841 [details] [diff] [review] (part 0) - Fix minor problems with RemovingIterator Review of attachment 8629841 [details] [diff] [review]: ----------------------------------------------------------------- froydnj: I'm going to carry over r+ from the previous patches because these versions are only slightly different, but I'll leave them here overnight in case you want to look over them one last time.
Attachment #8629841 - Flags: review+
Attachment #8629842 - Flags: review+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: