Closed Bug 35817 Opened 24 years ago Closed 20 years ago

optimize in-mem RDF for small # of props leading to large # of targets

Categories

(Core Graveyard :: RDF, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED
Future

People

(Reporter: waterson, Assigned: mozilla)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: [nsbeta3-])

Attachments

(2 files, 7 obsolete files)

The current in-memory datasource implementation is not well tuned for large
datasources. It needs the following:

1. an arena-based fixed size allocator from which Assertion objects are allocated.

2. each "source" should have a hashtable of "properties" that lead out of it if
there are more than (say) ten distinct properties.

3. each triple <source, property, target> should be hashed for quick lookup if
there are more than (say) ten distinct targets for a <source, property> pair.

Right now, a lot of the operations degenerate into linear-time lookups, which
kills performance in large datasources.
right now, the entire list of newsgroups for a server gets put into the 
subscribe datasource, which is an in memory datasource.

that a big in memory datasource, for a server like news.mcom.com

before waterson spends time on this, we should make sure this work would be 
worth his time.

I also have to make sure I'll continue to use the in memory datasource, and not 
a custom one.
Status: NEW → ASSIGNED
Target Milestone: --- → M18
use PLHash for this, not nsHashTable.

steal from waterson, as this is hurts the subscribe dialog.
Assignee: waterson → sspitzer
Status: ASSIGNED → NEW
notes from irc converstion with waterson:

<waterson> anyway, if you're going to add a hashtable to remember the properties 
in in-memory datasource
<waterson> it shouldn't be too hard
<waterson> have you looked thru that code at all?
<sspitzer> yes, I was debugging through it
<waterson> ok
<sspitzer> when trying to figure out why my HasAssertion() wasn't doing what I 
thought it would.
<sspitzer> I'll get the HasAssertion() riddle figured out
<waterson> so there are two top-level hashtables keyed by the "source"
<sspitzer> and then, when I get time, add the hashtable.
<waterson> and "target" respectively
<sspitzer> ok...
<waterson> when you assert
<waterson> it creates an Assertion object
<waterson> that gets added to two linked lists
<waterson> the "forward" links, which comes from the source-indexed table
<waterson> and
<waterson> the "backwards" links, which comes from the target-indexed table
<sspitzer> (this is good, keep going...)
<waterson> so when doing GetTarget, for example
<waterson> it looks up the source in the source-indexed table
<waterson> and just walks through the list comparing properties until it finds a 
match
<waterson> and returns the target
<waterson> so
<waterson> there are two cases to optimize
<waterson> source with many properties
<waterson> source with few properties, but many targets
<waterson> your case is the latter
<waterson> so you'd want to index on both property and target.
<sspitzer> gotcha.
<sspitzer> any need to make both fast?
<sspitzer> how about other in memory datasources, like bookmarks?
<sspitzer> or, are we "good enough" for them?
<waterson> well, we could make both go fast if we used a tree structure instead 
of hashtables
<sspitzer> isn't there some tree datastructures in xpcom/ds?
<waterson> there are
<waterson> ideally, we'd make both cases go fast
<waterson> because there are some graphs where you do have a lot of properties 
<waterson> on the same source
Status: NEW → ASSIGNED
David, can you make the call on whether we want to try this for beta3?
Assignee: sspitzer → bienvenu
Status: ASSIGNED → NEW
I'll defer to Chris and/or Alec. Is this still an issue, do you think?
I think rjc is working on this under the guises of another bug.
Assignee: bienvenu → rjc
Nominating for PR3 to get this on the radar that a decision needs to be 
made since there appears to be some sentiment that it's necessary and some 
confusion over whether it's being worked on.
Keywords: nsbeta3, perf
triage team:
nsbeta3+, very important to mail
Whiteboard: [nsbeta3+]
I've been playing around with this bug for a while now and think its time to
suggest that its marked "nsbeta3-"... 

I have a 90% rewrite of RDF's nsInMemoryDataSource which I've been playing with
for a while... it uses "smart" arcs (simple arrays which convert over to hash
tables when a size boundary is crossed)... however, its significantly slower
(about 50%) than the current implementation in most cases (especially for
insertions) and uses more memory.
Whiteboard: [nsbeta3+]
nsbeta3- for hard P3 bug we'll have to live with.
Whiteboard: [nsbeta3-]
This milestone has passed.
Target Milestone: M18 → ---
Hey Chris.  Is this bug still valid?  And if so, who should it be assigned to?
Taking for now, and FUTURE-ing. Will pull it back if I get inspired.
Assignee: rjc → waterson
Keywords: helpwanted
Target Milestone: --- → Future
Status: NEW → ASSIGNED
Chris, mind if I take this bug from you?  <grin>  I've felt inspired lately.
Assignee: waterson → rjc
Status: ASSIGNED → NEW
New interface for RDF's in-memory data source
nsInMemoryDataSource.cpp diffs - basically, for RDF SEQs which pass a size
focal point, add a new assertion into the graph (invisible to the outside
world) which is actually a PLDHash on the associated "properties" arc to all
the values.

Reading in a largish bookmark file (2300 URLs + 100 folders) I see a 20-30%
improvement in graph population.
nsInMemoryDataSource.cpp - entire file for those who don't enjoy applying large
diffs
Waterson/Tingley, care to run with these changes for a bit?

(Note that these changes are primarily for RDF SEQ usage.  A more aggressive hack 
job for really large graphs which don't use RDF SEQs derives from these changes 
without too much of a brain spasm.)
Status: NEW → ASSIGNED
Oopsy daisy.
Attachment #57287 - Attachment is obsolete: true
Oopsy daisy.
Attachment #57289 - Attachment is obsolete: true
Yeah, I'll give it a spin over the weekend.

Is it that much harder to keep the mutation threshold internally, rather than
requiring this extra interface and the external comparison against nextVal? 
Generalizing this will probably require keeping some sort of per-source count
inside the datasource itself...

Actually, as long as the slow duplicate-checking code is in there that walks all
the arcs when Asserting, we could probably get the count for free -- keep track
of how many properties we check, and mutate if we see k of them.
In an earlier version (I have about eight on my desktop at the moment <whee>), I
did have conversion going on entirely internally.

However, you don't get a count for free really... you have to walk over the
*entire* assertion list (which the code doesn't really do unless it happens to
have a complete mismatch) taking certain conditions into account (for example,
you wouldn't want to create a hash{propertyArc} if something was just using lots
of #child arcs to indicate containment, as that's a more pathological case --
the more aggressive "brain spasm" case I alluded to) or maintain internal counts
(which seemed bogus, maintaining consistency just for the focal convertion
point).   Having all the decision logic internal regarding containers also meant
another trip across the graph (via calling back out into
RDFContainerUtils::IsASeq for example) which was another added expense.
All that said, being even more agressive might end up needing internal counts. 
Time will tell.  Truthfully, the extra interface is probably needed anyway...
for example, it would be beneficial to be able to have the XUL template builder
"push" what it knows about containment down through RDF... doing so would allow
the in-memory datasource to better know what to optimize for.
Oops, you're right, no free counts.

I didn't get as much of a chance to test this over the weekend as I had planned,
because I'm a lazy beast.

Couple random things:

- do we really want to pre-allocate mObservers every time?  any idea on the
bloat tradeoff?  The cost of checking |if (mObservers)| seems pretty low, and
we'd still have to check |mObservers->Count() != 0| if we always allocated it.

- if the approach we're taking at this point is to rely on external hinting as
to what a "hot" resource is, there's no reason the API should pretend to be
tailored to containers, as there's nothing in the code (other than the name)
that makes it so.  I'd just change nsIRDFInMemoryDataSource to
nsIRDFOptimizedDataSource (or something) and EnsureFastContainment ->
OptimizeForSource() (or something), and then go through the code looking for
spots where we can make use of it (certain spots in the XUL template code come
to mind).

- it looks like there's a frequently repeated chunk of code that checks for the
existence of the hash and pulls out either the hashed assertion chain or the
"normal" one, depending on state.  this should probably be factored out to clean
things up...

By the way, what generated these diffs?  i had to convert \r to \n before they
would apply...
>  - do we really want to pre-allocate mObservers every time?

I certainly consider it a fine idea.  bloat should be minimal;  verification
would be a fine idea though.


>  there's no reason the API should pretend to be tailored to containers

Ah, but that's EXACTLY what these changes are.  It (currently) IS specifically
for RDF_SEQ containment and not applicable (as is) for other data sets such as a
node with lots of attributes or other ways of specifying containment.  Again,
#child is an example...  you wouldn't want to turn a linked list of

hash{parent node} -> [#child] -> [target #1]
                 \-> [#child] -> [target #2]
                 \-> [#child] -> [target #3]

into just a linked list again (same thing) off of a secondary hash:

hash{parent node} -> hash{#child} -> [target #1]
                                 \-> [target #2]
                                 \-> [target #3]

which the code currently would do.  This is the pathalogical case.  What should
really happen is more like:

hash{parent node} -> [#child] -> hash[target nodes]

- it looks like there's a frequently repeated chunk of code

True, although the frequency is not huge (my opinion)... a few lines of
duplication.  As you'll notice, the in-memory datasource has other issues like
this as well, such as with enumerating/updating its internal linked lists. 
Cleanup can happen after the fact; the current goal is performance.

- By the way, what generated these diffs?  i had to convert \r to \n before they
would apply...

Latest beta of MacCVS.  :)  [What?!?!  You don't use a Mac.  <grin>]  Sorry for
the inconvenience.
> The cost of checking |if (mObservers)| seems pretty low, and we'd still have
to check |mObservers->Count() != 0| if we always allocated it.


BTW, I forgot to add that this should be changed as well;  the count of the # of
observers only changes at two points:  AddObserver & RemoveObserver.  Instead of
calling mObjservers->Count() all the time, it should only be done at those two
points and cached as a member variable of the in-memory datasource.  :)
Blocks: 53521
Sorry it took me so long to get around to this. My big concern here is that
we're bloating the Assertion object: there tend to be a lot of these. Any way we
could do this without increasing its size? For example, using a |union|?
> My big concern here is that we're bloating the Assertion object:
> there tend to be a lot of these. Any way we could do this without
> increasing its size? For example, using a |union|?


Using a |union| is a fine idea, I was just lazy and didn't. However, there is
still a few extra bytes of bloat due to a new boolean flag which can't be
inside the |union|

Waterson:  Here's a new diff, care to review?
Attachment #57283 - Attachment is obsolete: true
Attachment #57284 - Attachment is obsolete: true
Attachment #57292 - Attachment is obsolete: true
Attachment #57293 - Attachment is obsolete: true
Oops, shame that it compiles on Mac but not on Linux.  :(
Darn COMPtr inside of class problem.  
Yeah, here's the love.	 ;)
Attachment #61376 - Attachment is obsolete: true
Comment on attachment 61395 [details] [diff] [review]
RDF optimizations with |union| usage, tweaked for Linux

Weird name. Focal reviews on the brain? ;-) (How about RDF_SEQ_MAX_LIST or
RDF_SEQ_LIST_LIMIT?)

>+#define RDF_SEQ_MUTATION_FOCALPOINT   8
>+


Couple places where you left commented-out or #if 0'd code in the patch. Just
remove?

>+/*
>+        if (doomed && (doomed->mFlags & HASH_ENTRY))
>+        {
>+            // rjc: just forward thinking;  if/when we have
>+            // multi-level hash-assertions, we'll have to
>+            // handle this case;  no big deal, though
>+        }
>+*/

And...

>+#if 0
>+    // Implementation methods
>+    Assertion*
>+    GetArcs(PLDHashTable *hash, nsIRDFNode* u) {
>+        PLDHashEntryHdr* hdr = PL_DHashTableOperate(hash, u, PL_DHASH_LOOKUP);
>+        return PL_DHASH_ENTRY_IS_BUSY(hdr)
>+            ? NS_REINTERPRET_CAST(Entry*, hdr)->mAssertions
>+            : nsnull; }
>+
>+    void
>+    SetArcs(PLDHashTable *hash, nsIRDFNode* u, Assertion* as) {
>+        PLDHashEntryHdr* hdr = PL_DHashTableOperate(hash, u,
>+                                                    as ? PL_DHASH_ADD : PL_DHASH_REMOVE);
>+        if (as && hdr) {
>+            Entry* entry = NS_REINTERPRET_CAST(Entry*, hdr);
>+            entry->mNode = u;
>+            entry->mAssertions = as;
>+        } }
>+#endif

And...

>+//          PL_DHashTableOperate(root->mForwardHash,
>+//                               aProperty, PL_DHASH_REMOVE);
>+            PL_DHashTableRawRemove(root->u.hash.mForwardHash, hdr);
>+
Attachment #61395 - Flags: superreview+
Sure, "RDF_SEQ_LIST_LIMIT" it is.  I'll also remove the commented out chunks.  :)
Comment on attachment 61395 [details] [diff] [review]
RDF optimizations with |union| usage, tweaked for Linux

r=tingley with a bunch of nits.

-  2) If/when RDF containers become commonplace, consider implementing
-     a special case for them to improve access time to individual
-     elements.

Rather than removing this outright, I'd replace it with a comment about the
other case... something like
  2) Optimize lookups for datasources which have a small number of properties +
     fanning out to a large number of targets.

+		    PLDHashTable*   mForwardHash; 
maybe it's just me, but I have trouble keeping mForwardHash and mForwardArcs
straight when reading this code.  mForwardHash is mapping from a source to a
set of properties; would mProperties, mPropertyHash, or even mForwardProps be
more descriptive?

The same issue with the assertion-level |DeleteForwardArcsEntry()| routine
(which at the very least should match the name of the hash table it's
clearing entries for).

+		    Assertion*	    mInvNext;

I get the willies when I see code that blindly accesses a union without
regards to the type -- as is the case with the LockedAssert() code that
sets this field.  Admittedly, it should be impossible for an assertion
that's using |u.mForwardHash| to end up having its mInvNext set (I hope),
but it still makes me uneasy.  I dunno, pull it out of the union, add
|NS_ASSERTION(!mHashEntry, "ack!");|, whatever; anything to help me sleep
better :)  Perhaps I'm overreacting.

+    nsIRDFService		*mRDFService;

Can we use a single static here instead of adding 4 bytes per datasource?

+    nsresult rv = rv = nsServiceManager::GetService(kRDFServiceCID,

typo.

-		
<SETTING><NAME>MWFTP_Post_password</NAME><VALUE>0</VALUE></SETTING>
+		
<SETTING><NAME>MWFTP_Post_password</NAME><VALUE>265rt2cn65mdaci^Qp'^Z&#140;^E^E
@&#204;&#191;&#255;&#216;&#240;</VALUE></SETTING>

no wonder no one understands the mac build system :)

+	     PL_DHashTableEnumerate(aAssertion->u.hash.mForwardHash,
+		 DeleteForwardArcsEntry, &aAllocator);

indentation.

* the union declaration and several other spots in the code (especially the
  NS_ADDREF calls) are using hard tabs; nothing else in the file does.

* match bracing style in patch to nsInMemoryDataSource.cpp to the rest of file.

Was there ever any quantification on the bloat tradeoff  (now extra 4 bytes for
mNumObservers + allocating mObservers every time)?  Hopefully the number of
datasources is small enough that saving the extra function call/pointer check
is worth it. 

We'll need to open a bug eventually for optimizing the other case as well
(one/few properties --> many targets).
Attachment #61395 - Flags: review+
Added suggested comment at top o' file, cleaned up indentation (my editor is 
messing with me) and formatting a bit, renamed mForwardHash to be mPropertyHash 
(although once the next set of RDF optimizations is written, that name too will 
be incorrect;  however, its fine for this round) and renamed 
DeleteForwardArcsEntry to be DeletePropertyHashEntry, etc.



>  nsIRDFService              *mRDFService;
>  Can we use a single static here instead of adding 4 bytes per datasource?

Please open another bug for that, and assign it to me, Chris, or yourself.  :)



> We'll need to open a bug eventually for optimizing the other case as well
> (one/few properties --> many targets).

Either that, or we can just keep this bug open.
(I was planning on the latter.)
Summary: optimize access time for InMemoryDataSource::HasAssertion → optimize in-mem RDF for small # of props leading to large # of targets
>> nsIRDFService              *mRDFService;
>> Can we use a single static here instead of adding 4 bytes per datasource?
>
>Please open another bug for that, and assign it to me, Chris, or yourself.  :)

Why wait?

"How hard can it be?" - Homer J. Simpson

/be


Brendan:  Why wait?
rjc:      Why now?
Because it's easy, later means never too often, and we are getting very serious
about < 0 tolerance for bloat regressions.  Why did you make it a member,
anyway?  That effort was nearly the same as making it a static, it seems to me.
 We are already too far down the slippery slope.

/be
Brendan:  Since you are around and have free time this Sunday evening, here 'ya
go then, review this and I'll check it in to get rid of the unused usage.
r=mcafee for unused RDF service patch
Comment on attachment 63781 [details] [diff] [review]
Remove unused RDF service reference

Cool -- I didn't realize it was unused.  sr=brendan@mozilla.org.  Recording
mcafee's r= too.

/be
Attachment #63781 - Flags: superreview+
Attachment #63781 - Flags: review+
Indeed, the RDF service reference was just a vestigial remainder of earlier 
hacking on the in-memory datasource which slipped through the cracks.  Thanks for 
the review(s), everybody.
tever is not RDF QA anymore
QA Contact: tever → nobody
Seems like someone forgot to actually close this bug, the last two diffs seem to 
be in the trunk.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Keywords: helpwanted
Resolution: --- → FIXED
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: