Closed Bug 121341 Opened 18 years ago Closed 11 years ago

nsPersistentProperties::Load is very inefficient

Categories

(Core :: XPCOM, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla1.9.1a2

People

(Reporter: sfraser_bugs, Assigned: michal)

References

Details

(Keywords: mobile, perf)

Attachments

(3 files, 9 obsolete files)

nsPersistentProperties::Load, which is used for loading string bundles, reads 
from a stream a character at a time, and does thousands of string appends. 4 
calls to it account for 44ms of early startup time. It should be possible to make 
it much faster.
Keywords: perf
OS: Mac System 9.x → All
Hardware: Macintosh → All
I think that either I or darin have a dup.  Also, verify that the Load is
reading from a buffered stream so the "inefficient" nature isn't that bad.
Oh baby! I've been wanting to rewrite this thing for a while, but haven't had a
good reason other than "it sucks" - I'm so glad to hear someone wants it rewritten.

I'm taking this.. dougt/darin - if you find the dupe, reassign it to me. 

One of the WORST parts about the whole thing is that we store the string bundles
in pure unicode, which means they take up a boatload of room. We should be
storing them in their native UTF8 and inflating them on demand.

My plan is to keep the entire text of the string bundle in memory, and use
nsDependentStrings to hold references to each string segment. i.e. if a string
bundle looks like:

foo=bar
baz=floop

then we'd keep the string 'foo=bar\nbaz=floop\n' in memory, and keep
nsDependentStrings (or some relative of them) with direct refs to 'foo', 'bar',
'baz', and 'floop' - then when someone does a
getProperty(NS_LITERAL_STRING("foo"), result); we can look up "foo" in a hash
table, and then inflate the dependent string that points to bar.
Assignee: dougt → alecf
targetting 1.0 for now.. I sure hope I can get to this
Status: NEW → ASSIGNED
Priority: -- → P2
Target Milestone: --- → mozilla1.0
Target Milestone: mozilla1.0 → mozilla1.1
woah, 1.1 came up quick. Throwing these bugs over to 1.2.. little or no work has
been done and there is no way that these bugs can be fixed in 1.1.
Target Milestone: mozilla1.1alpha → mozilla1.2alpha
Depends on: 153754
pushing a few bugs out to 1.2beta - too much stuff on my 1.2alpha plate!
I've already switched us over to using arenas, so we're much more efficient
w.r.t. memory usage.. however the parser needs some love.
Keywords: helpwanted
Target Milestone: mozilla1.2alpha → mozilla1.2beta
These bugs missed the 1.2 train. Moving these out to 1.3alpha
Target Milestone: mozilla1.2beta → mozilla1.3alpha
Target Milestone: mozilla1.3alpha → mozilla1.3beta
I had a sudden brainstorm about this last night - we should be using
nsManifestLineReader here! Its fast, has minimal copying, reuses existing code,
and so forth.
only if jband could see us now. :-)
wow, so the use of Read() for one character is only the beginning of this mess.
I just filed bug 189528 about the real mess. This bug and that bug can be fixed
simultaneously though.
Attached patch rewrite the .properties parser (obsolete) — Splinter Review
ok, here is a very fast parser, that is oriented around segmented buffers. All
state is held in nsParserState. The idea is that once bug 189528 is fixed, then
I can use ReadSegments() to parse the buffer. 

The parser is relatively straight forward. It has some basic states depending
on if it is parsing a comment, a key, or a value, and then the value parser has
some specific states depending on if it is parsing special values like \t or
\u5e4f. I have tested this with a number of different .properties files with
various combinations of escaped characters and so forth.

I've also included some #if 0'ed code in preparation for fixing bug 189528.
Comment on attachment 112107 [details] [diff] [review]
rewrite the .properties parser

doug/darin - can I get reviews on this new parser? Its pretty straight forward
and I've done a fair amount of testing on edge cases.. see the previous comment
for a little elaboration on how the parser works.
Attachment #112107 - Flags: superreview?(darin)
Attachment #112107 - Flags: review?(dougt)
oh, and the member variables mKey and mValue are nsAutoStrings because the whole
nsParserState structure is a stack-based structure.
Keywords: helpwanted
Whiteboard: fix in hand
Comment on attachment 112107 [details] [diff] [review]
rewrite the .properties parser

>Index: nsPersistentProperties.h

>+#if 0
...
>+#endif

nit: please kill the #if 0'd code or add a comment explaining why
it should remain.  same applies to other #if 0'd code.


>Index: nsPersistentProperties.cpp

>+  eParserSpecial_Special,       // awaiting a special character

whoa!  there's got to be a better name for this!


>+      aState.mValue += nsDependentSingleFragmentSubstring(tokenStart, cur);

nit: how about |Substring(tokenStart, cur)| instead?


>+  while (NS_SUCCEEDED(ret)) {
>+    ret = mIn->Read(buf, 0, PROP_BUFFER_SIZE, &nRead);
>+
>+    ParseBuffer(parserState, buf, nRead);
>+    
>+    if (nRead == 0) break;

seems like this check should be placed above ParseBuffer, no?

overall, patch looks really great!  sr=darin modulo these few nits.
Attachment #112107 - Flags: superreview?(darin) → superreview+
Comment on attachment 112107 [details] [diff] [review]
rewrite the .properties parser

what darin said.
Attachment #112107 - Flags: review?(dougt) → review+
fix is in.
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
heads up, bug #190215 might be a regression (or side effect) caused by this.

question for alecf, what about '\"', ascii char 34

I'm hitting this code in ParseValueCharacter() for that:

    default:
      // don't recognize the character, so just append it
      NS_WARNING("Unknown escaped value in properties file");
      aState.mValue += c;
      break;

that does appear in our properties file, should we be handling it (and not 
warning?)
This seems to have caused bug 190283 as well, another bad regression, esp. for
L10n people like me :(
a=dbaron for 1.3b checkin for backing out (per alecf's email request)
backed out. reopening.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED
Target Milestone: mozilla1.3beta → mozilla1.4alpha
mass moving lower risk 1.4alpha stuff to 1.4beta
Target Milestone: mozilla1.4alpha → mozilla1.4beta
Target Milestone: mozilla1.4beta → mozilla1.5alpha
Assignee: alecf → nobody
Status: ASSIGNED → NEW
QA Contact: scc → xpcom
What exactly happened to this patch?  There's still a significant win to be had.
Priority: P2 → --
Target Milestone: mozilla1.5alpha → ---
This should be probably blocker for 1.9.

nsPersistentProperty::Load takes about 1.72% of startup time.
Assignee: nobody → michal
Blocks: 406810
(In reply to comment #25)
> This should be probably blocker for 1.9.

You should request blocking on such bugs.
Flags: blocking1.9?
definitely want this, nice find. not going to block though.
Flags: wanted1.9+
Flags: blocking1.9?
Flags: blocking1.9-
There were quite a lot of regressions caused with this patch. I manually reapplied Alec's patch to the current HEAD and also included old parser as an new component. From simple JS I'm parsing properties files with both and comparing the result. I've found some regressions and I'm fixing them. Now I testing all properties files from CVS tree. I would like to test it on as much as possible files. Where can I get all properties files for other languages?
Attached patch fixed properties files parser (obsolete) — Splinter Review
Basically there were fixed 3 bugs:
- escaping at the end of buffer was broken
- right trimming at the end trimmed also escaped whitespaces
- leading whitespaces weren't removed in multiline properties

I tested it extensively with 142 properties files I found in CVS in variants with unix and dos line ends. I also tried to set lower PROP_BUFFER_SIZE (even 1) to see if it is working properly.
Attachment #112107 - Attachment is obsolete: true
Attachment #306257 - Flags: review?(benjamin)
attached to bug just to have it somewhere...
Status: NEW → ASSIGNED
I also tested how this patch influenced startup time. Number of valgrind's Ir needed to do all nsPersistentProperties::Load() during startup decreased from 16877051 to 5607477. In absolute numbers whole startup took 566ms instead of 573ms on my HW setup.
Comment on attachment 306257 [details] [diff] [review]
fixed properties files parser

There's absolutely no way I'm going to review this without unit tests.
Attachment #306257 - Flags: review?(benjamin)
Michal, thanks for the patch!  It changes a bunch of stuff.  Could you produce a unit test so that we reduce our chances of breaking something accidently?  As you can see a similar change did produce a regression.
Attached patch unit test (obsolete) — Splinter Review
Attached patch fixed patch (obsolete) — Splinter Review
Thanks to unit test I found a bug parsing shortened unicode escaped sequences like "\uAB\n". This patch handles them correctly.
Attachment #306257 - Attachment is obsolete: true
Attachment #307514 - Flags: review?(benjamin)
Comment on attachment 307514 [details] [diff] [review]
fixed patch

>Index: xpcom/ds/nsPersistentProperties.cpp

>+//
>+// parser stuff
>+//
>+enum {
>+    eParserState_AwaitingKey,
>+    eParserState_Key,
>+    eParserState_AwaitingValue,
>+    eParserState_Value,
>+    eParserState_Comment
>+};

Is "mParserState" below always a value from this enum? Is there any reason you can't name the enum and use it directly below, instead of using PRUint32?

>+  nsAutoString mKey;
>+  nsAutoString mValue;

Why are these public?


>+  // if we see a '\' then we enter this special state
>+  PRUint32 mSpecialState;
>+
>+  PRUint32  mUnicodeValuesRead; // should be 4!
>+  PRUnichar mUnicodeValue;      // currently parsed unicode value
>+  PRBool    mHaveMultiLine;     // if this key is multi-line
>+  PRBool    mMultiLineCanSkipN; // TRUE if "\\\r" was detected
>+  PRUint32  mMinLength;         // limit right trimming at the end to not trim
>+                                // escaped whitespaces

Ore these?

>+
>+private:
>+  PRUint32 mState;
>+  nsCOMPtr<nsIPersistentProperties> mProps; // weak

nsCOMPtr is not weak. Because this class is only instantiated on the stack and you know that the properties file is going to stay around, you can just use a raw nsIPersistentProperties*, I think.

>+#define IS_WHITE_SPACE(c) \
>+  (((c) == ' ') || ((c) == '\t') || ((c) == '\r') || ((c) == '\n'))
>+
>+#define IS_EOL(c) \
>+  (((c) == '\r') || ((c) == '\n'))

Can these be inline functions instead of macros, please?

As for the actual parsing: I read through it and it looks generally correct and documented. However, I don't see where errors are handled: that is, what happens if you feed it invalid input. Does it throw it away? If so, can you at least note where it does that and file a followup bug for us to issue console errors in that case?

>Index: xpcom/ds/nsPersistentProperties.h

> private:
>-  ~nsPersistentProperties();
>+  virtual ~nsPersistentProperties();

I don't think that this is necessary or desirable. We don't ever delete a subclass through a base-class pointer, do we?

In general, this looks good. I'd really like more testcase, though, for properties-as-UTF8, malformed data, etc. See bug 397093 for related parsing issues, I think.
Attachment #307514 - Flags: review?(benjamin) → review-
Attached patch patch v3 (obsolete) — Splinter Review
(In reply to comment #36)
> Is "mParserState" below always a value from this enum? Is there any reason you
> can't name the enum and use it directly below, instead of using PRUint32?

fixed

> >+  nsAutoString mKey;
> >+  nsAutoString mValue;
> 
> Why are these public?
> 
> 
> >+  // if we see a '\' then we enter this special state
> >+  PRUint32 mSpecialState;
> >+
> >+  PRUint32  mUnicodeValuesRead; // should be 4!
> >+  PRUnichar mUnicodeValue;      // currently parsed unicode value
> >+  PRBool    mHaveMultiLine;     // if this key is multi-line
> >+  PRBool    mMultiLineCanSkipN; // TRUE if "\\\r" was detected
> >+  PRUint32  mMinLength;         // limit right trimming at the end to not trim
> >+                                // escaped whitespaces
> 
> Ore these?

They were used in ParseBuffer() and ParseValueCharacter() functions. I've moved these functions into the parser class and I've renamed the class to nsPropertiesParser since I think it isn't state-only class anymore.

> >+
> >+private:
> >+  PRUint32 mState;
> >+  nsCOMPtr<nsIPersistentProperties> mProps; // weak
> 
> nsCOMPtr is not weak. Because this class is only instantiated on the stack and
> you know that the properties file is going to stay around, you can just use a
> raw nsIPersistentProperties*, I think.

fixed 

> >+#define IS_WHITE_SPACE(c) \
> >+  (((c) == ' ') || ((c) == '\t') || ((c) == '\r') || ((c) == '\n'))
> >+
> >+#define IS_EOL(c) \
> >+  (((c) == '\r') || ((c) == '\n'))
> 
> Can these be inline functions instead of macros, please?

fixed

> As for the actual parsing: I read through it and it looks generally correct and
> documented. However, I don't see where errors are handled: that is, what
> happens if you feed it invalid input. Does it throw it away? If so, can you at
> least note where it does that and file a followup bug for us to issue console
> errors in that case?

Correct me if I'm wrong, but neither old parser did check for invalid input. The goal was to have 100% compatible parser that is faster than the old one. IMHO there is also no definition of valid input. If you will pass random data to the parser it will parse it...

> >Index: xpcom/ds/nsPersistentProperties.h
> 
> > private:
> >-  ~nsPersistentProperties();
> >+  virtual ~nsPersistentProperties();
> 
> I don't think that this is necessary or desirable. We don't ever delete a
> subclass through a base-class pointer, do we?

fixed

> In general, this looks good. I'd really like more testcase, though, for
> properties-as-UTF8, malformed data, etc. See bug 397093 for related parsing
> issues, I think.

Unfortunately I don't have access to the bug.
Attachment #307514 - Attachment is obsolete: true
Attachment #315138 - Flags: review?(benjamin)
Attached patch unit test v2 (obsolete) — Splinter Review
Added test of valid and invalid UTF-8 sequence
Attachment #307512 - Attachment is obsolete: true
Attachment #315620 - Flags: review?(benjamin)
I've found probably a bug while I've been playing with invalid UTF-8 input. Nothing should be loaded from properties file when there is some invalid input. But it seems that input is buffered by 8kb. When invalid input isn't in the first 8kb then parsing succeeds partially. E.g. when at least 8kb of comments is added between properties utf8Property and latin1Property in http://mxr.mozilla.org/firefox/source/intl/strres/tests/397093.properties then unit test will fail. Since this is problem of old parser as well as new parser I think that new bug should be filed for this. Any opinions?
loading properties files in fennec goes from .75 seconds to .26 seconds w/ this patch + buffering (bug 433098).  Must have.

Great work michal!
Keywords: mobile
Attachment #315620 - Flags: review?(benjamin) → review+
Attachment #315138 - Flags: review?(benjamin) → review+
Attachment #315138 - Flags: superreview?(bzbarsky)
Before I start reading through the details (which is likely to take a few weeks, to be honest) I have two questions:

1)  What determined the choice of buffer size?
2)  Is there a reason you're not using ReadSegments?
(In reply to comment #41)
> Before I start reading through the details (which is likely to take a few
> weeks, to be honest) I have two questions:
> 
> 1)  What determined the choice of buffer size?
> 2)  Is there a reason you're not using ReadSegments?

My patch is derived from #112107 and I didn't change this part of code. Code that use ReadSegments was there but it was commented out.
It seems to me that using ReadSegments would be a good idea: avoid extra copying of the buffer, etc...

If we don't plan to do that, I can look at the patch as-is, but otherwise it doesn't make sense to review this and then review a rewrite to use ReadSegments.

Note that attachment 112107 [details] [diff] [review] didn't use ReadSegments because it didn't work right for jar:, but that should be fixed now. 
(In reply to comment #43)
> It seems to me that using ReadSegments would be a good idea: avoid extra
> copying of the buffer, etc...
> 
> If we don't plan to do that, I can look at the patch as-is, but otherwise it
> doesn't make sense to review this and then review a rewrite to use
> ReadSegments.

OK, don't do the review. I'll rewrite it.

> Note that attachment 112107 [details] [diff] [review] didn't use ReadSegments because it didn't work
> right for jar:, but that should be fixed now. 

You mean that it didn't work parsing properties file located in JAR file?
Using ReadSegments back then wouldn't have worked for parsing properties files in a jar, yeah.  It should work now.
Attached patch unit test v3 (obsolete) — Splinter Review
test at the buffer boundary wasn't actually at the boundary
Attachment #315620 - Attachment is obsolete: true
Attached patch patch v4 (obsolete) — Splinter Review
ReadSegments() is used for reading file.
Attachment #315138 - Attachment is obsolete: true
Attachment #322579 - Flags: superreview?(bzbarsky)
Attachment #315138 - Flags: superreview?(bzbarsky)
Adding this to the 1.9.1 train - we could use help with start time...
Flags: wanted1.9.1+
the ReadSegments() patch fails to build for me on VS 2008

DMOZILLA_SKIN_VERSION=\"1.8\"  -D_MOZILLA_CONFIG_H_ -DMOZILLA_CLIENT /e/mozilla-
build/mozilla/src/xpcom/ds/nsPersistentProperties.cpp
nsPersistentProperties.cpp
e:/mozilla-build/mozilla/src/xpcom/ds/nsPersistentProperties.cpp(525) : error C2
664: 'nsIUnicharInputStream::ReadSegments' : cannot convert parameter 1 from 'ns
result (__cdecl *)(nsIUnicharInputStream *,void *,const PRUnichar *,PRUint32,PRU
int32,PRUint32 *)' to 'nsWriteUnicharSegmentFun'
        None of the functions with this name in scope match the target type
e:/mozilla-build/mozilla/src/xpcom/ds/nsPersistentProperties.cpp(526) : fatal er
ror C1903: unable to recover from previous error(s); stopping compilation
make[6]: *** [nsPersistentProperties.obj] Error 2
make[6]: Leaving directory `/e/mozilla-build/mozilla/src/ff-opt/xpcom/ds'
SegmentWriter should be an NS_METHOD.

I'm going to try to get to this review as soon as I'm done moving and stuff, so in about 3 weeks...  I might be able to do before then, but not sure.
Attached patch patch v5 (obsolete) — Splinter Review
(In reply to comment #49)
> the ReadSegments() patch fails to build for me on VS 2008

I don't have VS 2008. Can you please test this new patch, if it compiles successfully?
(In reply to comment #51)
> I don't have VS 2008. Can you please test this new patch, if it compiles
> successfully?

Yeah, this one compiles without problems.
Attachment #322579 - Attachment is obsolete: true
Attachment #322579 - Flags: superreview?(bzbarsky)
Comment on attachment 325534 [details] [diff] [review]
patch v5

>Index: xpcom/ds/nsPersistentProperties.cpp
>+  static NS_METHOD SegmentWriter(nsIUnicharInputStream* aStream,
>+                                void* aClosure,

Fix the indentation here and for the remaining args to this method?

>+  PRBool ParseValueCharacter(PRUnichar c, const PRUnichar* cur,
>+                             const PRUnichar* &tokenStart, nsAString& oldValue);

Might be nice to document the args (especially the inout ones!).

>+  PRBool    mMultiLineCanSkipN; // TRUE if "\\\r" was detected

Isn't it true just if you hit '\r'?

>+inline PRBool IsWhiteSpace(PRUnichar aChar)
>+  return (((aChar) == ' ') || ((aChar) == '\t') ||
>+          ((aChar) == '\r') || ((aChar) == '\n'));

That's overparenthesized (a legacy of its past as a macro?).  Remove the extra parens around aChar throughout?

>+inline PRBool IsEOL(PRUnichar aChar)

Same here.

>+PRBool nsPropertiesParser::ParseValueCharacter(
>+      // handle multilines - since this is the beginning of a line,

How do we know this is the beginning of a line?

>+      // ignore sequential line endings

You mean ignoring \n after \r?  Or something else?

>+        // if this will be at the buffer, characters from tokenStart to end

I'm not sure what the part before the comma means here.

>+        if (c == ' ' || c == '\t') {
>+          // don't allow another '\n' to be skipped
>+          mMultiLineCanSkipN = PR_FALSE;
>+        // if this will be at the buffer, characters from tokenStart to end
>+        // are appended at the end of ParseBuffer()
>+        tokenStart = cur+1;
>+          break;
>+        }

Please fix the indentation.  And again, what does the part before the comma mean?

>+      }
>+      break;
>+    }
>+    break;

That doesn't make sense to me.  Why have the inner break?

>+NS_METHOD nsPropertiesParser::SegmentWriter(nsIUnicharInputStream* aStream,
>+  parser->ParseBuffer(aFromSegment + aToOffset, aCount);

This is wrong.  The first argument should just be aFromSegment.

>+nsresult nsPropertiesParser::ParseBuffer(const PRUnichar* aBuffer,
>+  // points the start/end of the current key or value

"points to"?

>+        // no value at all! mimic the normal value-ending4

What's with the '4' there?

>+  if (mState == eParserState_Value && tokenStart &&
>+      mSpecialState == eParserSpecial_None)
>+    mValue += Substring(tokenStart, cur);
>+  // if we're still parsing the key, then append whatever we have..
>+  else if (mState == eParserState_Key && tokenStart)
>+    mKey += Substring(tokenStart, cur);

This would be a lot easier to read with curlies around the if/else bodies.

>+  while (NS_SUCCEEDED(mIn->ReadSegments(nsPropertiesParser::SegmentWriter, &parser, 2048, &nProcessed)) &&
>+         nProcessed != 0);

I'd use either 4096 or just whateve the stream claims is available, really (or the max of the two).  The fewer function calls and bouncing around on the stack, the better.

>-  mIn->Close();
>-  NS_RELEASE(mIn);

Shouldn't you still close and null out mIn?  What's the benefit of keeping it open?

sr=bzbarsky with these comments addressed.  I'm sorry for the lag....
Attachment #325534 - Flags: superreview+
Attached patch patch v6Splinter Review
(In reply to comment #53)
> >+  PRBool    mMultiLineCanSkipN; // TRUE if "\\\r" was detected
> 
> Isn't it true just if you hit '\r'?

No, it is set to true when mSpecialState == eParserSpecial_Escaped in following case:

    case '\r':
    case '\n':
      mHaveMultiLine = PR_TRUE;
      mMultiLineCanSkipN = (c == '\r');
      mSpecialState = eParserSpecial_None;
      break;

It is really true only when "\\\r" was found.

> >+NS_METHOD nsPropertiesParser::SegmentWriter(nsIUnicharInputStream* aStream,
> >+  parser->ParseBuffer(aFromSegment + aToOffset, aCount);
> 
> This is wrong.  The first argument should just be aFromSegment.

I don't think it is wrong, but it is unnecessary. aToOffset is always 0 because SegmentWriter() consumes always all available data.

> >+  while (NS_SUCCEEDED(mIn->ReadSegments(nsPropertiesParser::SegmentWriter, &parser, 2048, &nProcessed)) &&
> >+         nProcessed != 0);
> 
> I'd use either 4096 or just whateve the stream claims is available, really (or
> the max of the two).  The fewer function calls and bouncing around on the
> stack, the better.

Method available() isn't present in interface nsIUnicharInputStream, so I've changed it to 4096. I've also added error handling to this code and it fixes problem mentioned in comment #39.
Attachment #325534 - Attachment is obsolete: true
Attached patch unit test v4Splinter Review
test at the buffer boundary was moved to proper offset, because buffer size was increased from 2048 to 4096
Attachment #322578 - Attachment is obsolete: true
Keywords: checkin-needed
> No, it is set to true when mSpecialState == eParserSpecial_Escaped

Ah, ok.  Gotcha.

> Method available() isn't present in interface nsIUnicharInputStream

Indeed.  <sigh>.  Maybe file a bug about that?  It could be useful...
> That doesn't make sense to me.  Why have the inner break?

I finally figured it out; the weird indent confused me...

I'll restore this when I land the patch.
Pushed changeset cc82f602d986.  Michal, thanks for picking this up!
Status: ASSIGNED → RESOLVED
Closed: 17 years ago11 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Doesn't seem to have obviously improved Ts, though.  :(
Keywords: checkin-needed
Whiteboard: fix in hand
Target Milestone: --- → mozilla1.9.1a2
I've backed this out as part of the investigation into bug 450401
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Relanded.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.