Closed Bug 276366 Opened 20 years ago Closed 8 years ago

need alloca() support in configure.in

Categories

(Firefox Build System :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: tim.ruehsen, Unassigned)

Details

(Keywords: perf)

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041219 Firefox/1.0 (Debian package 1.0+dfsg.1-1)
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041219 Firefox/1.0 (Debian package 1.0+dfsg.1-1)

We have *many* functions which use malloc() (e.g. in the form of PR_Malloc())
for temporary buffers. malloc() is (very) slow in comparison with alloca(),
implies the possibility of memory leaks, crashes (if freed twice) and memory
fragmentation.

So, why not using alloca() on systems that support it (at least most gcc/glibc
systems)?

Here is a typical (bit stupid) example of malloc usage, found in many routines:
buf = PR_MALLOC(strlen(s1)+strlen(s2)+1);
if (buf) {
  strcpy(buf,s1);
  strcat(buf,s2);
  some_function(buf);
  PR_FREE(buf);
}
or
buf = PR_smprintf("%s%s",s1,s2);
if (buf) {
  some_function(buf);
  PR_FREE(buf);
}

with alloca(), not nicer but faster:
#ifdef HAVE_ALLOCA
buf = alloca(strlen(s1)+strlen(s2)+1);
if (buf) {
  strcpy(buf,s1);
  strcat(buf,s2);
  some_function(buf);
}
#else
... malloc code...
#endif

For this, we need to change the configure script. It is just these two lines to
be inserted into configure.in:
AC_CHECK_HEADERS(alloca.h)
AC_FUNC_ALLOCA



Reproducible: Always
Keywords: perf
of course the other possibility for such code would be
nsCAutoString/nsPrintfCString...
-alloca() is available to C as well 
-construction/destruction of C++ classes is slow (just to mention, but please - 
the discussion is beyond the scope of this report) 
-nsPrintfCString malloc()s as soon as the resulting buffer is bigger than 15 
bytes 
I can't find the class definition and implementation for nsAutoString - please 
give a pointer. 
 
the last thing we need is a way for people to confuse yet another allocator.
let's not.
I was a bit confused yesterday (from C++ sources looking like C sources). We do 
not need alloca() for C++ since constructs like 
  char buf[strlen(s1)+strlen(s2)+1]; 
are allowed by C++ definition. 
 
So, we are just talking about C sources. The above construct is not a problem 
for 'modern' C compilers like gcc, but still it is an extension to the C 
standard. alloca() does the same, but is also available outside the variable 
definition scope. 
 
alloca() is worth using for tuning C programms. It is not meant as 'another 
allocator' for general purpose. But it should be available for developers who 
are working on tuning. 
 
Anyway, we already can use it: 
#ifdef __GNUC__ 
  use alloca() 
#else 
  use malloc()/free() 
#endif 
 
But I would not claim this as save. HAVE_ALLOCA and HAVE_ALLOCA_H would be more 
reliable. So, why not adding 2 lines to configure.in? 
 
 
>  char buf[strlen(s1)+strlen(s2)+1]; 
>are allowed by C++ definition. 

no they are not. C99 allows variable-length arrays, and GNU C++ allows them, but
standard C++ doesn't, and compilers like MSVC reject them.
(see for example
http://www.redhat.com/docs/manuals/enterprise/RHEL-3-Manual/gcc/variable-length.html
- "Variable-length automatic arrays are allowed in ISO C99, and as an extension
GCC accepts them in C89 mode and in C++.")
thx for enlightenment. 
these are good reasons to support alloca() via configure.in. 
 
Just a last word before I go home... 
I made a quick and dirty speed comparison between nsCAutoString and alloca(): 
each 10 million times: 
a) 
void test(char *s1, char *s2) 
{ 
 nsCAutoString as(s1); 
 as+=s2; 
} 
b) 
void test(char *s1, char *s2) 
{ 
 int len1=strlen(s1), len2=strlen(s2); 
 char *buf; 
 
 buf=(char *)PR_Malloc(len1+len2+1); 
 if (buf) { 
  strcpy(buf,s1); 
  strcpy(buf+len1,s2); 
  PR_Free(buf); 
 } 
} 
c) 
void test(char *s1, char *s2) 
{ 
 int len1=strlen(s1), len2=strlen(s2); 
 char *buf; 
 
 buf=(char *)alloca(len1+len2+1); 
 if (buf) { 
  strcpy(buf,s1); 
  strcpy(buf+len1,s2); 
 } 
} 
 
a) 4.10 seconds 
b) 2.25 seconds 
c) 1.32 seconds 
 
 
nsAutoString is sometimes used as a member variable in classes that are intended
to be instantiated on the stack.  Could such use cases be a concern?  Moreover,
right now when nsAutoString overflows to the heap, it allocates a sharable
buffer.  If alloca is used instead of malloc, then I don't think it will be
possible to share the buffer with any arbitrary nsSubstring.
Darin, i am not shure that we are talking about the same thing.
Maybe you can be more precise. What exactly is 'sharable'(I can just guess what
you are talking about, but i am not 100% shure)?
Can you give a code example to demonstrate your point?

Don't think alloca() is ment to be a replacement for malloc()!
My examples should show you in which cases it might worth using it...
Tim:

Take this for example:

  nsAutoString a;

  // make |a|'s buffer contain a copy of "foopy", which happens to fit within
  // |a|'s stack buffer, so no heap allocation is required.
  a.Assign("foopy", 5);

  // make |a|'s buffer contain an allocated copy of the first 1024 elements of 
  // the character array pointed to by |blah|.
  a.Assign(blah, 1024);

  nsString b;
  b = a;

  PRBool sameBuffer = b.get() == a.get();

I'm saying that |sameBuffer| would be true given that whenever our string code
implements copy-on-write buffer sharing.  In other words, whenever it is
necessary to allocate a string buffer from the heap, an additional few bytes are
allocated at the front of the string buffer to store a reference count (and a
few other things).

So, I'm not sure how alloca would be useful for nsAutoString.
alloca's pretty righteous if you just want fast LIFO storage of variable size. 
I know of cases where nsAutoString or nsCAutoString is used today that would
benefit from it, although it's hard to say the perf gain in real apps would be
measurable. Maybe we need an unshared (thread-local) auto-string template type?

/be
Darin: 
I did not talk about using alloca() withing nsAutoString. 
First of all, we should discuss if we allow using alloca() at all. 
If we come to the point where we say 'yes', we can discuss using alloca() 
within nsAutoString with all it's implications within a second bug/request. 
 
I just say this because i am afraid that the original request is getting out of 
sight. 
 
Brendan: 
(real app perf measurement) 
This is a question for later... once we say 'yes' to alloca() we have the 
possibility to optimize code that seems to be 'slow' (e.g. loading the mysql 
documentation from hard disk takes much too long in firefox... and this is 
definitely an issue with string parsing and copying). 
 
(unshared auto-string template) 
this is also an issue that we should discuss after we said 'yes' to alloca(). 
 
After all, you see i am absolutely convinced about using alloca(). I used for 
several year in many threaded applications. Especially in heavily used daemons. 
Not only speed but also to lower (sometimes even avoid) memory fragmentation. 
 
I have lot's of mozilla source code here that needs to be fixed. And once I 
touch it, I would like to make a complete review/rewrite. In several cases I 
would like to introduce alloca() (and besides, some other functions like 
PR_asprintf and PL_strlcpy - but thats another bug report...) 
 
(note that NSPR uses its own configure script, in nsprpub/, which would also
need the ALLOCA checks if you want to use it there)
> a) 4.10 seconds 
> b) 2.25 seconds 
> c) 1.32 seconds 

wow... i wouldn't have expected there to be so much difference between case a
and case b.  perhaps we should investigate that testcase further because it may
reveal a great opportunity to improve performance.

anyways, in response to your previous comment, i understand that this bug is
just about adding configure magic for alloca.  i was just responding to the
claim that alloca might be useful as an allocator for one or more of our string
types.  i'm concerned that that may be difficult to implement given the way the
string classes are used today.

as for alloca alone, it would be nice to have some idea of where it may be used
in mozilla code.  do you have some mozilla profiles that reveal significant
cases that could be optimized with some #ifdef HAVE_ALLOCA fun?

also, you should know that mozilla is mostly a single threaded app.  it is true
that there are some background threads, but they probably account for no more
than 1% of the CPU time spent inside the mozilla process for typical page loads.
Of course, case a was slower.  It should be written like this instead:

  void test(char *s1, char *s2) 
  {
    nsCAutoString as; 
    as = nsDependentCString(s1) + nsDependentCString(s2);
  }

If the size of s1 and s2 combined fits within as's stack buffer (64 bytes), then
I would expect this testcase to perform similarly to case c.  Otherwise, it
would perform like case b.  There is extra overhead in this testcase though
because of the ctor/dtor's for nsCAutoString and the nsDependentCString objects.
 A better testcase would take two nsCString objects as parameters or somehow
mitigate the overhead of those wrapper classes since that overhead would skew
the results.
I will make a test d) at the beginning of next week (than i am back in the
office). Of course the overhead of ctors/dtors is relative big when using small
strings. With larger strings - i assume - the relative difference between a),b)
and c) will be smaller. I try to find some time to make graphs of time consumed
as a f(strlen(s1)+strlen(s2)) for a)-d). That should give us an exact overview
of when using alloca is reasonable.

Happy new year!
Here is the promised graphic.

Test case e) (just to make it complete):
void test_e(char *s1, char *s2)
{
 char buf[256];
 PRUint32 len1, len2;

 len1=PL_strlcpy(buf,s1,sizeof(buf));
 if (len1>=sizeof(buf))
  test_c(s1,s2);
 else {
  len2=PL_strlcpy(buf+len1,s2,sizeof(buf));
  if (len2>=sizeof(buf)-len1)
   test_c(s1,s2);
 }
}
The function PL_strlcpy() has been filed as bug #276845.

Before actually starting the test, it tried to fragment the malloc a bit. Just
to build up a 'real' environment. Here is the code:

static void *mem[1024];
for (it=0;it<1024;it++) mem[it]=0;
for (it=0;it<10000;it++) {
  r=rand()%1024;
  if (mem[r]) { PR_Free(mem); mem[r]=0; }
  else mem[r]=PR_Malloc(rand()%1024);
}

I wonder about getting 4763 lines of error (all lines the same):
free(): invalid pointer 0x804d280!
Any explanation?

But however, I ignored this and made a measurement like this
for (len=0;len<=127;len++) {
    memset(s,'a',len);
    s[len]=0;

    t1=getmicros();
    for (it=0;it<1000000;it++) {
	test_c(s,s);
    }
    printf("%d %lld\n",len*2,(getmicros()-t1)/1000);
}

#include <sys/time.h>
#include <time.h>
long long getmicros()
{
    struct timeval tv;

    gettimeofday(&tv,NULL);
    return tv.tv_sec*1000000+tv.tv_usec;
}

Just one conclusion: c++ ctors/dtors are a big overhead. This is not compiler
dependant but a design decision (or flaw) of the c++ inventor(s). There had
been a good articel about c++ vs. java in a german computer magazin (c't). It
showed and proved that ctors/dtors in java are faster than in c++ (by design).
If we want fast routines we should always think about using (temporary)
classes. Sometimes we should avoid them even if the code does not look
'objectoriented' or 'c++ish' any more.


OK, here is my next question: how long do i have to wait for alloca()? ;-)
Tim: this is very interesting.  can you please provide complete source code for
the various test cases?  thanks!
please do not wonder. I raped 'TestStrings.cpp' from xpcom/tests, so I did not
have to hassle with the makefiles or anything.

I made output for a-e and put this into a gfx tool, which I wrote a while ago.
I can't give you gfx tool, it belongs the company i am working for.
There is something misleading about these tests since test (a) and test (e)
should overlap up until the point where the nsCAutoString in test (a) is forced
to allocate from the heap.  I think that the overhead of the nsCAutoString API
is obscuring the picture here.  Yes, it shows that we should reduce the overhead
of the nsCAutoString API.

A better way to test this would be to modify the code for nsCAutoString and
simply change the allocator.  For test (e) you could change nsCAutoString to
have a 256 byte stack buffer, and for test (c) you could change nsCAutoString's
allocator to use alloca instead of malloc.  Then, you would be doing a fair
comparison.
what options do you compile with?


may I ask which c't edition contains that comparison? this interests me...
I compile 'non-optimized', set in .mozconfig. 
But notice, that I do not claim the results are 'water-proof'. I just wanted to 
show a tendency and make the developers aware of dtor/ctor implications. 
 
Hmmm... after all I just want to use alloca for further patches within C and 
C++ sources. 
 
The articles are in c't 19/03 and 21/03. 
 
--enable-debug build disable inlining. I do not think that such benchmarks can
show usable results...
this is the same with statistics: never trust them if you didn't fake them ;-) 
 
before we go into too many iterations: just make the comparisons with your own 
favorite settings (the source code is attached). 
i assume that the result will differ in quantity but not in quality. 
 
the next iteration will be with optimization on ;-) 
 
Tim: it's pointless to compare allocators in the raw vs. the allocators used by
nsAutoString internally.  either hack nsAutoString to use different allocators
and test that or don't include nsAutoString in the comparisons.  more
importantly: i can see that alloca might be useful in some cases, but what are
those cases?  have you collected a list of places in the mozilla codebase where
you would make use of alloca?
biesi: 
I completely recompiled the mozilla source without debug and made the test 
again, no gfx, i just compared the numbers roughly: 
a) and d) are now slightly slower (???) 
b) ist slightly faster 
c) stays the same 
 
but maybe something is wrong with 'make clean' - sometimes i experienced 
compiling problems (then I had to: 'make clean','./configure','make clean') 
 
Here is how TestStrings compiled: 
 
TestStrings.cpp 
c++ -o TestStrings.o -c  -DOSTYPE=\"Linux2.6\" -DOSARCH=\"Linux\" -I./../ds 
-I./services  -I../../dist/include/string -I../../dist/include/xpcom 
-I../../dist/include -I/usr/oms/src/mozilla/mozilla/dist/include/nspr     
-I/usr/X11R6/include  -fPIC  -I/usr/X11R6/include -fno-rtti -fno-exceptions 
-Wall -Wconversion -Wpointer-arith -Wcast-align -Woverloaded-virtual -Wsynth 
-Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wno-long-long -pedantic 
-fshort-wchar -pthread -pipe  -DNDEBUG -DTRIMMED -O  -I/usr/X11R6/include 
-DMOZILLA_CLIENT -include ../../mozilla-config.h -Wp,-MD,.deps/TestStrings.pp 
TestStrings.cpp 
TestStrings.cpp: In function `int main(int, char**)': 
TestStrings.cpp:697: warning: unused variable `char*buf' 
c++  -I/usr/X11R6/include -fno-rtti -fno-exceptions -Wall -Wconversion 
-Wpointer-arith -Wcast-align -Woverloaded-virtual -Wsynth 
-Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wno-long-long -pedantic 
-fshort-wchar -pthread -pipe  -DNDEBUG -DTRIMMED -O -o TestStrings 
TestStrings.o     -L../../dist/bin -L../../dist/lib -L../../dist/bin -lxpcom 
-lxpcom_core  -L/usr/oms/src/mozilla/mozilla/dist/lib -lplds4 -lplc4 -lnspr4 
-lpthread -ldl   -ldl -lm 
 
Darin, it was never my intention to make nsCAutoString obsolete. The only 
reason why I made the comparisons were comments #1 and #4. 
Maybe the comparison was not useless and helps investigating for a faster 
nsCAutoString (but this is definitely not my job). 
Just one example of needless nsCAutoString usage: 
http://lxr.mozilla.org/mozilla1.7/source/browser/components/migration/src/nsOperaProfileMigrator.cpp#503 
(lines 503-505 could be replaced by *aResult = atoi(val); 
 
The comparison shows that strlen+alloca+strcpy is faster than 
strlen+malloc+strcpy+free. Much faster for short strings. 
 
 
Ok, back to alloca. 
 
http://lxr.mozilla.org/mozilla1.7/source/dbm/src/nsres.c#284 
(either use alloca instead of malloc or use a C99 variable-length array. 
Question: do we still support C89 C compilers, if yes why and which?) 
 
http://lxr.mozilla.org/mozilla1.7/source/embedding/browser/activex/src/control/ActiveScriptSite.cpp#172 
(same here) 
 
http://lxr.mozilla.org/mozilla1.7/source/expat/xmlwf/readfilemap.c#54 
(same here) 
 
http://lxr.mozilla.org/mozilla1.7/source/mailnews/local/src/nsMailboxService.cpp#97 
(same here) 
 
http://lxr.mozilla.org/mozilla1.7/source/mailnews/addrbook/src/nsVCardObj.cpp#588 
(an example of implicit malloc. can be replaced by alloca+strcpy) 
 
http://lxr.mozilla.org/mozilla1.7/source/mailnews/extensions/mdn/src/nsMsgMdnGenerator.cpp#473 
(ughh... nothing to do with alloca but I stumbled over this ****. maybe the 
sourcecode should be completely rewritten) 
 
http://lxr.mozilla.org/mozilla1.7/source/mailnews/local/src/nsPop3Protocol.cpp#2688 
(could be replaced by a static char buffer) 
 
http://lxr.mozilla.org/mozilla1.7/source/mailnews/mime/cthandlers/vcard/mimevcrd.cpp#575 
(good example for alloca. since we have several lines doing memory allocation 
but just one line for freeing) 
 
OK, that list is interesting.  Did you generate that list by inspecting the code
or did you use some profiling tool to determine that those are hot spots worth
fixing?
I manually search for hot spots, which are easy to find (typical misuse of e.g. 
strcpy, strcat, strncpy, strncat, sprintf, malloc for temp buffers etc.). 
When compiling the mozilla project i also see lots of warnings (gcc 3.3), some 
of them might be dangerous like 'use of uninitialized variable'. 
 
Also, did you know that gcc is able to check the format string against the 
argument types for printf like functions? I can not see that this feature is 
used within the mozilla code. I will create a bug with patches for PR_smprintf 
and similar functions. 
 
Version: unspecified → 1.7 Branch
Version: 1.7 Branch → Trunk
This is an automated message, with ID "auto-resolve01".

This bug has had no comments for a long time. Statistically, we have found that
bug reports that have not been confirmed by a second user after three months are
highly unlikely to be the source of a fix to the code.

While your input is very important to us, our resources are limited and so we
are asking for your help in focussing our efforts. If you can still reproduce
this problem in the latest version of the product (see below for how to obtain a
copy) or, for feature requests, if it's not present in the latest version and
you still believe we should implement it, please visit the URL of this bug
(given at the top of this mail) and add a comment to that effect, giving more
reproduction information if you have it.

If it is not a problem any longer, you need take no action. If this bug is not
changed in any way in the next two weeks, it will be automatically resolved.
Thank you for your help in this matter.

The latest beta releases can be obtained from:
Firefox:     http://www.mozilla.org/projects/firefox/
Thunderbird: http://www.mozilla.org/products/thunderbird/releases/1.5beta1.html
Seamonkey:   http://www.mozilla.org/projects/seamonkey/
Status: UNCONFIRMED → NEW
Ever confirmed: true
Product: Mozilla Application Suite → Core
I'm going to guess that memory changes in Gecko in the 12 years since this bug was updated have addressed matters.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: