Closed Bug 52573 Opened 19 years ago Closed 12 years ago

Bonsai doesn't scale very well

Categories

(Webtools Graveyard :: Bonsai, defect, P1, critical)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rkotalampi, Assigned: bear)

References

Details

(Keywords: crash)

Attachments

(5 files, 1 obsolete file)

When the tree is spanked badly and there's a lot of modification it drains huge 
amount of resources. One mail per file, each of those forking a long chain on 
perl programs and eating a lot of memory.

Currently we have some limitations in sendmail but they don't seem to help 
because in many cases the problem is not the load but the memory all these big 
processes takes.

bonsai needs to be rearchitected either on cvs side or in database side to 
better scale to situations like this.
we have the same problem on warp, although typically the changes are not quite 
as large.
<rayw> Define "brought the whole system to hork". <risto> rayw, 
bugzilla/tinderbox/bonsai crashed.... mysqld crashed... sshd crashed... inetd 
crashed
Keywords: crash
Let me just add here that we already have limitations in sendmail that were 
supposed to prevent this happening:

O MaxDaemonChildren=25
O QueueLA=8

But like I wrote my guess is that it's not a load issue... it's a memory issue 
these 25 children plus the process chain will eat. Let me see if I can find some 
sar data about today's incident.
CPU usage:
SunOS lounge.mozilla.org 5.6 Generic_105181-16 sun4u    09/13/00

16:50:00    %usr    %sys    %wio   %idle
16:51:00      53      14       6      28
16:52:00      81      17       1       1
16:53:01      79      18       3       1
16:54:00      75      21       3       0
16:55:01      64      23      12       0
16:56:50      50      20      29       2
16:57:00      52      18      30       0
16:58:01      44      16      39       0

Average       62      19      15       4

CPU load looks ok to me...

---
Memory (these are disk blocks, 512b each):

SunOS lounge.mozilla.org 5.6 Generic_105181-16 sun4u    09/13/00

16:50:00 freemem freeswap
16:51:00   43401  5025628
16:52:00   11643  4359608
16:53:01    4103  3431154
16:54:00    4089  2551139
16:55:01    3203  1760839
16:56:50    2572   875564
16:57:00    2428   260161
16:58:01    3612   204572

Average     9402  2374776

OUCH! In 7 minutes someone allocated about 2G.

---

Unfortunately at 16:58 crond died and we didn't get more data but I'm sure the 
idea can be seen in these. Someone used all the memory and swap in the server.

it was probably all the 1000+ sendmail and handlecheckin.pl processes.
Not only that because sendmail in theory should limit max children to 25. But 
handlecheckin.pl launces another program, which launces more, and more, and 
more....
someone outside cpd landed a massive checkin last night and warp was broken from
5:30-8:30pm due to lack of resources (memory & swap).  Anyone have any thoughts
on how to hack bonsai to throttle handlemail and/or addcheckin?
this is being worked on now along with other enhancements that give more control
over the a CVS repository
Status: NEW → ASSIGNED
QA Contact: matty → timeless
Priority: P3 → P1
There's a dupe of this bug somewhere...  ("addCheckin uses all available memory"
or something to that effect) I remember seeing it earlier, but I can't go back
to look for it without losing my place in the list.  If I remember when I'm done
with this buglist I'll go back and look.
*** Bug 39328 has been marked as a duplicate of this bug. ***
By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and
<http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss
bugs are of critical or possibly higher severity.  Only changing open bugs to
minimize unnecessary spam.  Keywords to trigger this would be crash, topcrash,
topcrash+, zt4newcrash, dataloss.
Severity: major → critical
Somehow I suspect the "this is being worked on" from 2001 to have never materialized.  Removing "assigned" status.

This is something we need to fix in short order on bonsai.m.o...  we've been getting bit by it again recently.

We just made a change to tinderbox a few weeks back that could (and probably should) be borrowed for this.  See bug 354462 and its dependencies.  The script run by sendmail on the incoming mail just dumps the mail in a directory.  A cron job then comes through and processes the stuff in the directory serially so you don't wind up with 50 processes running at once if a bunch of directories got touched on the same checkin.
Status: ASSIGNED → NEW
OS: Solaris → All
QA Contact: timeless → bonsai
Assignee: tara → bear
this updates handleCheckinMail.pl to write to the data directory the mail files
Attachment #274046 - Flags: review?(cls)
Attachment #274047 - Flags: review?(cls)
Attachment #274046 - Flags: review?(cls) → review+
Comment on attachment 274047 [details]
new file to process all mail files

Remove the & before function calls && r=cls
Attachment #274047 - Flags: review?(cls) → review+
Checking in handleCheckinMail.pl;
/cvsroot/mozilla/webtools/bonsai/handleCheckinMail.pl,v  <--  handleCheckinMail.pl
new revision: 1.8; previous revision: 1.7
done
RCS file: /cvsroot/mozilla/webtools/bonsai/processMail.pl,v
done
Checking in processMail.pl;
/cvsroot/mozilla/webtools/bonsai/processMail.pl,v  <--  processMail.pl
initial revision: 1.1
done
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Are there instructions somewhere how to deploy this?  I assume it needs a cron job set up and so forth.
doh! I am commiting the following to the INSTALL file (just below where it says how to setup the handleCheckMail.pl)

To process the queued mail from handleCheckinMail.pl, you will need to setup
a cron job to call processMail.pl.  processMail.pl does take an optional
parameter to locate the bonsai data directory, but if it's not present it
will default to the directory where processMail.pl resides.

As the bonsai user, add a cron job to call 'processMail.pl'.
    For example,
        MAILTO="root"
        USER=bonsai
        */5 * * * *     /usr/local/bonsai/processMail.pl

This will cause the bonsai mail to be processed every five minutes and
to mail the root user if any errors occur.
Oh, so this patch actually makes bonsai run setuid now also?  (There is no bonsai user on the production server currently)  That in itself requires additional config changes, does it not?
yes, if the production server isn't running with a bonsai user then setuid will need to be used.

Let me know if there are any tweaks you need to make production life easier - they are probably changes that bonsai should have in the first place.
So... the Makefile apparently doesn't set the permissions up how they need to be for the setuid stuff to actually work...  everything's still owned by apache or root.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #289670 - Flags: review? → review?(cls)
Comment on attachment 289670 [details] [diff] [review]
Patch to fix Makefile

>+	@echo "Fixing permissions"
>+	@chown -R ${BONSAI_USER} $(PREFIX)
>+	@chgrp -R ${BONSAI_GROUP} $(PREFIX)

Why not |@chown -R {BONSAI_USER}:${BONSAI_GROUP} $(PREFIX)| instead of doubling the time it takes by doing them separately?
(In reply to comment #24)
> Why not |@chown -R {BONSAI_USER}:${BONSAI_GROUP} $(PREFIX)| instead of doubling
> the time it takes by doing them separately?

Because the combined version doesn't work on Solaris.
Attachment #289670 - Flags: review?(cls) → review+
Checking in Makefile.in;
/cvsroot/mozilla/webtools/bonsai/Makefile.in,v  <--  Makefile.in
new revision: 1.19; previous revision: 1.18
done
Status: REOPENED → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → FIXED
The mail handling code here doesn't work.  There's a taint error in handleCheckinMail.pl (Insecure dependency in chdir while running setuid at /etc/smrsh/handlewwwCheckinMail.pl line 28.)

Even after fixing that (detaint $ARGV[0] before using it), I couldn't get any test commits to show up.  I turned on debug mode, and it finds the queue file and deletes it, but the data from the file never makes it to the database.  I ran out of time to debug this on the live system, and couldn't locate anyone familiar with the code, so I reverted back to the previous version of bonsai in production.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: bonsai don't scale very well → Bonsai doesn't scale very well
oh, the other thing I ran into... processMail.pl is not installed by the Makefile.
Sendmail is pretty bad, AFAIK. There are better alternatives:

http://shearer.org/MTA_Comparison
http://www.geocities.com/mailsoftware42/

I'd try Postfix. It's fast and maintained.

For the scripting, maybe Parallel or Stackless Python can help.
(In reply to comment #29)
> Sendmail is pretty bad, AFAIK. There are better alternatives:
[...]
> I'd try Postfix. It's fast and maintained.

I love postfix to death, and we use it on our primary mail relays (where it matters), however, I suspect the choice of MTA in use here has very little relation if any to the problem at hand (which is in the bonsai application's handling of mail, not how it gets passed off to it by the MTA).
So, there were a number of problems:
* processMail.pl needed to be chmod'd +x
* if (($#ARGV >= 0) && (-d $ARGV[0])) is wrong, as it's always true
* $ARGV[0] needed to be detainted
* $0 doesn't work anymore for the current script's full path, as it's setuid now

This patch is currently being used on http://bonsai-stage.mozilla.org/, and it seems to be working. Yay!

While my fixes seem to work, there may be better ways to do what I'm doing. Please let me know if there's something better I could do.
Attachment #320146 - Flags: review?(cls)
Attachment #320146 - Flags: review?(bear)
Attachment #320146 - Flags: review?(cls) → review+
Attachment #320146 - Flags: review?(bear)
Checking in Makefile.in;
/cvsroot/mozilla/webtools/bonsai/Makefile.in,v  <--  Makefile.in
new revision: 1.20; previous revision: 1.19
done
Checking in handleAdminMail.pl;
/cvsroot/mozilla/webtools/bonsai/handleAdminMail.pl,v  <--  handleAdminMail.pl
new revision: 1.7; previous revision: 1.6
done
Checking in handleCheckinMail.pl;
/cvsroot/mozilla/webtools/bonsai/handleCheckinMail.pl,v  <--  handleCheckinMail.pl
new revision: 1.9; previous revision: 1.8
done
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
While I was upgrading all the Mozilla bonsai instances, I noticed that processMail.pl had some of the same issues that handle*Mail.pl had, along with a few other things. I took this opportunity to clean it up and use the commit as a test for the scripts. This has already been committed, but getting post-commit review.
Attachment #321223 - Flags: review?(cls)
So, turns out a bug in the original processMail.pl caused bonsai-www and bonsai-l10n not to get updated properly (bug 434489). The issue was that there wasn't a chdir happening, which caused |require| and the |system()| calls not to work correctly, as the call wasn't happening in the right directory. I didn't notice it when testing, as I was mostly testing bonsai.m.o, which worked fine due to bonsai's homedir on dm-webtools02 being /var/www/webtools/bonsai-cvs. I tested bonsai-www and bonsai-l10n, too, but I was manually running processMail.pl under the proper directory, so it worked fine. Oh well. I've patched processMail.pl on dm-webtools02 with ajschult's patch, and everything seems to be working now.

I committed ajschult's modification with my r+, but I still want to have my change and ajschult's change reviewed by bear or cls.

Checking in processMail.pl;
/cvsroot/mozilla/webtools/bonsai/processMail.pl,v  <--  processMail.pl
new revision: 1.4; previous revision: 1.3
done
Attachment #321223 - Attachment is obsolete: true
Attachment #321612 - Flags: review?(cls)
Attachment #321612 - Flags: review?(bear)
Attachment #321223 - Flags: review?(cls)
Comment on attachment 321612 [details] [diff] [review]
Clean-up processMail.pl to use @BONSAI_DIR@ and other fixes from ajschult - v2

r=cls if you add an |or die "";| to the chdir so that it's obvious why the script failed.
Attachment #321612 - Flags: review?(cls) → review+
(In reply to comment #35)
> (From update of attachment 321612 [details] [diff] [review])
> r=cls if you add an |or die "";| to the chdir so that it's obvious why the
> script failed.

Checking in processMail.pl;
/cvsroot/mozilla/webtools/bonsai/processMail.pl,v  <--  processMail.pl
new revision: 1.5; previous revision: 1.4
done
Attachment #321612 - Flags: review?(bear)
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.