Bonsai doesn't scale very well

RESOLVED FIXED

Status

P1
critical
RESOLVED FIXED
18 years ago
2 years ago

People

(Reporter: rkotalampi, Assigned: bear)

Tracking

({crash})

Details

Attachments

(5 attachments, 1 obsolete attachment)

(Reporter)

Description

18 years ago
When the tree is spanked badly and there's a lot of modification it drains huge 
amount of resources. One mail per file, each of those forking a long chain on 
perl programs and eating a lot of memory.

Currently we have some limitations in sendmail but they don't seem to help 
because in many cases the problem is not the load but the memory all these big 
processes takes.

bonsai needs to be rearchitected either on cvs side or in database side to 
better scale to situations like this.

Comment 1

18 years ago
we have the same problem on warp, although typically the changes are not quite 
as large.

Comment 2

18 years ago
<rayw> Define "brought the whole system to hork". <risto> rayw, 
bugzilla/tinderbox/bonsai crashed.... mysqld crashed... sshd crashed... inetd 
crashed
Keywords: crash
(Reporter)

Comment 3

18 years ago
Let me just add here that we already have limitations in sendmail that were 
supposed to prevent this happening:

O MaxDaemonChildren=25
O QueueLA=8

But like I wrote my guess is that it's not a load issue... it's a memory issue 
these 25 children plus the process chain will eat. Let me see if I can find some 
sar data about today's incident.
(Reporter)

Comment 4

18 years ago
CPU usage:
SunOS lounge.mozilla.org 5.6 Generic_105181-16 sun4u    09/13/00

16:50:00    %usr    %sys    %wio   %idle
16:51:00      53      14       6      28
16:52:00      81      17       1       1
16:53:01      79      18       3       1
16:54:00      75      21       3       0
16:55:01      64      23      12       0
16:56:50      50      20      29       2
16:57:00      52      18      30       0
16:58:01      44      16      39       0

Average       62      19      15       4

CPU load looks ok to me...

---
Memory (these are disk blocks, 512b each):

SunOS lounge.mozilla.org 5.6 Generic_105181-16 sun4u    09/13/00

16:50:00 freemem freeswap
16:51:00   43401  5025628
16:52:00   11643  4359608
16:53:01    4103  3431154
16:54:00    4089  2551139
16:55:01    3203  1760839
16:56:50    2572   875564
16:57:00    2428   260161
16:58:01    3612   204572

Average     9402  2374776

OUCH! In 7 minutes someone allocated about 2G.

---

Unfortunately at 16:58 crond died and we didn't get more data but I'm sure the 
idea can be seen in these. Someone used all the memory and swap in the server.

Comment 5

18 years ago
it was probably all the 1000+ sendmail and handlecheckin.pl processes.
(Reporter)

Comment 6

18 years ago
Not only that because sendmail in theory should limit max children to 25. But 
handlecheckin.pl launces another program, which launces more, and more, and 
more....

Comment 7

18 years ago
someone outside cpd landed a massive checkin last night and warp was broken from
5:30-8:30pm due to lack of resources (memory & swap).  Anyone have any thoughts
on how to hack bonsai to throttle handlemail and/or addcheckin?

Comment 8

17 years ago
this is being worked on now along with other enhancements that give more control
over the a CVS repository
Status: NEW → ASSIGNED

Updated

16 years ago
QA Contact: matty → timeless

Updated

16 years ago
Priority: P3 → P1
There's a dupe of this bug somewhere...  ("addCheckin uses all available memory"
or something to that effect) I remember seeing it earlier, but I can't go back
to look for it without losing my place in the list.  If I remember when I'm done
with this buglist I'll go back and look.
*** Bug 39328 has been marked as a duplicate of this bug. ***

Comment 12

16 years ago
By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and
<http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss
bugs are of critical or possibly higher severity.  Only changing open bugs to
minimize unnecessary spam.  Keywords to trigger this would be crash, topcrash,
topcrash+, zt4newcrash, dataloss.
Severity: major → critical
Somehow I suspect the "this is being worked on" from 2001 to have never materialized.  Removing "assigned" status.

This is something we need to fix in short order on bonsai.m.o...  we've been getting bit by it again recently.

We just made a change to tinderbox a few weeks back that could (and probably should) be borrowed for this.  See bug 354462 and its dependencies.  The script run by sendmail on the incoming mail just dumps the mail in a directory.  A cron job then comes through and processes the stuff in the directory serially so you don't wind up with 50 processes running at once if a bunch of directories got touched on the same checkin.
Status: ASSIGNED → NEW

Updated

12 years ago
OS: Solaris → All
QA Contact: timeless → bonsai
(Assignee)

Updated

11 years ago
Assignee: tara → bear
(Assignee)

Comment 14

11 years ago
Created attachment 274046 [details] [diff] [review]
patch to update handleCheckinMail.pl

this updates handleCheckinMail.pl to write to the data directory the mail files
Attachment #274046 - Flags: review?(cls)
(Assignee)

Comment 15

11 years ago
Created attachment 274047 [details]
new file to process all mail files
Attachment #274047 - Flags: review?(cls)

Updated

11 years ago
Attachment #274046 - Flags: review?(cls) → review+

Comment 16

11 years ago
Comment on attachment 274047 [details]
new file to process all mail files

Remove the & before function calls && r=cls
Attachment #274047 - Flags: review?(cls) → review+
(Assignee)

Comment 17

11 years ago
Checking in handleCheckinMail.pl;
/cvsroot/mozilla/webtools/bonsai/handleCheckinMail.pl,v  <--  handleCheckinMail.pl
new revision: 1.8; previous revision: 1.7
done
RCS file: /cvsroot/mozilla/webtools/bonsai/processMail.pl,v
done
Checking in processMail.pl;
/cvsroot/mozilla/webtools/bonsai/processMail.pl,v  <--  processMail.pl
initial revision: 1.1
done
Status: NEW → RESOLVED
Last Resolved: 11 years ago
Resolution: --- → FIXED
Are there instructions somewhere how to deploy this?  I assume it needs a cron job set up and so forth.
(Assignee)

Comment 19

11 years ago
doh! I am commiting the following to the INSTALL file (just below where it says how to setup the handleCheckMail.pl)

To process the queued mail from handleCheckinMail.pl, you will need to setup
a cron job to call processMail.pl.  processMail.pl does take an optional
parameter to locate the bonsai data directory, but if it's not present it
will default to the directory where processMail.pl resides.

As the bonsai user, add a cron job to call 'processMail.pl'.
    For example,
        MAILTO="root"
        USER=bonsai
        */5 * * * *     /usr/local/bonsai/processMail.pl

This will cause the bonsai mail to be processed every five minutes and
to mail the root user if any errors occur.
Oh, so this patch actually makes bonsai run setuid now also?  (There is no bonsai user on the production server currently)  That in itself requires additional config changes, does it not?
(Assignee)

Comment 21

11 years ago
yes, if the production server isn't running with a bonsai user then setuid will need to be used.

Let me know if there are any tweaks you need to make production life easier - they are probably changes that bonsai should have in the first place.
So... the Makefile apparently doesn't set the permissions up how they need to be for the setuid stuff to actually work...  everything's still owned by apache or root.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #289670 - Flags: review? → review?(cls)
Comment on attachment 289670 [details] [diff] [review]
Patch to fix Makefile

>+	@echo "Fixing permissions"
>+	@chown -R ${BONSAI_USER} $(PREFIX)
>+	@chgrp -R ${BONSAI_GROUP} $(PREFIX)

Why not |@chown -R {BONSAI_USER}:${BONSAI_GROUP} $(PREFIX)| instead of doubling the time it takes by doing them separately?
(In reply to comment #24)
> Why not |@chown -R {BONSAI_USER}:${BONSAI_GROUP} $(PREFIX)| instead of doubling
> the time it takes by doing them separately?

Because the combined version doesn't work on Solaris.

Updated

11 years ago
Attachment #289670 - Flags: review?(cls) → review+
Checking in Makefile.in;
/cvsroot/mozilla/webtools/bonsai/Makefile.in,v  <--  Makefile.in
new revision: 1.19; previous revision: 1.18
done
Status: REOPENED → RESOLVED
Last Resolved: 11 years ago11 years ago
Resolution: --- → FIXED
The mail handling code here doesn't work.  There's a taint error in handleCheckinMail.pl (Insecure dependency in chdir while running setuid at /etc/smrsh/handlewwwCheckinMail.pl line 28.)

Even after fixing that (detaint $ARGV[0] before using it), I couldn't get any test commits to show up.  I turned on debug mode, and it finds the queue file and deletes it, but the data from the file never makes it to the database.  I ran out of time to debug this on the live system, and couldn't locate anyone familiar with the code, so I reverted back to the previous version of bonsai in production.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: bonsai don't scale very well → Bonsai doesn't scale very well
oh, the other thing I ran into... processMail.pl is not installed by the Makefile.

Comment 29

11 years ago
Sendmail is pretty bad, AFAIK. There are better alternatives:

http://shearer.org/MTA_Comparison
http://www.geocities.com/mailsoftware42/

I'd try Postfix. It's fast and maintained.

For the scripting, maybe Parallel or Stackless Python can help.
(In reply to comment #29)
> Sendmail is pretty bad, AFAIK. There are better alternatives:
[...]
> I'd try Postfix. It's fast and maintained.

I love postfix to death, and we use it on our primary mail relays (where it matters), however, I suspect the choice of MTA in use here has very little relation if any to the problem at hand (which is in the bonsai application's handling of mail, not how it gets passed off to it by the MTA).
Created attachment 320146 [details] [diff] [review]
Fix problems with Makefile and handle*Mail.pl scripts - v1

So, there were a number of problems:
* processMail.pl needed to be chmod'd +x
* if (($#ARGV >= 0) && (-d $ARGV[0])) is wrong, as it's always true
* $ARGV[0] needed to be detainted
* $0 doesn't work anymore for the current script's full path, as it's setuid now

This patch is currently being used on http://bonsai-stage.mozilla.org/, and it seems to be working. Yay!

While my fixes seem to work, there may be better ways to do what I'm doing. Please let me know if there's something better I could do.
Attachment #320146 - Flags: review?(cls)
Attachment #320146 - Flags: review?(bear)

Updated

11 years ago
Attachment #320146 - Flags: review?(cls) → review+
Attachment #320146 - Flags: review?(bear)
Checking in Makefile.in;
/cvsroot/mozilla/webtools/bonsai/Makefile.in,v  <--  Makefile.in
new revision: 1.20; previous revision: 1.19
done
Checking in handleAdminMail.pl;
/cvsroot/mozilla/webtools/bonsai/handleAdminMail.pl,v  <--  handleAdminMail.pl
new revision: 1.7; previous revision: 1.6
done
Checking in handleCheckinMail.pl;
/cvsroot/mozilla/webtools/bonsai/handleCheckinMail.pl,v  <--  handleCheckinMail.pl
new revision: 1.9; previous revision: 1.8
done
Status: REOPENED → RESOLVED
Last Resolved: 11 years ago11 years ago
Resolution: --- → FIXED
Created attachment 321223 [details] [diff] [review]
Clean-up processMail.pl to use @BONSAI_DIR@ - v1

While I was upgrading all the Mozilla bonsai instances, I noticed that processMail.pl had some of the same issues that handle*Mail.pl had, along with a few other things. I took this opportunity to clean it up and use the commit as a test for the scripts. This has already been committed, but getting post-commit review.
Attachment #321223 - Flags: review?(cls)
Created attachment 321612 [details] [diff] [review]
Clean-up processMail.pl to use @BONSAI_DIR@ and other fixes from ajschult - v2

So, turns out a bug in the original processMail.pl caused bonsai-www and bonsai-l10n not to get updated properly (bug 434489). The issue was that there wasn't a chdir happening, which caused |require| and the |system()| calls not to work correctly, as the call wasn't happening in the right directory. I didn't notice it when testing, as I was mostly testing bonsai.m.o, which worked fine due to bonsai's homedir on dm-webtools02 being /var/www/webtools/bonsai-cvs. I tested bonsai-www and bonsai-l10n, too, but I was manually running processMail.pl under the proper directory, so it worked fine. Oh well. I've patched processMail.pl on dm-webtools02 with ajschult's patch, and everything seems to be working now.

I committed ajschult's modification with my r+, but I still want to have my change and ajschult's change reviewed by bear or cls.

Checking in processMail.pl;
/cvsroot/mozilla/webtools/bonsai/processMail.pl,v  <--  processMail.pl
new revision: 1.4; previous revision: 1.3
done
Attachment #321223 - Attachment is obsolete: true
Attachment #321612 - Flags: review?(cls)
Attachment #321612 - Flags: review?(bear)
Attachment #321223 - Flags: review?(cls)

Comment 35

11 years ago
Comment on attachment 321612 [details] [diff] [review]
Clean-up processMail.pl to use @BONSAI_DIR@ and other fixes from ajschult - v2

r=cls if you add an |or die "";| to the chdir so that it's obvious why the script failed.
Attachment #321612 - Flags: review?(cls) → review+
(In reply to comment #35)
> (From update of attachment 321612 [details] [diff] [review])
> r=cls if you add an |or die "";| to the chdir so that it's obvious why the
> script failed.

Checking in processMail.pl;
/cvsroot/mozilla/webtools/bonsai/processMail.pl,v  <--  processMail.pl
new revision: 1.5; previous revision: 1.4
done
Attachment #321612 - Flags: review?(bear)
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.