Closed Bug 54256 Opened 20 years ago Closed 20 years ago

runaway cvs processes on cvs.mozilla.org

Categories

(mozilla.org Graveyard :: Server Operations, task, P3)

Sun
Solaris
task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dmose, Assigned: rkotalampi)

Details

10-20 processes per day are spinning and chewing up CPU on thelizard.mozilla.org.
Eventually, the load gets so high that mail stops being processed.  This seems
to have started happening when the MN6 branch was cut last week.  I've killed a
bunch of them yesterday and today, and they show no signs of letting up.

Some preliminary analysis: all the processes I've tried attaching to with gdb
look like this:

#0  0x3e1a0 in translate_symtag (rcs=0xaed60,
    tag=0xc4538 "\tNetscape_20000922_BRANCH") at ../../src/rcs.c:3371
#1  0x3d070 in RCS_gettag (rcs=0xaed60,
    symtag=0xc4538 "\tNetscape_20000922_BRANCH", force_tag_match=1,
    simple_tag=0x0) at ../../src/rcs.c:2486
#2  0x52ef4 in val_fileproc (callerdat=0xeffff868, finfo=0xeffff548)
    at ../../src/tag.c:691
#3  0x4786c in do_file_proc (p=0xc1658, closure=0xeffff540)
    at ../../src/recurse.c:595
#4  0x29bec in walklist (list=0xcfab0, proc=0x477b8 <do_file_proc>,
    closure=0xeffff540) at ../../src/hash.c:370
#5  0x476a8 in do_recursion (frame=0xeffff5e0) at ../../src/recurse.c:517
#6  0x47cf0 in do_dir_proc (p=0x0, closure=0xeffff6e8)
    at ../../src/recurse.c:807
#7  0x29bec in walklist (list=0xcf9e8, proc=0x47890 <do_dir_proc>,
    closure=0xeffff6e8) at ../../src/hash.c:370
#8  0x47764 in do_recursion (frame=0xeffff790) at ../../src/recurse.c:543
#9  0x47084 in start_recursion (fileproc=0x52ed4 <val_fileproc>,
    filesdoneproc=0, direntproc=0x52f14 <val_direntproc>, dirleaveproc=0,
    callerdat=0xeffff868, argc=0, argv=0xcee88, local=0, which=6, aflag=1,
    readlock=1, update_preload=0x0, dosrcs=1) at ../../src/recurse.c:216
#10 0x53188 in tag_check_valid (name=0xc4538 "\tNetscape_20000922_BRANCH",
    argc=0, argv=0xc3cf0, local=0, aflag=1,
---Type <return> to continue, or q <return> to quit---
    repository=0xae5d0 "/cvsroot/mozilla/build/mac") at ../../src/tag.c:835
#11 0x17dbc in checkout_proc (pargc=0xeffffa24, argv=0xc3cec, where_orig=0x0,
    mwhere=0x0, mfile=0xc3d30 "XPCOM_BASE", shorten=804352, local_specified=0,
    omodule=0xa8178 "mozilla/build/mac", msg=0x7ac00 "Updating")
    at ../../src/checkout.c:1005
#12 0x36bec in do_module (db=0xa7610, mname=0xa8178 "mozilla/build/mac",
    m_type=CHECKOUT, msg=0x7ac00 "Updating",
    callback_proc=0x17450 <checkout_proc>, where=0x0, shorten=0,
    local_specified=0, run_module_prog=0, extra_arg=0x0)
    at ../../src/modules.c:552
#13 0x171fc in checkout (argc=1, argv=0xc44d4) at ../../src/checkout.c:373
#14 0x4d914 in do_cvs_command (cmd_name=0x8a718 "checkout",
    command=0x16934 <checkout>) at ../../src/server.c:2349
#15 0x4eeac in serve_co (arg=0xc0082 "") at ../../src/server.c:3322
#16 0x50a7c in server (argc=662272, argv=0xeffffe98) at ../../src/server.c:4599
#17 0x353fc in main (argc=1, argv=0xeffffe98) at ../../src/main.c:923
(gdb) frame 2
#2  0x52ef4 in val_fileproc (callerdat=0xeffff868, finfo=0xeffff548)
    at ../../src/tag.c:691
../../src/tag.c:691: No such file or directory.
(gdb) print *finfo
$1 = {file = 0xc9d70 "BuildList.pm", update_dir = 0xceeb8 "",
  fullname = 0xc9de8 "BuildList.pm",
  repository = 0xc0820 "/cvsroot/mozilla/build/mac", entries = 0x0,
  rcs = 0xaed60}

cvs always appears to be spinning while looking for the branch tag in that
particular file (which happens to be in the Attic, and in fact does not have the
tag).

Perhaps a newer version of cvs fixes this?
We're looking into this... 
Status: NEW → ASSIGNED
Okay, after intensive detective work I have a theory of what is happening.

Clue #1: after looking where the requests comes that horks cvs -> all seems to 
be Macs

Clue #2: I did some snooping. Take a look at this packet:

           0: 0800 2095 e8a1 00b0 c285 aca1 0800 4500    .. ...........E.
          16: 00a2 3801 4000 f806 33bc d00c 24ee cfc8    ..8.@.ø.3...$...
          32: 51d5 c001 0961 3058 0793 cea3 d4d6 5018    Q....a0X......P.
          48: 8000 2f44 0000 4469 7265 6374 6f72 7920    ../D..Directory
          64: 2e0a 2f63 7673 726f 6f74 0a41 7267 756d    ../cvsroot.Argum
          80: 656e 7420 2d41 0a41 7267 756d 656e 7420    ent -A.Argument
          96: 2d72 0a41 7267 756d 656e 7420 094e 6574    -r.Argument .Net
         112: 7363 6170 655f 3230 3030 3039 3232 5f42    scape_20000922_B
         128: 5241 4e43 480a 4172 6775 6d65 6e74 202d    RANCH.Argument -
         144: 6e0a 4172 6775 6d65 6e74 206d 6f7a 696c    n.Argument mozil
         160: 6c61 2f62 7569 6c64 2f6d 6163 0a63 6f0a    la/build/mac.co.

Look for string "-r.Argument .Net". The dot in front of "Net" is 0x09 -> HT.

Compare it to this pserver packet (pulled from my home adsl):

           0: 0800 2095 e8a1 00b0 c285 aca1 0800 4500    .. ...........E.
          16: 00ad bcad 4000 3506 ad47 3fc1 79f7 cfc8    ....@.5..G?.y<F7>..
          32: 51d5 052e 0961 3166 b667 db40 41f5 8018    Q....a1f.g.@A<F5>..
          48: 7d78 fa9f 0000 0101 080a 010d 8167 1bec    }x...........g..
          64: 81f1 4172 6775 6d65 6e74 202d 4e0a 4172    ..Argument -N.Ar
          80: 6775 6d65 6e74 202d 500a 4172 6775 6d65    gument -P.Argume
          96: 6e74 202d 720a 4172 6775 6d65 6e74 204e    nt -r.Argument N
         112: 6574 7363 6170 655f 3230 3030 3039 3232    etscape_20000922
         128: 5f42 5241 4e43 480a 4172 6775 6d65 6e74    _BRANCH.Argument
         144: 206d 6f7a 696c 6c61 2f62 7569 6c64 2f6d     mozilla/build/m
         160: 6163 0a44 6972 6563 746f 7279 202e 0a2f    ac.Directory ../
         176: 6376 7372 6f6f 740a 636f 0a             cvsroot.co.

-> No tabs there.

I talked about this with smfr and according to him Mac build machines does 
explicitely pull mozilla/build/mac before they run anything else. 

So, my theory here is that for some reason MACs runs command:

cvs co -A -r<tab>Netscape_20000922_BRANCH -n mozilla/build/mac

And apparently maccvs client don't check this and is causing it to violate 
pserver protocol.
We need someone to look into these macs and see if the theory is valid. This is 
causing major problems for both internal (cvs.netscape.com) and external 
(cvs.mozilla.org) cvs servers. Cc:ing smfr and leaf.
Btw, I have had a script "cvs-kill.pl" I've run from cron to find these runaway 
processes out and kill 'em. But someone still should take a look at those macs 
and find out if the problem is what I've described. Also, internal cvs servers 
don't run the script.
I'll take a look at the automation and see what I can see.  that would explain 
why the macs are more screwy than usual on the branch.  thanks for the detective 
work!
granrose fixed the errand tab, so I think this is fixed.
yes, it looks like all the macs are pulling successfully.  Now the only question 
is the server happy again?
We should see that soon. I'm keeping the bug open until we can be sure that 
cvs.mozilla.org or cvs.mcom.com aren't hit anymore.
Nice detective work, Risto!
Thanks Dan! Sniffing is fun.

Things have been calm, closing the bug. I've also posted to gnu.cvs.bug to see 
if this is a known bug and fixed yet - or maybe fixed in 1.11.1. When I'm 
looking into code I don't it see checking anything else than numeric tags.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.