Closed Bug 137800 Opened 22 years ago Closed 21 years ago

tinderbox core dumps (y2sun1 Solaris 6, OSF 5.1)

Categories

(NSS :: Test, defect, P1)

Other
Other
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: sonja.mirtitsch, Assigned: bishakhabanerjee)

Details

machine had hanging selfservers, killed a few times, checked where they came
from, found cores in client and ronlydir on y2sun1 (4/14 22:52 and 4/16 01:29)
and dijkstra 4/16 00:37). 
Maybe problems in dbtest.
I saw the failures, and built on axilla, which is also Solaris 2.6.  I have been
running SSL stress tests on it (100,000 connections in each test) and haven't
seen the problem.

You have a script that detects coredumps in the QA, right?  In that case, maybe
it should shutdown the tinderbox on that machine, so that the state is frozen. 
Then we can go in and debug the core file.  That may be harder to do than I
realize, though...
put a stop on fail on dijkstra's tinderbox (selfserv core)

I am not certain about the solaris cores since it looks loke shell core dumps,
some of our lab machines had troubles tonight, there might be something wrong
with the network.

Neither of the cores had anything to do with the dbtest, this test just copies
the client directory and makes it readonly.
copied
http://cindercone.red.iplanet.com/share/builds/mccrel3/nss/nsstip/tinderbox/tests_results/security/dijkstra-20020417-05.26
to /u/sonmi/tmp

ssl.sh: SSL Stress Test Extended test ===============================
ssl.sh: skipping  Stress SSL2 RC4 128 with MD5 for Extended test
ssl.sh: Stress SSL3 RC4 128 with MD5 ----
selfserv -D -p 8444 -d ../ext_server -n dijkstra.red.iplanet.com \
         -w nss   -i ../tests_pid.315559  &
selfserv started at Wed Apr 17 05:31:01 PDT 2002
tstclnt -p 8444 -h dijkstra -q 
        -d ../ext_client <
/usr2/nss_tbx_OSF1-5.1/builds/tinderbox/OSF1-5.1/mozilla/security/nss/tests/ssl/sslreq.txt
\
strsclnt -q -p 8444 -d ../ext_client -w nss -c 1000 -C c \
          dijkstra.red.iplanet.com
strsclnt started at Wed Apr 17 05:31:01 PDT 2002
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable
/usr2/nss_tbx_OSF1-5.1/builds/tinderbox/OSF1-5.1/mozilla/security/nss/tests/all.sh:
317397 Memory fault - core dumped
strsclnt completed at Wed Apr 17 05:31:19 PDT 2002
/usr2/nss_tbx_OSF1-5.1/builds/tinderbox/OSF1-5.1/mozilla/security/nss/tests/all.sh:
317160 Terminated

OS: Solaris → other
Hardware: Sun → Other
Summary: y2sun1 Solaris 6 tinderbox core dumps → tinderbox core dumps (y2sun1 Solaris 6, OSF 5.1)
Here is the trace from the core:

   7 free(0x3ffc0087f58, 0x100000000, 0x3ffc0087390, 0xfffffffffffffff6,
0x300020219bc) [0x3ff800d3500]
   8 PR_Free(ptr = 0x140219240) ["prmem.c":82, 0x300020219b8]
   9 PR_DestroyLock(lock = 0x140219240) ["ptsynch.c":176, 0x30002035900]
  10 pk11_CleanupFreeLists() ["pkcs11u.c":1789, 0x30002856da8]
  11 NSC_Finalize(pReserved = (nil)) ["pkcs11.c":2409, 0x30002844308]
  12 SECMOD_UnloadModule(mod = 0x14003d420) ["pk11load.c":285, 0x30000849324]
  13 SECMOD_SlotDestroyModule(module = 0x14003d420, fromSlot = 1)
["pk11util.c":653, 0x30000861d00]
  14 PK11_DestroySlot(slot = 0x1400fb000) ["pk11slot.c":483, 0x3000084cba0]
  15 PK11_FreeSlot(slot = 0x1400fb000) ["pk11slot.c":515, 0x3000084cca4]
  16 SECMOD_DestroyModule(module = 0x14003d420) ["pk11util.c":630, 0x30000861bd8]
  17 SECMOD_DestroyModuleListElement(element = 0x14003ae00) ["pk11util.c":669,
0x30000861d94]
More (n if no)?
  18 SECMOD_DestroyModuleList(list = 0x14003ae00) ["pk11util.c":684, 0x30000861e00]
  19 SECMOD_Shutdown() ["pk11util.c":87, 0x300008608d0]
  20 NSS_Shutdown() ["nssinit.c":462, 0x3000082e9e8]
  21 main(argc = 13, argv = 0x11fffc018) ["strsclnt.c":1147, 0x1200078c0]
Unfortunately, I couldn't get any more information from the core file.  Bob,
does this tell you anything?
I'm not sure about that core file.  Here is another, generated by my own build:

   8 PR_Free(ptr = 0x14019ea80) ["prmem.c":82, 0x300020219b8]
   9 PORT_Free(ptr = 0x14019ea80) ["secport.c":149, 0x30000884050]
  10 SECITEM_FreeItem(zap = 0x14019ea80, freeit = 1) ["secitem.c":227,
0x30000882ef4]
  11 PK11_DestroyContext(context = 0x14015fc80, freeit = 1) ["pk11skey.c":3250,
0x3000085c480]
  12 ssl_ResetSecurityInfo(sec = 0x20000fd5458) ["sslsecur.c":805, 0x3ffbffeaee0]
  13 ssl_DestroySecurityInfo(sec = 0x20000fd5458) ["sslsecur.c":845, 0x3ffbffeb060]
  14 ssl_DestroySocketContents(ss = 0x20000fd5440) ["sslsock.c":354, 0x3ffbffef6b4]
  15 ssl_FreeSocket(ss = 0x14014de00) ["sslsock.c":418, 0x3ffbffef8e8]
More (n if no)?y
  16 ssl_DefClose(ss = 0x14014de00) ["ssldef.c":244, 0x3ffbffe7460]
  17 ssl_SecureClose(ss = 0x14014de00) ["sslsecur.c":906, 0x3ffbffeb314]
  18 ssl_Close(fd = 0x140122c40) ["sslsock.c":1178, 0x3ffbfff17c8]
  19 PR_Close(fd = 0x140122c40) ["priometh.c":131, 0x3000201618c]
> 20 do_connects(a = 0x11fffbed0, b = 0x140115800, connection = 9)
["strsclnt.c":779, 0x1200069e0]


The interesting thing to note is that the SECITEM_FreeItem call in
PK11_DestroyContext immediately follows a call to PK11_FreeSymKey.  Here is some
possibly useful information from context->slot:

(dbx) print *context->slot
struct {
    functionList = 0x300428976d0
    module = 0x14003d420
    needTest = 0
    isPerm = 1
    isHW = 0
    isInternal = 1
    disabled = 0
    reason = PK11_DIS_NONE
    readOnly = 1
    needLogin = 0
    hasRandom = 1
    defRWSession = 0
    isThreadSafe = 1
    flags = 32771
    session = 1
    sessionLock = 0x140111640
    slotID = 1
    defaultFlags = 2684370749
    refCount = 33
    refLock = 0x1400f6d80
    freeListLock = 0x1401117c0
    freeSymKeysHead = 0x1401d1f00
    keyCount = 13


It appears this is a shutdown error, not a stress error.  I've written a script
that runs strsclnt instances until a core file is generated.  I must have been
very lucky to get this core; the script has been running for a while.
The first core file looks like shutdown, the second is not (we don't free
contexts on shutdown).

Note that both of these crash in free, I suspect that our problem is probably a
double free or a free into the wrong heap.

Since Ian can't reproduce it easily, it's probably a race condition.

bob
another core:

   7 malloc(0x3ffc0086e90, 0x1000000a0, 0x3ff800d423c, 0x15c, 0x140148280)
[0x3ff800d1ca0]
   8 calloc(0x3ff800d423c, 0x15c, 0x140148280, 0x14003af80, 0x30002021948)
[0x3ff800d4238]
   9 PR_Calloc(nelem = 1, elsize = 348) ["prmem.c":64, 0x30002021944]
  10 PORT_ZAlloc(bytes = 348) ["secport.c":137, 0x3000288c330]
  11 SHA1_NewContext() ["sha_fast.c":328, 0x30002869714]
  12 NSC_DigestInit(hSession = 709, pMechanism = 0x20000f45680)
["pkcs11c.c":1062, 0x3000284a0b8]
  13 pk11_context_init(context = 0x14021e980, mech_info = 0x20000f45680)
["pk11skey.c":3364, 0x3000085c898]
  14 pk11_CreateNewContextInSlot(type = 544, slot = 0x1400faa00, operation =
2164260864, symKey = (nil), param = 0x20000f456e8) ["pk11skey.c":3450,
0x3000085cc7c]
  15 PK11_CreateDigestContext(hashAlg = SEC_OID_SHA1) ["pk11skey.c":3558,
0x3000085cfc4]
More (n if no)?y
  16 ssl3_InitState(ss = 0x14014b700) ["ssl3con.c":7641, 0x3ffbffdc538]
  17 ssl3_SendClientHello(ss = 0x14014b700) ["ssl3con.c":2590, 0x3ffbffd1190]
  18 ssl2_BeginClientHandshake(ss = 0x14014b700) ["sslcon.c":3072, 0x3ffbffe5758]
  19 ssl_Do1stHandshake(ss = 0x14014b700) ["sslsecur.c":155, 0x3ffbffe97a4]
  20 ssl_SecureSend(ss = 0x14014b700, buf = 0x120004090 = "GET /abc
HTTP/1.0\r\n\r\n", len = 21, flags = 0) ["sslsecur.c":1036, 0x3ffbffeb930]
  21 ssl_SecureWrite(ss = 0x14014b700, buf = 0x120004090 = "GET /abc
HTTP/1.0\r\n\r\n", len = 21) ["sslsecur.c":1070, 0x3ffbffeba90]
  22 ssl_Write(fd = 0x14019e0c0, buf = 0x120004090, len = 21) ["sslsock.c":1260,
0x3ffbfff1bf0]
  23 PR_Write(fd = 0x14019e0c0, buf = 0x120004090, amount = 21)
["priometh.c":141, 0x30002016214]
  24 handle_connection(ssl_sock = 0x14019e0c0, connection = 47)
["strsclnt.c":645, 0x120006540]
  25 do_connects(a = 0x11fffbed0, b = 0x140108040, connection = 47)
["strsclnt.c":772, 0x1200069a4]
Assigned the bug to Bishakha.
Assignee: sonja.mirtitsch → bishakhabanerjee
Changed the QA contact to Bishakha.
QA Contact: sonja.mirtitsch → bishakhabanerjee
This is a crash, therefore it qualifies as P1, IFF it is still happening,
otherwise it should be resolved WORKSFORME.

Sonja, Bishakha, does this still happen?
Priority: -- → P1
Can't verify it it still happens at Sun, Solaris 6 is not supported anymore on 
NSS > 3.3.x, neither is OSF
Have never seen this on Netscape Tinderboxes. We do run them on OSF/1 and
Solaris 5.8.  Have never run on Solaris 2.6 here.

Resolving "WORKSFORME".
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.