Open Bug 508081 Opened 12 years ago Updated 5 years ago

Sqlite shared database continues to malfunction after we lost and regained access to the database

Categories

(NSS :: Libraries, defect)

3.12
All
Linux
defect
Not set
normal

Tracking

(Not tracked)

People

(Reporter: wtc, Assigned: KaiE)

References

Details

Attachments

(4 files, 1 obsolete file)

This bug is first reported in Chromium issue 15630 (http://crbug.com/15630).

Suppose the file system where the NSS sqlite shared database resides goes
away and comes back while NSS is already initialized.  The softoken returns
CKR_DEVICE_ERROR (0x48) when we try to search for certificates because some
sqlite function fails with SQLITE_IOERR.  The stack trace is:

Breakpoint 3, sdb_mapSQLError (type=SDB_CERT, sqlerr=10) at sdb.c:349
349		return CKR_DEVICE_ERROR;
(gdb) where
#0  sdb_mapSQLError (type=SDB_CERT, sqlerr=10) at sdb.c:349
#1  0xf1bbdcdd in sdb_FindObjects (sdb=0xf1287220, sdbFind=0xf56619c8, 
    object=0xf5694a18, arraySize=5, count=0xf57608e0) at sdb.c:781
#2  0xf1bc23b3 in sftkdb_FindObjects (handle=0xf1266d08, find=0xf56619c8, 
    ids=0xf5694a18, arraySize=5, count=0xf57608e0) at sftkdb.c:1241
#3  0xf1ba9963 in sftk_searchDatabase (handle=0xf1266d08, search=0xf1267be0, 
    pTemplate=0xf5760a84, ulCount=3) at pkcs11.c:4144
#4  0xf1ba9ccf in sftk_searchTokenList (slot=0xf1276f60, search=0xf1267be0, 
    pTemplate=0xf5760a84, ulCount=3, tokenOnly=0xf5760968, isLoggedIn=1)
    at pkcs11.c:4265
#5  0xf1ba9f0e in NSC_FindObjectsInit (hSession=16777217, 
    pTemplate=0xf5760a84, ulCount=3) at pkcs11.c:4317
#6  0xf77861ff in find_objects (tok=0xf12a45e0, sessionOpt=0xf12a1db8, 
    obj_template=0xf5760a84, otsize=3, maximumOpt=0, statusOpt=0xf5760aec)
    at devtoken.c:334
#7  0xf7786599 in find_objects_by_template (token=0xf12a45e0, 
    sessionOpt=0xf12a1db8, obj_template=0xf5760a84, otsize=3, maximumOpt=0, 
    statusOpt=0xf5760aec) at devtoken.c:463
#8  0xf7786e59 in nssToken_FindCertificatesBySubject (token=0xf12a45e0, 
    sessionOpt=0xf12a1db8, subject=0xf5760b80, 
    searchType=nssTokenSearchType_TokenOnly, maximumOpt=0, 
    statusOpt=0xf5760aec) at devtoken.c:657
#9  0xf777d95d in nssTrustDomain_FindCertificatesBySubject (td=0xf12a1cd0, 
    subject=0xf5760b80, rvOpt=0x0, maximumOpt=0, arenaOpt=0x0)
    at trustdomain.c:646
#10 0xf777daa7 in NSSTrustDomain_FindCertificatesBySubject (td=0xf12a1cd0, 
    subject=0xf5760b80, rvOpt=0x0, maximumOpt=0, arenaOpt=0x0)
    at trustdomain.c:702
#11 0xf77760f2 in CERT_CreateSubjectCertList (certList=0x0, handle=0xf12a1cd0, 
    name=0xf12b02f8, sorttime=1249323300718664, validOnly=1)
    at stanpcertdb.c:691
#12 0xf782fed7 in pkix_pl_Pk11CertStore_CertQuery (params=0xf126a8b0, 
    pSelected=0xf5760cc0, plContext=0xf19068a8) at pkix_pl_pk11certstore.c:238
#13 0xf78310f7 in pkix_pl_Pk11CertStore_GetCert (store=0xf564b330, 
    selector=0xf564f840, parentVerifyNode=0xf1247b08, pNBIOContext=0xf5760d2c, 
    pCertList=0xf5760d34, plContext=0xf19068a8) at pkix_pl_pk11certstore.c:618
#14 0xf77c72e0 in pkix_Build_GatherCerts (state=0xf190d248, 
    certSelParams=0xf126a8b0, pNBIOContext=0xf5760df0, plContext=0xf19068a8)
    at pkix_build.c:1807
#15 0xf77c87ed in pkix_BuildForwardDepthFirstSearch (pNBIOContext=0xf5760f5c, 
    state=0xf190d248, pValResult=0xf5760f54, plContext=0xf19068a8)
    at pkix_build.c:2343
#16 0xf77cd4df in pkix_Build_InitiateBuildChain (procParams=0xf126bcd8, 
    pNBIOContext=0xf5761010, pState=0xf5761018, pBuildResult=0xf5761014, 
    pVerifyNode=0xf5761078, plContext=0xf19068a8) at pkix_build.c:3551
#17 0xf77ce1d6 in PKIX_BuildChain (procParams=0xf126bcd8, 
    pNBIOContext=0xf576108c, pState=0xf5761088, pBuildResult=0xf5761090, 
    pVerifyNode=0xf5761078, plContext=0xf19068a8) at pkix_build.c:3719
#18 0xf7728fea in CERT_PKIXVerifyCert (cert=0xbd07188, usages=2, 
    paramsIn=0xf57611d8, paramsOut=0xf5761190, wincx=0x0) at certvfypkix.c:2148
[Rest of stack trace omitted for brevity]

The sqlite function that failed is the sqlite3_step call in sdb_FindObjects.
That sqlite3_step call returns SQLITE_IOERR (10 = 0x0A).  That error seems
to come from this failure:

#0  unixFileSize (id=0xf1277d60, pSize=0xf5781010)
    at /usr/local/google/home/wtc/chrome1/src/third_party/sqlite/src/os_unix.c:1110
#1  0x09465419 in sqlite3OsFileSize (id=0xf1277d60, pSize=0xf5781010)
    at /usr/local/google/home/wtc/chrome1/src/third_party/sqlite/src/os.c:80
#2  0x0946a162 in sqlite3PagerPagecount (pPager=0xf1277c78, pnPage=0xf5781058)
    at /usr/local/google/home/wtc/chrome1/src/third_party/sqlite/src/pager.c:2548
#3  0x0946b66b in sqlite3PagerAcquire2 (pPager=0xf1277c78, pgno=1, 
    ppPage=0xf57810dc, noContent=0, pDataToFill=0x0)
    at /usr/local/google/home/wtc/chrome1/src/third_party/sqlite/src/pager.c:3848
#4  0x0946b537 in pagerAcquire (pPager=0xf1277c78, pgno=1, ppPage=0xf57810dc, 
    noContent=0)
    at /usr/local/google/home/wtc/chrome1/src/third_party/sqlite/src/pager.c:3780
#5  0x0946b889 in sqlite3PagerAcquire (pPager=0xf1277c78, pgno=1, 
    ppPage=0xf57810dc, noContent=0)
    at /usr/local/google/home/wtc/chrome1/src/third_party/sqlite/src/pager.c:3918
#6  0x094a5cf6 in sqlite3BtreeGetPage (pBt=0xf12777e8, pgno=1, 
    ppPage=0xf5781120, noContent=0)
    at /usr/local/google/home/wtc/chrome1/src/third_party/sqlite/src/btree.c:1082
#7  0x094a6ad1 in lockBtree (pBt=0xf12777e8)
    at /usr/local/google/home/wtc/chrome1/src/third_party/sqlite/src/btree.c:1715
#8  0x094a7097 in sqlite3BtreeBeginTrans (p=0xf12777c0, wrflag=0)
    at /usr/local/google/home/wtc/chrome1/src/third_party/sqlite/src/btree.c:1983
#9  0x094ba1d3 in sqlite3VdbeExec (p=0xf091cce8)
    at /usr/local/google/home/wtc/chrome1/src/third_party/sqlite/src/vdbe.c:2479
#10 0x0947f293 in sqlite3Step (p=0xf091cce8)
    at /usr/local/google/home/wtc/chrome1/src/third_party/sqlite/src/vdbeapi.c:476
#11 0x0947f4d9 in sqlite3_step (pStmt=0xf091cce8)
    at /usr/local/google/home/wtc/chrome1/src/third_party/sqlite/src/vdbeapi.c:540
#12 0xf1bbdc4f in sdb_FindObjects (sdb=0xf1287220, sdbFind=0xf091aea0, 
    object=0xf091ade0, arraySize=5, count=0xf57818e0) at sdb.c:762
#13 0xf1bc23b3 in sftkdb_FindObjects (handle=0xf1266d08, find=0xf091aea0, 
    ids=0xf091ade0, arraySize=5, count=0xf57818e0) at sftkdb.c:1241
#14 0xf1ba9963 in sftk_searchDatabase (handle=0xf1266d08, search=0xf091b250, 
    pTemplate=0xf5781a84, ulCount=3) at pkcs11.c:4144
#15 0xf1ba9ccf in sftk_searchTokenList (slot=0xf1276f60, search=0xf091b250, 
    pTemplate=0xf5781a84, ulCount=3, tokenOnly=0xf5781968, isLoggedIn=1)
    at pkcs11.c:4265
#16 0xf1ba9f0e in NSC_FindObjectsInit (hSession=16777217, 
    pTemplate=0xf5781a84, ulCount=3) at pkcs11.c:4317
#17 0xf77861ff in find_objects (tok=0xf12a45e0, sessionOpt=0xf12a1db8, 
    obj_template=0xf5781a84, otsize=3, maximumOpt=0, statusOpt=0xf5781aec)
    at devtoken.c:334
#18 0xf7786599 in find_objects_by_template (token=0xf12a45e0, 
    sessionOpt=0xf12a1db8, obj_template=0xf5781a84, otsize=3, maximumOpt=0, 
    statusOpt=0xf5781aec) at devtoken.c:463
#19 0xf7786e59 in nssToken_FindCertificatesBySubject (token=0xf12a45e0, 
    sessionOpt=0xf12a1db8, subject=0xf5781b80, 
    searchType=nssTokenSearchType_TokenOnly, maximumOpt=0, 
    statusOpt=0xf5781aec) at devtoken.c:657
#20 0xf777d95d in nssTrustDomain_FindCertificatesBySubject (td=0xf12a1cd0, 
    subject=0xf5781b80, rvOpt=0x0, maximumOpt=0, arenaOpt=0x0)
    at trustdomain.c:646
#21 0xf777daa7 in NSSTrustDomain_FindCertificatesBySubject (td=0xf12a1cd0, 
    subject=0xf5781b80, rvOpt=0x0, maximumOpt=0, arenaOpt=0x0)
    at trustdomain.c:702
#22 0xf77760f2 in CERT_CreateSubjectCertList (certList=0x0, handle=0xf12a1cd0, 
    name=0xf12b02f8, sorttime=1249325704833941, validOnly=1)
    at stanpcertdb.c:691
#23 0xf782fed7 in pkix_pl_Pk11CertStore_CertQuery (params=0xf091b008, 
    pSelected=0xf5781cc0, plContext=0xf091e868) at pkix_pl_pk11certstore.c:238
#24 0xf78310f7 in pkix_pl_Pk11CertStore_GetCert (store=0xf091b5a8, 
    selector=0xf091af00, parentVerifyNode=0xf091ad68, pNBIOContext=0xf5781d2c, 
    pCertList=0xf5781d34, plContext=0xf091e868) at pkix_pl_pk11certstore.c:618
#25 0xf77c72e0 in pkix_Build_GatherCerts (state=0xf091ac18, 
    certSelParams=0xf091b008, pNBIOContext=0xf5781df0, plContext=0xf091e868)
    at pkix_build.c:1807
#26 0xf77c87ed in pkix_BuildForwardDepthFirstSearch (pNBIOContext=0xf5781f5c, 
    state=0xf091ac18, pValResult=0xf5781f54, plContext=0xf091e868)
    at pkix_build.c:2343
#27 0xf77cd4df in pkix_Build_InitiateBuildChain (procParams=0xf091b418, 
    pNBIOContext=0xf5782010, pState=0xf5782018, pBuildResult=0xf5782014, 
    pVerifyNode=0xf5782078, plContext=0xf091e868) at pkix_build.c:3551
#28 0xf77ce1d6 in PKIX_BuildChain (procParams=0xf091b418, 
    pNBIOContext=0xf578208c, pState=0xf5782088, pBuildResult=0xf5782090, 
    pVerifyNode=0xf5782078, plContext=0xf091e868) at pkix_build.c:3719
#29 0xf7728fea in CERT_PKIXVerifyCert (cert=0xbd07188, usages=2, 
    paramsIn=0xf57821d8, paramsOut=0xf5782190, wincx=0x0) at certvfypkix.c:2148
[Rest of stack trace omitted for brevity]

The fstat() call in unixFileSize fails (returns -1) with errno 107
(ENOTCONN).

Note that Chromium uses its own copy of the sqlite library.  I'm not
sure if that matters to this bug.
I just verified that the old DBM database doesn't have this bug.  So
this bug can be considered a regression by products that switch from
DBM to sqlite.

Bob, I have the setup to reproduce and debug this bug.  You are
welcome to come to my office to debug this.
Version: 3.11.14 → 3.12
Question: Does old dbm lack the bug because old dbm does not return an error, returns a different error, or the error is handled differently?

My first thought is that the DBM sematics were wrong (they should have returned an error if the underlying file went way), but the upper level code should have been more tolerant of underlying failures in softoken itself.

NOTE: Has a couple of issues.... 
Say you have a database on one of these flakey file system. You delete a certificate from the builtins (which puts a 'delete' record in the old database). If your file system goes away, does dbm suddenly 'see' that new certificate? Is this the failure mode we want?

bob
DBM always held the file open the whole time.  IINM, sqlite3 closes and 
opens it between (some) accesses.
Bob, the old DBM database successfully returns the root CA cert I
added manually after it loses and regains access to the cert database.
I am running into this bug again.  Chromium copies PSM's algorithm
of storing intermediate CA certs in the NSS cert database.  This
causes the new temp certs that libSSL creates for the intermediate
CA certs to become "perm" certs:

http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/nss/lib/certdb/stanpcertdb.c&rev=1.84&mark=378-380,398#362

When we lose access to ~/.pki/nssdb, these intermediate CA certs
become undiscoverable if we use the sql db, even after we regain
access to ~/.pki/nssdb.  On the other hand, if we use the dbm db,
the softoken's db slot continues to return these intermediate CA
certs even when we lose access to ~/.pki/nssdb.

Re: Nelson's comment 3: I found that we open the sql or dbm
databases at NSS initialization, and do not close them until
NSS shutdown.  However, the dbm code reads from the file only
once, and can satisfy subsequent C_FindObjects calls from
memory, so the dbm code is oblivious to the fact that the
filesystem is gone.  I believe this is why the dbm database
handles this condition "better" then the sql database.  As
Bob noted, dbm's behavior is less correct, but from a user's
point of view, dbm's behavior makes the cert chains validate,
and is therefore more desirable.

I don't know the softoken code well enough to propose a fix.
I think it'll require reopening the db after detecing an error
with the current db handle.

This bug is a serious problem for Linux Chromium users at Google,
so I am afraid that I will have to not add the intermediate
CA certs to the cert db.
I forgot to add: I will not give up the sql db yet.
One obvious "workaround" I tried was to remove the code in
CERT_NewTempCertificate that simply returns the perm cert
if the cert is already in the cert db:
http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/nss/lib/certdb/stanpcertdb.c&rev=1.84&mark=378-380,398#362

Index: mozilla/security/nss/lib/certdb/stanpcertdb.c
===================================================================
RCS file: /cvsroot/mozilla/security/nss/lib/certdb/stanpcertdb.c,v
retrieving revision 1.84
diff -u -u -r1.84 stanpcertdb.c
--- mozilla/security/nss/lib/certdb/stanpcertdb.c	29 May 2009 19:16:54 -0000	1.84
+++ mozilla/security/nss/lib/certdb/stanpcertdb.c	11 Sep 2009 05:16:26 -0000
@@ -374,11 +374,13 @@
 	/* First, see if it is already a temp cert */
 	c = NSSCryptoContext_FindCertificateByEncodedCertificate(gCC, 
 	                                                       &encoding);
+#if 0
 	if (!c) {
 	    /* Then, see if it is already a perm cert */
 	    c = NSSTrustDomain_FindCertificateByEncodedCertificate(handle, 
 	                                                           &encoding);
 	}
+#endif
 	if (c) {
 	    /* actually, that search ends up going by issuer/serial,
 	     * so it is still possible to return a cert with the same

However, when I did this, the certificate chain building
code in libpkix always get into a loop.  I tried several changes
to eliminate the duplicate temp and perm certs, but couldn't
get rid of the cert loop problem.

I also tried opening two cert db's containing the same
intermediate CA certs, but libpkix did not get a cert
loop.  So it seems that having two perm certs, from two
tokens, of the same intermediate CA is fine.
This whole thing sounds like an sql lite issue, particularly if the failure is persistent.

bob
Blocks: 783994
What happens while the network file system is gone? Which of the following is true?

(a) Application will block forever on sqlite access, until the network is back

(b) sqlite function calls will fail

From the symptoms being reported, I'd guess it's (b).

If it's (b), then we cannot really blame sqlite. It's the application's job to deal with the storage being unavailable and recover on its own.
Wan-Teh has suggested to close the database after each write operation, and reopen it prior to any operation.

But apparently we need the database for read access. If we cannot retrieve the list of intermediate CAs from the cache, then apparently NSS doesn't cache such data, but rather NSS requires to read the database each time it needs information.

This "must read database each time" makes sense, given the intention of sharing data between multiple applications.

Having to open the database prior to each database access seems very expensive.
(In reply to Wan-Teh Chang from comment #5)
> 
> When we lose access to ~/.pki/nssdb, these intermediate CA certs
> become undiscoverable if we use the sql db, even after we regain
> access to ~/.pki/nssdb.  On the other hand, if we use the dbm db,
> the softoken's db slot continues to return these intermediate CA
> certs even when we lose access to ~/.pki/nssdb.

I assume they were still successfully written to disk, but the NSS read logic isn't smart enough to read the data after an earlier failure, because we don't open the database again.
Because of the intended data sharing across multiple processes,
I assume that an NSS process N1 performing operations N1a and N1z
deals correctly with any database operations N2c, N3d, N4e, ...
performed by NSS processes N2, N3, N4, ...
that happen between the time of N1a and N1z.

If this assumption is true, in order to recover, shouldn't it be sufficient to simply close and reopen the database after a failure?


Another question, what happens if N1a, N1b succeed, but N1c fails, then the user continues to use the application, which trigger additional operations N1d, N1e, which fail, too. Then network comes back, NSS recovers by reopening the database. NSS attempts to perform operation N1f. What happens now? Will the the operation N1f be based on data matching N1b or rather on N1e? If it's based on N1e (because NSS has cached data despite the failing transactions), will operation N1f write all remaining data to the database? If not, we probably aren't willing to accept such dataloss.

However, if the operation N1f was based on data from N1b (the most recent successful database operation), then we might be OK.

If the answer to any of the above is like "difficult to say, because of the complexity of NSS internals", then I'd rather like to make a different proposal: If sqlite fails, either exit the application, or permanently switch NSS into a "broken, won't access sqlite again, abort all future operations with failure code" mode.
(In reply to Kai Engert (:kaie) from comment #12)
> ... or permanently
> switch NSS into a "broken, won't access sqlite again, abort all future
> operations with failure code" mode.

Which might actually be the current behaviour. But I'd prefer that the user learns about the broken application state, like aborting any SSL connections, making the application unusable until a restart.

In my opinion, if the application continues to run without storage access, and if we cannot avoid the risk of dataloss even after storage comes back, then I'd prefer either a clean process exit, or a "sorry user, cannot continue, please restart".
I performed multiple experiments, using "sshfs" and a minimal sqlite application, with the following operations:

- 1 open db
- 2 create table
- 3 insert row a
- 4 insert row b
- 5 select and print

I executed steps 1, 2, 3, then paused the app.
I disconnected the mount (killed sshfs).
I continued with step 4, which detects an I/O error.
I restored the mountpoint.
Ran step 5 with the same database handle.
Still failed with an I/O error.

This means, unless the application reopens the database, all future data access will fail.


I performed a second experiment.
I changed the application, prior to an operation, to check whether the previous operation has failed, and if yes, close and re-open the database.

I executed steps 1, 2, 3, then paused the app.
I disconnected the mount (killed sshfs).
I continued with step 4, which detects an I/O error.
I restored the mountpoint.
I tried to continue with step 5, which attempted to re-open the database.
If the application uses an absolute path to the database, re-opening succeeds.
I continued with operation 5. The select succeeded and printed row b.


This confirms my guess from comment 9, it's (b), we can't blame sqlite, it's our own job to recover.

We should find answers for my questions in comment 11, 12 and 13.
Wan-Teh, can you please tell me, which filesystem did you use for your tests?


I'm trying to simulate the part "filesystem goes away and comes back".

When using the sshfs file system, if I disconnect the network interface, then my test process just blocks. I assume in your scenario the I/O call eventually timed out?
I learned that NSS/sdb uses two separate database handles for its operation.

When strictly reading only (no write transaction active), the database handle "sqlReadDB" is used.

This database handle "sqlReadDB" gets opened only once (*), and is kept open for the whole NSS session.

After a failure, we should set a new flag "readDbFailed", and on the next future attempt to make use of sqlReadDB while flag "readDbFailed" is set, NSS must close and reopen sqlReadDB.


(*) There is an exception to my "only once" statement. In one special scenario, when attempting to read meta data and getting error code sqlite_schema, then NSS will attempt to reopen (sdb_reopenDBLocal). Can we can reuse function sdb_reopenDBLocal after an I/O failure, too?
Wan-Teh, regarding your suggestion to open and close the database each time, and regarding my worry about that being too expensive, you might want to read the comment in front of function sdb_openDBLocal.
The file system where this bug was originally reported is NFS with Kerberos
access control.
(In reply to Wan-Teh Chang from comment #18)
> The file system where this bug was originally reported is NFS with Kerberos
> access control.

Can you confirm you were using the "soft" mount option on the client side?

(I think you must have used "soft", because the "hard" option is defined to wait forever and never returns an I/O error to the application.)
Attached patch certutil patch for testing (obsolete) — Splinter Review
Wan-Teh, questions for you are at the end of this comment.


I'm focusing on this bug, I want it resolved, as I consider it a blocker to deploying the NSS shared database format by default.

We need a minimal test case. I have been guessing too much, it's time to test.

I tried to reproduce this bug, both using Firefox and using a modified certutil, but I cannot reproduce.

I modified the list (-L) command of certutil to operate in a loop. (See attached patch for certutil.) After starting, it will init as usual. Then it will pause until a key is pressed, then it will produce the listing, then it will loop, again pausing for a key to be pressed.

If I understand the bug, when using this modified certutil, the following incorrect behavior is reported:
- mount a network directory /net
  (I used NFS, mount options "-o intr,soft,timeo=10")
- export NSS_DEFAULT_DB_TYPE="sql"
- copy a cert9+key4 db to /net - the database should contain a couple of certificates,
  ensure that certutil -L reports them.
- execute:
  certutil -d /net -L
- do NOT yet press enter. Wait for the tool to report "press enter to list"
- disconnect the network on the NFS server
- press enter, thereby instructing certutil to list
- after a few seconds, you should get an error message, NO certs listed
- reconnect the network on the NFS server
- wait until the network share becomes available again
- in the terminal window running certutil, press enter again, wait, again, wait, again...

- REPORTED incorrect result:
     the list of certificates will never be printed

- EXPECTED good result:
     eventually, when the network share is back, certutil should print the list of certificates


Wan-Teh, do you AGREE or DISAGREE that my test case shall be equivalent to the failure you have reported, and that you saw the REPORTED incorrect result? Could you please try the attached patch to certutil in your environment, and tell me which result you get now?

If you disagree, can you please clarify how your test scenario is different, and how we could construct a minimal test for this bug?


The trouble is, I cannot reproduce a failure using NSS trunk. I get the EXPECTED good result in my tests!
In case anyone of you is still able to reproduce a failure,
can you please test this potential fix?
Assignee: rrelyea → kaie
Attachment #677194 - Attachment is obsolete: true
Forget my other questions.

Now I'm able to reproduce, using a modified version of the "vfychain" tool.
(In reply to Kai Engert (:kaie) from comment #23)
> Forget my other questions.
> Now I'm able to reproduce, using a modified version of the "vfychain" tool.

Sorry, that was a mistake...


I *cannot* reproduce the failure.

Even when testing the code from NSS version 3.12.4 (the latest published version prior to Wan-Teh's comment 5, I *cannot* reproduce.

I have a patch to the vfychain tool for testing, which pauses prior to verification, and loops. I did an equivalent test to what I have described in comment 20.

I start vfychain, wait after NSS init, pause, disconnect NFS server, let it verify, then I get the error message about being invalid to verify, then I restore the network connection, then I wait until the share is back, then I continue the loop, and vfychain succeeds with verification (still in the same session/process).

I used a breakpoint on sdb_mapSQLError, and when the initial loop reported the error, I got *exactly* the same stack as reported in the initial comment in this bug report - but vfychain still recovers from it after the network is back.


I propose to resolve this as WORKSFORME.

If you still see this bug, then please help me to reproduce the failure.
Please use the patch to vfychain that I'll attach, and please try yourself to reproduce.

If you can reproduce, then please give me more detailed information about the environment you are using:
- version of server software (OS and NFS)
- version of client software
- full set of mount options on the client (as reported by the "mount" command with the mount active)
VFY, tested with a certificate database that contains the startcom intermediate (as loaded by visiting https://kuix.de ), having the libnssckbi.so roots module added with modutil, having saved the kuix.de server cert to file kuix.pem, with the vfychain patch applied, and testing with the following command line:
vfychain -d ~/moz/nss/from-remote/ -p -p -u 1 -v -a /tmp/kuix.pem

(In order to build NSS 3.12.4 on my Fedora 17 system, I used security/coreconf from the tip of the NSS_3_12_BRANCH, and the newer makefile for freebl, but besides that, it was 3.12.4 with NSPR 4.8 and sqlite 3.3.7)
After we discussed this on the phone 2-3 weeks ago, I had done an additional test, where I added the cert to perm storage during execution. I still couln't reproduce.

Ryan, Wan-Teh, have you been able to experiment with the patch I have provided? It would be good to know to get your confirmation that:
- you are still able to reproduce
- whether or not my patch fixes the issue

Thanks
Attachment #677199 - Flags: review?(ryan.sleevi)
Priority: -- → P2
Target Milestone: --- → 3.14.2
Kai: I need you to do an experiment.

Edit the unixFileSize function in lib/sqlite/sqlite3.c.

/*
** Determine the current size of a file in bytes
*/
static int unixFileSize(sqlite3_file *id, i64 *pSize){
  int rc;
  struct stat buf;
  assert( id );
  rc = osFstat(((unixFile*)id)->h, &buf);
  SimulateIOError( rc=1 );
  if( rc!=0 ){
    ((unixFile*)id)->lastErrno = errno;
    return SQLITE_IOERR_FSTAT;
  }
  *pSize = buf.st_size;

  /* When opening a zero-size database, the findInodeInfo() procedure
  ** writes a single byte into that file in order to work around a bug
  ** in the OS-X msdos filesystem.  In order to avoid problems with upper
  ** layers, we need to report this file size as zero even though it is
  ** really 1.   Ticket #3260.
  */
  if( *pSize==1 ) *pSize = 0;


  return SQLITE_OK;
}

Add a printf statement to print ((unixFile*)id)->zPath. Then
run Firefox and visit some https sites. Let me know what file
pathnames are printed. Thanks.

I am planning to use a magic file named "missing.txt" in the
NSS database directory to make unixFileSize fail artificially.
This will allow us to simulate an inaccessible filesystem
easily. So I need to know what is actually stored in the zPath
field of the unixFile structure. Thanks.
(In reply to Wan-Teh Chang from comment #28)
> Kai: I need you to do an experiment.
> 
> run Firefox and visit some https sites. Let me know what file
> pathnames are printed.

output sorted | uniq -c

   2414 path-to-profile/cert9.db
     26 path-to-profile/content-prefs.sqlite
     23 path-to-profile/cookies.sqlite
      1 path-to-profile/cookies.sqlite-wal
     29 path-to-profile/downloads.sqlite
     47 path-to-profile/key4.db
     26 path-to-profile/permissions.sqlite
    133 path-to-profile/places.sqlite
      8 path-to-profile/places.sqlite-wal
     17 path-to-profile/signons.sqlite
     11 path-to-profile/webappsstore.sqlite
      2 (null)
Priority: P2 → --
Target Milestone: 3.14.2 → ---
Ryan, did you have time to look at the patch?

Wan-Teh, did you have a chance to work on your plan?
ping
Comment on attachment 677199 [details] [diff] [review]
potential fix (v6)

Sorry for the delay. I'm not a good person to review this code, as I have not spent anywhere near the amount of time that Wan-Teh has investigating, nor do I have an environment to reliably repro (I use Windows as my primary env).
Attachment #677199 - Flags: review?(ryan.sleevi) → review?(wtc)
This bug has been sitting idle for too long.

Given the lack of attention to get this issue fixed, we should no longer accept it as a blocker to make the sqlite database the default database.
No longer blocks: 783994
See Also: → 783994
You need to log in before you can comment on or make changes to this bug.