Closed Bug 486304 Opened 12 years ago Closed 12 years ago

cert7.db/cert8.db "corruption" when importing a large certificate (>64K)

Categories

(NSS :: Libraries, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED
3.12.4

People

(Reporter: mozbgz, Assigned: nelson)

Details

(Whiteboard: FIPS)

Attachments

(4 files, 1 obsolete file)

Importing a (relatively) large certificate into cert8.db fails, and will corrupt cert8.db. The attached certificate can be used to reproduce the problem. With trunk (3.12.x), I get:

C:\>certutil -d mydb -A -t ,, -n foo -i mozchomp.crt
certutil: could not add certificate to token or database: Error adding certificate to database.

... while 3.11.9 just fails silently.

In both cases, the cert8.dir subdirectory is created, with a 90564-byte file holding a blob with this certificate (as expected for records larger than DBS_MAX_ENTRY_SIZE).

Trying to list the cert8.db contents afterwards, however, just gives an empty result. And while it's possible to add additional certs to the DB, they silently "disappear" - i.e. the record is apparently added (judging from the hex dump of cert8.db), but it can't be retrieved... making the DB read-only from that point on, effectively.

What's most worrisome, however, is the behavior when trying to add such a certificate to a cert8.db with existing entries - it will get corrupted, and you may lose access to some of the existing certificates (in one of my tests almost half of them - depending on where the new entry is added into the Berkeley DB's hash structure, I assume).

The issue seems to be limited to cert entries, as I could not reproduce similar failures when importing CRLs (I used a sample CRL with ~750 KByte). And as a last observation, the problem does not exist with the sqlite backend (NSS_DEFAULT_DB_TYPE=sql), fortunately enough.
Assignee: nobody → rrelyea
Priority: -- → P1
Whiteboard: FIPS
Target Milestone: --- → 3.12.4
I tested this using a tip of trunk certutil and a test DB that had 14 certs.
I got the error message

> certutil: could not add certificate to token or database: Error adding 
> certificate to database.

after which, a certutil -L showed only 7 of the previous 14 certs.  :(

I guess the next step is to trace through it.  

(BTW, I like the .gif file in the cert. :)
OK, I hunted this down.  

The first problem is with the format of an "encoded" cert DB "entry".  
The length of a certificate is only a 16-bit value there.
Even when we write the certificate as a "blob", the blob contains an 
"encoded DB cert entry", with an explicit length that is only 16 bits.

The problem is seen in two functions in file pcertdb.c.
1) in EncodeDBCertEntry(), 

646    buf[6] = ( entry->derCert.len >> 8 ) & 0xff;
647    buf[7] = entry->derCert.len & 0xff;

2) in DecodeDBCertEntry()

760    /* is database entry correct length? */
761    entry->derCert.len = ( ( dbentry->data[lenoff] << 8 ) |
762                             dbentry->data[lenoff+1] );
763    nnlen = ( ( dbentry->data[lenoff+2] << 8 ) | dbentry->data[lenoff+3] );
764    if ( ( entry->derCert.len + nnlen + headerlen )
765        != dbentry->len) {
766        PORT_SetError(SEC_ERROR_BAD_DATABASE);     <== failure is here
767        goto loser;
768    }

I think I can fix this with a heuristic that will not require any change 
to the encoded cert entry format, because we also have the overall length
of the encoded  entry.  I think we can reconstruct the correct derCert.len
with the information at hand.

The second question/issue is: why does this make the DB behave as if it 
was corrupted?  and is it really corrupted?  If not, if the only problem
is the incorrect length in the blob, then it should not cause the DB to
seem to lose other certs in the DB.  This is bad.  I will continue my 
investigation.
The reason for the second problem, the appearance that the cert DB has lost
entries, rather than merely having one bad entry, is in the function 
nsslowcert_TraverseDBEntries() (and maybe in other similar functions).
It stops the traversal if the attempt to handle any one cert fails.

4239    do {
    ...
4252            rv = (* callback)(&dataitem, &keyitem, type, udata);
4253            if ( rv != SECSuccess ) {
4254                return(rv);
4255            }
4256        }
4257    } while ( certdb_Seq(handle->permCertDB, &key, &data, R_NEXT) == 0 );

removing lines 4253-4257 makes the DB appear complete and behave in a 
more normal fasion.  I'm not sure what the best way is to deal with that.
Maybe it is to merely count the errors, and only return an error result
if there was NO successfull call to (*callback).

Bob, what do you think?
Attached patch proposed patch v1 (obsolete) — Splinter Review
This patch solves both problems described above.
Bob, please review for NSS 3.12.4
Assignee: rrelyea → nelson
Status: NEW → ASSIGNED
Attachment #370518 - Flags: review?(rrelyea)
Attachment #370518 - Flags: review?(rrelyea) → review+
Comment on attachment 370518 [details] [diff] [review]
proposed patch v1

r+ rrelyea

So the issue was:

1) we couldn't handle entries bigger that 64K (the resulting entry would always show an error).

2) once we had an error in an entry, we would never find the other entries.

So the database itself is not 'corrupted', but we couldn't read such a database.

This makes more sense to me than the earlier proposition that somehow we weren't going through the 'entry too large' code for the certs, since I was pretty sure that code was a shim dropped into almost the last layer before the db call.
That previous patch didn't really do what I wanted.  
This one does, I believe.  
The difference between the two is the value finally returned by 
nsslowcert_TraverseDBEntries.  

The first patch returns SECFailure if ANY call to the callback fails.
It returns SECSuccess if either 
a) no calls to the callback were made, or 
b) all calls to the callback succeeded.

The second patch returns SECSuccess if 
a) no calls to the callback were made, or
b) ANY call to the callback succeeded.
It returns SECFailure only if ALL calls to the callback failed.
That's what I originally intended for it to do.

Bob, please decide which of these alternatives is best.
Attachment #370548 - Flags: review?(rrelyea)
Kaspar, do you think anyone cares enough about 3.11.x, going forward, 
that we should try to carry this fix back to the 3.11 branch?
Thanks for promptly looking into this, Nelson.

(In reply to comment #7)
> Kaspar, do you think anyone cares enough about 3.11.x, going forward, 
> that we should try to carry this fix back to the 3.11 branch?

As a Thunderbird 2.x user, I still do, to a certain extent... as do the Seamonkey 1.x folks, probably ;-) For Firefox, it's true that it's less of a concern given the fact that 2.x was EOL'd last December.
Comment on attachment 370548 [details] [diff] [review]
Alternative patch (checked in)

As I look more closely, in both patches, the overloading 'ret' can be confusing. It's both the return code from dbm and a counter of successes or failures.

In the existing code it requires constructing a logic map to convince myself that it does what you expect. That map should probably be in a comment.

Finally, currently this function returns failure if it can't find the first entry, so there is one case where the database is completely empty, that it would return a failure (that may not be a problem if we always have some record in a database).

Any way, I'll r+ if you add a comment explaining that 1) ret is initialized to zero implicitly by the previous call (before the loop), so if ret is non-zero some call back call succeeded (return true).
2) if ret is zero and no call backs were made, rv will not change from it's initialized value of SECSuccess. 
3) if ret is zero and any callbacks were made, we know that all the callbacks failed. rv is set to the last callback value. Since they all failed, the last one also failed.

I'm ok with the semantic
Attachment #370548 - Flags: review?(rrelyea) → review+
(BTW feel free to cast the comments in your own words, just as long as they contain the appropriate logic).

bob
Turns out that this issue has been present ever since NSS 3 was released (which also means that 64K certs are definitely rarely seen in the wild, otherwise I assume that it wouldn't have gone unnoticed that long :-).

Similar code for length checks is also found in DecodeDBSubjectEntry - I guess this should be adapted as well, right? Let me know if you would like to see a sample certificate for testing this.
Summary: cert8.db corruption when trying to import a large certificate → cert7.db/cert8.db "corruption" when importing a large certificate (>64K)
Version: trunk → 3.0
Kaspar, If you're volunteering to make a cert with a 64K+ subject name, 
Please do, and please file a separate bug for that.  Thanks. 
However, that bug may get "WONTFIX"ed.  Now that we have cert9, we're not 
really trying to perfect cert8.  There won't be a cert8.1 :)  But if it's 
as easy to fix as this one, then I expect it will get fixed about as quickly.
I believe a subject entry doesn't have the subject itself, but a list of issuer/Serial numbers indexed by that subject.

That means you need to create N certs with the same subject such that the sum of the issuer/serial numbers (encoded) is greater that 64k. If you create an intermediate CA with a 1K subject, then issue 64 differrent certificates from that intermediate with the same subject, that should do the trick.;).

bob
A single issuer name larger than 64K will suffice to meet the criterion 
that "the sum of the issuer/serial numbers (encoded) is greater that 64k."

But also, consider that the DB size limitations apply to DB keys as well
as to DB "entries".
Well, a certificate with a 64K subject DN signed by attachment 370394 [details] actually does the trick. NSS will write blobs to cert8.dir for both the cert and the subject. With the "alternative patch" applied, I can then list the certificate, but sadly, I can no longer get at it with certutil ;-)

C:\>certutil -L -d mydb

Certificate Nickname                                         Trust Attributes
                                                             SSL,S/MIME,JAR/XPI

mozchomp2                                                    ,,

C:\>certutil -L -d mydb -n mozchomp2
certutil: Could not find: mozchomp2
: security library: bad database.

(In reply to comment #12)
> Please do, and please file a separate bug for that.  Thanks. 
> However, that bug may get "WONTFIX"ed.

I don't like filing bugs which might get WONTFIXed :-) But anyway, why should it be a separate bug? It's again a 64K+ certificate, just a somewhat more peculiar one... actually one which more closely follows the idea in that (in)famous posting from Bob Jueneman in 1997:

> Therefore, if I choose to include as one of the RDN components of the DN
> (perhaps as a component of a multi-valued RDN for the leaf node) a 1
> gigabyte MPEG movie of me playing with my cat, that is perfectly allowable
> per the present standard, no matter how unwise or brain-dead such an
> approach might be.

(http://www.cs.auckland.ac.nz/~pgut001/pubs/dave.der doesn't really count, the movie is only in an extension ;-)
Please attach mozchomp2 to this bug.  I'll at least look at it.
(In reply to comment #16)
> Please attach mozchomp2 to this bug.

Ok, here it is. Doesn't include a 1 gigabyte MPEG movie (but the 90K GIF instead, which you said you like in comment 1 :-).
Attachment #370518 - Attachment is obsolete: true
Comment on attachment 370548 [details] [diff] [review]
Alternative patch (checked in)

I added additional comments as Bob requested and committed it.
pcertdb.c; new revision: 1.8; previous revision: 1.7
Attachment #370548 - Attachment description: Alternative patch → Alternative patch (checked in)
With this patch, I'm able to list the contents of mozchomp2.  I also fixed 
some other 16-bit length problems that I found by code inspection.
Comment on attachment 371634 [details] [diff] [review]
patch part 2, fix a few more length errors, v1 (checked in)

The problem that is shown with Kaspar's two existing certs 
is actually a problem with a cert 7/8 db "nickname" record
being too long.  This patch also fixes problems with the 
SMIME record being too long and with the CRL entry being 
too long.  (I'm really stunned that we haven't run into the 
CRL problem before, because CRLs larger than 64KB are not 
uncommon.)  

Kaspar and/or Bob, please review and/or test.
Attachment #371634 - Flags: superreview?(rrelyea)
Attachment #371634 - Flags: review?(mozbugzilla)
Comment on attachment 371634 [details] [diff] [review]
patch part 2, fix a few more length errors, v1 (checked in)

Note: we are still assuming urls, smimeOptions and optionsDates cannot go over 64k. It's unlikely that they will, and we have no recourse if they do...... but we should at least remember the possibility.

bob
Comment on attachment 371634 [details] [diff] [review]
patch part 2, fix a few more length errors, v1 (checked in)

Hmmm, I must have forgotten to set the + flag when I made my comment...

r+
Attachment #371634 - Flags: superreview?(rrelyea) → superreview+
re: CRL problem

We did run into the CRL problem -- long ago. The current code has similar functionality to what you added (even a comment that says 'CRL entry is greater than 64 K. Hack to make this continue to work').

Your code is functionally equivalent to the code you replaced, but now that that test is prevalent through the code, it's best to keep it the same form (rather than 2 different versions that do the same thing but don't look the same).

It's also pretty clear that when we ran into this case initially for CRL's, we made the same choice as we have today (calculate the correct length if the length overflows our 16 bit length value).

bob
Comment on attachment 371634 [details] [diff] [review]
patch part 2, fix a few more length errors, v1 (checked in)

Checking in pcertdb.c; new revision: 1.9; previous revision: 1.8
Attachment #371634 - Flags: review?(mozbugzilla)
Attachment #371634 - Attachment description: patch part 2, fix a few more length errors, v1 → patch part 2, fix a few more length errors, v1 (checked in)
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.