Closed Bug 1103787 Opened 10 years ago Closed 9 years ago

xpcshell tests fail on nightly builds

Categories

(Toolkit :: Application Update, defect)

x86_64
Windows 8
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: bbondy, Assigned: bbondy)

References

Details

TBPL dep builds always succeeded the first time, but when re-running them at a later time, the later ones would always fail.  I used a hex editor (in the end, a lot of other things were attempted first) to investigate why this happens and it turns out that the dep builds use the nightly binaries when re-running the xpcshell tests instead of the dep build binaries.

This isn't a problem currently on Windows because each build contains the normal certs and the xpcshell certs.   And the fallback key decides which one to use.  So you can use nightly binary or dep binaries and it's fine.  This was supposed to change with the multi platform mar signing work so that each binary only contains a single cert. 

This is the biggest blocker left for landing the work.  There's still a linux problem independent of this.
This is how I plan to fix this:

What we need to do is always include the normal certs and the xpcshell cert in the updater binary.
And the solution has to be cross platform.  The old solution I was going to use was only including the certs needed, but see Comment 0 for why that's not possible anymore.

We'll continue using the well known nss cert store (which is committed to mozilla-central) for the test mar files. This is nice because these mars don't need to be secured.  And we have no chance of ever losing the key. Re-signing becomes easy when we need to change the test mar files.

We'll maintain a whitelist of MAR hashes that updater will check the MAR file it's processing against. 

If updater finds the MAR hash is inside the whitelist, it'll attempt to do the operation with the xpcshell fallback key.  It'll only ever use this key if the test MAR file hash matches something in the whitelist.  This means we'll need to update our whitelist anytime we have MAR file changes.  They already need to be resigned anyway so we can just do this via some kind of script in a followup. 

This would seem like it would require an extra process of the file to get the MAR hash. But I think we can just do the normal security check, and if it fails, then look in the whitelist (I think will require some libmar changes).  That way tests will read the file twice, but everyone's updates will only ever read the file once. 

Needinfo'ing to make sure you see this and to let me know if you have any objections to this plan.
Flags: needinfo?(robert.strong.bugs)
By the way I'm considering also doing this same type of whitelist check based on the input mar file for the maintenance service as well (lower priority than this work). This would eliminate the need to manually configure windows talos machines with the test xpcshell certs (Which I'm sure releng would love).  And it would allow everyone to run updater tests without manually configuring their machines.   This new change would just skip the authenticode check altogether for the whitelist mar files.  So a bit worse testing, but I think that code is unlikely to change for the input mars.   Let me know if you think we should do this too and I'll post another task for myself for that.  Also we could get rid of a lot of the test code that has checks for auto-skipping, so better testing for non tbpl machines.
The crux of this plan is that the hashing algorithm has to be hard to reproduce a new mar with one of those whitelisted hashes. Which I think is the definition of a good hashing algorithm. We may need security's input on this plan.  Is bsmith still around?

If we can't do this we can just make the xpcshell test mar files signed by a key mozilla manages but is not checked into source code.  But still use the whitelist type idea.  This is just a little harder on maintenance.
Is there a buildbot/test harness issue which should be fixed, or is this work which has merit in it's own right ?
This could be fixed by using the dep builds instead on re-running an xpcshell test, instead of using the nightly build package. But I'm not sure if it's possible that we ever run tests using the nightly builds in some other case, like maybe for pgo builds or something.
More detailed, better explained, slightly modified possible plan for fixing this:

Check for a valid verified mar here as usual during updater binary running:
http://dxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/updater/archivereader.cpp#54

If it fails, then then check that: SIGNATURES.SIGNATURE_ENTRY.Signature of the mar file matches something in the whitelist.  If it does, use the xpcshell cert to do the verify and repeat line 54 of archivereader.cpp.  The signature can be extracted using libmar's `extract_signature` API.
(See https://wiki.mozilla.org/Software_Update:MAR for explanation of what SIGNATURES.SIGNATURE_ENTRY.Signature is.)

For each of these files:
./toolkit/mozapps/update/tests/data/complete.mar
./toolkit/mozapps/update/tests/data/complete_mac.mar
./toolkit/mozapps/update/tests/data/old_version.mar
./toolkit/mozapps/update/tests/data/partial.mar
./toolkit/mozapps/update/tests/data/partial_mac.mar
./toolkit/mozapps/update/tests/data/simple.mar
./toolkit/mozapps/update/tests/data/wrong_product_channel.mar

Use the signmar tool to extract the signatures:
./signmar -n(i) -X signed_input_archive.mar base_64_encoded_signature_file

Add those signatures to the whitelist.

We use: RSA-PKCS1-SHA1 (2048 bits / 256 bytes) for the file hashing.
We could also check for matching file size in the whitelist.

--

Slight possible modification of the plan is to make the whitelist contain what the nightly builds would hash those MAR files to.  But then if we use some other type of binaries ever to run tests (such as release channel or diff nightly channel), then it would fail.
The attack surface for this would be:
Can an attacker produce a MAR file of the same size which when signed by the cert store in question [1] produces the exact RSA-PKCS1-SHA1 signature as a MAR in the whitelist.


[1]: Currently it's public but we can make it a different cert store that's protected if needed.
(In reply to Brian R. Bondy [:bbondy] from comment #6)
> See the OSX 10.6 Opt line for example:
> https://tbpl.mozilla.org/?tree=Oak&rev=dfdeaab30aca

I dug into the buildbot side. The passing xpcshell job does indeed come from the dep build, and the first failing xpcshell comes from testing the nightly build. I'm changing the summary because that nightly test is correctly picking up the nightly binary. The later failures are retriggers of the nightly test.
Summary: Tbpl re-running xpcshell tests uses nightly binaries instead of dep build based binaries → xpcshell tests fail on nightly builds
Just giving an update on the progress for this bug.

I've been working on building a second updater which has the different certs in it. And using that updater binary only in xpcshell tests.  There are various commits to oak for this and the base implementation is there.

I was getting a failure from a couple of the xpcshell tests which call firefox-bin with process-updates.  That uses updater.app always for the binary so it hangs on postupdate log checking.  To get around this I tried separating out the src binary in head_update.js from the destination binary, you can see that attempt in oak/rev/061c798d4c7e.  In the end I can't easily use this though because on osx we have updater.app which contains updater-xpcshell binary instead of updater binary.  Making it so updater.app won't open because it can't find it's udpater binary.

Instead I'm just going to use the base implementation that I had and use the xpcshell binary only when certain env vars are set inside toolkit xre code.
Won't just renaming the binary in the bundle be enough?
Flags: needinfo?(robert.strong.bugs)
Sure sure but it adds more special cases to the code so I wanted to avoid it.  If you really want no test code in xre though I can do that. The xre change is simpler that's why I was opting for it.  See https://hg.mozilla.org/projects/oak/rev/061c798d4c7e
I thought what we discussed was creating the updater bin in the tests/data directory and using that.

You state that it adds more special cases in the code yet it doesn't add test code to xre seem to contradict each other. Are you saying it adds more special cases outside of the client code? If so, can you provide details or clarify what is being special cased?

Also, there are no xre changes in the link you provided in comment #12.
As per IRC the conversation I'm not going to use copyFileToTestAppDir and just use the binary from the tests/data instead.
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=883e17fc475f
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.