Status

defect
P1
major
RESOLVED FIXED
7 years ago
Last year

People

(Reporter: kaie, Assigned: espindola)

Tracking

({regression})

18 Branch
mozilla19
x86_64
macOS
Dependency tree / graph

Firefox Tracking Flags

(firefox17-, firefox18+ fixed)

Details

(Whiteboard: [qa?])

Attachments

(9 attachments, 2 obsolete attachments)

3.14 KB, patch
Details | Diff | Splinter Review
3.59 KB, patch
Details | Diff | Splinter Review
8.99 KB, text/plain
Details
17.38 KB, text/plain
Details
21.02 KB, text/plain
Details
2.87 KB, patch
Details | Diff | Splinter Review
634 bytes, patch
rail
: review+
Details | Diff | Splinter Review
4.72 KB, patch
rail
: review+
Details | Diff | Splinter Review
3.18 KB, patch
Details | Diff | Splinter Review
Reporter

Description

7 years ago
At least two people face the problem that all signed S/MIME messages they send are reported by recipients as having invalid signatures.

Upon inspection, the messages contain a signingTime attribute that is backdated, exactly 2 years in the past. All other attributes in the email headers report correct current time.

I wasn't able to reproduce on Linux 64 bit, nor on OSX 10.6 32 bit.

The users apparently use OSX 10.7 with 64 bit.
I use 10.6 with 64 bit
Reporter

Updated

7 years ago
Depends on: 795017
Reporter

Comment 2

7 years ago
Mark, can you please say whether the signed S/MIME messages you send from your computer are good or bad?
Reporter

Comment 3

7 years ago
It's not just Ludovic, JB's messages (OSX 10.8) also show the very same behavior, message signing time exactly 2 years in the past.

Comment 4

7 years ago
Ludovic, JB: bug 786531 is related to the DER encoding of time. Could you try reverting the patch
in that bug and see if it fixes this bug? Thanks.

Comment 5

7 years ago
This patch is the diffs in mozilla/security/nss/lib/pkcs7 and
mozilla/security/nss/lib/smime (the two directories that make
up libsmime3.dylib) between NSS_3_13_6_RTM and NSS_3_14_BETA1,
ignoring the comment changes (such as the MPL 2 license headers).

Only one file was changed: mozilla/security/nss/lib/smime/smimeutil.c.
It is the patch for bug 676108. It does not seem related to
the encoding of time. But it is worth backing out to see if it
causes this bug.

Comment 6

7 years ago
Comment on attachment 669671 [details] [diff] [review]
Changes to libsmime3.dylib between NSS 3.13.6 and NSS 3.14 Beta 1

I just realized that mozilla/security/nss/lib/pkcs12 is also a
part of libsmime3.dylib. I verified that there are no changes
to mozilla/security/nss/lib/pkcs12 other than the MPL 2 license
headers.
Reporter

Comment 7

7 years ago
I'm worried that Ludo isn't building on his own, or am I wrong?

I would have offered to make a test build for you, but I don't have a 64 bit OSX system.

In theory there would be the option to produce a try server build, but the last time I had asked, it was impossible to produce a Thunderbird try server build, which uses changes to the core of Mozilla.

Comment 8

7 years ago
This patch file is the diffs in NSPR between NSPR_4_9_2_RTM
NSPR_4_9_3_BETA1, ignoring the version number change and a
new test program. None of the changes is related to this bug.
See comment 4. If there is a bug in time encoding then WebRTC may also be affected as it generates certificates which contain timestamps.

Also, it is worth checking whether I messed up the import of NSS 3.14 into mozilla-central. NSS 3.14 had several files that were moved and/or removed, so it was more complicated than normal. Still, I would expect NSS to fail to build if I imported it incorrectly.
Whiteboard: [WebRTC], [blocking-webrtc?]
Reporter

Comment 10

7 years ago
(In reply to Brian Smith (:bsmith) from comment #9)
> Also, it is worth checking whether I messed up the import of NSS 3.14 into
> mozilla-central.

Current mozilla-central appears to be correct - it contains NSPR 4.9.3 beta1 and NSS 3.14 beta1 code, plus the patch from bug 797572.

But I noticed that people in bug 795507 made a change on top of NSS, probably unknowingly that directory mozilla/dbm is part of NSS, not Mozilla.
(In reply to Kai Engert (:kaie) from comment #7)
> In theory there would be the option to produce a try server build, but the
> last time I had asked, it was impossible to produce a Thunderbird try server
> build, which uses changes to the core of Mozilla.

It is still possible:

https://wiki.mozilla.org/Thunderbird/Infrastructure/TryServer#Pushing_mozilla-central_patches
Reporter

Comment 12

7 years ago
It's confirmed that it's specific to 64 bit.

I have access to a VM that runs OSX 64 bit with OSX 10.6.8.

Using the same test certificate that I used successfully on a 32 bit system, Thunderbird creates a message with the bad information on that 64 bit system.


I guess I should attempt to use cmsutil to simulate what Thunderbird is doing.


Thanks to Dustin for bringing the OSX 64bit VM back online (it frequently crashes, we don't know why...).
Reporter

Comment 13

7 years ago
I cannot reproduce using NSS utility cmsutil.

I used
  cmsutil -d ~/bug799572 -S -G -N "StartCom Ltd. ID von testonly+sha2@kuix.de" \
          -P -Y "StartCom Ltd. ID von testonly+sha2@kuix.de" \
          -i ~/bug799572/plaintext -o ~/bug799572/cmsout-opt

Where directory bug799572 contains the NSS databases that I took from the Thunderbird profile (the one that showed the problem on the same machine).

The signingTime attribute is correct and shows the timestamp of execution.
I remember that people have been doing some work on the NSS build system, including changes within NSS, and changes of Mozilla's build system outside of NSS, regarding how NSS is built. A 64-bit-specific issue, that reproduces in Gecko but not NSS's utilities, suggests the possibility that this may be a build configuration issue. What changes to how we build NSS have landed?
Reporter

Comment 15

7 years ago
Hurray!

I can reproduce the bug when combining the NSS binaries from the Thunderbird 18 build with the cmsutil utility I built locally?

That means Brian is right, the NSS binaries built by Mozilla/Thunderbird are broken.
Reporter

Comment 16

7 years ago
> I can reproduce the bug when combining the NSS binaries from the Thunderbird
> 18 build with the cmsutil utility I built locally?

not '?' but "!!!"
Reporter

Comment 17

7 years ago
I guess the recommendation that we can learn from this excercise shall be clear.

Don't mess with building NSS, unless your Mozilla software build environment is able to run the full NSS test suite...

Don't try to optimize yet, but make it a higher priority to help the NSS project to execute its tests inside the Mozilla test infrastructure on all platforms. Just saying...
Reporter

Updated

7 years ago
Assignee: nobody → nobody
Component: Libraries → Build Config
Product: NSS → Core
Summary: Incorrect old signingTime attribute with Thunderbird nightly (18) with NSPR 4.9.3 (beta) and NSS 3.14 (beta) → Mozilla's changes to building NSS break NSS
Version: 3.14 → 18 Branch
Reporter

Updated

7 years ago
No longer depends on: 795017

Comment 18

7 years ago
Ekr's patch in bug 769930 looks correct to me. I'm very
curious what he did wrong.
Me too. Adding glandium since the original patch is his.
Should be easy enough to sanity-check that that patch is at fault, just build a Thunderbird build with that backed out. You can also test the nightly before it landed, looks like that would be 2012-08-13.
If a thunderbird build has these problems, a firefox build would have them too. So it should be possible to check with a custom built cmsutil and the firefox libraries. That would make it easier, because pushing m-c changes to thunderbird try is not exactly convenient.
(In reply to Kai Engert (:kaie) from comment #17)
> Don't mess with building NSS, unless your Mozilla software build environment
> is able to run the full NSS test suite...

Seeing how many things try to use .so files in the NSS test suite scripts, I doubt the full NSS test suite runs on OSX and Windows. Numerous tests also copy $DIST/lib/* under test_results, which means in mozilla build case, unstripped libxul.so and various other things are copied. The test suite happily fills 3.5GB that way before failing because of no free space left on my machine.
FWIW, filed bug 799855, about the nss test suite.
(In reply to Ted Mielczarek [:ted] from comment #20)
> Should be easy enough to sanity-check that that patch is at fault, just
> build a Thunderbird build with that backed out. You can also test the
> nightly before it landed, looks like that would be 2012-08-13.

The first build I was able to find was http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012/08/2012-08-11-03-05-26-comm-central/ and they display the signed message properly.
And it works too with the build from the 14th.
And bug 769930 landed between the build from the 11th and the one from the 14th
In that range, what changed under security is bug 792581.
(In reply to Mike Hommey [:glandium] from comment #28)
> In that range, what changed under security is bug 792581.

Nut nothing changed security/nss, only psm. So that wouldn't explain Kai's comment 15
(In reply to Kai Engert (:kaie) from comment #17)
> I guess the recommendation that we can learn from this excercise shall be
> clear.
> 
> Don't mess with building NSS, unless your Mozilla software build environment
> is able to run the full NSS test suite...
> 
> Don't try to optimize yet, but make it a higher priority to help the NSS
> project to execute its tests inside the Mozilla test infrastructure on all
> platforms. Just saying...

FWIW, the change in question was not an "optimization", but a necessary split to build NSS and PSM at different times. The WebRTC code relies on NSS, but must be built before the PSM code. It's unfortunate that this bug occurred and I agree that we should work on getting the NSS test suite running on all Firefox/Thunderbird builds.
(In reply to Mark Banner (:standard8) from comment #30)
> Just to be explicit:
> 
> http://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=c24a0fd08031&tochange=2da1f2bde40e
> http://hg.mozilla.org/comm-central/
> pushloghtml?fromchange=861e8385f731&tochange=5a49ea8a7052

Notably, bug 787982 is in that range.
Updating clang sounds plausible, since this seems to be limited to OS X.
Reporter

Comment 35

7 years ago
(In reply to Ludovic Hirlimann [:Usul] from comment #24)
> The first build I was able to find was
> http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012/08/2012-08-
> 11-03-05-26-comm-central/ and they display the signed message properly.


Ludo, because your wording is ambiguous, I hope you don't mind that I clarify:

It's NOT the DISPLAYing that has the bug.

It's the PRODUCING of signed messages that triggers this bug.

So, in order to test a version, you must use it to create a new signed message, and open/view it afterwards.
That's what I did.
(In reply to Ted Mielczarek [:ted] from comment #32)
> (In reply to Mark Banner (:standard8) from comment #30)
> > Just to be explicit:
> > 
> > http://hg.mozilla.org/mozilla-central/
> > pushloghtml?fromchange=c24a0fd08031&tochange=2da1f2bde40e
> > http://hg.mozilla.org/comm-central/
> > pushloghtml?fromchange=861e8385f731&tochange=5a49ea8a7052
> 
> Notably, bug 787982 is in that range.

That was just a wrapper, and I don't think it was actually enabled in that time frame (the changes to enable it were buildbot side).
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #34)
> Updating clang sounds plausible, since this seems to be limited to OS X.

The old one was not deleted. If you want you can just change the manifest to build with the old one on try.
(In reply to Ted Mielczarek [:ted] from comment #33)
> ... as well as bug 784691.

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #34)
> Updating clang sounds plausible, since this seems to be limited to OS X.

I did a try build with clang downgraded to before that bug:

http://ftp.mozilla.org/pub/mozilla.org/thunderbird/try-builds/bugzilla@standard8.plus.com-ec6c86dc2ca1/try-comm-central-macosx64/

This sends correctly signed emails, so I think it proves that upgrading clang broke this.
So:
88eb1c08043919a224da1d25b37a5b0c5719f3547ddfc1b456c577d5fa66564706b6e25853fedefaa08ca8fb1dff765d7eb747bbe99ed81c09afbfd43b222e96 works

e4b198da311aa851b14a65a392b1f429d0df942bd6167887e389c89fbdbd46c00d3d09f88923f8b2bf6c41c3ab9ea87de54473deb3b6b0a71eb4146d483ba035 is known to fail?

Do you have a try push with e4b198da311aa851b14a65a392b1f429d0df942bd6167887e389c89fbdbd46c00d3d09f88923f8b2bf6c41c3ab9ea87de54473deb3b6b0a71eb4146d483ba035 ?

Does any other test fail? If not, sorry for the silly question, but how do I reproduce the s/mime message problem?

Can someone remind me how to do a thunderbird try push? :-)
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #40)
> So:
> 88eb1c08043919a224da1d25b37a5b0c5719f3547ddfc1b456c577d5fa66564706b6e25853fed
> efaa08ca8fb1dff765d7eb747bbe99ed81c09afbfd43b222e96 works
> 
> e4b198da311aa851b14a65a392b1f429d0df942bd6167887e389c89fbdbd46c00d3d09f88923f
> 8b2bf6c41c3ab9ea87de54473deb3b6b0a71eb4146d483ba035 is known to fail?
> 
> Do you have a try push with
> e4b198da311aa851b14a65a392b1f429d0df942bd6167887e389c89fbdbd46c00d3d09f88923f
> 8b2bf6c41c3ab9ea87de54473deb3b6b0a71eb4146d483ba035 ?
> 
> Does any other test fail? If not, sorry for the silly question, but how do I
> reproduce the s/mime message problem?

Follow the instructions on the intranet wiki on getting a cert :
    Go to https://www.startssl.com/?app=12
    click on the signup button
    Fill in the form (notice that there is only one email possible - so if you have aliases (like I have ludovic and lhirlimann) you'll need to repeat the process for each email address
    Read you email, copy the code and click on the url provided
    Continue with the form
    Wait for the second email with code and url to come (this takes a bit of time as it is reviewed by a human from startssl)
    Choose High Grade and wait for the certificate to finish generating (this takes up to 5 minutes).
    The certficate is now installed in your firefox
    You need to back it up. Go to preferences (tools -> option , Ff -> preferences, Edit > Preferences ) Advanced -> Encryption -> view certificates
    In Your certificates, you'll have a Entry under StartCom Ltd. Select and click the backup button.
    Choose a password for that file (the file exetnsion will be .p12).
    Copy that file on your usb key (as auth to the startssl website will be done thru that cert).
    In Thunderbird go to the view certificates (same as in Firefox), and choose Import ....
    Enter the password from step 11
    Now go to your account setting in Thunderbird. Go down to the Security section of your account settings. Under the digital signing click the choose and choose your newly imported cert) click the always sign and you are done 

Once you have the cert in Thunderbird send an email to yourself and see the broken icon instead of the non broken one.
 
> Can someone remind me how to do a thunderbird try push? :-)

Comment 11 has that.
OK. Looking at the try I see that

88eb1c08043919a224da1d25b37a5b0c5719f3547ddfc1b456c577d5fa66564706b6e25853fedefaa08ca8fb1dff765d7eb747bbe99ed81c09afbfd43b222e96 works

e4b198da311aa851b14a65a392b1f429d0df942bd6167887e389c89fbdbd46c00d3d09f88923f8b2bf6c41c3ab9ea87de54473deb3b6b0a71eb4146d483ba035 is unknown

50eb6fa636403f444bab10aee2370f1ac624a1e0105566639c7c266e784dbc6e4cd3901bbd11f53c4c63b2a6d2d07a603b3d9c333f5049bdc7a816b7d225631b fails

I will send a try job while I try to setup ssl to reproduce it.
The builds have finished, but my cert proces has not. Can someone test the builds in comment 43? Thanks.
I ran the nss tests with 88eb1c08043919a224da1d25b37a5b0c5719f3547ddfc1b456c577d5fa66564706b6e25853fedefaa08ca8fb1dff765d7eb747bbe99ed81c09afbfd43b222e96 and e4b198da311aa851b14a65a392b1f429d0df942bd6167887e389c89fbdbd46c00d3d09f88923f8b2bf6c41c3ab9ea87de54473deb3b6b0a71eb4146d483ba035. An interesting regression was:

 cert.sh: Verify Certificate for Password Test Cert with new password --------------------------
 certutil -V -n PasswordCert -u S -d /Users/espindola/mozilla-central/tests_results/security/Rafaels-MacBook-Pro.4/dbpass -f ../tests.fipspw
-certutil: certificate is valid
-cert.sh: #149: Verify Certificate for Password Test Cert with new password - PASSED
+certutil: certificate is invalid: Peer's Certificate has expired.
+cert.sh: #149: Verify Certificate for Password Test Cert with new password (255=0)  - FAILED
+cert.sh ERROR: Verify Certificate for Password Test Cert with new password failed 255

I will try to figure out how to run just that test and try to find out what regressed.
Status: NEW → ASSIGNED

Comment 46

7 years ago
Rafael: please try backing out the NSS patch I noted in comment 4.

Also, how do I use clang on the command line on Mac OS X? Thanks.

Comment 47

7 years ago
(In reply to comment #46)
> Also, how do I use clang on the command line on Mac OS X? Thanks.

It's a drop-in replacement for gcc.
Who can help confirm if this bug blocks webRTC ? Need info to decide if we need to track it for FF18
bajaj: That depends on what's done to correct this bug.  As-is, this bug does not affect WebRTC.  Previous experience on Sunday was that ASAN builds with webrtc were broken using the clang that was used before they installed an update last Sunday (Saturday?); I don't know if that would come into play here if the solution is to regress clang.  If WTC is right that it was the patch mentioned in comment 4, then fixing that wouldn't hurt WebRTC.

And finally, backing out all of NSS 3.14 would badly hurt WebRTC, though likely we could either patch 3.13 and/or code around it by dropping call security or temporarily implementing an alternative encryption scheme to DTLS-SRTP, and probably dropping DataChannels (or running them in the clear).  None of those would be pretty or very easy.  (NSS 3.13 might not be *too* bad if we had to, but we'd need to add the DTLS support from 3.14 to it and the make changes we did.)
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #45)
>  cert.sh: Verify Certificate for Password Test Cert with new password
> --------------------------
>  certutil -V -n PasswordCert -u S -d
> /Users/espindola/mozilla-central/tests_results/security/Rafaels-MacBook-Pro.
> 4/dbpass -f ../tests.fipspw
> -certutil: certificate is valid
> -cert.sh: #149: Verify Certificate for Password Test Cert with new password
> - PASSED
> +certutil: certificate is invalid: Peer's Certificate has expired.
> +cert.sh: #149: Verify Certificate for Password Test Cert with new password
> (255=0)  - FAILED
> +cert.sh ERROR: Verify Certificate for Password Test Cert with new password
> failed 255
> 
> I will try to figure out how to run just that test and try to find out what
> regressed.

NSS_TESTS=certs ./all.sh will run just the "certs" tests.

Looking at certs.sh, I can see that it runs these sets of tests:

cert_init 
cert_all_CA
cert_extended_ssl 
cert_ssl 
cert_smime_client        
cert_fips
cert_eccurves
cert_extensions
cert_test_password
cert_test_distrust

I think you can change this to make the tests complete even faster:

cert_init 
cert_all_CA
#cert_extended_ssl 
#cert_ssl 
#cert_smime_client        
#cert_fips
#cert_eccurves
#cert_extensions
cert_test_password
cert_test_distrust
Is there any reason to believe that solely backing out NSS 3.14 would address the issue?
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #43)
> OK. I did these pushes with exactly the same parent to make it easier to
> isolate the bug:

I would if I could find the builds.(In reply to Brian Smith (:bsmith) from comment #50)
> (In reply to Rafael Ávila de Espíndola (:espindola) from comment #45)
> >  cert.sh: Verify Certificate for Password Test Cert with new password
> > --------------------------
> >  certutil -V -n PasswordCert -u S -d
> > /Users/espindola/mozilla-central/tests_results/security/Rafaels-MacBook-Pro.
> > 4/dbpass -f ../tests.fipspw
> > -certutil: certificate is valid
> > -cert.sh: #149: Verify Certificate for Password Test Cert with new password
> > - PASSED
> > +certutil: certificate is invalid: Peer's Certificate has expired.
> > +cert.sh: #149: Verify Certificate for Password Test Cert with new password
> > (255=0)  - FAILED
> > +cert.sh ERROR: Verify Certificate for Password Test Cert with new password
> > failed 255
> > 
> > I will try to figure out how to run just that test and try to find out what
> > regressed.
> 
> NSS_TESTS=certs ./all.sh will run just the "certs" tests.
> 
> Looking at certs.sh, I can see that it runs these sets of tests:
> 
> cert_init 
> cert_all_CA
> cert_extended_ssl 
> cert_ssl 
> cert_smime_client        
> cert_fips
> cert_eccurves
> cert_extensions
> cert_test_password
> cert_test_distrust
> 
> I think you can change this to make the tests complete even faster:
> 
> cert_init 
> cert_all_CA
> #cert_extended_ssl 
> #cert_ssl 
> #cert_smime_client        

I'm being bitten by smime_clients so skipping that test might not be the best idea.

Comment 53

7 years ago
The NSS_3_14_BETA1 upgrade occurred at 2012-10-01 11:02 -0700, which was
before the regression window identified by Ludovic in comment 27. So this
bug does not block WebRTC.

I confirmed that regression window. I also confirmed that the incorrect
NSS shared library compiled by the new clang is libnssutil3.dylib. So the
NSS patch in comment 4 is the first thing we need to look at with the new
clang.

I cannot reproduce this bug with the clang in Xcode 4.5.1.
Assignee: nobody → respindola
Severity: normal → major
Priority: -- → P1
Whiteboard: [WebRTC], [blocking-webrtc?]

Comment 54

7 years ago
My debugging showed that something seems to go wrong on line 61 in dertime.c:
http://mxr.mozilla.org/security/source/security/nss/lib/util/dertime.c#50

50     /* Convert an int64 time to a printable format.  */
51     PR_ExplodeTime(gmttime, PR_GMTParameters, &printableTime);
52 
53     /* The month in UTC time is base one */
54     printableTime.tm_month++;
55 
56     /* remove the century since it's added to the tm_year by the 
57        PR_ExplodeTime routine, but is not needed for UTC time */
58     printableTime.tm_year %= 100; 
59 
60     d[0] = HIDIGIT(printableTime.tm_year);
61     d[1] = LODIGIT(printableTime.tm_year);


It seems to always store '0' as the last (low) digit of the year. This
is why we end up with year 2010 ("10" in the two-digit year format).

I wonder if it has something to do with printableTime.tm_year being a
PRInt16, because all the other members of printableTime are PRInt32
and they are stored correctly.

Comment 55

7 years ago
I reproduced the bug with the clang I built from the SVN trunk.
I could not create a reduced test case, so I will need to use
the NSS source file to illustrate the bug.

Here is the NSS C source file in question.

Comment 56

7 years ago
$ clang --version
Apple clang version 4.1 (tags/Apple/clang-421.11.66) (based on LLVM 3.1svn)
Target: x86_64-apple-darwin12.2.0
Thread model: posix

Comment 57

7 years ago
$ clang --version
clang version 3.2 (trunk 165885)
Target: x86_64-apple-darwin12.2.0
Thread model: posix

Comment 58

7 years ago
I forgot to mention that this bug is related to compiler optimization.

Here are the diffs between the good and bad assembly code for the
function _DER_TimeToUTCTimeArena. Note line 71 in the assembly code.

@@ -68,7 +68,7 @@ LBB0_5:
 	imull	$100, %esi, %edx
 	subl	%edx, %ecx
 	movswl	%cx, %ecx
-	movw	%cx, -40(%rbp)
+	movw	%ax, -40(%rbp)
 	movslq	%ecx, %rcx
 	imulq	$1717986919, %rcx, %rcx ## imm = 0x66666667
 	sarq	$34, %rcx

If I change %ax back to %cx, then the NSS "cert" and "smime" tests pass.
Should we be reverting the compiler we're using for builds? If we have been bitten by a compiler bug here, that makes me think that the same compiler bug will occur elsewhere. All of our assumptions on security are based on the the idea that the code is compiled correctly.
It doesn't look like we upgraded clang to fix any critical compiler bugs, so backing this out is probably a sensible solution for now until we get a fix upstream.
Summary: Mozilla's changes to building NSS break NSS → clang update broke NSS
(In reply to Ted Mielczarek [:ted] from comment #61)
> It doesn't look like we upgraded clang to fix any critical compiler bugs, so
> backing this out is probably a sensible solution for now until we get a fix
> upstream.

I fixed it already. I am testing a new build and should have the patches for review tonight or tomorrow.
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #62)
> (In reply to Ted Mielczarek [:ted] from comment #61)
> > It doesn't look like we upgraded clang to fix any critical compiler bugs, so
> > backing this out is probably a sensible solution for now until we get a fix
> > upstream.
> 
> I fixed it already. I am testing a new build and should have the patches for
> review tonight or tomorrow.

Is our build system using bleeding-edge clang? If so, isn't that what contributed to this bug?

I definitely would like us to upgrade compilers much more quickly, but it makes me nervous to rely on a relatively-untested compiler.
> Is our build system using bleeding-edge clang? If so, isn't that what
> contributed to this bug?
> 
> I definitely would like us to upgrade compilers much more quickly, but it
> makes me nervous to rely on a relatively-untested compiler.

That is what allowed us to switch to clang in the first place and is what makes the work on asan possible. With gcc (and specially visual studio) we have to hack our code to avoid bugs.

The compiler version gets branched when firefox branches, so it gets excellent testing as the code itself.
I'll note that a week ago plus a few days, when we landed webrtc, we triggered asan issues that were fixed conveniently by the face that we'd updated clang the same day (10/5?).  Just a warning that downgrading clang might introduce new issues to deal with to get asan working again.
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #64)
> That is what allowed us to switch to clang in the first place and is what
> makes the work on asan possible. With gcc (and specially visual studio) we
> have to hack our code to avoid bugs.

There's no stable or even beta release of Clang that can build mozilla-central? Definitely I don't want to regress to ancient builds of GCC. But, it just seems very risky to be upgrading our compiler in the middle of Aurora. Normally we are much more (too) conservative about compiler upgrades. I hope there is some good balance we can find here.

ASAN doesn't need to use the same compiler version as production builds of Firefox, right? So, we could revert to some more stable build of Clang for production builds but still use newer clang for ASAN. (Not sure how important it is to have the same compiler version used for production and ASAN, as far as the usefulness of ASAN results.)
>But, it just seems very risky to be upgrading our compiler in the
> middle of Aurora. 

We are not, that is the point. The compiler that is used in controlled by a file in hg. When I check-in a new compiler it will be used by m-i. When that gets merged, m-c will use it and so forth.
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #67)
> >But, it just seems very risky to be upgrading our compiler in the
> > middle of Aurora. 
> 
> We are not, that is the point. The compiler that is used in controlled by a
> file in hg. When I check-in a new compiler it will be used by m-i. When that
> gets merged, m-c will use it and so forth.

OK, I guess I was jumping to conclusions. It isn't clear what version of clang is being used on Aurora or whether the NSS currently on Aurora has this bug. Note that the NSS upgrade occurred just a few days before the Aurora merge. So, we at least need to verify that Aurora isn't affected by this bug.
I feel substantially more at ease using bleeding-edge clang and having Rafael at hand to fix any issues we hit than I do using any other toolchain. He's been able to track down and quickly fix every clang issue we've hit in recent memory. Historically we've just had to work around GCC and VC++ bugs.
(In reply to Ted Mielczarek [:ted] from comment #69)
> I feel substantially more at ease using bleeding-edge clang and having
> Rafael at hand to fix any issues we hit than I do using any other toolchain.
> He's been able to track down and quickly fix every clang issue we've hit in
> recent memory. Historically we've just had to work around GCC and VC++ bugs.

+1
> OK, I guess I was jumping to conclusions. It isn't clear what version of
> clang is being used on Aurora or whether the NSS currently on Aurora has
> this bug. Note that the NSS upgrade occurred just a few days before the
> Aurora merge. So, we at least need to verify that Aurora isn't affected by
> this bug.

The file to check is browser/config/tooltool-manifests/macosx64/releng.manifest. In aurora this has r163716, so unfortunately the compiler bug is present there. I will check whether to backport a patch or upgrade once inbound is clear.

Beta has r161152, so it should be good.
> tracking-firefox17: ? → +

Does this really affect Gecko 17? When I looked earlier, I thought we hadn't upgraded gecko 17 to the broken version of clang.
(In reply to Mark Banner (:standard8) from comment #72)
> > tracking-firefox17: ? → +
> 
> Does this really affect Gecko 17? When I looked earlier, I thought we hadn't
> upgraded gecko 17 to the broken version of clang.

We should verify, at least, that the NSS test suite passes when built with the compiler used on -beta. So, I think it is good to track for 17.
Posted patch Update the build script (obsolete) — Splinter Review
I have checked that the nss testsuite regression is fixed in this revision. I have also uploaded packages and pushed to try.

The try run is at: https://tbpl.mozilla.org/?tree=Try&rev=c30ddce479bd

I will also  do a thunderbird try push so that others can verify that the thunderbird issue was also fixed.
Attachment #672023 - Flags: review?(rail)

Updated

7 years ago
Attachment #672023 - Flags: review?(rail) → review+

Updated

7 years ago
Attachment #672025 - Flags: review?(rail) → review+
Regarding the problems finding the self-serve process, perhaps this is a race condition on the shell script. I suspect that might happen in this sequence (in ssl.sh):

            kill_selfserv
            start_selfserv
            is_selfserv_alive

I suggest changing it to sleep for a bit between each of those statements, as a workaround.
(In reply to Brian Smith (:bsmith) from comment #76)
> Regarding the problems finding the self-serve process, perhaps this is a
> race condition on the shell script. I suspect that might happen in this
> sequence (in ssl.sh):
> 
>             kill_selfserv
>             start_selfserv
>             is_selfserv_alive
> 
> I suggest changing it to sleep for a bit between each of those statements,
> as a workaround.

No, still the same problem with a "sleep 10" after each. output.log has

selfserv starting at Tue 16 Oct 2012 18:23:33 EDT
selfserv -D -p 8443 -d ../server -n Rafaels-MacBook-Pro.local  \
          -w nss -r -i ../tests_pid.28686  &
trying to connect to selfserv at Tue 16 Oct 2012 18:23:33 EDT
tstclnt -p 8443 -h Rafaels-MacBook-Pro.local  -q \
        -d ../client -v < /Users/espindola/mozilla-central/security/nss/tests/ssl/sslreq.dat
selfserv: PR_Bind returned error -5982:
Local Network address is in use
I had a selfserv running already. Killing it and running all.sh again solved the problem, so maybe it was really a race condition that left it running after a previous all.sh run?
Posted patch new versionSplinter Review
Sorry, running on try found both a clang and a firefox one (bug 803707) which took me ages to track.

I have pushed to try with the fix 803707. The only error that was caused by bug 803707 disappears in a local run, so I am fairly confident that this will be green.
Attachment #672023 - Attachment is obsolete: true
Attachment #673434 - Flags: review?(rail)
Attachment #672025 - Attachment is obsolete: true
Attachment #673436 - Flags: review?(rail)

Updated

7 years ago
Attachment #673434 - Flags: review?(rail) → review+

Updated

7 years ago
Attachment #673436 - Flags: review?(rail) → review+
Try is green. The diff-talus results with 10 runs are

a11yr_paint
MacOSX 10.7                       | (299.85,       1.99901972816)  -> (295.25,       1.9872346897)   1.0156x better

dromaeo_css
MacOSX 10.7                       | (3139.833,     12.2209582955)  -> (3101.891,     19.1911123021)  1.0122x worse

tp5n_paint
MacOSX 10.8                       | (173.9239,     0.528994698327) -> (172.6863,     0.618013690671) 1.0072x better


The text segment of XUL went from 35704832 to 35598336.

Since all the results are good, I will check the build scrip in. Unfortunately, the actual update is blocked on bug 803707.
https://hg.mozilla.org/mozilla-central/rev/0e73daa7fc4c
https://hg.mozilla.org/mozilla-central/rev/2612d7f2b20d
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: in-testsuite-
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
[Approval Request Comment]
Bug caused by (feature/regressing bug #): 

The compiler in aurora has http://llvm.org/pr14088 which causes a nss miscompilation.

User impact if declined:

Certificate expire date check fails on thunderbird. Not clear if it breaks something in firefox.
 
Testing completed (on m-c, etc.): 

The compiler on m-c has been switched. The nss testsuite passes and thunderbird works.

Risk to taking this patch (and alternatives if risky): 

Other compiler bugs are possible. We can backport the just the fix, but that means keeping track of a compiler for aurora that was never used on m-c.

String or UUID changes made by this patch:
Attachment #674835 - Flags: approval-mozilla-aurora?
Not clear if this affects 17, who can confirm if we've upgraded gecko 17 to the broken version of clang?

Has there been testing that confirms that the new compiler doesn't adversely affect Firefox?
(In reply to Lukas Blakk [:lsblakk] from comment #89)
> Not clear if this affects 17, who can confirm if we've upgraded gecko 17 to
> the broken version of clang?

The manifest for beta (current 17) is:
http://hg.mozilla.org/releases/mozilla-beta/raw-file/tip/browser/config/tooltool-manifests/macosx64/releng.manifest

It says it has r161152, which is ok. llvm's pr14088 was introduced in r162099.

Aurora (current 18) on the other hand has this manifest:

http://hg.mozilla.org/releases/mozilla-aurora/raw-file/tip/browser/config/tooltool-manifests/macosx64/releng.manifest

which says r163716, so it has the bug.

> Has there been testing that confirms that the new compiler doesn't adversely
> affect Firefox?

It is being currently used by nightly.
Comment on attachment 674835 [details] [diff] [review]
update manifests

Approving for aurora considering testing in comment 90
Attachment #674835 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #90)
> (In reply to Lukas Blakk [:lsblakk] from comment #89)
> > Not clear if this affects 17, who can confirm if we've upgraded gecko 17 to
> > the broken version of clang?
> 
> The manifest for beta (current 17) is:
> http://hg.mozilla.org/releases/mozilla-beta/raw-file/tip/browser/config/
> tooltool-manifests/macosx64/releng.manifest
> 
> It says it has r161152, which is ok. llvm's pr14088 was introduced in
> r162099.
> 

Leaving it as a nomination on 17 considering the above comment . (lsblakk - looks like we can go ahead an " - " it for 17 )
> Aurora (current 18) on the other hand has this manifest:
> 
> http://hg.mozilla.org/releases/mozilla-aurora/raw-file/tip/browser/config/
> tooltool-manifests/macosx64/releng.manifest
> 
> which says r163716, so it has the bug.
> 
> > Has there been testing that confirms that the new compiler doesn't adversely
> > affect Firefox?
> 
> It is being currently used by nightly.
Is there any need for a QA to investigate this issue? Are the automated tests enough?
Whiteboard: [qa?]

Updated

Last year
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.