Closed Bug 1244803 Opened 8 years ago Closed 8 years ago

HSTS and HPKP preload list periodic updates failed on 30 January 2016

Categories

(Release Engineering :: General, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: keeler, Assigned: Cykesiopka)

References

Details

Attachments

(3 files)

The HSTS preload list periodic updater failed this weekend:

https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-linux64/mozilla-central-linux64-periodicupdate-bm73-build1-build0.txt.gz

...
request to 'zehntner.ch' finished
spinning off request to 'smartphone.continental.com ' (remaining retries: 3)
cat: nsSTSPreloadList.errors: No such file or directory
ERROR: xpcshell exited with a non-zero exit code: 3
program finished with exit code 70
elapsedTime=2207.751528
========= master_lag: 0.12 =========
========= Finished 'bash scripts/scripts/periodic_file_updates/periodic_file_updates.sh ...' failed (results: 2, elapsed: 36 mins, 47 secs) (at 2016-01-30 03:40:40.156576) =========
...
So, still no investigation?
It failed again today:

https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-linux64/mozilla-central-linux64-periodicupdate-bm91-build1-build1.txt.gz

Has there been any change that prevents the script from accessing the nsSTSPreloadList.errors file? Maybe wrong path or something like that?
Flags: needinfo?(dkeeler)
Flags: needinfo?(catlee)
Bug 1246761 may be responsible for this.
Flags: needinfo?(dkeeler)
Note that Bug 1246765 is at least partially responsible (an issue with the script itself).

However, the HPKP updates seem to have failed as well. I've confirmed that the script still works locally, so further investigation is still needed.
Summary: HSTS preload list periodic update failed on 30 January 2016 → HSTS and HPKP preload list periodic updates failed on 30 January 2016
Depends on: 1253845
And now we're hosed, and poor Cykesiopka is going to have to manually update the list again, on aurora before Monday or on beta after Monday.

The list currently on aurora expires April 30th, and that branch releases April 19th and is replaced by the next one on June 7th, so not only will HPKP be broken for much of its released life (not my problem), but we'll have xpcshell permaorange on mozilla-release if we need to chemspill 46 (my problem).
Severity: normal → blocker
Is there any significance to the way that on esr38 the failure is "New HSTS preload error list nsSTSPreloadList.errors is empty. I guess that's good?" and "New HSTS preload list nsSTSPreloadList.inc is empty. That's less good.", or is that just a change in the script since then?
(In reply to Phil Ringnalda (:philor) from comment #5)
> And now we're hosed, and poor Cykesiopka is going to have to manually update
> the list again, on aurora before Monday or on beta after Monday.
> 
> The list currently on aurora expires April 30th, and that branch releases
> April 19th and is replaced by the next one on June 7th, so not only will
> HPKP be broken for much of its released life (not my problem), but we'll
> have xpcshell permaorange on mozilla-release if we need to chemspill 46 (my
> problem).

Right. Thanks for the reminder.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a210889b0d8d

This patch is against aurora 46.
It includes a full HPKP update, but only an expiration time update for HSTS because I have no desire to connect to four thousand something random hosts.

I'll ping someone to check this into aurora after the try run comes back green. Or, failing that, I'll get this checked into beta 46.
(In reply to Phil Ringnalda (:philor) from comment #6)
> Is there any significance to the way that on esr38 the failure is "New HSTS
> preload error list nsSTSPreloadList.errors is empty. I guess that's good?"
> and "New HSTS preload list nsSTSPreloadList.inc is empty. That's less
> good.", or is that just a change in the script since then?

I assume you're referring to https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-esr38-linux64/mozilla-esr38-linux64-periodicupdate-bm94-build1-build0.txt.gz, which is for 2016-01-30.

To answer your question: this looks somewhat disturbing, since the log shows that the script runs at least partially through. I'll look into it.
(In reply to :Cykesiopka from comment #9)
> I assume you're referring to
> https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-esr38-
> linux64/mozilla-esr38-linux64-periodicupdate-bm94-build1-build0.txt.gz,
> which is for 2016-01-30.

And also 2016-02-13; after that we got into the same silent failure to download the build and then non-silent failure to resolve stage.m.o, bug 1253845, that we have on other trees as well.
Depends on: 1253958
(In reply to Phil Ringnalda (:philor) from comment #11)
> And also 2016-02-13; after that we got into the same silent failure to
> download the build and then non-silent failure to resolve stage.m.o, bug
> 1253845, that we have on other trees as well.

This turned out to be a separate issue which I filed Bug 1253958 for.
Assigning this to you, please let me know if you're not actively working on it, or there are parts that RelEng need to act on.
Assignee: nobody → cykesiopka.bmo
Flags: needinfo?(catlee)
Bug 1233958 made it to m-c so I'm running a job there now.

By the way, any reason we can't bump
  const MAX_CONCURRENT_REQUESTS = 5;
to something higher ? A value of 20 worked fine on my laptop and was considerably faster than the ~40 minutes we take normally.
(In reply to Nick Thomas [:nthomas] from comment #15)
> Bug 1233958 made it to m-c so I'm running a job there now.

Oops, I meant bug 1253958.
(In reply to Nick Thomas [:nthomas] from comment #16)
> Which resulted in this landing -
> http://hg.mozilla.org/mozilla-central/pushloghtml?changeset=74546cb2ed83

Cool, thanks.
(In reply to Nick Thomas [:nthomas] from comment #15)
> By the way, any reason we can't bump
>   const MAX_CONCURRENT_REQUESTS = 5;
> to something higher ? A value of 20 worked fine on my laptop and was
> considerably faster than the ~40 minutes we take normally.

I think keeler will have to answer this one.
Flags: needinfo?(dkeeler)
(In reply to Nick Thomas [:nthomas] from comment #15)
> By the way, any reason we can't bump
>   const MAX_CONCURRENT_REQUESTS = 5;
> to something higher ? A value of 20 worked fine on my laptop and was
> considerably faster than the ~40 minutes we take normally.

I bet we could increase it to 20. We should keep an eye out for intermittent failures to connect to sites, though (although we do retry each site up to 3 times).
Flags: needinfo?(dkeeler)
(In reply to David Keeler [:keeler] (use needinfo?) from comment #20)
> (In reply to Nick Thomas [:nthomas] from comment #15)
> > By the way, any reason we can't bump
> >   const MAX_CONCURRENT_REQUESTS = 5;
> > to something higher ? A value of 20 worked fine on my laptop and was
> > considerably faster than the ~40 minutes we take normally.
> 
> I bet we could increase it to 20. We should keep an eye out for intermittent
> failures to connect to sites, though (although we do retry each site up to 3
> times).

I filed Bug 1255251 for this.
Score for March 19th:

esr38: success
esr45: success
mozilla-aurora: 1200 seconds without output at the exact same spot as March 10th
mozilla-central: success
(In reply to Phil Ringnalda (:philor) from comment #23)
> Score for March 19th:
> 
> esr38: success
> esr45: success
> mozilla-aurora: 1200 seconds without output at the exact same spot as March
> 10th
> mozilla-central: success

Thanks for keeping track.

I briefly looked into the aurora failures, but didn't see anything obvious.

In particular:
 - There is currently no difference in getHSTSPreloadList.js between m-c and m-a.
 - The last entry in the logs suggest that this is the line the script gets stuck at: https://hg.mozilla.org/releases/mozilla-aurora/file/ff51ee15fa46/security/manager/tools/getHSTSPreloadList.js#l289
   - In particular, zzw.ca is the last on the sorted list to process and it seemed to have been processed successfully, so the script should've moved on to the line of code mentioned above.

I'll look into it more later.
nthomas: Is it possible to get an m-a run done using a debug build (and stderr directed into the log if it isn't already)? I've been trying to reproduce the failure locally, but without success. My hope is that the debug build will also fail, but provide a useful error message.
Flags: needinfo?(nthomas)
I'll try to find some time to run this manually on a build machine.
Here's what happened with mozilla-aurora debug build (rev 11a460c667e4), with 'set -x' in the script too. See 'log' for the combined output of stdout and stderr. I also grabbed the two nsSTSPreloadList files which were in the firefox/ directory.

It looks like the files were written to disk, and then the process hangs. Nothing is happening if I strace the xpcshell process after the fact, but it's using 100% CPU.
Flags: needinfo?(nthomas)
(In reply to Nick Thomas [:nthomas] from comment #27)
> Here's what happened with mozilla-aurora debug build (rev 11a460c667e4),
> with 'set -x' in the script too. See 'log' for the combined output of stdout
> and stderr. I also grabbed the two nsSTSPreloadList files which were in the
> firefox/ directory.
> 
> It looks like the files were written to disk, and then the process hangs.
> Nothing is happening if I strace the xpcshell process after the fact, but
> it's using 100% CPU.

Thanks! I'll try and debug further locally then.
(In reply to Nick Thomas [:nthomas] from comment #27)
> It looks like the files were written to disk, and then the process hangs.
> Nothing is happening if I strace the xpcshell process after the fact, but
> it's using 100% CPU.

Perhaps bug 1247798 is causing this? (although I don't know why it would be specific to mozilla-aurora...)
(In reply to David Keeler [:keeler] (use needinfo?) from comment #29)
> (In reply to Nick Thomas [:nthomas] from comment #27)
> > It looks like the files were written to disk, and then the process hangs.
> > Nothing is happening if I strace the xpcshell process after the fact, but
> > it's using 100% CPU.
> 
> Perhaps bug 1247798 is causing this? (although I don't know why it would be
> specific to mozilla-aurora...)

I looked into this, and I believe you're correct. Basically, during shutdown of the script runs, all the time is being spent in this while loop: https://hg.mozilla.org/releases/mozilla-aurora/file/27f40946fe2d/modules/libpref/prefapi.cpp#l867. Lazy printf debugging suggests that a lot of the entries being looped through are indeed related to Safe Browsing, the URL Classifier etc.
Depends on: 1247798
Here's a proof of concept patch that makes it easy to reproduce this behaviour.
It also includes a potential fix.

Below are some numbers for how long shutdown takes once the script has finished writing out both output files.

// m-a (faa65506b112)
  n  | w/o fix | w/ fix
---------------|--------
2000 | 33s     | -
4000 | 192s    | -
6000 | 689s    | 13s

// m-i (4678a49486b3)
  n  | w/o fix | w/ fix
---------------|--------
2000 | 50s     | -
4000 | 258s    | -
6000 | 1092s   | 14s

Quite the difference. The results here also explain why I wasn't initially able to reproduce (I was trying to save time, and limited the number of entries to a few hundred).

It's very strange how only m-a times out when m-i is worse off. I have no explanation for this at the moment.
I can't think of anything on the RelEng side, eg the timeouts appear to be the same.
Results for 2016-04-02:
 - esr38: No issues.
 - esr45: Some sort of runtime failure during the HPKP run that I can't reproduce locally. If it happens again next week I'll dig deeper.
 - m-a, m-c: Bug 1247798.

(In reply to Nick Thomas [:nthomas] from comment #32)
> I can't think of anything on the RelEng side, eg the timeouts appear to be
> the same.
Well, I guess the results for this week answer that.
Status: NEW → ASSIGNED
Results for 2016-04-16:
 - m-c, m-a, esr38: No issues.
 - esr45: Now also bitten by Bug 1247798.

I will try and get the fix onto esr45 <insert long sigh here>.

(I also wonder how practical it would be to get an m-c build + m-c versions of the HSTS and HPKP scripts updating all relevant branches.)
Certainly less maintenance that way, but IIRC the code consuming the preload lists may support different features across different branches. dkeeler would know.
Is there something about these data sets that require running xpcshell to fetch them? How difficult would it be to reimplement in python?
One difficulty in just using the m-c build to update across all branches is situations like in bug 1255425 where we changed the entire format. This is rare, though, so maybe it's less work than what we're doing now. The other issue is that part of why we use xpcshell is to simulate the browser visiting every site on the list (so that a site is only on the list if that version of the browser correctly processed its HSTS header). If we used m-c to define the list for every branch, there might be a branch that would consider a site HSTS even though it wouldn't normally. This is probably not a huge concern, though, since that code isn't changing much these days. I think it is important that we continue to use xpcshell to simulate the browser connecting to sites, though (rather than using, e.g. python).
The fix for Bug 1247798 was uplifted to ESR45 a few days ago, so m-c, m-a and esr45 are now all fine as of the 2016-05-28 periodic update.

esr38 AIUI is EOL now, so it doesn't matter.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: