Closed Bug 1093196 Opened 10 years ago Closed 7 years ago

reschedule HSTS and HPKP automatic updates to run daily, and be visible on treeherder

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P5)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: keeler, Assigned: aobreja)

References

Details

Attachments

(3 files)

The HSTS and HPKP automatic update scripts aren't quite fool-proof yet (see e.g. bug 1092606), and they occasionally break the build. It's particularly a bummer when this happens since they're currently scheduled to happen on Saturdays, when nobody is around. Let's re-schedule for something like Thursday mornings.
What do you want to do, among things that are possible, when-not-if it loses a push race? The reason they run Saturday morning is because nobody wants to write the code to deal with someone else pushing between the time the updater pulls and when it pushes.
The scripts update files that only they touch (i.e. normally no human-initiated commit touches those files). If there's a push between the time the scripts check out and when they check in, in the majority of cases an automatic merge should be successful. If not, they can abandon the attempted changes and send an email that they failed or something.
Can we run these every day instead?
Blocks: 1303847
keeler: See comment 3.
Flags: needinfo?(dkeeler)
Sure, we could (note that the HSTS updater takes on the order of an hour (the HPKP updater is faster)).
Flags: needinfo?(dkeeler)
OOC, why do they check in this error log? https://dxr.mozilla.org/mozilla-central/source/security/manager/ssl/StaticHPKPins.errors

Would those be better as logs in the Treeherder job?
Yeah - as long as they're accessible somewhere, I don't think we need to check (either of) the error logs in.
Summary: reschedule HSTS and HPKP automatic updates for Thursday mornings (PST) or something → reschedule HSTS and HPKP automatic updates to run daily, and be visible on treeherder
Component: General Automation → Buildduty
QA Contact: catlee → bugspam.Callek
That's a bit of a problem with our plans for autoland, in that we want to never have merges when we "merge" autoland to m-c, which requires that there not be anything on m-c which isn't on autoland below the merge point, so having this land on m-c every day would require that it happen at a time when a sheriff is available to merge it to autoland, and then autoland couldn't be merged back until after that push, or whatever push above it has backed out everything busted, had finished PGO builds.

We could half-ass around it by just having actual merges from autoland for a while, until the ocean boils and there are hardly any pushes going to mozilla-inbound and this can be switched to push there without much fear of push races, but ideal would be either teaching this how to deal with push races or even prettier, teach it to do whatever it would take to let autoland do its landing for it.
Could you tell me please on which server I could see these logs that are been generating on each Saturday ?
Flags: needinfo?(dkeeler)
Andrei, you asked earlier in the day re updating treeherder so these jobs appear

I think you need to write a patch to include these jobs,

here
github.com:mozilla/treeherder-service.git 

treeherder/etl/buildbot.py

and update the tests as well

tests/etl/test_buildbot.py 

you can probably ask questions in #treeherder if you need more details
Assignee: nobody → aobreja
The patch for buildbotcustom repository,it should reschedule HSTS and HPKP automatic updates to run daily at 3:02 AM.
Attachment #8798857 - Flags: review?(kmoir)
Patch that should make these jobs visible on treeherder.
Thanks Kim for the hint.
Attachment #8798858 - Flags: feedback?(kmoir)
That's not going to make it visible on treeherder, because just like tbpl before it what treeherder really is is "a list of pushes, and the jobs that are pending/running/finished on them" rather than "a list of jobs, some of them associated with pushes."

Unlike pretty much everything else we run (release tagging in the non-release-promotion world is the only other direct parallel), the periodic update job doesn't start out with a revision that it ran on, it instead either creates a revision if it succeeds or doesn't create one if it fails in any way.

Even if you altered the script to capture the revision it creates by pushing, and altered the job to suddenly be run on that revision rather than on no revision as the final step (which I don't have any idea about the feasability of doing), you're still only going to be able to usefully make it visible on treeherder when it succeeds, and not when it fails, since when it fails there's no revision where it makes the tiniest bit of sense to display it.
Comment on attachment 8798858 [details] [diff] [review]
bug1093196_treeherder.patch

Usually I get someone from the treeherder team to review these requests. (edmorley) Also, usually I create a pull request against their github repo which allows you to run the tests and see if they pass.
Attachment #8798858 - Flags: feedback?(kmoir)
Attachment #8798857 - Flags: review?(kmoir) → review+
Comment on attachment 8798857 [details] [diff] [review]
bug1093196_b_custom_notifications.patch

http://hg.mozilla.org/build/buildbotcustom/rev/3fbd15421d2a to remove the accidental commit of misc.py.orig in 3d94b2506858,
http://hg.mozilla.org/build/buildbotcustom/rev/c3eb75097fe3 to merge to production.
Note that this still doesn't solve the fundamental problem that caused the recent public stir - namely that changes to our release cycle have invalidated the assumption that not doing these updates on Beta is OK because we release often enough that users will get a newer version before we run out of time. Running every day on Trunk/Aurora isn't going to magically do anything to change the fact that we can go 7+ weeks without an update after we go to Beta. And that we throttle updates to release users for multiple weeks sometimes.

We still need a better story for not cutting it so close with the expiration date if we want to avoid ever getting stuck in this situation again.
(And note that I already had to do a manual update to the expiration time for Fx50 to avoid the same problem happening again months after the last time)
By checking the latest logs from (1) I found that we don't have a revision number which should be set in set_script_properties (see (2) on step 8).

The script_repo_revision is getting a value here (2) in step 6 (get_script_repo_revision) but the value is not set in revision in step 8, so we don't have a value for this revision.
Here we have an example of what is run on set_script_properties where we should set the revision number: (3)

My guess is that we should force the revision to take the value of the script_repo_revision.



(1)https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-aurora-linux64/
(2)http://buildbot-master72.bb.releng.usw2.mozilla.com:8001/builders/Linux%20x86-64%20mozilla-aurora%20periodic%20file%20update/builds/4
(3) http://buildbot-master72.bb.releng.usw2.mozilla.com:8001/builders/Linux%20x86-64%20mozilla-aurora%20periodic%20file%20update/builds/4/steps/set_script_properties/logs/stdio
The set_script_properties step is set here (4)
To set the revision number we can also use this part (5) (lines 52-58).

(4)http://hg.mozilla.org/build/buildbotcustom/file/default/process/factory.py
(5)http://hg/build/tools/file/tip/scripts/valgrind/valgrind.sh
Andrei, what's the current status of this bug? Perhaps we can discuss in the standup tomorrow morning so we can get it unblocked
Flags: needinfo?(aselagea)
I guess this was intended for Andrei, so shifting the ni request to him :)
Flags: needinfo?(aselagea) → needinfo?(aobreja)
I talked to Andrei about this in our standup and he said that the job runs once a day. The problem is that because the script doesn't reference a revision so it doesn't appear on treeherder.
Flags: needinfo?(aobreja)
Depends on: 1171193
See Also: → 1389611
Note explaining the priority level: P5 doesn't mean we've lowered the priority, but the contrary. However, we're aligning these levels to the buildduty quarterly deliverables, where P1-P3 are taken by our daily waterline KTLO operational tasks.
Priority: -- → P5
Can we close this now that this work has landed?
Flags: needinfo?(bugspam.Callek)
Yes, this was fixed by scheduling via taskcluster (so the buildbot job has a revision to work from). Bug 1402457
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(bugspam.Callek)
Resolution: --- → FIXED
(In reply to Justin Wood (:Callek) from comment #26)
> Yes, this was fixed by scheduling via taskcluster (so the buildbot job has a
> revision to work from). Bug 1402457

Awesome, thanks!
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: