Last Comment Bug 1223872 - Single locale release updates should not race each other
: Single locale release updates should not race each other
Status: RESOLVED FIXED
:
Product: Release Engineering
Classification: Other
Component: Balrog: Backend (show other bugs)
: unspecified
: All All
-- normal (vote)
: ---
Assigned To: Varun Joshi (:vjoshi)
: Ben Hearsum (:bhearsum)
:
Mentors:
Depends on: 1224674
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-11 09:49 PST by Rail Aliiev [:rail] ⌚️ET
Modified: 2016-08-29 11:04 PDT (History)
5 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
balrog_metrics-tools-4.diff (2.14 KB, patch)
2015-11-13 06:01 PST, Rail Aliiev [:rail] ⌚️ET
no flags Details | Diff | Splinter Review
balrog_metrics-tools-5.diff (2.16 KB, patch)
2015-11-13 07:20 PST, Rail Aliiev [:rail] ⌚️ET
catlee: review+
rail: checked‑in+
Details | Diff | Splinter Review

Description User image Rail Aliiev [:rail] ⌚️ET 2015-11-11 09:49:49 PST
We see a lot of races in funsize similar to this:

l) locale A requests data_version
2) locale B requests data_version
3) locale B updates the release blob
4) locale A fails to update the release blob

They probably don't update the same data in the blob, so we can probably find a work around for this.
Comment 1 User image Rail Aliiev [:rail] ⌚️ET 2015-11-13 06:01:25 PST
Created attachment 8687173 [details] [diff] [review]
balrog_metrics-tools-4.diff

Let's start with some basic request metrics for now.

We can walk the TC tasks using the index, fetch their logs, process with something like https://github.com/etsy/logster (would require some custom parser) and submit to graphite.
Comment 2 User image Rail Aliiev [:rail] ⌚️ET 2015-11-13 06:28:58 PST
BTW, this is an issue for l10n repacks as well know. It's all suddenly started on Tue Nov 3, see https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=bc9c6e996006&filter-searchStr=l10n&exclusion_profile=false

I wonder if it is somehow related to all recent  DB/webhead migrations...
Comment 3 User image Rail Aliiev [:rail] ⌚️ET 2015-11-13 06:56:42 PST
According to the webapp/db PHX1-SCL3 dashboard, aus migrated on Monday, Oct 26
Comment 4 User image Rail Aliiev [:rail] ⌚️ET 2015-11-13 07:20:01 PST
Created attachment 8687210 [details] [diff] [review]
balrog_metrics-tools-5.diff

req.elapsed is not that good from what I see. It's usually less than a second while the surrounding log lines are 10-15 secs away. Not sure why, let's go with old school way.
Comment 5 User image Rail Aliiev [:rail] ⌚️ET 2015-11-13 07:27:58 PST
Comment on attachment 8687210 [details] [diff] [review]
balrog_metrics-tools-5.diff

https://hg.mozilla.org/build/tools/rev/58900072a047
Comment 6 User image Rail Aliiev [:rail] ⌚️ET 2015-11-13 12:40:42 PST
Looks like this is related to the aus migration, which according to the dashboard happened on Oct 26.

It starts with the push from 25th (nightly builds on 26th):
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=d53a52b39a95&filter-searchStr=update-%20balrog

The whole picture can bee seen at this link: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=update-&fromchange=f8283eaf5aad

We are interested in jobs ending with "u" (en-USu, 1.1u, 5u, etc) with "HTTP 400" errors in the summary.

On Oct 29 we stopped funsize (see https://bugzilla.mozilla.org/show_bug.cgi?id=1219879#c3 and https://bugzilla.mozilla.org/show_bug.cgi?id=1220252#c3)

It was turned back on Nov 2, Mon (https://bugzilla.mozilla.org/show_bug.cgi?id=1220857). Nov 2 and 3 were very terrible - tons of balrog submission errors.

Sounds like we degraded after the migration, but not sure where: networking, db, TC-to-VPN delays...
Comment 7 User image Rail Aliiev [:rail] ⌚️ET 2015-11-18 08:15:05 PST
Back to the pool - I don't think that I can look at this until after Mozlando.
Comment 8 User image OrangeFactor Robot 2015-11-20 17:00:18 PST
16 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-aurora: 16

Platform breakdown:
* windows8-32: 6
* linux32: 6
* windows8-64: 3
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1223872&startday=2015-11-20&endday=2015-11-20&tree=all
Comment 9 User image OrangeFactor Robot 2015-11-22 17:04:27 PST
83 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-aurora: 83

Platform breakdown:
* windows8-32: 28
* linux32: 22
* linux64: 18
* windows8-64: 15

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1223872&startday=2015-11-16&endday=2015-11-22&tree=all
Comment 10 User image Ben Hearsum (:bhearsum) 2016-06-10 08:24:38 PDT
Varun is actively working on this!
Comment 11 User image [github robot] 2016-06-24 07:39:00 PDT
Commit pushed to master at https://github.com/mozilla/balrog

https://github.com/mozilla/balrog/commit/18047698e064bff5af1459323fe943daf6c5175a
bug 1223872: merge blob updates on server when safe to do so (#93). r=bhearsum
Comment 12 User image OrangeFactor Robot 2016-07-08 18:00:13 PDT
67 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-aurora: 67

Platform breakdown:
* linux64: 33
* linux32: 30
* windowsxp: 3
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1223872&startday=2016-07-08&endday=2016-07-08&tree=all
Comment 13 User image OrangeFactor Robot 2016-07-10 18:04:25 PDT
67 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-aurora: 67

Platform breakdown:
* linux64: 33
* linux32: 30
* windowsxp: 3
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1223872&startday=2016-07-04&endday=2016-07-10&tree=all
Comment 14 User image Nick Thomas [:nthomas] 2016-07-10 18:52:55 PDT
This is in production I think, but we still hit some issues around old_data_version. eg over the weekend 

eg https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=bbb29a9b88dd680dbb59577cbe4dc6e58d117100&filter-searchStr=l10n&exclusion_profile=false&selectedJob=4293214
https://treeherder.mozilla.org/logviewer.html#?job_id=4293214&repo=mozilla-central#L24037

Could you investigate Varun ?
Comment 15 User image Varun Joshi (:vjoshi) 2016-07-10 21:35:25 PDT
(In reply to Nick Thomas [:nthomas] from comment #14)
> This is in production I think, but we still hit some issues around
> old_data_version. eg over the weekend 
> 
> eg
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> central&revision=bbb29a9b88dd680dbb59577cbe4dc6e58d117100&filter-
> searchStr=l10n&exclusion_profile=false&selectedJob=4293214
> https://treeherder.mozilla.org/logviewer.html#?job_id=4293214&repo=mozilla-
> central#L24037
> 
> Could you investigate Varun ?

Yes, I'm on it!
Comment 16 User image Ben Hearsum (:bhearsum) 2016-07-11 10:46:26 PDT
(In reply to Nick Thomas [:nthomas] from comment #14)
> This is in production I think, but we still hit some issues around
> old_data_version. eg over the weekend 
> 
> eg
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> central&revision=bbb29a9b88dd680dbb59577cbe4dc6e58d117100&filter-
> searchStr=l10n&exclusion_profile=false&selectedJob=4293214
> https://treeherder.mozilla.org/logviewer.html#?job_id=4293214&repo=mozilla-
> central#L24037
> 
> Could you investigate Varun ?

It's a bit confusing right now actually. CloudOps is running code with this, but we haven't moved admin traffic over to them yet. The WebOps admin box is still running older code, so this is effectively not in production yet. We should be cutting over admin later this week, so hopefully we'll have this in production sometime next week.
Comment 17 User image Nick Thomas [:nthomas] 2016-07-11 16:14:02 PDT
That makes more sense. Perhaps we should update the webops admin soon, so that we're not changing hosting and code when we swap over to CloudOps.
Comment 18 User image OrangeFactor Robot 2016-07-11 18:00:11 PDT
21 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-aurora: 21

Platform breakdown:
* osx-10-10: 11
* windows8-64: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1223872&startday=2016-07-11&endday=2016-07-11&tree=all
Comment 19 User image OrangeFactor Robot 2016-07-12 18:00:19 PDT
113 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-aurora: 113

Platform breakdown:
* linux32: 54
* linux64: 36
* osx-10-10: 13
* windowsxp: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1223872&startday=2016-07-12&endday=2016-07-12&tree=all
Comment 20 User image OrangeFactor Robot 2016-07-13 18:00:14 PDT
66 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-aurora: 66

Platform breakdown:
* linux32: 33
* linux64: 32
* windowsxp: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1223872&startday=2016-07-13&endday=2016-07-13&tree=all
Comment 21 User image OrangeFactor Robot 2016-07-17 18:03:51 PDT
206 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-aurora: 206

Platform breakdown:
* linux32: 89
* linux64: 71
* osx-10-10: 24
* windowsxp: 12
* windows8-64: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1223872&startday=2016-07-11&endday=2016-07-17&tree=all
Comment 22 User image Ben Hearsum (:bhearsum) 2016-08-29 11:04:43 PDT
This landed in production awhile ago. Thanks Varun!

Note You need to log in before you can comment on or make changes to this bug.