Use version 2 of the json-pushes API and '&startID=N' rather than '&fromchange=SHA' for polling the pushlog

RESOLVED FIXED

Status

Tree Management
Treeherder: Data Ingestion
P3
normal
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: emorley, Assigned: camd)

Tracking

(Depends on: 1 bug)

Details

Attachments

(1 attachment)

PR
46 bytes, text/x-github-pull-request
emorley
: review+
Details | Review | Splinter Review
(Reporter)

Description

4 years ago
According to jeads, using &fromchange=REV with json-pushes requires an additional join (compared to &startID=N), which is presumably what is contributing to the higher rates of timeout accessing json-pushes than with TBPL.

We should either:

a) fix json-pushes to not time out (jeads mentioned wanting to find out if it was due to deadlocks etc) / not need the additional join

b) switch back to using startID=N (at least short term), since this is what TBPL has been using successfully for years, and works "just fine". Note this will mean that we'll need to go back to having a manual cache clear when repos are reset, until the pushlog bug 1065771 is fixed.
(Reporter)

Comment 1

4 years ago
https://hg.mozilla.org/hgcustom/version-control-tools/file/default/hgext/pushlog-legacy/pushlog-feed.py#l102

   102             elif self.querystart == QueryType.CHANGESET:
   103                 where.append("id > (select c.pushid from changesets c where c.node = :start_node)")
   104                 params['start_node'] = hex(self.repo.lookup(self.querystart_value))
   105             elif self.querystart == QueryType.PUSHID:
   106                 where.append("id > :start_id")
   107                 params['start_id'] = self.querystart_value
(In reply to Ed Morley [:edmorley] from comment #0)
> According to jeads, using &fromchange=REV with json-pushes requires an
> additional join (compared to &startID=N), which is presumably what is
> contributing to the higher rates of timeout accessing json-pushes than with
> TBPL.
> 
> We should either:
> 
> a) fix json-pushes to not time out (jeads mentioned wanting to find out if
> it was due to deadlocks etc) / not need the additional join

Hal, how can we get access to the weblogs for hgweb?
Flags: needinfo?(hwine)
(In reply to Jonathan Griffin (:jgriffin) from comment #3)
> Hal, how can we get access to the weblogs for hgweb?

For now, ask for them in #vcs, and specify (roughly) how much data you want (# weeks) We'll pull them off the 10 hosts and make them available.
Flags: needinfo?(hwine)
(Reporter)

Comment 5

4 years ago
Re:
19:32 <jgriffin> Can someone help retrieve weblogs for hgweb queries in the format http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=5d6ec4dddf14 ?
19:32 <jgriffin> We’re seeing lots of timeouts for those queries in Treeherder and are trying to determine the cause
19:33 <•fubar> jgriffin: how far back?
19:36 <jgriffin> fubar: not far, just a few days worth would be fine
19:36 <•fubar> jgriffin: np. m-c specifically, or any tree?
19:37 <jgriffin> fubar: any/all trees
19:37 <jgriffin> the problem we’re seeing doesn’t seem to be tree-specific
20:10 <•fubar> jgriffin: hg-pushloghtml.txt in your homedir on people
20:10 <jgriffin> fubar: tyvm!

The pushloghtml is not what TBPL/treeherder use, that's json-pushes.

The requests of interest are:

http://hg.mozilla.org/*/json-pushes?full=1&fromchange=*
(Reporter)

Comment 6

4 years ago
When we do time out requesting json-pushes, how quickly do we retry? Or more, by how long does it delay ingestion? (Just trying to set priority for this bug)
Flags: needinfo?(mdoglio)
A single timeout in the connection to json-pushes will delay the ingestion by about a minute. But that's probably not very relevant; if that timeout is due to the json-pushes service being under heavy load, we will likely see a series of timeouts, not just one. That could result in several minutes of ingestion delay, it's very hard to tell precisely.
Flags: needinfo?(mdoglio)
I forgot to mention, the pushlog ingestion is per repo. That means a series of timeouts in one repo will not affect the other ones.
(Reporter)

Comment 9

4 years ago
Thank you - if we retry after 1 minute, I think this is a P2 given it will require digging more into the logs and the lower hanging (and higher priority) P1s that are already filed.
Priority: P1 → P2
(Reporter)

Updated

4 years ago
Depends on: 1079788
(Reporter)

Comment 10

4 years ago
I've broken the stats gathering out to bug 1079788, since this bug is about mitigation steps we can take on treeherder's side (whereas I imagine we'll need pushlog changes too).
(Reporter)

Updated

4 years ago
Blocks: 1080757
No longer blocks: 1059400
(Reporter)

Updated

4 years ago
Blocks: 1096919
(Reporter)

Comment 11

4 years ago
Adjusting summary, since bug 1104374 will need us to use startID anyway.
Depends on: 1065771
No longer depends on: 1079788
Summary: Fix the json-pushes timeouts when using 'fromchange' or else switch back to 'startID' → Switch back to using '&startID=N' when performing incremental json-pushes requests
(Reporter)

Updated

4 years ago
Blocks: 1104374
(Reporter)

Updated

4 years ago
No longer blocks: 1080757
Component: Treeherder → Treeherder: Data Ingestion
Bug 1065771 is now in production. URLs like https://hg.mozilla.org/mozilla-central/json-pushes?version=2 now work. Adding the |version=2| querystring param will change the output format of the JSON slightly. A new top-level "lastpushid" property is now present. You can use that to track push offsets and to detect if/when the pushlog is reset (the value would decrease if the database is reset).
(Reporter)

Updated

4 years ago
Depends on: 1114843
(Reporter)

Updated

4 years ago
Assignee: nobody → emorley
(Reporter)

Updated

3 years ago
Status: NEW → ASSIGNED
(Reporter)

Updated

3 years ago
Priority: P2 → P3
(Reporter)

Updated

3 years ago
Assignee: emorley → nobody
Status: ASSIGNED → NEW
(Reporter)

Updated

3 years ago
Summary: Switch back to using '&startID=N' when performing incremental json-pushes requests → Use version 2 of the json-pushes API and '&startID=N' rather than '&fromchange=SHA' for polling the pushlog
(Reporter)

Comment 14

3 years ago
(ignore the startdate params; it's 'lastpushid' that is the relevant part there)
(Assignee)

Updated

3 years ago
Assignee: nobody → cdawson
(Assignee)

Comment 15

3 years ago
Note from GPS:

Polling pushlog every 10-15 seconds should be fine. TaskCluster does something in that range. Although one thing they do that I would appreciate you don't do is queue and fire 20+ HTTP requests nearly simultaneously. That can overload the servers. Staggering requests is the way to go.
(Assignee)

Comment 16

3 years ago
Created attachment 8651205 [details] [review]
PR
Attachment #8651205 - Flags: review?(emorley)
(Assignee)

Comment 17

3 years ago
Hi Ed-- I know you already looked this over.  But I make a couple changes (including removing the change to the poll interval, as you suggested).  So please have a look at your convenience.

Thanks!  :)
(Reporter)

Comment 18

3 years ago
Comment on attachment 8651205 [details] [review]
PR

Thank you for doing this :-)
Attachment #8651205 - Flags: review?(emorley) → review+
(Assignee)

Updated

3 years ago
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.