Closed Bug 770811 Opened 12 years ago Closed 10 years ago

Recurring pushlog issues with hg.mozilla.org/try

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: smaug, Assigned: bkero)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/786] )

+++ This bug was initially created as a clone of Bug #767657 +++

https://hg.mozilla.org/try/ won't load and https://tbpl.mozilla.org/?tree=Try&rev=<some changeset> loads only headers.


It seems the problem is there again.
Whiteboard: [TreeClosure][Workaround in comment 21]
[root@hgssh1.dmz.scl3 try]# hg heads | wc -l
3591

That's a lot of heads :| This comes back to vicious cycle of us misusing mercurial (in the way we use try) and the mercurial bug when calculating diffs. This will only be permanently fixed upstream, but that patch hasn't made it to any releases because last we heard it was failing unit tests.

The fastest fix for this right now is to reset try.
Assignee: server-ops-devservices → shyam
That heads is actually incorrect (since hg heads spits out a lot of information).

This is more accurate :

[root@hgssh1.dmz.scl3 try]# hg heads|grep ^changeset|wc -l
676

And with these many heads, we shouldn't be seeing this issue..but we are.
Or shorter to type: hg heads -q | wc -l

Added url to some recent try repo stats (may not stay up forever)

At the moment, there is no script running to try an reduce the number of heads, as Shyam notes, this is 'too few' heads to require that based on past experience.
Number of heads doesn't matter.  This problem is solely caused by a changeset referencing a parent that is hugely old, the diff calculation is exponential.
(In reply to Ben Kero [:bkero] from comment #4)
> Number of heads doesn't matter.  This problem is solely caused by a
> changeset referencing a parent that is hugely old, the diff calculation is
> exponential.

Soooo, if this is true its an issue that could/will affect our main repos at some point? If so we *need* to solve this somehow, even if its a change in how we configure/use hg. Its already borderline tree closure worthy for just try.
iirc, Ben noted in another bug that this "exponential depth calculation" only affects http access - if you manage your repos via ssh, it's not an issue.

Another approach would be to uncover _why_ we're referencing old changesets - it's a relatively sudden change - it may be related to one feature/project/bug (and a biweekly scheduled reset for project duration may be acceptable).
(In reply to comment #5)
> (In reply to Ben Kero [:bkero] from comment #4)
> > Number of heads doesn't matter.  This problem is solely caused by a
> > changeset referencing a parent that is hugely old, the diff calculation is
> > exponential.
> 
> Soooo, if this is true its an issue that could/will affect our main repos at
> some point? If so we *need* to solve this somehow, even if its a change in how
> we configure/use hg. Its already borderline tree closure worthy for just try.

Yes, as far as I can tell, all it takes is for someone to backout an old changeset and then merge to tip, or something to that effect.  That seems _very very_ bad...
We (dev services and releng) had a quick meeting about this today and have come with a few points :

1) Logging on ssh connections to the load balancer to try and find pushes that are taking a bunch of time
2) bkero will file an upstream hg bug about the issue mentioned in comment #4 and see what the hg devs can help with
3) hwine will try to come back with some ideas on how we can pull some data of current and older try repos to try and find patterns that cause issues mentioned in comment #4

We'll keep this bug updated. I'm not sure this is going to be an easy or fast one to fix, but we're working on it :) Thanks a lot for your patience!
Severity: normal → blocker
Summary: hg try repo is broken → TBPL can't load try
> I'm not sure this is going to be an easy or fast one to fix, but we're working on it :)

Can we please strip the hg repository so things get back into a working state?  You can clone the repository so you can work on a Real Fix, but in the meantime, we're all blocked on this.
Back to normal, what was done/found today (THANKS SHYAM):

* We identified that https:// was timing out at ~30 seconds, and http:// loads were taking ~1.5 minutes to load (e.g. hg.mozilla.org/try/json-pushes?full=1&maxhours=24 )
* Finding out what is causing the https:// timeout is not a fast solution for Shyam after midnight on a saturday.
* TBPL uses https:// and pulls the JSON over https://
* We had a few aborted pushes clogging up the pushlog atm, which cause the json to give errors out to us.
* Shyam cleared those aborted pushes from the DB
* Philor got TBPL to pull the JSON over http://

NOW we are able to load the try pages fine [for now], we still have the "what the heck takes so long" and "fix the https:// timeout" issues to solve though.
Severity: blocker → normal
Depends on: 771849
I didn't change tbpl, other than locally.

Interesting fact, though: immediately after fox2mike fixed pushlog, I was able to load &rev=something URLs just fine, but now they are back where they were, and where they were was apparently hitting that 30 second connection reset. My local tbpl, pulling json-pushes over http, does fine at loading them (as long as you are patient), in 90 seconds. Or 87 seconds, or 92 seconds, but not sometimes that and other times 60, or 45 - within a two or three seconds of 90 seconds for every rev I've tried. I wouldn't make a fuss about that, except that it turned out the 30 second connection reset over https was significant, so maybe somehow the 90 second success over http is, too.

Hacking around this for tbpl in bug 771849, but, 90 seconds‽
Depends on: 771852
Blocks: 771849
No longer depends on: 771849
(In reply to Phil Ringnalda (:philor) from comment #11)

> Hacking around this for tbpl in bug 771849, but, 90 seconds‽

Yeah, this is the issue mentioned in comment #4. Changing the title of this bug isn't correct as this is still very much an issue with try. tbpl being affected by it is a fallout and moving things around in tbpl is simply a workaround to unblock devs who depend quite a bit on tbpl.
Summary: TBPL can't load try → Recurring issues with hg.mozilla.org/try
Well, I do get surprised quite often, but I would be surprised if that explains why loading http://hg.mozilla.org/try/json-pushes?full=1&maxhours=24 takes a second or two, and loading http://hg.mozilla.org/try/json-pushes?full=1&changeset=c1a427dbcd07 which is a changeset contained within the ten changesets in that maxhours=24 URL and which does not have an old parent takes 90 seconds, or 89 seconds, or 92 seconds (unless we're caching that first one, and it did take someone 90 seconds to load before I came along and got it for free in a second or two).
(In reply to Phil Ringnalda (:philor) from comment #13)
> Well, I do get surprised quite often, but I would be surprised if that
> explains why loading
> http://hg.mozilla.org/try/json-pushes?full=1&maxhours=24 takes a second or
> two, and loading
> http://hg.mozilla.org/try/json-pushes?full=1&changeset=c1a427dbcd07 which is
> a changeset contained within the ten changesets in that maxhours=24 URL and
> which does not have an old parent takes 90 seconds, or 89 seconds, or 92
> seconds (unless we're caching that first one, and it did take someone 90
> seconds to load before I came along and got it for free in a second or two).

No, we don't do any caching whatsoever (yet) and this is really interesting. I'll poke some more.
ted, any ideas on comment #13?
Depends on: 772819
This is surfacing frequently on #developers.

Do we have any idea why the recent Try repo resets haven't fixed this?
We've never had anything like this before persist after a reset, so what has changed?
Severity: normal → critical
Got that again yesterday.
This is happening all the time.
I've started saying '9%' for 'TBPL link'; too bad, as they are a really important part of my workflow.
(In reply to Phil Ringnalda (:philor) from comment #13)
> http://hg.mozilla.org/try/json-pushes?full=1&maxhours=24
(maxhours is not a valid query parameter here, just FWIW)

Should result in a SQL query that looks like:
SELECT id, user, date FROM pushlog ORDER BY date DESC LIMIT 10 OFFSET 0

then one query per push id returned, like:
SELECT node FROM changesets WHERE pushid = ? ORDER BY rev DESC

> http://hg.mozilla.org/try/json-pushes?full=1&changeset=c1a427dbcd07

This should result in a query that looks like:
SELECT id, user, date, node from pushlog LEFT JOIN changesets ON id = pushid WHERE id = (select c.pushid from changesets c where c.node = <...>) ORDER BY id DESC, rev DESC

Could be the subquery there that kills it, I am terrible with databases.
CC'ing our DBAs for input on comment #21. This is an sqlite DB.
SQLite uses reader/writer locks on the entire database file, so one writer can block many readers. Does the TPBL access the sqlite database from multiple processes?

https://www.sqlite.org/whentouse.html
(In reply to Chris Peterson (:cpeterson) from comment #23)
> SQLite uses reader/writer locks on the entire database file, so one writer
> can block many readers. Does the TPBL access the sqlite database from
> multiple processes?
> 
> https://www.sqlite.org/whentouse.html

I'm not sure tbpl accesses this DB directly, this is only accessed by hg.m.o IIRC. I'd be super surprised if tbpl had access to this one.
(In reply to Shyam Mani [:fox2mike] from comment #24)
> I'm not sure tbpl accesses this DB directly, this is only accessed by hg.m.o
> IIRC. I'd be super surprised if tbpl had access to this one.

The same question is still valid though, just a different process. Do more than one process on the host where this db lives ever access it at the same time? If so, the whole db lock will hurt us.
(In reply to Johnny Stenback (:jst, jst@mozilla.com) from comment #25)
> The same question is still valid though, just a different process. Do more
> than one process on the host where this db lives ever access it at the same
> time? If so, the whole db lock will hurt us.

Ted's the best person to answer that. This is primarily accessed by the pushlog stuff we've built over hg (which is installed as a global hook IIRC).
Yes, the database can be accessed by multiple processes. Each push to the repository will trigger the hook which inserts into the db. Each query to the pushlog web command will query the db.

But per comment 13, some queries seem to be taking longer than others, and it looks like TBPL is using those queries. It may be just bad schema/index design in the DB.
(In reply to Ted Mielczarek [:ted] from comment #21)

> Could be the subquery there that kills it, I am terrible with databases.

Ted, I haven't finished looking into this yet, just reading the ticket, but let's say it is - do we have control over that part, or would it be more like "make a bug for this hg tool and have those developers figure out a solution?"

If it comes to it, can we figure out a solution and push a patch to them?
I've been informed that my comment 28 is irrelevant, as we wrote the addon to hg. :D
The pushlog is all in-house code, we can change it as needed:
http://hg.mozilla.org/hgcustom/hghooks/file/38218ddfda34/mozhghooks/pushlog.py is the hook that does the inserts, and
http://hg.mozilla.org/hgcustom/pushlog/file/e99a36d3fd4a/pushlog-feed.py is the web bit that does the queries

I wrote most of it, and I will freely admit to being bad at database design, so there may be some obvious fix here.
Subqueries are haaaard so I would be willing to bet that's a culprit. I'm reading up on the sqlite query optimizer now at http://www.sqlite.org/optoverview.html
Another issue we may be running into:

"SQLite stores the entire database in a single disk file" - and the indexes are stored in the same file

If tbpl has to read the beginning and end of the file and go back and forth constantly, disk seeking could be what's taking so long.
In comment 21, in the subquery "select c.pushid from changesets c where c.node = <…>" does that turn out to be "where c.node='c1a427dbcd07' " or something else? I ask because there also may be optimization differences between using a number (like push_id) and a string.
Not sure if this is of any use, but I've been logging the response body, headers and time of http://hg.mozilla.org/try.

32% of the requests finish in about 1s. The rest take around 90s.

There are several cases where a 1s response is identical to a 90s response in everything except Date and ETag header.
Also, I believe that http://hg.mozilla.org/try bypasses the pushlog completely.
Bug 774862 may be related here. We've been getting sporadic 500 errors from try recently. Our polling code specifically looks for "unknown revision" in the error message if it gets a 500 error. If the poller sees "unknown revision" it assumes that the repository has been reset, and so resets the poller's internal state. This results in re-triggering builds that already ran.

So something is causing hg / try / pushlog to think a previous revision doesn't exist for a brief moment in time.
Here is something that may be of interest:

I just tried to push a few csets (pushlog 143411e132bf) to try. It was waiting here for a minute or two:

kats@kgupta-air mozilla$ hg push -f try -r tip
pushing to ssh://hg.mozilla.org/try/
searching for changes

So I hit ctrl-c to abort it. Immediately I got this output:

^Cinterrupted!
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 3 changesets with 2 changes to 1 files (+1 heads)
remote: Looks like you used try syntax, going ahead with the push.
remote: If you don't get what you expected, check http://trychooser.pub.build.mozilla.org/ for help with building your trychooser request.
remote: Thanks for helping save resources, you're the best!
remote: Killed by signal 2.

So the push itself worked (because if I push again I get the "no changes found" output from hg), but my push isn't showing up on https://tbpl.mozilla.org/?tree=Try

This seems to imply that the post-commit hook is taking forever, while the push itself worked from an hg point of view. This is likely to due to database contention. Also this leads to a question: will tests be run on this half-push cset? I assume not because it seems saner to use the pushlog database for triggering tests, but I could be wrong. If tests are currently running on my push then they are just eating up machine resources uselessly because I don't know how I can get the output of those tests.
Like Chris (comment #34), I've been logging times & response body, but from a different location (in the dmz, from host github-sync1-dev). Overnight, 43 of 66 requests resulted in a dropped connection (curl error 18 after 41 wall clock seconds - I suspect this is due to different rules in the load balancer or other network configuration).

Of the 23 successful GET requests, 22 completed in <2.5 seconds, with 18 in <2 seconds.
A subsequent push (that took a while) included this output:

remote: waiting for lock on repository /repo/hg/mozilla/try/ held by 'hgssh1.dmz.scl3.mozilla.com:14954'

Not sure if that's useful at all.
I don't find any lockfiles in try hg metadata anymore, someone must have been doing a push while you were doing one.
Also, something we've started noticing recently :

fox2mike@woodpecker ~ $ time curl http://hg.mozilla.org/try/raw-file/16ae7d0bedf4/testing/talos/talos.json
{
    "talos.zip": {
        "url": "http://build.mozilla.org/talos/zips/talos.dd8466b5d8d0.zip",
        "path": ""
    }
}

real	2m52.963s
user	0m0.012s
sys	0m0.000s

Just that raw-file is taking over 3 mins to process..and dump out.
(In reply to Shyam Mani [:fox2mike] from comment #41)
> Also, something we've started noticing recently :
> 
> fox2mike@woodpecker ~ $ time curl
> http://hg.mozilla.org/try/raw-file/16ae7d0bedf4/testing/talos/talos.json
> {
>     "talos.zip": {
>         "url": "http://build.mozilla.org/talos/zips/talos.dd8466b5d8d0.zip",
>         "path": ""
>     }
> }
> 
> real	2m52.963s
> user	0m0.012s
> sys	0m0.000s
> 
> Just that raw-file is taking over 3 mins to process..and dump out.

FYI Joel/ATeam; (sent from my phone - due to no power).
(In reply to Ted Mielczarek [:ted] from comment #21)
> (In reply to Phil Ringnalda (:philor) from comment #13)
> > http://hg.mozilla.org/try/json-pushes?full=1&maxhours=24
> (maxhours is not a valid query parameter here, just FWIW)
> 
> Should result in a SQL query that looks like:
> SELECT id, user, date FROM pushlog ORDER BY date DESC LIMIT 10 OFFSET 0

Should be basically free if there's an index on date.

> then one query per push id returned, like:
> SELECT node FROM changesets WHERE pushid = ? ORDER BY rev DESC

Should be basically free if there's an index on pushid.  (If there are hundreds or thousands of rows with the same pushid, you'd want an index on (pushid, rev).)

> > http://hg.mozilla.org/try/json-pushes?full=1&changeset=c1a427dbcd07
> 
> This should result in a query that looks like:
> SELECT id, user, date, node from pushlog LEFT JOIN changesets ON id = pushid
> WHERE id = (select c.pushid from changesets c where c.node = <...>) ORDER BY
> id DESC, rev DESC

Even if this query is run naively, populating a temporary table with the results of the subquery and then joining the outer query to it, it shouldn't be too slow -- as long as there are proper indexes.  On changesets, you'd need an index on node for the subquery.  Then on pushlog, you'd need an index on id to satisfy the WHERE (ideally on (id, rev) if there are lots of rows returned, to avoid a sort).  Then on changesets, you'd also need an index on pushid for the join.  All these indexes should exist, according to <http://hg.mozilla.org/hgcustom/hghooks/file/38218ddfda34/mozhghooks/pushlog.py#l40>, except no compound index on (id, rev), but that shouldn't be a big deal.  

> Could be the subquery there that kills it, I am terrible with databases.

Could someone get a copy of the actual live queries being run, and the actual live table schemas, and the output that SQLite gives for the slow queries prefixed with EXPLAIN?  I happen to have a lot of experience optimizing MySQL queries, as a former MediaWiki developer who wrote schemas and queries that are used on Wikipedia, so I suspect I could help if slow queries are the problem.  SQLite doesn't seem much different from MySQL.
Shyam: can you maybe grab a copy of the pushlog db from the try repo and attach it here (or put it somewhere else if it's too large)?
(In reply to Ted Mielczarek [:ted] from comment #44)
> Shyam: can you maybe grab a copy of the pushlog db from the try repo and
> attach it here (or put it somewhere else if it's too large)?

https://people.mozilla.com/~shyam/pushlog2.db

That's the latest copy.
So this query is lightning-fast on the local DB copy, as I'd expect:

SELECT id, user, date, node from pushlog LEFT JOIN changesets ON id = pushid WHERE id = (select c.pushid from changesets c where c.node = 'c1a427dbcd07960c0590ebdfef22dc9cf4195e37') ORDER BY id DESC, rev DESC;

EXPLAIN QUERY PLAN indicates that it's executed just as I'd think.  Looking at pushlog-feed.py suggests that this is in fact the query that's executed.  Really, the DB is minuscule (15M), so even if there were no indexes and every query sorted everything in a temporary table, it should take way under a second per query.  But it has sane indexes too.

So if there is a DB issue, I'm guessing it has something to do with extremely long DB-wide locks.  As far as I can tell by briefly looking over the SQLite documentation, nothing should cause exclusive locks to be held for any period of time unless there's a long-running query (either read or write), which so far seems implausible to me.

(Specifically: reads run in parallel.  The only time reads will be blocked is if something is trying to take out a write lock, which only occurs immediately before the write takes place.  The write lock will block on any current reads, and while it's blocked no new reads are allowed, to prevent write starvation.  So a long-running read query will block the write lock, which in turn blocks new reads.)

So far I don't see what the issue could be on the DB side.  This is exactly the kind of lightweight workload SQLite is designed to handle.  Do we know that the DB is an issue here?  Has anyone monitored what the push hooks are doing to see why they're taking so long?
When the try repo is reset, does the pushlog DB get cleared?

If so, then whilst I'm sure the DB could be optimised, it can't be the cause of this recent spell of problems surely?

Instead, don't we really need to focus on:

(In reply to Ed Morley [:edmorley] from comment #16)
> Do we have any idea why the recent Try repo resets haven't fixed this?
> We've never had anything like this before persist after a reset, so what has
> changed?
eg: one of the new hooks that were added in the last two months:
https://hg.mozilla.org/hgcustom/hghooks/summary
Just a note, the work-around of using &pusher=foo@example.com doesn't really work great for me, as I have to look at a lot of try pushes from other people and I don't have all of their LDAP accounts memorized. :(
(In reply to Ed Morley [:edmorley] from comment #47)
> When the try repo is reset, does the pushlog DB get cleared?

Yes
 
> If so, then whilst I'm sure the DB could be optimised, it can't be the cause
> of this recent spell of problems surely?

We did that to look at what was causing comment #13

> Instead, don't we really need to focus on:
> 
> (In reply to Ed Morley [:edmorley] from comment #16)
> > Do we have any idea why the recent Try repo resets haven't fixed this?
> > We've never had anything like this before persist after a reset, so what has
> > changed?

Excellent questions. Ones we don't have the answer to, which is why we're kind of lost at this point. We know there are bugs (see comment #4 and comment #6) and it seems like we may have multiple issues at play. I'm focusing on the hgweb side of things because that's where things seem to be slow.
(In reply to Shyam Mani [:fox2mike] from comment #50)
> (In reply to Ed Morley [:edmorley] from comment #47)
> > When the try repo is reset, does the pushlog DB get cleared?
> 
> Yes
>  
> > If so, then whilst I'm sure the DB could be optimised, it can't be the cause
> > of this recent spell of problems surely?
> 
> We did that to look at what was causing comment #13
> 
> > Instead, don't we really need to focus on:
> > 
> > (In reply to Ed Morley [:edmorley] from comment #16)
> > > Do we have any idea why the recent Try repo resets haven't fixed this?
> > > We've never had anything like this before persist after a reset, so what has
> > > changed?
> 
> Excellent questions. Ones we don't have the answer to, which is why we're
> kind of lost at this point. We know there are bugs (see comment #4 and
> comment #6) and it seems like we may have multiple issues at play. I'm
> focusing on the hgweb side of things because that's where things seem to be
> slow.

fox2mike: not sure if related, but fyi. In bug#774862, we're seeing recurring 500 errors from hg.m.o on both try repo *and* non-try repos.
FWIW I agree with Aryeh in comment 46. I wouldn't think that there are so many writes that the db is causing contention. and 15 Mb isn't very large, so I can't imagine it's due to disk seeking or anything.

I also like where you're going with comment 43. I suspect sqlite doesn't optimize subqueries well (just like MySQL!) and if we can change those to joins or temp tables I bet it'd go much faster.
(In reply to Sheeri Cabral [:sheeri] from comment #52)
> I also like where you're going with comment 43. I suspect sqlite doesn't
> optimize subqueries well (just like MySQL!) and if we can change those to
> joins or temp tables I bet it'd go much faster.

Well, here's the execution plan:

sqlite> EXPLAIN QUERY PLAN SELECT id, user, date, node from pushlog LEFT JOIN changesets ON id = pushid WHERE id = (select c.pushid from changesets c where c.node = 'c1a427dbcd07960c0590ebdfef22dc9cf4195e37') ORDER BY id DESC, rev DESC;
0|0|0|SEARCH TABLE pushlog USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0|0|0|EXECUTE SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE changesets AS c USING INDEX changeset_node (node=?) (~1 rows)
0|1|1|SEARCH TABLE changesets USING INDEX changeset_pushid (pushid=?) (~10 rows)
0|0|0|USE TEMP B-TREE FOR ORDER BY

This is about what you'd expect; rewriting to use a join is basically the same idea:

sqlite> EXPLAIN QUERY PLAN SELECT id, user, date, c2.node FROM changesets AS c1 JOIN pushlog ON c1.pushid = id JOIN changesets AS c2 ON id = c2.pushid WHERE c1.node = 'c1a427dbcd07960c0590ebdfef22dc9cf4195e37' ORDER BY id DESC, c2.rev DESC;
0|0|0|SEARCH TABLE changesets AS c1 USING INDEX changeset_node (node=?) (~1 rows)
0|1|1|SEARCH TABLE pushlog USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0|2|2|SEARCH TABLE changesets AS c2 USING INDEX changeset_pushid (pushid=?) (~10 rows)
0|0|0|USE TEMP B-TREE FOR ORDER BY

(I made it an inner join instead of outer because there should never be a case where there's a changeset without corresponding push, right?)  If you do a full EXPLAIN instead of EXPLAIN QUERY plan, the version with a join is a little shorter.  You can't avoid the temp table here because you're sorting by columns from two different tables.  But you actually don't need to; changesets.node has a unique index, and pushlog.id is the primary key, so the id will always be unique:

sqlite> EXPLAIN QUERY PLAN SELECT id, user, date, c2.node FROM changesets AS c1 JOIN pushlog ON c1.pushid = id JOIN changesets AS c2 ON id = c2.pushid WHERE c1.node = 'c1a427dbcd07960c0590ebdfef22dc9cf4195e37' ORDER BY c2.rev DESC;
0|0|0|SEARCH TABLE changesets AS c1 USING INDEX changeset_node (node=?) (~1 rows)
0|1|1|SEARCH TABLE pushlog USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0|2|2|SEARCH TABLE changesets AS c2 USING INDEX changeset_pushid (pushid=?) (~10 rows)
0|0|0|USE TEMP B-TREE FOR ORDER BY

Now you're retrieving from c2 using pushid and ordering by rev, and you can't use any one index for both.  But we can't really help that, because the index is only usable for sorting if it's on the first table you're retrieving from; we're retrieving first from c1, but sorting by c2.  So if you wanted to get rid of the temporary table, you'd do:

sqlite> DROP INDEX changeset_pushid;
sqlite> CREATE UNIQUE INDEX changeset_pushid_rev ON changesets (pushid, rev);

(doesn't have to be unique, there's already a unique index on rev, so it should make no difference) and then rewrite again using a subquery, but with no left join (so we can reorder the tables):

sqlite> EXPLAIN QUERY PLAN SELECT id, user, date, node FROM changesets JOIN pushlog ON pushid = id WHERE pushid = (SELECT pushid FROM changesets AS c WHERE c.node = 'c1a427dbcd07960c0590ebdfef22dc9cf4195e37') ORDER BY rev DESC;
0|0|0|SEARCH TABLE changesets USING INDEX changeset_pushid_rev (pushid=?) (~10 rows)
0|0|0|EXECUTE SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE changesets AS c USING INDEX changeset_node (node=?) (~1 rows)
0|1|1|SEARCH TABLE pushlog USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)

Apparently SQLite isn't clever enough to reorder the tables and/or do equality propagation (pushid = id) itself in this case, but by reordering the tables myself I was able to get it to figure things out.  In this case the subquery helps us, because SQLite will retrieve the desired id from the unique index before it runs the main query.  If I try writing it as a join, it's not smart enough to figure out that it can run the query on the changesets table before anything else, and substitute the value before running the real query.  MySQL is clever enough to do this, and probably would avoid a filesort even if this were written as a join -- its optimizer will actually run single-row retrievals from unique indexes before it does anything else ("const" join type https://dev.mysql.com/doc/refman/5.6/en/explain-output.html#jointype_const).


. . . but this is all completely beside the point, because the query is extremely fast already and doesn't need to be optimized.
Aryeh - Much <3 for this analysis. And here I was going to research what sqlite's EXPLAIN was.....turns out it's just EXPLAIN.

Thank you for establishing it's not the DB!
Unfortunately this is causing people to take greater risks with landings :-(

From #developers:

snorp	Ms2ger: I couldn't even load try last night
snorp	it somehow did load just now
snorp	sigh
Ms2ger	You landed even though you couldn't load try? ?_?
snorp	I haven't been able to load try all week
...
firebot	Check-in: http://hg.mozilla.org/integration/mozilla-inbound/rev/a501733bc7b5 - Ed Morley - Backout 4987ffd173a4 (bug 687267) for bustage
(In reply to Ed Morley [:edmorley] from comment #55)
> Unfortunately this is causing people to take greater risks with landings :-(

Hi Ed,

This is a high priority for us.  We are still working on trying to find the root cause and possible fixes.  Just wanted you to know this is top of our list.
one of the underlying issues we believe is fixed upstream (but not in a release yet)

see http://www.selenic.com/hg/rev/7bf48bc7de23
The fix for this is located in:

http://www.selenic.com/hg/rev/7bf48bc7de23

Which I've confirmed is located in mercurial-2.2.2.  This means we'll need to upgrade mercurial, which is bug 741353.
So the current plan is to build an rpm for hg 2.2.3 and roll it out to one, single read only webhead and see how fast /try loads there. If that seems to fix the issue completely, we'll work on a downtime with releng and upgrade across the board to 2.2.3
Is that it? The change talks about filediff, but we see slow requests even when requesting a full change. I also don't remember seeing the other symptoms.
tl;dr - no tree closure involved in this change

(In reply to Shyam Mani [:fox2mike] from comment #59)
> So the current plan is to build an rpm for hg 2.2.3 and roll it out to one,
> single read only webhead and see how fast /try loads there. If that seems to
> fix the issue completely, we'll work on a downtime with releng and upgrade
> across the board to 2.2.3

bit more detail here,

http access to hg comes from a distinct pool of nodes with read-only access to the hg data. Normally when upgrading we have a lot of testing to do to ensure pushes and hooks all work, and rolling out to a push node also requires a tree closure.  Since the http nodes are seperate (and behind a load balancer) we can take a node out, upgrade it with no risk to the data, test it, put it back in to the pool and direct all /try traffic to the upgraded node to test for a fix there.

If this does fix the issue we can roll out the upgrade across the rest of the http pool with no impact and no need for a tree closure.  At that point we will work on more extensive testing of pushes and hooks for a tree closure to upgrade the push nodes at some later date, but rolling this out to the web nodes would ideally fix the problem here.
(In reply to Corey Shields [:cshields] from comment #56)
> Just wanted you to know this is top of our list

Thank you :-D
After spending 3 hours tracking this down with Corey and not really getting anywhere, we've reported this upstream to hg. For those interested - http://bz.selenic.com/show_bug.cgi?id=3557
Rafael was right, that fix appears to be for filediff and nothing to do with summary or shortlog.

We worked some more on this yesterday to isolate other variables unique to our environment to no avail.  Filed an upstream bug at http://bz.selenic.com/show_bug.cgi?id=3557
Depends on: 777521
Something that's just occurred to me after seeing all of the collapsed changesets in decoder's push to Try (https://tbpl.mozilla.org/?tree=Try&rev=5843369883ad) after the emergency reset (bug 778062), is that for every 6-week uplift we have, there are many more heads (and heads with thousands of changesets since the last common ancestor) present on aurora/beta, that get pushed to Try when people are testing branch patches. 

In decoder's push alone, a new head with 2800 changesets since the common ancestor has been added, a matter of minutes after the reset. Whomever pushes beta to Try next will add another few thousand. Once we have the next uplift (and heads closed, new ones created, as part of the uplift standard proceedure) we're going to have yet more added on top :-(

Given that seemingly nothing else has changed; might this be the cause of the Try resets no longer being as effective as they once were?
(Sorry split post)

Or more succinctly:

Given: (from http://bz.selenic.com/show_bug.cgi?id=3557#c7)
> With our last repo reset, these performance problems returned with
> a much lower head count than in the past.

Perhaps the O(N) performance problem, is not just dependant on number of heads, but also on how many unique changesets that head has between it and the common ancestor? 

Normal Try usage for people basing work on trunk will typically only add a handful of unique changesets since the common ancestor; whereas with the 'new' release train model/uplift method, aurora+beta will have thousands for each new head (and steadily increasing every 6 weeks).
Ed, I think you've finally hit the nail on the head as to why try resets aren't as effective as before with comment #66. With the rapid release stuff in place for a while now, we're probably seeing the effects on try over time.
If this does turn out to be the cause, perhaps the simplest solution in the short term (whilst we wait and see if mercurial can be made more performant) is to create 'try-aurora' & 'try-beta' repos (or even just a single 'try-branches' one). 

This would:
* Keep the O(head*depth) issue out of the way of people working just on trunk.
* Mean that head-heavy try-aurora/try-beta repos could be reset more frequently, since with their lower load it would be easier to find a window where blowing away all results was less problematic.

We could then see if it was possible to protect the main Try repo against accidental branch pushes, by using a hook that denied any pushes that added more than N changesets since the last common ancestor.
(In reply to comment #67)
> Ed, I think you've finally hit the nail on the head as to why try resets aren't
> as effective as before with comment #66. With the rapid release stuff in place
> for a while now, we're probably seeing the effects on try over time.

That stuff has been in place for more than a year now!
(In reply to comment #68)
> If this does turn out to be the cause, perhaps the simplest solution in the
> short term (whilst we wait and see if mercurial can be made more performant) is
> to create 'try-aurora' & 'try-beta' repos (or even just a single 'try-branches'
> one). 

As discussed on IRC, note that the way that we uplift things into aurora/beta etc is nothing new at all.  We've been following roughly the same steps since Firefox 4 (I know that exactly because I did a few of those merges last year!).  Also, I don't believe we have any evidence to suggest that this is really what's to blame here.  The depth of aurora/beta heads is not a lot more than m-c heads (perhaps a few hundred revisions more).

A theory that I could buy is just the continuous increase in the size of our repository.  This means that even if Mercurial has to run some O(N) algorithms over and over again (for example, the one which needs to check whether .hgtags has changed in a given revision <http://bz.selenic.com/show_bug.cgi?id=3557#c18>) it's going to get worse and worse as time goes on, which is what we've been seeing.

Now I do believe that splitting the try repo as suggested by Ed will _help_ the people who want to push aurora and beta try jobs, but only because those repositories will get a lot fewer hits!  That would be the wrong criteria for splitting things up IMO, and it doesn't really prevent anyone from pushing their aurora/beta tree over to the regular try either, and hooks to prevent large pushes will just hurt the people who are working on their own branch for a long time.

> This would:
> * Keep the O(head*depth) issue out of the way of people working just on trunk.

Out of curiosity, have we established this order?
(In reply to Ehsan Akhgari [:ehsan] from comment #70)
> As discussed on IRC, note that the way that we uplift things into
> aurora/beta etc is nothing new at all.  We've been following roughly the
> same steps since Firefox 4 (I know that exactly because I did a few of those
> merges last year!).  

Ah, my mistake - that was before my time :-)

> Out of curiosity, have we established this order?

No, I was referring the the 'perhaps' of comment 66. I'm not at all familiar enough with mercurial's implementation to know, I was hoping someone from the Selenic bug might be able comment on it.
Ben was working on some long term solutions here.
Assignee: shyam → bkero
Dropping severity while work is being done.
Severity: critical → normal
Depends on: 827123
Depends on: 894429
Depends on: 827123
Depends on: 691459
Depends on: 994028, 1001735, 992368, 945383
Depends on: 1024877
Depends on: 1030020
Depends on: 1045089
Depends on: 1045086
This bug is very out of date w.r.t. try hg server issues. However, it has a lot of good analysis of pushlog database issues. Adjusting summary and leaving open for that reason.

All current try issues, including this one, are linked off bug https://bugzilla.mozilla.org/show_bug.cgi?id=try-tracker
No longer depends on: 945383, 992368, 994028, 1045089
Summary: Recurring issues with hg.mozilla.org/try → Recurring pushlog issues with hg.mozilla.org/try
Ah, thank you :-)
Depends on: 1053567
Component: Server Operations: Developer Services → Mercurial: hg.mozilla.org
Product: mozilla.org → Developer Services
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/204]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/204] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/786] [kanban:engops:https://kanbanize.com/ctrl_board/6/204]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/786] [kanban:engops:https://kanbanize.com/ctrl_board/6/204] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/786]
There are a gazillion issues buried in the comments. All are tracked by other bugs. I don't see much value in leaving this bug open. So I'm closing it.

FWIW, the issue that got the most attention in this bug was slowness loading certain pushlog URLs. Discussion in the Mercurial bug report indicates this was likely the result of using a read-only filesystem and having branch cache population not work as a result, leading to lag to compute branch info on pushlog load. This issue was resolved when we moved to local disk slaves (bug 937732). There are other issues reportedly fixed in newer versions of Mercurial. We are now running Mercurial 3.2.3 in production and we have those fixes deployed.

We also just recently rewrote the pushlog code to be more robust. All kinds of issues with pushlog consistency should now be a thing of the past. As part of this, we now have a "version 2" pushlog API to enable consumers to get pushlog data more robustly. That is documented at https://mozilla-version-control-tools.readthedocs.org/en/latest/hgmo/pushlog.html.

We still having scaling issues with Try due to the number of heads. Those will be fixed in bug 1055298.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.