TBPL is broken for try (after Reset try server repo, June 2010 edition ?)

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
--
blocker
RESOLVED FIXED
8 years ago
3 years ago

People

(Reporter: joduinn, Assigned: aravind)

Tracking

Details

(Whiteboard: [Thursday 6/10 6am], URL)

Attachments

(1 attachment)

Monthly request to reset the TryServer repo. This will need a downtime for TryServer, so please coordinate with RelEng in advance.
Blocks: 570512

Updated

8 years ago
Whiteboard: [tryserver]

Comment 1

8 years ago
Going to tackle this during tomorrow's downtime.
Assignee: server-ops → jdow
Whiteboard: [tryserver] → [Thursday 6/10 6am]

Comment 2

8 years ago
The wiki page has been updated for a couple of discrepancies but this went really smooth.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
Reopening on the basis of comments in bug 571305 (in particular #0 and #7). We haven't got the nice speed improvement which is the motiviation behind resetting the try server ? Did the 'create an empty pushlog db' steps on the wiki get carried out ?
Severity: normal → blocker
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
From a mpt machine:

# pushlog full listing - this works sometimes
$ time curl -m 300 -s http://hg.mozilla.org/try/pushloghtml | wc -c
34473
real	1m8.854s
# ... but fails othertimes
1825
real	3m1.645s

# pushlog tips of heads only - this is OK if a l
$ time curl -m 300 -s http://hg.mozilla.org/try/pushloghtml?tipsonly=1 | wc -c
6644
real	0m27.112s

Philor says this should be < 6 seconds.

Updated

8 years ago
Assignee: jdow → aravind
Created attachment 450591 [details]
Talos slave transfer rates

Not making a huge amount of progress here.

Current situation
 * can't use tinderboxpushlog on try because it can't pull in the changeset data. This has an effect on developers tracking the results of their test builds but can be reconstructed using emails and tinderbox.m.o with some effort
 * try server builds are still working (ie you can pull the try repo still)
 * mozilla-central et al seem fine for builds and tests

Known data:
* we had a releng downtime (bug 570512) between 0600 and 0900 PDT for the try server reset (this bug) and to upgrade the squid proxy in the Castro office (bug 571305), plus a few minor talos changes

* (from munin) at 0730 PDT the memory usage on dm-vcview01 and 03 started ramping up linearly, with some jumps back to 0 which were presumably restarting the web server. CPU usage also more than doubled, and the number of NFS requests jumped to over 4K/s when it was averaging < 1K/s previously (but spiky).

* dm-vcview04 had a network issue, pushing more load onto the other three vcview boxes (which sit behind hg.m.o); some details for that in bug 571305, this was breaking mozilla-central but that's fine now. Timing of this not clear

* talos slaves in Castro are getting slower transfers from the proxy (see attachment). Don't know where in the chain of slave -> proxy -> stage.m.o -> nfs to ftp.m.o -> iscsi on equallogic but
 * we changed the proxy
 * tinderbox.m.o storage is also on equallogic storage (same head??)

* aravind reported assorted other things which would match with nfs being hammered on the netapp, which is where the backend for hg.m.o is

Things that might be useful to check
* is the equallogic array working much harder since this morning ? 
That might indicate the proxy is not working properly and sending more requests direct to stage. I don't see any major changes on munin's graphs for dm-ftp01 though. Or something bizarre with changing the partition size for tinderbox making equallogic misbehave ?
* is nfs running slower for hg.m.o ?
* is there a spider hitting hg.m.o or tinderbox.m.o ?
Probably should have left this all in bug 571305 but it's too late now. Morphing bug title.

Another thing to check: 'hg verify' on the try repo
Summary: Reset try server repo, June 2010 edition → TBPL is broken for try (after Reset try server repo, June 2010 edition ?)
(Assignee)

Comment 7

8 years ago
I have a temporary hack in place, which should hold us over till we get a more permanent solution in.

The try repo is now living on a in-memory filesystem on dm-vcview04.  I have tested cloning and pushlog, stuff seems to be working okay and appears to be quite speedy.

The only limitation is that the repo will only last for as long as dm-vcview04 is alive and kicking.  The worst case scenario is that we'd have to re-create the try repo (which isn't so bad).

The real fix (imo) is along similar lines, only we should be using a SSD mounted on a server.  Marking this fixed for now.
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Resolution: --- → FIXED

Comment 8

8 years ago
Seeing this as of right now:

pushing to ssh://hg.mozilla.org/try/
searching for changes
^Cinterrupted!
remote: waiting for lock on repository /repo/hg/mozilla/try/ held by 'dm-svn01.mozilla.org:22240'
remote: Killed by signal 2.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 9

8 years ago
Please try pushing to it now?
(Assignee)

Comment 10

8 years ago
My fix was breaking pushes to the repo, so I reverted it and now we are back to the netapp.  Will look at this somemore tomorrow.
(Assignee)

Comment 11

8 years ago
I made one other change earlier - I am now sending all the /try requests to only one of the backend servers (vcview04) instead of splaying it across all four.  That seems to have made some difference and tbpl queries are working now.  will leave this open for now.. Please comment here if you notice more breakage.
aravind: this seems to be working now. unless there is anything still to do on your side, I think its safe to close...
(Assignee)

Comment 13

8 years ago
(In reply to comment #12)
> aravind: this seems to be working now. unless there is anything still to do on
> your side, I think its safe to close...

lets leave this open for now, I am trying to come up with a better way to serve the try repo.  I will comment here once I am happy.  I want to leave this bug open if folks to comment if they notice issues during my work .
(Assignee)

Comment 14

8 years ago
I am all done here.

hgweb on the try repo is now served off a local clone on dm-vcview04.  This clone is automatically kept in sync via a changegroup hook in the main try repo on hg.mozilla.org.  This being a local repo, I think we will have much better performance on hgweb.

Please ping me on irc (or page me) if you guys notice any problems with the try repo.
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.