Closed
Bug 1036998
Opened 11 years ago
Closed 10 years ago
hg web head synchronization taking excessively long
Categories
(Developer Services :: Mercurial: hg.mozilla.org, defect)
Developer Services
Mercurial: hg.mozilla.org
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: hwine, Unassigned)
References
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/900] )
Attachments
(1 file)
|
31.58 KB,
text/plain
|
Details |
Failure reports from pulsebot indicate that there is at least a 3 minute gap between the time the first hg webhead updates and the last one updates. I.e. the cluster is in an inconsistent state for at least that long, perhaps on every push.
Besides documented cases with pulsebot, this could impact CI performance as a missing revision leads to the build machine local hg cache to be discarded, and the entire repository re-cloned.
Comment 1•11 years ago
|
||
(In reply to Hal Wine [:hwine] (use needinfo) from comment #0)
> Failure reports from pulsebot indicate that there is at least a 3 minute gap
> between the time the first hg webhead updates and the last one updates. I.e.
> the cluster is in an inconsistent state for at least that long, perhaps on
> every push.
So, the norm for a m-i push is somewhere between 40 and 70 seconds, sometimes less, which considering the polling every minute, is about normal, except for the extra 10 seconds. The norm for try is much higher, between 60 and 200, sometimes more, but never seen less. (that's the time between the push date and the moment pulsebot receives a pulse notification, which comes from buildbot polling web heads)
> Besides documented cases with pulsebot, this could impact CI performance as
> a missing revision leads to the build machine local hg cache to be
> discarded, and the entire repository re-cloned.
And it does, I've seen it happen. I don't know how often, though.
Comment 2•11 years ago
|
||
IanN
14:00:18 hmmm, now getting abort: HTTP Error 500: Internal Server Error on the try server :(
14:01:17 https://tbpl.mozilla.org/php/getParsedLog.php?id=44962549&tree=Thunderbird-Try
pmoore
14:02:07 hwine-commuting: fubar: ^^
14:02:31 IanN: we sometimes get these http 500s from hg :(
IanN
14:02:59 k, well i have asked for a rebuild *crosses fingers*
pmoore
14:03:53 normally we'd have several retry attempts at the hg pull but i don't see one here, i can look into it to see why not
14:03:58 we should probably have an automatic retry there
14:08:36 IanN: so the http 500 occurred when trying to unbundle, and then it fell back to using a regular clone:
14:09:20 IanN: then it hit: abort: unknown revision 'aa7edad6eac9b66044da9ad4f3093c2ffd82fc6f'
14:09:47 maybe this is a problem with the webheads syncing on hg (?)
14:09:50 i will investigate...
14:10:21 (in other words, the initial 500 error shouldn't have prevented it from working - it just made it fall back to doing a regular clone)
14:11:11 coop: i haven't looked into that (try-linux64-ec2-golden)
14:11:51 coop: maybe jealousy that all its offspring get to connect to buildbot masters, and it doesn't?
14:16:23 IanN: https://hg.mozilla.org/try-comm-central/rev/aa7edad6eac9b66044da9ad4f3093c2ffd82fc6f seems ok but locally i also get the same error…
14:16:38 $ hg clone -U -r aa7edad6eac9b66044da9ad4f3093c2ffd82fc6f https://hg.mozilla.org/try-comm-central
14:16:45 abort: unknown revision 'aa7edad6eac9b66044da9ad4f3093c2ffd82fc6f'!
IanN
14:17:16 :(
14:18:06 ah now, it is working
14:18:12 it must be a problem with web heads syncing
14:18:40 IanN: i'll raise a bug and CC you
fubar
14:18:57 pmoore: there's already a (closed) bug. please reopen that one.
14:19:12 ewong|sleep is now known as ewong
pmoore
14:19:28 fubar: will do
IanN
14:21:37 pmoore: ok, thanks, i'm seeing the abort too now :(
14:21:43 on try not locally
fubar
14:22:04 IanN: you have three of the last four pushes to try-comm-central. did you notice anything odd with any of those pushes?
IanN
14:22:23 fubar: they all seemed to work fine
Comment 3•11 years ago
|
||
Attaching the key parts of the log file...
Comment 4•11 years ago
|
||
pmoore
14:31:27 IanN: looking again at the logs, even after the unbundle failed, and then the clone failed, it tried again an unbundle, and that one worked! So I think the hg problems were automatically resolved, but appear as a red herring in the log file
14:31:45 that said, we should still fix those hg issues, so i'll write in the bug anyway
14:32:10 IanN: fubar: i'm updating bug 1036998
I believe our next step here is to determine the actual "sync time" now that some of the connection errors are addressed (bug 1038478).
We may have sufficient details in the new logs to do the correlation, or we may need to add some additional logging to the mirror scripts.
| Assignee | ||
Updated•11 years ago
|
Product: Release Engineering → Developer Services
Comment 6•11 years ago
|
||
Hal, is there anything more to do on this bug (e.g. from comment 5)?
Pete
Flags: needinfo?(hwine)
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/264]
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/264] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/895] [kanban:engops:https://kanbanize.com/ctrl_board/6/264]
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/895] [kanban:engops:https://kanbanize.com/ctrl_board/6/264] [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/899] [kanban:engops:https://kanbanize.com/ctrl_board/6/264] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/899] [kanban:engops:https://kanbanize.com/ctrl_board/6/264] [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/900] [kanban:engops:https://kanbanize.com/ctrl_board/6/264]
| Assignee | ||
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/900] [kanban:engops:https://kanbanize.com/ctrl_board/6/264] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/900]
Comment 7•11 years ago
|
||
This sync information should be available in hg.log. Does that conclude the work necessary for this bug?
Comment 8•10 years ago
|
||
Closing this out, I don't think there's any value in keeping this open. If push times are still a problem please file a new bug.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•