Closed Bug 1036998 Opened 11 years ago Closed 10 years ago

hg web head synchronization taking excessively long

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwine, Unassigned)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/900] )

Attachments

(1 file)

Failure reports from pulsebot indicate that there is at least a 3 minute gap between the time the first hg webhead updates and the last one updates. I.e. the cluster is in an inconsistent state for at least that long, perhaps on every push. Besides documented cases with pulsebot, this could impact CI performance as a missing revision leads to the build machine local hg cache to be discarded, and the entire repository re-cloned.
(In reply to Hal Wine [:hwine] (use needinfo) from comment #0) > Failure reports from pulsebot indicate that there is at least a 3 minute gap > between the time the first hg webhead updates and the last one updates. I.e. > the cluster is in an inconsistent state for at least that long, perhaps on > every push. So, the norm for a m-i push is somewhere between 40 and 70 seconds, sometimes less, which considering the polling every minute, is about normal, except for the extra 10 seconds. The norm for try is much higher, between 60 and 200, sometimes more, but never seen less. (that's the time between the push date and the moment pulsebot receives a pulse notification, which comes from buildbot polling web heads) > Besides documented cases with pulsebot, this could impact CI performance as > a missing revision leads to the build machine local hg cache to be > discarded, and the entire repository re-cloned. And it does, I've seen it happen. I don't know how often, though.
Blocks: 1042210
IanN 14:00:18 hmmm, now getting abort: HTTP Error 500: Internal Server Error on the try server :( 14:01:17 https://tbpl.mozilla.org/php/getParsedLog.php?id=44962549&tree=Thunderbird-Try pmoore 14:02:07 hwine-commuting: fubar: ^^ 14:02:31 IanN: we sometimes get these http 500s from hg :( IanN 14:02:59 k, well i have asked for a rebuild *crosses fingers* pmoore 14:03:53 normally we'd have several retry attempts at the hg pull but i don't see one here, i can look into it to see why not 14:03:58 we should probably have an automatic retry there 14:08:36 IanN: so the http 500 occurred when trying to unbundle, and then it fell back to using a regular clone: 14:09:20 IanN: then it hit: abort: unknown revision 'aa7edad6eac9b66044da9ad4f3093c2ffd82fc6f' 14:09:47 maybe this is a problem with the webheads syncing on hg (?) 14:09:50 i will investigate... 14:10:21 (in other words, the initial 500 error shouldn't have prevented it from working - it just made it fall back to doing a regular clone) 14:11:11 coop: i haven't looked into that (try-linux64-ec2-golden) 14:11:51 coop: maybe jealousy that all its offspring get to connect to buildbot masters, and it doesn't? 14:16:23 IanN: https://hg.mozilla.org/try-comm-central/rev/aa7edad6eac9b66044da9ad4f3093c2ffd82fc6f seems ok but locally i also get the same error… 14:16:38 $ hg clone -U -r aa7edad6eac9b66044da9ad4f3093c2ffd82fc6f https://hg.mozilla.org/try-comm-central 14:16:45 abort: unknown revision 'aa7edad6eac9b66044da9ad4f3093c2ffd82fc6f'! IanN 14:17:16 :( 14:18:06 ah now, it is working 14:18:12 it must be a problem with web heads syncing 14:18:40 IanN: i'll raise a bug and CC you fubar 14:18:57 pmoore: there's already a (closed) bug. please reopen that one. 14:19:12 ewong|sleep is now known as ewong pmoore 14:19:28 fubar: will do IanN 14:21:37 pmoore: ok, thanks, i'm seeing the abort too now :( 14:21:43 on try not locally fubar 14:22:04 IanN: you have three of the last four pushes to try-comm-central. did you notice anything odd with any of those pushes? IanN 14:22:23 fubar: they all seemed to work fine
Attached file bug1036998_log.txt
Attaching the key parts of the log file...
pmoore 14:31:27 IanN: looking again at the logs, even after the unbundle failed, and then the clone failed, it tried again an unbundle, and that one worked! So I think the hg problems were automatically resolved, but appear as a red herring in the log file 14:31:45 that said, we should still fix those hg issues, so i'll write in the bug anyway 14:32:10 IanN: fubar: i'm updating bug 1036998
See Also: → 1038478
I believe our next step here is to determine the actual "sync time" now that some of the connection errors are addressed (bug 1038478). We may have sufficient details in the new logs to do the correlation, or we may need to add some additional logging to the mirror scripts.
Product: Release Engineering → Developer Services
Hal, is there anything more to do on this bug (e.g. from comment 5)? Pete
Flags: needinfo?(hwine)
Flags: needinfo?(hwine)
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/264]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/264] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/895] [kanban:engops:https://kanbanize.com/ctrl_board/6/264]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/895] [kanban:engops:https://kanbanize.com/ctrl_board/6/264] [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/899] [kanban:engops:https://kanbanize.com/ctrl_board/6/264] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/899] [kanban:engops:https://kanbanize.com/ctrl_board/6/264] [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/900] [kanban:engops:https://kanbanize.com/ctrl_board/6/264]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/900] [kanban:engops:https://kanbanize.com/ctrl_board/6/264] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/900]
This sync information should be available in hg.log. Does that conclude the work necessary for this bug?
Closing this out, I don't think there's any value in keeping this open. If push times are still a problem please file a new bug.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: