Closed Bug 1036244 Opened 10 years ago Closed 10 years ago

Occasional mirror process hangs "corrupt" a web head - monitor and detect

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: hwine, Assigned: bkero)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/846] )

We've seen a couple of "hung" processes in the new hg webhead sync process. It would be great to monitor for that so we can find and fix before it turns into an intermittent failure to devs and CI. Context from #releng today: 15:50 < glandium> bkero: i think i always get the the same webhead 15:50 < bkero> glandium: That should be pretty easy to confirm using the aforementioned method. 15:51 < glandium> bkero: https://pastebin.mozilla.org/5535061 15:51 < glandium> bkero: and that's a changeset from yesterday 15:53 < glandium> bkero: https://pastebin.mozilla.org/5535062 that's the most recent error i got 15:54 < bkero> glandium: found hgweb6 to be missing that changeset, all the others are 200 15:54 < bkero> glandium: Ah, a 'pull' operation got locked 15:57 < bkero> glandium: should be updated now, see if any changes
per IRC this was theoretically part of an old ssh configuration issue that was fixed, but this webheads daemon was likely never restarted. And this should never happen again. That said, I still agree monitoring to verify that is good practice.
Component: Server Operations: Developer Services → Mercurial: hg.mozilla.org
Product: mozilla.org → Developer Services
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/66]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/66] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/846] [kanban:engops:https://kanbanize.com/ctrl_board/6/66]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/846] [kanban:engops:https://kanbanize.com/ctrl_board/6/66] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/846]
Have we had any of these issues occur recently? IIRC this was fixed by disabling the SSH ControlSockets.
Assignee: server-ops-devservices → bkero
AIUi there haven't been any failures attributed to this for quite some time. As such, it's probably not an effective use of engineering resources until that happens. This work will likely be done when the replication infrastructure is rewritten to be more robust.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.