Open Bug 1241406 Opened 8 years ago Updated 7 years ago

No push user information recorded and push not showing on TreeHerder

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
major

Tracking

(Not tracked)

People

(Reporter: MattN, Unassigned)

Details

I pushed to try and got this at the end of the message (after most, if not all, of the usual success info):

> error: changegroup.vcsreplicator hook raised an exception: ProduceResponse(topic='pushdata', partition=4, error=6, offset=241385)

I didn't get a Try email and it didn't show on TH so I pushed again[1] and the error message didn't appear but I still didn't a Try email nor did it show on TH.

When looking at [1] the push id, user, and date are all unknown so I think something is busted.

[1] https://hg.mozilla.org/try/rev/623a2dc067dc
The try push before mine was also missing push information: https://hg.mozilla.org/try/rev/7900096b4a12
I see the same thing happening to tnikkel's push before Matt as well

https://hg.mozilla.org/try/rev/7900096b4a12


The push 2 pushes before is the last successful one

https://hg.mozilla.org/try/rev/4ee1e000746f

Something has broken in between these two commits.
The last successful push was at 2016-01-21 06:48:45 UTC and the failures started from 2016-01-21 06:58:27.
note that mozilla-inbound seems to have sync issues too - gps is on it
I resynced try and mozilla-inbound manually before playing it safe and scheduling a resync for all repos that have been pushed to in the past few days. That should finish up within minutes.

No data was lost AFAICT. The replication just didn't occur.

What I find strange is my IRC connection from people was killed (the irssi process died) around the same time this was reported. I suspect there was a larger network event that occurred.

glandium reported this from the irc backlog (i assume times are from Japan):

    15:53 <nagios-scl3> Wed 22:53:09 PST [5067] hgweb9.dmz.scl3.mozilla.com:Zookeeper - hg is WARNING: ENSEMBLE WARNING - node (hgssh1.dmz.scl3.mozilla.com) is alive but
                        not available (http://m.mozilla.org/Zookeeper+-+hg)
    15:53 <nagios-scl3> Wed 22:53:09 PST [5068] hgssh1.dmz.scl3.mozilla.com:Zookeeper - hg is WARNING: NODE CRITICAL - not in read/write mode: null
                        (http://m.mozilla.org/Zookeeper+-+hg)
    15:53 <nagios-scl3> Wed 22:53:19 PST [5069] hgweb6.dmz.scl3.mozilla.com:hg vcsreplicator lag is WARNING: WARNING - exception fetching offsets:
                        OffsetResponse(topic=pushdata, partition=1, error=6, offsets=()) (http://m.mozilla.org/hg+vcsreplicator+lag)
    15:53 <nagios-scl3> Wed 22:53:19 PST [5070] hgweb9.dmz.scl3.mozilla.com:hg vcsreplicator lag is WARNING: WARNING - exception fetching offsets:
                        OffsetResponse(topic=pushdata, partition=1, error=6, offsets=()) (http://m.mozilla.org/hg+vcsreplicator+lag)
    15:53 <nagios-scl3> Wed 22:53:19 PST [5071] hgweb10.dmz.scl3.mozilla.com:hg vcsreplicator lag is WARNING: WARNING - exception fetching offsets:
                        OffsetResponse(topic=pushdata, partition=0, error=6, offsets=()) (http://m.mozilla.org/hg+vcsreplicator+lag)
    15:53 <nagios-scl3> Wed 22:53:19 PST [5072] hgweb1.dmz.scl3.mozilla.com:hg vcsreplicator lag is WARNING: WARNING - exception fetching offsets:
                        OffsetResponse(topic=pushdata, partition=1, error=6, offsets=()) (http://m.mozilla.org/hg+vcsreplicator+lag)
    15:53 <nagios-scl3> Wed 22:53:19 PST [5073] hgweb7.dmz.scl3.mozilla.com:hg vcsreplicator lag is WARNING: WARNING - exception fetching offsets:
                        OffsetResponse(topic=pushdata, partition=0, error=6, offsets=()) (http://m.mozilla.org/hg+vcsreplicator+lag)
    15:54 <nagios-scl3> Wed 22:54:09 PST [5074] hgssh1.dmz.scl3.mozilla.com:Zookeeper - hg is OK: zookeeper node and ensemble OK (http://m.mozilla.org/Zookeeper+-+hg)
    15:54 <nagios-scl3> Wed 22:54:09 PST [5075] hgweb9.dmz.scl3.mozilla.com:Zookeeper - hg is OK: zookeeper node and ensemble OK (http://m.mozilla.org/Zookeeper+-+hg)
    15:54 <nagios-scl3> Wed 22:54:09 PST [5076] hgweb1.dmz.scl3.mozilla.com:hg vcsreplicator lag is OK: OK - 8/8 consumers completely in sync
                        (http://m.mozilla.org/hg+vcsreplicator+lag)
    15:54 <nagios-scl3> Wed 22:54:09 PST [5077] hgweb10.dmz.scl3.mozilla.com:hg vcsreplicator lag is OK: OK - 8/8 consumers completely in sync
                        (http://m.mozilla.org/hg+vcsreplicator+lag)
    15:54 <nagios-scl3> Wed 22:54:09 PST [5078] hgweb9.dmz.scl3.mozilla.com:hg vcsreplicator lag is OK: OK - 8/8 consumers completely in sync
                        (http://m.mozilla.org/hg+vcsreplicator+lag)
    15:54 <nagios-scl3> Wed 22:54:09 PST [5079] hgweb6.dmz.scl3.mozilla.com:hg vcsreplicator lag is OK: OK - 8/8 consumers completely in sync
                        (http://m.mozilla.org/hg+vcsreplicator+lag)
    15:54 <nagios-scl3> Wed 22:54:09 PST [5080] hgweb7.dmz.scl3.mozilla.com:hg vcsreplicator lag is OK: OK - 8/8 consumers completely in sync
                        (http://m.mozilla.org/hg+vcsreplicator+lag) 


error 6 matches what MattN posted.
Error 6 is NOT_LEADER_FOR_PARTITION.
We also encountered paragraph #2 of https://mozilla-version-control-tools.readthedocs.org/en/latest/hgmo/replication.html#data-loss for the first time. I guess that's not a theoretical limitation any more. It was bound to happen sometime. *sigh*
Component: Mercurial: Pushlog → Mercurial: hg.mozilla.org
QA Contact: hwine
QA Contact: hwine → klibby
You need to log in before you can comment on or make changes to this bug.