Closed Bug 524798 Opened 15 years ago Closed 15 years ago

Firefox 3.6, 3.7 l10n repacks on push not building since oct 4th

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Pike, Assigned: coop)

Details

(Whiteboard: [l10n])

Both on mozilla-central and on 1.9.2, the repacks on change don't seem to pay attention. I don't see any builds after oct 4th for any of the "... build" builders.
Whiteboard: [l10n]
Verified; the builders are not picking up any changes.

I have grep the twistd logs and there is nothing obvious.
Unless somebody can pick this up (please do so) I can *probably* pick it up within today and tomorrow.
Grabbing for now, but Armen may take this from me tomorrow.
Assignee: nobody → ccooper
Status: NEW → ASSIGNED
Priority: -- → P2
I'm seeing some timeouts in the logs. Not sure whether that would tank the whole scheduler or not.

2009-10-28 05:54:01-0700 [-] <HgLocalePoller for http://hg.mozilla.org/releases/
l10n-mozilla-1.9.2/ar>: polling failed, result Getting http://hg.mozilla.org/rel
eases/l10n-mozilla-1.9.2/ar/pushlog?fromchange=f211a1ffb498594947c7e1ab6a7a5d2c5
5066ea2 took longer than 30 seconds.
2009-10-28 05:54:01-0700 [-] Traceback (most recent call last):
2009-10-28 05:54:01-0700 [-] Failure: twisted.internet.defer.TimeoutError: Getti
ng http://hg.mozilla.org/releases/l10n-mozilla-1.9.2/ar/pushlog?fromchange=f211a
1ffb498594947c7e1ab6a7a5d2c55066ea2 took longer than 30 seconds.
2009-10-28 05:54:01-0700 [-] <HgLocalePoller for http://hg.mozilla.org/l10n-cent
ral/as>: polling failed, result 
2009-10-28 05:54:01-0700 [-] Traceback (most recent call last):
2009-10-28 05:54:01-0700 [-] Failure: twisted.internet.error.TimeoutError: User 
timeout caused connection failure.
FWIW, repack-on-change seems to be working in staging...lots of builds since oct 4th. 

Maybe we're seeing load issues on production master?
We've had free slaves on pm if that's what you mean, but its certainly busy since we added debug unittests and split mochitest-plain. There's also been one "Firefox mozilla-1.9.2 win32 l10n" build pending on pm for quite a while, perhaps that's wedging it ?
(In reply to comment #5)
> We've had free slaves on pm if that's what you mean, but its certainly busy
> since we added debug unittests and split mochitest-plain. There's also been one
> "Firefox mozilla-1.9.2 win32 l10n" build pending on pm for quite a while,
> perhaps that's wedging it ?

I was being non-specific about "load" until I narrow it down. ;) I was specifically worried about network congestion here since due to the recent timeout when pulling an l10n pushlog (comment #3).
My suspect is that there was a network problem at the time of the master start/config.

If there is a problem loading the l10n.ini's, the Dispatchers don't get added to the Scheduler, and thus it doesn't listen to the changes.

Not sure when the master got its last kicks.
(In reply to comment #7)
> My suspect is that there was a network problem at the time of the master
> start/config.
> 
> If there is a problem loading the l10n.ini's, the Dispatchers don't get added
> to the Scheduler, and thus it doesn't listen to the changes.
> 
> Not sure when the master got its last kicks.

AFAICT the last full restart happened on Sep 24, as witnessed by https://wiki.mozilla.org/ReleaseEngineering:Maintenance and the reported age of the buildbot master process on production-master. We've had many reconfigs since then though, including some on Oct 4-5.

Would a single bad reconfig where the l10n.inis fail to load cause the problem to persist until the next restart?
I just scheduled some downtime for tomorrow (7am EDT) no restart this master and resurrect the scheduler.

Axel: is there anyway we could make this more robust? Multiple initial loading attempts? Periodic retries is there's nothing setup?
Yeah, probably there are ways to make this more robust. And it might be that l10n.ini failures don't recover on reconfig, but fixing that seems hard.

I didn't find a good way to report an error, fwiw, as if there's something bad, you might just not have a builder to which you can hook an error message, so the best way to tell right now is to look at the waterfall and make sure that the l10n builders (the on-demand ones) report that the tree is configured.

I just cross-checked, the reconfigs of the other master got loaded.
(In reply to comment #10)
> I didn't find a good way to report an error, fwiw, as if there's something bad,
> you might just not have a builder to which you can hook an error message, so
> the best way to tell right now is to look at the waterfall and make sure that
> the l10n builders (the on-demand ones) report that the tree is configured.

Armen had a good suggestion about having any exceptions from the master twistd.log files mailed immediately to releng so things like this wouldn't go unnoticed for so long that we lose the logs needed to diagnose them.

Any such process to pull out exceptions would have to be extremely lightweight though to avoid bogging down the already-slow masters. We generate twisted logs at quite a rate.
Seems that coop restarted the master, but neither the central nor the 1.9.2 builders seem to indicate that the dispatchers went up.

Can someone from releng attach the relevant twistd.log from that restart for reference and debugging?
Someone must have accidentally removed the symlink to l10nbuilds1.ini, probably on or around Oct 4. Re-creating the symlink got things working again.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.