524798 - Firefox 3.6, 3.7 l10n repacks on push not building since oct 4th

Reporter

Description

•

15 years ago

Both on mozilla-central and on 1.9.2, the repacks on change don't seem to pay attention. I don't see any builds after oct 4th for any of the "... build" builders.

Axel Hecht [:Pike]

Reporter

Updated

•

15 years ago

Whiteboard: [l10n]

Armen [:armenzg]

Comment 1

•

15 years ago

Verified; the builders are not picking up any changes.

I have grep the twistd logs and there is nothing obvious.
Unless somebody can pick this up (please do so) I can *probably* pick it up within today and tomorrow.

Chris Cooper [:coop] (he/him)

Assignee

Comment 2

•

15 years ago

Grabbing for now, but Armen may take this from me tomorrow.

Assignee: nobody → ccooper

Status: NEW → ASSIGNED

Priority: -- → P2

Chris Cooper [:coop] (he/him)

Assignee

Comment 3

•

15 years ago

I'm seeing some timeouts in the logs. Not sure whether that would tank the whole scheduler or not.

2009-10-28 05:54:01-0700 [-] <HgLocalePoller for http://hg.mozilla.org/releases/
l10n-mozilla-1.9.2/ar>: polling failed, result Getting http://hg.mozilla.org/rel
eases/l10n-mozilla-1.9.2/ar/pushlog?fromchange=f211a1ffb498594947c7e1ab6a7a5d2c5
5066ea2 took longer than 30 seconds.
2009-10-28 05:54:01-0700 [-] Traceback (most recent call last):
2009-10-28 05:54:01-0700 [-] Failure: twisted.internet.defer.TimeoutError: Getti
ng http://hg.mozilla.org/releases/l10n-mozilla-1.9.2/ar/pushlog?fromchange=f211a
1ffb498594947c7e1ab6a7a5d2c55066ea2 took longer than 30 seconds.
2009-10-28 05:54:01-0700 [-] <HgLocalePoller for http://hg.mozilla.org/l10n-cent
ral/as>: polling failed, result 
2009-10-28 05:54:01-0700 [-] Traceback (most recent call last):
2009-10-28 05:54:01-0700 [-] Failure: twisted.internet.error.TimeoutError: User 
timeout caused connection failure.

Chris Cooper [:coop] (he/him)

Assignee

Comment 4

•

15 years ago

FWIW, repack-on-change seems to be working in staging...lots of builds since oct 4th. 

Maybe we're seeing load issues on production master?

Nick Thomas [:nthomas] (UTC+12)

Comment 5

•

15 years ago

We've had free slaves on pm if that's what you mean, but its certainly busy since we added debug unittests and split mochitest-plain. There's also been one "Firefox mozilla-1.9.2 win32 l10n" build pending on pm for quite a while, perhaps that's wedging it ?

Chris Cooper [:coop] (he/him)

Assignee

Comment 6

•

15 years ago

(In reply to comment #5)
> We've had free slaves on pm if that's what you mean, but its certainly busy
> since we added debug unittests and split mochitest-plain. There's also been one
> "Firefox mozilla-1.9.2 win32 l10n" build pending on pm for quite a while,
> perhaps that's wedging it ?

I was being non-specific about "load" until I narrow it down. ;) I was specifically worried about network congestion here since due to the recent timeout when pulling an l10n pushlog (comment #3).

Axel Hecht [:Pike]

Reporter

Comment 7

•

15 years ago

My suspect is that there was a network problem at the time of the master start/config.

If there is a problem loading the l10n.ini's, the Dispatchers don't get added to the Scheduler, and thus it doesn't listen to the changes.

Not sure when the master got its last kicks.

Chris Cooper [:coop] (he/him)

Assignee

Comment 8

•

15 years ago

(In reply to comment #7)
> My suspect is that there was a network problem at the time of the master
> start/config.
> 
> If there is a problem loading the l10n.ini's, the Dispatchers don't get added
> to the Scheduler, and thus it doesn't listen to the changes.
> 
> Not sure when the master got its last kicks.

AFAICT the last full restart happened on Sep 24, as witnessed by https://wiki.mozilla.org/ReleaseEngineering:Maintenance and the reported age of the buildbot master process on production-master. We've had many reconfigs since then though, including some on Oct 4-5.

Would a single bad reconfig where the l10n.inis fail to load cause the problem to persist until the next restart?

Chris Cooper [:coop] (he/him)

Assignee

Comment 9

•

15 years ago

I just scheduled some downtime for tomorrow (7am EDT) no restart this master and resurrect the scheduler.

Axel: is there anyway we could make this more robust? Multiple initial loading attempts? Periodic retries is there's nothing setup?

Axel Hecht [:Pike]

Reporter

Comment 10

•

15 years ago

Yeah, probably there are ways to make this more robust. And it might be that l10n.ini failures don't recover on reconfig, but fixing that seems hard.

I didn't find a good way to report an error, fwiw, as if there's something bad, you might just not have a builder to which you can hook an error message, so the best way to tell right now is to look at the waterfall and make sure that the l10n builders (the on-demand ones) report that the tree is configured.

I just cross-checked, the reconfigs of the other master got loaded.

Chris Cooper [:coop] (he/him)

Assignee

Comment 11

•

15 years ago

(In reply to comment #10)
> I didn't find a good way to report an error, fwiw, as if there's something bad,
> you might just not have a builder to which you can hook an error message, so
> the best way to tell right now is to look at the waterfall and make sure that
> the l10n builders (the on-demand ones) report that the tree is configured.

Armen had a good suggestion about having any exceptions from the master twistd.log files mailed immediately to releng so things like this wouldn't go unnoticed for so long that we lose the logs needed to diagnose them.

Any such process to pull out exceptions would have to be extremely lightweight though to avoid bogging down the already-slow masters. We generate twisted logs at quite a rate.

Axel Hecht [:Pike]

Reporter

Comment 12

•

15 years ago

Seems that coop restarted the master, but neither the central nor the 1.9.2 builders seem to indicate that the dispatchers went up.

Can someone from releng attach the relevant twistd.log from that restart for reference and debugging?

Chris Cooper [:coop] (he/him)

Assignee

Comment 13

•

15 years ago

Someone must have accidentally removed the symlink to l10nbuilds1.ini, probably on or around Oct 4. Re-creating the symlink got things working again.

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Bugzilla

Quick Search

Firefox 3.6, 3.7 l10n repacks on push not building since oct 4th

Categories

(Release Engineering :: General, defect, P2)

Tracking

(Not tracked)

People

(Reporter: Pike, Assigned: coop)

References

Details

(Whiteboard: [l10n])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated