Closed
Bug 842172
Opened 13 years ago
Closed 12 years ago
MXR seems not to be updating
Categories
(Webtools Graveyard :: MXR, defect)
Webtools Graveyard
MXR
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Ms2ger, Assigned: fubar)
Details
Mozilla-central seems to be stuck on Wednesday.
Comment 1•13 years ago
|
||
Going to lean on webops to help. TIA guys.
Assignee: server-ops-devservices → server-ops-webops
Component: Server Operations: Developer Services → Server Operations: Web Operations
QA Contact: shyam → nmaul
Comment 2•13 years ago
|
||
This is causing major problems to development.
Btw, could we add some checks that mozilla-central mxr gets updated on time. If the checks fail,
email would be sent to someone or tree would be closed or something like that.
Severity: normal → major
Comment 3•13 years ago
|
||
The US is on holiday today and mxr has been getting hammered by an external attack (which should now be mitigated) but not sure there's anything else I can do in the meantime.
Assignee: server-ops-webops → pradcliffe
Comment 4•13 years ago
|
||
Can someone give me a search that should give results but does not? Not being a developer I don't know what would be missing or not...
Comment 5•13 years ago
|
||
Killed some stuck jobs from the 15th, this may help.
Comment 6•13 years ago
|
||
(stuck jobs killed on mxr-processor1.private.scl3, as per https://mana.mozilla.org/wiki/display/websites/mxr.mozilla.org
Updated•13 years ago
|
Assignee: pradcliffe → eziegenhorn
Comment 8•13 years ago
|
||
i think we really need to try to track down the root cause here. next time we see a "hung" cron, i want to run a `strace` on the process and see if it helps point us in the right direction. going to keep this bug open for now to monitor.
Comment 9•13 years ago
|
||
It hasn't updated in over a day
Comment 10•13 years ago
|
||
there are a number of crons that have active processes. the oldest of these is:
root 32552 0.0 0.0 106096 1148 ? S Feb18 0:00 /bin/bash /root/bin/mxr-daily-cron
the following is an lsof and strace of this pid:
[root@mxr-processor1.private.scl3 ~]# lsof -n -p 32552
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
bash 32552 root cwd DIR 8,3 4096 2490372 /data/www/mxr.mozilla.org
bash 32552 root rtd DIR 8,3 4096 2 /
bash 32552 root txt REG 8,3 938736 395889 /bin/bash
bash 32552 root mem REG 8,3 156872 3407879 /lib64/ld-2.12.so
bash 32552 root mem REG 8,3 1922112 3407886 /lib64/libc-2.12.so
bash 32552 root mem REG 8,3 22536 3407901 /lib64/libdl-2.12.so
bash 32552 root mem REG 8,3 138280 3408204 /lib64/libtinfo.so.5.7
bash 32552 root mem REG 8,3 99158576 274070 /usr/lib/locale/locale-archive
bash 32552 root mem REG 8,3 26060 264940 /usr/lib64/gconv/gconv-modules.cache
bash 32552 root 0u CHR 136,0 0t0 3 /dev/pts/0 (deleted)
bash 32552 root 1u CHR 136,0 0t0 3 /dev/pts/0 (deleted)
bash 32552 root 2u CHR 136,0 0t0 3 /dev/pts/0 (deleted)
bash 32552 root 255r REG 8,3 2019 131136 /root/bin/mxr-daily-cron
[root@mxr-processor1.private.scl3 ~]# strace -p 32552
Process 32552 attached - interrupt to quit
wait4(-1,
Comment 11•13 years ago
|
||
This is un-stuck.
The hanging trees in this case were l10n-central (one of the 4hour jobs, blocking mozilla-central, comm-central, mozilla, l10n, and mobile-browser) and addons (daily, blocking aurora/beta/release, b2g18, and various other trees).
The l10n-central one was hanging on "hg pull -r default" in the "hu" locale. I don't know if there's anything interesting here, or just chance that it was this locale and not some other one. I don't know what we can do here. Perhaps :bkero can weight in?
default = http://hg.mozilla.org/l10n-central/hu/
mxr-processor1.private.scl3:/data/mxr-data/l10n-central/l10n-central/hu
I ran it by hand just now, and it completed quickly without issue.
The addons one was hanging trying to update its local addons list. This is a very weird process, and I can't say with any certainty why it was hanging. Next time (if there is a next time) I'll try to get more data on it.
I have opened bug 843740 to try and get some monitoring set up, so we'll hopefully be able to catch these situations in the future before users notice the extra delay.
It might be possible to restructure how we manage lockfiles here, and lock each tree individually. That way one tree cannot block any others from updating. This won't fix the underlying problem, but it will minimize damage if or when something does break.
Comment 12•13 years ago
|
||
i am going to mark this as r/fixed since i believe this has been functioning as expected since february. if i am incorrect and there are outstanding issues here, please feel free to reopen and give me heck.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment 13•12 years ago
|
||
mxr hung up again.
I got alerted with:
<nagios-scl3:#sysadmins> Wed 04:16:43 PDT [5417]
mxr-processor1.private.scl3.mozilla.com:File Age - /var/lock/mxr/short is
CRITICAL: CRITICAL: 1dir(s) -- /var/lock/mxr/short/l10n-central.lock:
44197secs /var/lock/mxr/short: 3 files
From the looks of the age of hung cronjobs it seems to have gotten stuck yesterday.
root 6464 0.0 0.0 9232 984 ? Ss Sep24 0:00 /bin/bash /root/bin/mxr-4hour-cron
root 6468 0.0 0.0 78788 2500 ? S Sep24 0:00 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
root 6470 0.0 0.0 48552 9572 ? S Sep24 0:03 perl /usr/bin/parallel --gnu -j-1 echo -n "{}: Starting " && date && nice -n 19 ./update-full-onetree.sh -cron {}; echo -n "{}: Ending " && date
root 6471 0.0 0.0 78752 3212 ? S Sep24 0:00 /usr/sbin/postdrop -r
root 6510 0.0 0.0 9232 1100 ? S Sep24 0:00 /bin/sh -c echo -n "l10n-central: Starting " && date && nice -n 19 ./update-full-onetree.sh -cron l10n-central; echo -n "l10n-central: Ending " && date
root 6515 0.0 0.0 9232 1196 ? SN Sep24 0:00 /bin/sh ./update-full-onetree.sh -cron l10n-central
root 6574 0.0 0.0 9232 592 ? SN Sep24 0:00 /bin/sh ./update-full-onetree.sh -cron l10n-central
root 6575 0.0 0.0 44584 7592 ? SN Sep24 0:00 perl update-src.pl -cron l10n-central
root 6617 0.0 0.0 9228 1096 ? SN Sep24 0:00 sh -c cd /data/mxr-data/l10n-central/l10n-central/ak; hg pull -r default 2>&1 || pwd; hg update --clean 2>&1; hg parents --template="{node|short}\n"
root 6618 0.0 0.0 85100 10544 ? SN Sep24 0:00 /usr/bin/python /usr/bin/hg pull -r default
root 23630 0.0 0.0 107888 13144 ? SN 03:26 0:00 ruby /usr/sbin/mcollectived --pid=/var/run/mcollectived.pid --config=/etc/mcollective/server.cfg
root 27856 0.0 0.0 140112 1364 ? S 04:00 0:00 CROND
root 27860 0.0 0.0 9232 1116 ? Ss 04:00 0:00 /bin/bash /root/bin/mxr-4hour-cron
root 27863 0.0 0.0 78788 3232 ? S 04:00 0:00 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
root 27865 0.0 0.0 48552 10884 ? S 04:00 0:00 perl /usr/bin/parallel --gnu -j-1 echo -n "{}: Starting " && date && nice -n 19 ./update-full-onetree.sh -cron {}; echo -n "{}: Ending " && date
root 27866 0.0 0.0 78752 3212 ? S 04:00 0:00 /usr/sbin/postdrop -r
root 27898 0.0 0.0 9232 1096 ? S 04:00 0:00 /bin/sh -c echo -n "comm-central: Starting " && date && nice -n 19 ./update-full-onetree.sh -cron comm-central; echo -n "comm-central: Ending " && date
root 27901 0.0 0.0 9232 1204 ? SN 04:00 0:00 /bin/sh ./update-full-onetree.sh -cron comm-central
I killed 6618 which seemed to let things carry on for now.
Can we not have a timeout on these processes so they get killed if they overrun by silly amounts? I know mxr hasn't been as problematic recently but this seems like a problem that can be automated rather than us going in and killing processes every time it happens.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•12 years ago
|
Assignee: cturra → klibby
Updated•12 years ago
|
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
| Assignee | ||
Comment 14•12 years ago
|
||
TIL man 1 timeout
added to update-src.pl in 76ae30bc7fd7.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 12 years ago
Component: WebOps: Other → MXR
Product: Infrastructure & Operations → Webtools
QA Contact: nmaul
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Webtools → Webtools Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•