Closed Bug 842172 Opened 13 years ago Closed 12 years ago

MXR seems not to be updating

Categories

(Webtools Graveyard :: MXR, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Ms2ger, Assigned: fubar)

Details

Mozilla-central seems to be stuck on Wednesday.
Going to lean on webops to help. TIA guys.
Assignee: server-ops-devservices → server-ops-webops
Component: Server Operations: Developer Services → Server Operations: Web Operations
QA Contact: shyam → nmaul
This is causing major problems to development. Btw, could we add some checks that mozilla-central mxr gets updated on time. If the checks fail, email would be sent to someone or tree would be closed or something like that.
Severity: normal → major
The US is on holiday today and mxr has been getting hammered by an external attack (which should now be mitigated) but not sure there's anything else I can do in the meantime.
Assignee: server-ops-webops → pradcliffe
Can someone give me a search that should give results but does not? Not being a developer I don't know what would be missing or not...
Killed some stuck jobs from the 15th, this may help.
(stuck jobs killed on mxr-processor1.private.scl3, as per https://mana.mozilla.org/wiki/display/websites/mxr.mozilla.org
Assignee: pradcliffe → eziegenhorn
grabbing ownership of this bug.
Assignee: eziegenhorn → cturra
i think we really need to try to track down the root cause here. next time we see a "hung" cron, i want to run a `strace` on the process and see if it helps point us in the right direction. going to keep this bug open for now to monitor.
It hasn't updated in over a day
there are a number of crons that have active processes. the oldest of these is: root 32552 0.0 0.0 106096 1148 ? S Feb18 0:00 /bin/bash /root/bin/mxr-daily-cron the following is an lsof and strace of this pid: [root@mxr-processor1.private.scl3 ~]# lsof -n -p 32552 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME bash 32552 root cwd DIR 8,3 4096 2490372 /data/www/mxr.mozilla.org bash 32552 root rtd DIR 8,3 4096 2 / bash 32552 root txt REG 8,3 938736 395889 /bin/bash bash 32552 root mem REG 8,3 156872 3407879 /lib64/ld-2.12.so bash 32552 root mem REG 8,3 1922112 3407886 /lib64/libc-2.12.so bash 32552 root mem REG 8,3 22536 3407901 /lib64/libdl-2.12.so bash 32552 root mem REG 8,3 138280 3408204 /lib64/libtinfo.so.5.7 bash 32552 root mem REG 8,3 99158576 274070 /usr/lib/locale/locale-archive bash 32552 root mem REG 8,3 26060 264940 /usr/lib64/gconv/gconv-modules.cache bash 32552 root 0u CHR 136,0 0t0 3 /dev/pts/0 (deleted) bash 32552 root 1u CHR 136,0 0t0 3 /dev/pts/0 (deleted) bash 32552 root 2u CHR 136,0 0t0 3 /dev/pts/0 (deleted) bash 32552 root 255r REG 8,3 2019 131136 /root/bin/mxr-daily-cron [root@mxr-processor1.private.scl3 ~]# strace -p 32552 Process 32552 attached - interrupt to quit wait4(-1,
This is un-stuck. The hanging trees in this case were l10n-central (one of the 4hour jobs, blocking mozilla-central, comm-central, mozilla, l10n, and mobile-browser) and addons (daily, blocking aurora/beta/release, b2g18, and various other trees). The l10n-central one was hanging on "hg pull -r default" in the "hu" locale. I don't know if there's anything interesting here, or just chance that it was this locale and not some other one. I don't know what we can do here. Perhaps :bkero can weight in? default = http://hg.mozilla.org/l10n-central/hu/ mxr-processor1.private.scl3:/data/mxr-data/l10n-central/l10n-central/hu I ran it by hand just now, and it completed quickly without issue. The addons one was hanging trying to update its local addons list. This is a very weird process, and I can't say with any certainty why it was hanging. Next time (if there is a next time) I'll try to get more data on it. I have opened bug 843740 to try and get some monitoring set up, so we'll hopefully be able to catch these situations in the future before users notice the extra delay. It might be possible to restructure how we manage lockfiles here, and lock each tree individually. That way one tree cannot block any others from updating. This won't fix the underlying problem, but it will minimize damage if or when something does break.
i am going to mark this as r/fixed since i believe this has been functioning as expected since february. if i am incorrect and there are outstanding issues here, please feel free to reopen and give me heck.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
mxr hung up again. I got alerted with: <nagios-scl3:#sysadmins> Wed 04:16:43 PDT [5417] mxr-processor1.private.scl3.mozilla.com:File Age - /var/lock/mxr/short is CRITICAL: CRITICAL: 1dir(s) -- /var/lock/mxr/short/l10n-central.lock: 44197secs /var/lock/mxr/short: 3 files From the looks of the age of hung cronjobs it seems to have gotten stuck yesterday. root 6464 0.0 0.0 9232 984 ? Ss Sep24 0:00 /bin/bash /root/bin/mxr-4hour-cron root 6468 0.0 0.0 78788 2500 ? S Sep24 0:00 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root root 6470 0.0 0.0 48552 9572 ? S Sep24 0:03 perl /usr/bin/parallel --gnu -j-1 echo -n "{}: Starting " && date && nice -n 19 ./update-full-onetree.sh -cron {}; echo -n "{}: Ending " && date root 6471 0.0 0.0 78752 3212 ? S Sep24 0:00 /usr/sbin/postdrop -r root 6510 0.0 0.0 9232 1100 ? S Sep24 0:00 /bin/sh -c echo -n "l10n-central: Starting " && date && nice -n 19 ./update-full-onetree.sh -cron l10n-central; echo -n "l10n-central: Ending " && date root 6515 0.0 0.0 9232 1196 ? SN Sep24 0:00 /bin/sh ./update-full-onetree.sh -cron l10n-central root 6574 0.0 0.0 9232 592 ? SN Sep24 0:00 /bin/sh ./update-full-onetree.sh -cron l10n-central root 6575 0.0 0.0 44584 7592 ? SN Sep24 0:00 perl update-src.pl -cron l10n-central root 6617 0.0 0.0 9228 1096 ? SN Sep24 0:00 sh -c cd /data/mxr-data/l10n-central/l10n-central/ak; hg pull -r default 2>&1 || pwd; hg update --clean 2>&1; hg parents --template="{node|short}\n" root 6618 0.0 0.0 85100 10544 ? SN Sep24 0:00 /usr/bin/python /usr/bin/hg pull -r default root 23630 0.0 0.0 107888 13144 ? SN 03:26 0:00 ruby /usr/sbin/mcollectived --pid=/var/run/mcollectived.pid --config=/etc/mcollective/server.cfg root 27856 0.0 0.0 140112 1364 ? S 04:00 0:00 CROND root 27860 0.0 0.0 9232 1116 ? Ss 04:00 0:00 /bin/bash /root/bin/mxr-4hour-cron root 27863 0.0 0.0 78788 3232 ? S 04:00 0:00 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root root 27865 0.0 0.0 48552 10884 ? S 04:00 0:00 perl /usr/bin/parallel --gnu -j-1 echo -n "{}: Starting " && date && nice -n 19 ./update-full-onetree.sh -cron {}; echo -n "{}: Ending " && date root 27866 0.0 0.0 78752 3212 ? S 04:00 0:00 /usr/sbin/postdrop -r root 27898 0.0 0.0 9232 1096 ? S 04:00 0:00 /bin/sh -c echo -n "comm-central: Starting " && date && nice -n 19 ./update-full-onetree.sh -cron comm-central; echo -n "comm-central: Ending " && date root 27901 0.0 0.0 9232 1204 ? SN 04:00 0:00 /bin/sh ./update-full-onetree.sh -cron comm-central I killed 6618 which seemed to let things carry on for now. Can we not have a timeout on these processes so they get killed if they overrun by silly amounts? I know mxr hasn't been as problematic recently but this seems like a problem that can be automated rather than us going in and killing processes every time it happens.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee: cturra → klibby
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
TIL man 1 timeout added to update-src.pl in 76ae30bc7fd7.
Status: REOPENED → RESOLVED
Closed: 13 years ago12 years ago
Component: WebOps: Other → MXR
Product: Infrastructure & Operations → Webtools
QA Contact: nmaul
Resolution: --- → FIXED
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.