Closed
Bug 834812
Opened 11 years ago
Closed 11 years ago
Track number of files in on-disk queue on sp-collectorXX nodes
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 841578
People
(Reporter: selenamarie, Assigned: ashish)
References
Details
We should track the number of files that are queued for crash moving on collector nodes. These files are stored in a directory structure like: /home/socorro/primaryCrashStore/20120128/date/00/05/ We should graph the total number of *files* in this directory. We're not sure when we need to alarm on it - my best guess for the moment is 10000.
Comment 1•11 years ago
|
||
Specifically, we should monitor the total # of files in all subdirs beneath date/.
Comment 2•11 years ago
|
||
Small nit: we'd technically be looking for the # of *symlinks*, not files. There are no regular files under /date/... only directories containing symlinks. We need to be exceptionally careful about this. If whatever we do descends into primaryCrashStore/*/name/*, it will take *hours*. In my testing so far a naive "find ... ! -path '*/name/*' is not sufficient to avoid this. If you can devise a "find" that is superior to the below (in terms of execution speed), that'd be alright. :) Here's something that does work fairly well: for i in /home/socorro/primaryCrashStore/*/date/*/*/*; do echo -n "$i: "; ls $i | wc -l; done That prints out the number of files in each individual directory, which is great for figuring out which day has a backlog, but not ideal for monitoring. Here's a slightly revised command: for i in /home/socorro/primaryCrashStore/*/date/*/*/*; do ls $i; done | wc -l This runs very quickly (under 100ms) when there aren't many files, and still runs acceptably even when there are far too many files (~500k takes a few seconds) At present, here's what this command would return on each of the 6 nodes: sp-collector01.phx1: 716 sp-collector02.phx1: 737 sp-collector03.phx1: 720 sp-collector04.phx1: 732 sp-collector05.phx1: 255890 sp-collector06.phx1: 743 Obviously, 05 has a problem (we're working on processing a backlog). This value does grow slightly over time, and will fluctuate up to 100-200 all the time. If there is a problem it should spike *rapidly*. An alert threshold of 10,000 seems like a good starting point. Thanks!
Assignee | ||
Updated•11 years ago
|
Assignee: server-ops → ashish
Assignee | ||
Comment 3•11 years ago
|
||
This has been implemented in Bug 841578
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•