Closed Bug 735563 Opened 13 years ago Closed 11 years ago

elmo needs to re-work how it stores it's files

Categories

(Webtools Graveyard :: Elmo, defect, P2)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: fox2mike, Unassigned)

References

Details

Can't believe I missed this earlier...but here's a jist : While working on moving stuff from sjc1 -> phx1 for the new elmo setup, my rsync failed with a bunch of File too large (27) errors from rsync. After having a chat with Brian and Dan, Brian poked around in the /mnt/space directory on bm-l10n-dashboard01 and found this : root@bm-l10n-dashboard01:/mnt/space/build-data/l10n-master/compare# time ls -l | wc -l 997585 real 2m5.987s user 0m12.501s sys 0m15.381s That's almost a million files, in a single directory. The only reason this hasn't failed already is probably because there is no regular IO across all those files. Peter had some excellent ideas on how to prevent this, but we shouldn't have more than a few thousand files in _any_ directory. Please audit elmo and *fix* this for the existing as well as future files. Without this fix for existing files, the migration/setup for dev cannot proceed.
Sorry, but that won't be possible without data loss. Those a the buildbot log files, and they hold our data. If rsync can't move those across, we'll have to tarball them and copy. Storing the buildbot logs in a different storage than files is something that requires both buildbot and elmo code, and won't be able to happen on short notice, sadly.
Can you at least use subdirectories, instead of putting 1 million files in one place?
That's buildbot-internal code that expects those files in particular places. Rewriting that log system sounds scary, in particular given that we're using a version that's not maintained anymore. Sadly, the current versions have the same log store, and don't support the tests we have, so an update won't help. Another mid- to long-term solution is to add an alternative storage for the logs, say, put the data into ES, and through the buildbot-internal logs away. Does that sound like something we should do?
Dustin, I'm at a loss to understand why buildbot would insist on dumping it's log files in a single directory...even when that means a million files in a single directory :) Would be awesome to know what our options here are. Thanks!
We should focus on how to get the data across, and not so much on why it's stored that way. We're on a deadline here after all. The logs are pretty structured in how they're named, so we can reliably chunk it up into tarballs to get them from one place to the other. Does that help? Also enables us to move the data across somewhat incrementally.
(In reply to Axel Hecht [:Pike] from comment #5) > We should focus on how to get the data across, and not so much on why it's > stored that way. We're on a deadline here after all. Sorry, but the way it's stored right now does *not* scale and you're asking for failure if you continue to do this. Knowing that, I can't fathom why you're so much against doing this right. It is fundamentally wrong to have a million files in a single directory. At some point in the future, you will hit filesystem limits (already have on the netapp, for example) and that is a really robust storage system. I can't imagine the issues you will run into with this in the future. I'd like to be able to scale the system, not just blindly move it across. That isn't possible with this setup. A million files in a directory is asking for trouble and I'm trying to eliminate that right away. I'm disappointed that I didn't catch this earlier, or I'd have filed this bug a year ago.
What I'm blocking on is you determining that I'm going to fix that bug by your deadlines on datacenter moves. And yes, it's going to be me that has to fix this, nobody else knows the pieces that integrate into each other. I'm not blocking a solution, in fact I did propose something in comment 3.
(In reply to Shyam Mani [:fox2mike] from comment #0) ... > That's almost a million files, in a single directory. The only reason this > hasn't failed already is probably because there is no regular IO across all > those files. Peter had some excellent ideas on how to prevent this, but we > shouldn't have more than a few thousand files in _any_ directory. > Apparently more than a 1,000 is the "limit" for NFS. (From the trenches of Socorro)
(In reply to Axel Hecht [:Pike] from comment #3) > That's buildbot-internal code that expects those files in particular places. > > Rewriting that log system sounds scary, in particular given that we're using > a version that's not maintained anymore. Sadly, the current versions have > the same log store, and don't support the tests we have, so an update won't > help. > So buildbot (current versions) writes all files to the same directory? Amazed if that's true! > Another mid- to long-term solution is to add an alternative storage for the > logs, say, put the data into ES, and through the buildbot-internal logs > away. Does that sound like something we should do? If replacing the file logging with a key-value database is an option at all, why can't we start by modifying the way it file-logs today? E.g. :: - log.write('/compare/file-%s.log' % ooid) + log.write('/compare/%s/%s/%s/file-%s.log' % (ooid[:2], ooid[2:4], ooid[4:6], ooid)) Where in the compare-locales/elmo/master-ball do we READ from the logs? I can have a look to see how easy it would be to replace with something like Redis/Memcached/Mongo/Riak right there for the reading part. If we do replace the FS log storage with a key-value database, today, that would at least solve the problem with not having to break existing references.
Using logs as a data store is inherently flawed, but I remember something like that from my earlier life as a SeaMonkey RelEng person. Buildbot 0.8+ has a database though, have you verified it still needs the logs for something? We maybe should ask our RelEng people here, they work with buildbot a whole lot and have quite deep knowledge of what it does how, AFAIK.
(In reply to Peter Bengtsson [:peterbe] from comment #8) > Apparently more than a 1,000 is the "limit" for NFS. (From the trenches of > Socorro) This is untrue, Dan can give you exact numbers. Back in the socorro days, we probably had them on equallogic (not netapp) and it might have been a really old one.
(In reply to Peter Bengtsson [:peterbe] from comment #9) > So buildbot (current versions) writes all files to the same directory? > Amazed if that's true! It is, although it wouldn't be *that* hard to fix. Patches accepted. And if any of you were at PyCon this week and didn't come sprint on this topic, shame on you! 1000 files is a tough limit, but you should be using log horizons to delete things before they get to the million-file level. IIRC Axel's still running 0.7.mumble, so upgrading to 0.8.7 or 0.9.0 will be a pretty significant jump. (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #10) > Using logs as a data store is inherently flawed, but I remember something > like that from my earlier life as a SeaMonkey RelEng person. Buildbot 0.8+ > has a database though, have you verified it still needs the logs for > something? The database only stores half of the data, and not the status/logfile half. Getting the remainder into the db is in progress, but it's not quick. > We maybe should ask our RelEng people here, they work with buildbot a whole > lot and have quite deep knowledge of what it does how, AFAIK. I'm the Buildbot maintainer. That doesn't mean I have time to fix this stuff though, just the ability to merge pull requests, hint hint.
As mentioned before, the tests that the l10n buildbot has on top of 0.7.x are not supported on 0.8.x, so updating buildbot isn't really an option.
(In reply to Shyam Mani [:fox2mike] from comment #11) > (In reply to Peter Bengtsson [:peterbe] from comment #8) > > > Apparently more than a 1,000 is the "limit" for NFS. (From the trenches of > > Socorro) > > This is untrue, Dan can give you exact numbers. Back in the socorro days, we > probably had them on equallogic (not netapp) and it might have been a really > old one. Indeed, the limit is a lot more than 1000, but a hell of a lot less than a million.
I don't think that this is going to be quickly fixable. Do we have other options so as to not block bug 652792?
Dustin: Shyam and I came up with a plan. (I'll send the details.) Basically we want to migrate the existing VMs, and solve the data problem in Q2. The migration bug is here: https://bugzilla.mozilla.org/show_bug.cgi?id=737606 Pike is also working on getting buy-in for a retention policy in order to reduce the size of the problem. The draft policy is documented here: https://wiki.mozilla.org/Elmo/Retention_Policy
Thanks for the update. From beneath my Buildbot-maintainer hat: patches accepted to use subdirectories -- that code hasn't changed much since 0.7.10, so likely the patch would be easily forward/back-ported.
Blocks: 741957
We've met with Daniel from Metrics, and the plan is to store the buildbot logs in ElasticSearch. I filed bug 741957 to get that implemented.
Priority: -- → P2
Blocks: 857107
Restructuring some dependencies, migrating this bug to be mostly a tracker for the various things we need to get done to move off of file logs and forward to ES.
No longer blocks: 857107, 741957
Depends on: 857107
Depends on: 958061
Depends on: 958067
This is fixed, we haven't started removing the files yet, but we're mostly there.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.