Closed
Bug 735563
Opened 13 years ago
Closed 11 years ago
elmo needs to re-work how it stores it's files
Categories
(Webtools Graveyard :: Elmo, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: fox2mike, Unassigned)
References
Details
Can't believe I missed this earlier...but here's a jist :
While working on moving stuff from sjc1 -> phx1 for the new elmo setup, my rsync failed with a bunch of File too large (27) errors from rsync. After having a chat with Brian and Dan, Brian poked around in the /mnt/space directory on bm-l10n-dashboard01 and found this :
root@bm-l10n-dashboard01:/mnt/space/build-data/l10n-master/compare# time ls -l | wc -l
997585
real 2m5.987s
user 0m12.501s
sys 0m15.381s
That's almost a million files, in a single directory. The only reason this hasn't failed already is probably because there is no regular IO across all those files. Peter had some excellent ideas on how to prevent this, but we shouldn't have more than a few thousand files in _any_ directory.
Please audit elmo and *fix* this for the existing as well as future files. Without this fix for existing files, the migration/setup for dev cannot proceed.
Comment 1•13 years ago
|
||
Sorry, but that won't be possible without data loss. Those a the buildbot log files, and they hold our data.
If rsync can't move those across, we'll have to tarball them and copy.
Storing the buildbot logs in a different storage than files is something that requires both buildbot and elmo code, and won't be able to happen on short notice, sadly.
Comment 2•13 years ago
|
||
Can you at least use subdirectories, instead of putting 1 million files in one place?
Comment 3•13 years ago
|
||
That's buildbot-internal code that expects those files in particular places.
Rewriting that log system sounds scary, in particular given that we're using a version that's not maintained anymore. Sadly, the current versions have the same log store, and don't support the tests we have, so an update won't help.
Another mid- to long-term solution is to add an alternative storage for the logs, say, put the data into ES, and through the buildbot-internal logs away. Does that sound like something we should do?
Reporter | ||
Comment 4•13 years ago
|
||
Dustin, I'm at a loss to understand why buildbot would insist on dumping it's log files in a single directory...even when that means a million files in a single directory :)
Would be awesome to know what our options here are. Thanks!
Comment 5•13 years ago
|
||
We should focus on how to get the data across, and not so much on why it's stored that way. We're on a deadline here after all.
The logs are pretty structured in how they're named, so we can reliably chunk it up into tarballs to get them from one place to the other. Does that help? Also enables us to move the data across somewhat incrementally.
Reporter | ||
Comment 6•13 years ago
|
||
(In reply to Axel Hecht [:Pike] from comment #5)
> We should focus on how to get the data across, and not so much on why it's
> stored that way. We're on a deadline here after all.
Sorry, but the way it's stored right now does *not* scale and you're asking for failure if you continue to do this. Knowing that, I can't fathom why you're so much against doing this right.
It is fundamentally wrong to have a million files in a single directory. At some point in the future, you will hit filesystem limits (already have on the netapp, for example) and that is a really robust storage system. I can't imagine the issues you will run into with this in the future.
I'd like to be able to scale the system, not just blindly move it across. That isn't possible with this setup. A million files in a directory is asking for trouble and I'm trying to eliminate that right away. I'm disappointed that I didn't catch this earlier, or I'd have filed this bug a year ago.
Comment 7•13 years ago
|
||
What I'm blocking on is you determining that I'm going to fix that bug by your deadlines on datacenter moves. And yes, it's going to be me that has to fix this, nobody else knows the pieces that integrate into each other.
I'm not blocking a solution, in fact I did propose something in comment 3.
Comment 8•13 years ago
|
||
(In reply to Shyam Mani [:fox2mike] from comment #0)
...
> That's almost a million files, in a single directory. The only reason this
> hasn't failed already is probably because there is no regular IO across all
> those files. Peter had some excellent ideas on how to prevent this, but we
> shouldn't have more than a few thousand files in _any_ directory.
>
Apparently more than a 1,000 is the "limit" for NFS. (From the trenches of Socorro)
Comment 9•13 years ago
|
||
(In reply to Axel Hecht [:Pike] from comment #3)
> That's buildbot-internal code that expects those files in particular places.
>
> Rewriting that log system sounds scary, in particular given that we're using
> a version that's not maintained anymore. Sadly, the current versions have
> the same log store, and don't support the tests we have, so an update won't
> help.
>
So buildbot (current versions) writes all files to the same directory? Amazed if that's true!
> Another mid- to long-term solution is to add an alternative storage for the
> logs, say, put the data into ES, and through the buildbot-internal logs
> away. Does that sound like something we should do?
If replacing the file logging with a key-value database is an option at all, why can't we start by modifying the way it file-logs today? E.g. ::
- log.write('/compare/file-%s.log' % ooid)
+ log.write('/compare/%s/%s/%s/file-%s.log' % (ooid[:2], ooid[2:4], ooid[4:6], ooid))
Where in the compare-locales/elmo/master-ball do we READ from the logs? I can have a look to see how easy it would be to replace with something like Redis/Memcached/Mongo/Riak right there for the reading part.
If we do replace the FS log storage with a key-value database, today, that would at least solve the problem with not having to break existing references.
![]() |
||
Comment 10•13 years ago
|
||
Using logs as a data store is inherently flawed, but I remember something like that from my earlier life as a SeaMonkey RelEng person. Buildbot 0.8+ has a database though, have you verified it still needs the logs for something?
We maybe should ask our RelEng people here, they work with buildbot a whole lot and have quite deep knowledge of what it does how, AFAIK.
Reporter | ||
Comment 11•13 years ago
|
||
(In reply to Peter Bengtsson [:peterbe] from comment #8)
> Apparently more than a 1,000 is the "limit" for NFS. (From the trenches of
> Socorro)
This is untrue, Dan can give you exact numbers. Back in the socorro days, we probably had them on equallogic (not netapp) and it might have been a really old one.
Comment 12•13 years ago
|
||
(In reply to Peter Bengtsson [:peterbe] from comment #9)
> So buildbot (current versions) writes all files to the same directory?
> Amazed if that's true!
It is, although it wouldn't be *that* hard to fix. Patches accepted. And if any of you were at PyCon this week and didn't come sprint on this topic, shame on you!
1000 files is a tough limit, but you should be using log horizons to delete things before they get to the million-file level.
IIRC Axel's still running 0.7.mumble, so upgrading to 0.8.7 or 0.9.0 will be a pretty significant jump.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #10)
> Using logs as a data store is inherently flawed, but I remember something
> like that from my earlier life as a SeaMonkey RelEng person. Buildbot 0.8+
> has a database though, have you verified it still needs the logs for
> something?
The database only stores half of the data, and not the status/logfile half. Getting the remainder into the db is in progress, but it's not quick.
> We maybe should ask our RelEng people here, they work with buildbot a whole
> lot and have quite deep knowledge of what it does how, AFAIK.
I'm the Buildbot maintainer. That doesn't mean I have time to fix this stuff though, just the ability to merge pull requests, hint hint.
Comment 13•13 years ago
|
||
As mentioned before, the tests that the l10n buildbot has on top of 0.7.x are not supported on 0.8.x, so updating buildbot isn't really an option.
Comment 14•13 years ago
|
||
(In reply to Shyam Mani [:fox2mike] from comment #11)
> (In reply to Peter Bengtsson [:peterbe] from comment #8)
>
> > Apparently more than a 1,000 is the "limit" for NFS. (From the trenches of
> > Socorro)
>
> This is untrue, Dan can give you exact numbers. Back in the socorro days, we
> probably had them on equallogic (not netapp) and it might have been a really
> old one.
Indeed, the limit is a lot more than 1000, but a hell of a lot less than a million.
Comment 15•13 years ago
|
||
I don't think that this is going to be quickly fixable. Do we have other options so as to not block bug 652792?
Comment 16•13 years ago
|
||
Dustin: Shyam and I came up with a plan. (I'll send the details.) Basically we want to migrate the existing VMs, and solve the data problem in Q2. The migration bug is here:
https://bugzilla.mozilla.org/show_bug.cgi?id=737606
Pike is also working on getting buy-in for a retention policy in order to reduce the size of the problem. The draft policy is documented here:
https://wiki.mozilla.org/Elmo/Retention_Policy
Comment 17•13 years ago
|
||
Thanks for the update. From beneath my Buildbot-maintainer hat: patches accepted to use subdirectories -- that code hasn't changed much since 0.7.10, so likely the patch would be easily forward/back-ported.
Comment 18•13 years ago
|
||
We've met with Daniel from Metrics, and the plan is to store the buildbot logs in ElasticSearch. I filed bug 741957 to get that implemented.
Updated•13 years ago
|
Priority: -- → P2
Comment 19•12 years ago
|
||
Restructuring some dependencies, migrating this bug to be mostly a tracker for the various things we need to get done to move off of file logs and forward to ES.
Comment 20•11 years ago
|
||
This is fixed, we haven't started removing the files yet, but we're mostly there.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•5 years ago
|
Product: Webtools → Webtools Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•