Closed Bug 1131228 Opened 9 years ago Closed 9 years ago

SymbolLRUCacheManager occasionally raising OSError trying to stat files

Categories

(Socorro :: Backend, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: rhelmer)

Details

Not sure why this is happening, as the file appears to be there when I check it:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/pyinotify.py", line 1542, in run
    self.loop()
  File "/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/pyinotify.py", line 1528, in loop
    self.process_events()
  File "/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/pyinotify.py", line 1328, in process_events
    self._default_proc_fun(revent)
  File "/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/pyinotify.py", line 967, in __call__
    return _ProcessEvent.__call__(self, event)
  File "/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/pyinotify.py", line 687, in __call__
    return meth(event)
  File "/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/socorro-master-py2.6.egg/socorro/processor/symbol_cache_manager.py", line 76, in process_IN_OPEN
    self.monitor._update_cache(event.pathname)
  File "/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/socorro-master-py2.6.egg/socorro/processor/symbol_cache_manager.py", line 192, in _update_cache
    size = os.stat(path).st_size
OSError: [Errno 2] No such file or directory: '/tmp/symbols/firefox.pdb/20885C58B68D486A83F2D7503F9D35662/firefox.sym'
This stops the cache cleaning from happening so we should probably handle more errors here (e.g. some other process coming along and wiping out files), but I'd like to know why this happens...
puppet?
(In reply to Chris Lonnen :lonnen from comment #2)
> puppet?

Maaaybe... it could be that a new stackwalker instance is downloading the file so I am wrong about it being there when the stat() fails.

Or maybe we're running SymbolLRUCacheManager on more than one thread accidentally?

phrawzty, we're trying to keep a temporary symbol cache (for symbols downloaded from S3) in /tmp/symbols/ and they seem to be disappearing out from under our cache cleaner code - do you think this is an OK place to keep these, if not can you suggest a better one? :)
Flags: needinfo?(dmaher)
(In reply to Robert Helmer [:rhelmer] from comment #3)
> phrawzty, we're trying to keep a temporary symbol cache (for symbols
> downloaded from S3) in /tmp/symbols/ and they seem to be disappearing out
> from under our cache cleaner code - do you think this is an OK place to keep
> these, if not can you suggest a better one? :)

You didn't note the name of a machine, so I guessed "socorro-processor1.stage.metrics.phx1".  The rest of this comment refers to that machine.

First off, I note that /tmp/ has a bunch of files of glob "symbol??????" (ex. "symbolrQi0eV").  Is that normal?  Also, there's no particular reason that Puppet would be causing this behaviour, so I'd look elsewhere first.  As for whether /tmp/symbols/ is an acceptable place to put these things - yeah, it's fine.  The tmpwatch runs daily but it's configured to only remove things between 10 and 30 days old (depending on the item).

Some quick searching seems to suggest that pynotify can be buggy, though most of the references to this fact are for older versions.  That said, if does make me wonder if there's something screwy going on there - for example, if pynotify is triggering before the file has been written entirely?  Alternatively, yes, something else is wiping out the file - somewhat ironically, an inotify-based tool might be useful in logging this behaviour, though I'd sooner employ audit.
Flags: needinfo?(dmaher)
My best guess here is that the file gets opened and then removed before SymbolLRUCacheManager gets the open notification. I wouldn't think that was likely with the way things are written in HTTPSymbolSupplier, since it downloads to a temp file (the /tmp/symbol???? files phrawzty is describing above) and then renames to the final file destination after successfully downloading:
https://github.com/mozilla/socorro/blob/master/minidump-stackwalk/http_symbol_supplier.cc#L289

...which means that the stackwalker shouldn't ever delete a file under /tmp/symbols, so I'm not sure what's up. If you think that tmpwatch thing is screwing things up you could try changing the location of the symbols cache to somwhere outside of /tmp.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #5)
> If you think that tmpwatch thing is
> screwing things up you could try changing the location of the symbols cache
> to somwhere outside of /tmp.

No, I *don't* think tmpwatch is screwing things up - unless the files that are "disappearing" more than 10 days old (which seems unlikely).
"if pynotify is triggering before the file has been written entirely"
This is my guess. 

I'm tempted to say "I told you so" on pyinotify. :)

I think we should make our SymbolLRUCacheManager class more defensive and careful. A simple `if os.path.isfile(...)` should do. That pyinotify says something should only be considered a hint/clue. Not a fact.
(In reply to Peter Bengtsson [:peterbe] from comment #7)
> "if pynotify is triggering before the file has been written entirely"
> This is my guess. 
> 
> I'm tempted to say "I told you so" on pyinotify. :)
> 
> I think we should make our SymbolLRUCacheManager class more defensive and
> careful. A simple `if os.path.isfile(...)` should do. That pyinotify says
> something should only be considered a hint/clue. Not a fact.

Yeah, the exception is being thrown when we're trying to stat the file so doing that (or catching the exception) would work:
https://github.com/mozilla/socorro/blob/master/socorro/processor/symbol_cache_manager.py#L192
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #5)
> My best guess here is that the file gets opened and then removed before
> SymbolLRUCacheManager gets the open notification. I wouldn't think that was
> likely with the way things are written in HTTPSymbolSupplier, since it
> downloads to a temp file (the /tmp/symbol???? files phrawzty is describing
> above) and then renames to the final file destination after successfully
> downloading:
> https://github.com/mozilla/socorro/blob/master/minidump-stackwalk/
> http_symbol_supplier.cc#L289


I think the reason these are getting left behind is that we're killing stackwalker after 30s while it's in the middle of downloading symbols (since we're remote from S3 and the cache is cold), filed bug 1131296 for that.
I'm still not sure why that would result in the file not existing. If we've called open() on it it should exist on disk, right?
PR: https://github.com/mozilla/socorro/pull/2629
Status: NEW → ASSIGNED
Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/43fafddc04ba46c2bdc4433a27ee6a2de067dd37
fix bug 1131228 - catch unexpected OSError and return early

https://github.com/mozilla/socorro/commit/b5f09025afe9862b54442015bccd247209b41584
Merge pull request #2629 from rhelmer/bug1131228-catch-oserror

fix bug 1131228 - catch unexpected OSError and return early
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.