Closed Bug 675375 Opened 13 years ago Closed 13 years ago

Frequent Android talos red with "devicemanager.FileError: error returned from pull: could not find metadata"

Categories

(Testing :: Talos, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: dholbert, Unassigned)

References

()

Details

(Keywords: intermittent-failure)

In the last two days, we've been getting very frequent Talos reds (around 1 red per 1 or 2 pushes, on m-c and m-i) with this error message:
> devicemanager.FileError: error returned from pull: could not find metadata

http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311956009.1311957454.10061.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311962909.1311964350.6883.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311975178.1311976645.28669.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311972269.1311973677.15655.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311978029.1311979476.9356.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311849689.1311851237.5678.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311852539.1311853997.23538.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311855354.1311856792.6359.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311888725.1311890153.8445.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311894389.1311895829.2984.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311896109.1311897548.11519.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311896110.1311897557.11584.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311904949.1311906393.23041.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311939652.1311941059.20206.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311935369.1311936855.26358.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1311914969.1311916422.6084.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1311979463.1311980867.14135.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1311975233.1311976679.28766.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1311957064.1311958497.14507.gz
OS: Linux → Android
Hardware: x86_64 → ARM
Summary: Frequent talos red with "devicemanager.FileError: error returned from pull: could not find metadata" → Frequent Android talos red with "devicemanager.FileError: error returned from pull: could not find metadata"
The logs in comment 0 are all one of {remote-ts, remote-tpan, remote-tzoom}
Summary: Frequent Android talos red with "devicemanager.FileError: error returned from pull: could not find metadata" → Frequent Android talos red with "devicemanager.FileError: error returned from pull: could not find metadata" (affects remote-ts, remote-tpan, remote-tzoom)
adding some members of ateam, since the problem may be in the device manager
http://hg.mozilla.org/build/talos/file/c5e493459dd0/devicemanager.py#l704

This appears to be a devicemanager bug.
Component: Release Engineering → Talos
Product: mozilla.org → Testing
QA Contact: release → talos
Version: other → unspecified
Joel, have you found any way to repro/diagnose this?
I have only spent a couple hours on this, and I haven't been able to reproduce it locally.  I might have to look at just fixing the code without a repro.

Any tips folks have to reproduce this would be appreciated.
Whiteboard: [orange]
so walking through the code it appears we are getting an exception while doing a sock.recv()

Here is the function:
    def uread(to_recv, error_msg):
      """ unbuffered read """
      try:
        data = self._sock.recv(to_recv)
        if not data:
          err(error_msg)
          return None
        return data
      except:
        err(error_msg)
        return None


We are raising an exception in the except clause, so just thinking this through, self._sock.recv() is throwing an exception.  This is feeling like the socket is disconnecting and we throw an error.  I can test around that condition.  

A few thoughts here are that we are killing the networking on the tegra.  I have seen this while testing where we eat up a lot of memory.  The resolution is to power cycle.  We could test that theory by looking at the history of the tegra when this happens and see if it goes offline unplanned during or after this test.

I don't think a fennec crash would cause this to throw an exception.
Blocks: 438871
oh, I finally saw this reproduce!  It appears that we are hitting a race condition where we are shutting down the devicemanager at the same time we are doing a pullFile().  I call this an inch of progress!
I think at the time I knew what this was supposed to be fixed-by, but by now I've forgotten.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.