Closed Bug 1141418 Opened 9 years ago Closed 9 years ago

After 9 hours running, marionette server response incorrectly

Categories

(Remote Protocol :: Marionette, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:2.2+, firefox37 affected, firefox39 wontfix, firefox40 wontfix, firefox41 fixed, b2g-v2.2 fixed, b2g-master fixed)

RESOLVED FIXED
mozilla41
blocking-b2g 2.2+
Tracking Status
firefox37 --- affected
firefox39 --- wontfix
firefox40 --- wontfix
firefox41 --- fixed
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: pyang, Assigned: ting)

References

Details

Attachments

(7 files, 1 obsolete file)

Build ID 20150305162505
Gaia Revision eb86137e247224e86d17ed1a0a133b2a318dce3c
Gaia Date 2015-03-05 16:51:07
Gecko Revision https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/e9d29a3b94e5
Gecko Version 37.0
Device Name flame
Firmware(Release) 4.4.2
Firmware(Incremental)  eng.cltbld.20150305.200709
Firmware Date Thu Mar5 20:07:21 EST 2015
Bootloader L1TC100118D0

STR:
Running mtbf tests about 9 hours.

EXPECT: 
marionette can continously run.

ACTUAL: 
Exception raisen in _send_message/_handle_error, server side didn't include 'error' in response, following is the traceback.

Traceback (most recent call last):
  File "/var/jenkins/workspace/flamekk.v2.2.moztwlab01.319.mtbf_op/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/marionette_client-0.8.6-py2.7.egg/marionette/marionette_test.py", line 250, in run
    self.setUp()
  File "/var/jenkins/workspace/flamekk.v2.2.moztwlab01.319.mtbf_op/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/gaiatest-0.32-py2.7.egg/gaiatest/gaia_test.py", line 805, in setUp
    self.cleanup_gaia(full_reset=True)
  File "/var/jenkins/workspace/flamekk.v2.2.moztwlab01.319.mtbf_op/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/mtbf_driver-0.2.0-py2.7.egg/mtbf_driver/MtbfTestCase.py", line 111, in cleanup_gaia
    self.device.touch_home_button()
  File "/var/jenkins/workspace/flamekk.v2.2.moztwlab01.319.mtbf_op/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/gaiatest-0.32-py2.7.egg/gaiatest/gaia_test.py", line 672, in touch_home_button
    Wait(self.marionette).until(lambda m: m.execute_script(
  File "/var/jenkins/workspace/flamekk.v2.2.moztwlab01.319.mtbf_op/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/marionette_client-0.8.6-py2.7.egg/marionette/wait.py", line 122, in until
    rv = condition(self.marionette)
  File "/var/jenkins/workspace/flamekk.v2.2.moztwlab01.319.mtbf_op/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/gaiatest-0.32-py2.7.egg/gaiatest/gaia_test.py", line 673, in <lambda>
    "return window.wrappedJSObject.scrollY") == 0)
  File "/var/jenkins/workspace/flamekk.v2.2.moztwlab01.319.mtbf_op/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/marionette_client-0.8.6-py2.7.egg/marionette/marionette.py", line 1278, in execute_script
    filename=os.path.basename(frame[0]))
  File "/var/jenkins/workspace/flamekk.v2.2.moztwlab01.319.mtbf_op/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/marionette_client-0.8.6-py2.7.egg/marionette/decorators.py", line 36, in _ 
    return func(*args, **kwargs)
  File "/var/jenkins/workspace/flamekk.v2.2.moztwlab01.319.mtbf_op/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/marionette_client-0.8.6-py2.7.egg/marionette/marionette.py", line 634, in _send_message 
    self._handle_error(response)
  File "/var/jenkins/workspace/flamekk.v2.2.moztwlab01.319.mtbf_op/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/marionette_client-0.8.6-py2.7.egg/marionette/marionette.py", line 663, in _handle_error 
    "Malformed packet, expected key 'error' to be a dict: %s" % response)
Hi Andreas, did you see this before?
Flags: needinfo?(ato)
Depends on: MTBF-Marionette
No longer depends on: MTBF-Marionette
It looks like marionette server stop returning value and keep response of "ok".  Not sure the reason but will start to dig into marionette server.
(In reply to Paul Yang [: pyang] from comment #1)
> Hi Andreas, did you see this before?

I have seen this before when there's a race condition the server code.

For example this might happen if a callback that sends back an OK message to the client is registered and not removed when the command has finished.  If it's left dangling and then fires during the execution of another command, that command's normal response would get short-circuited.

To fix this we need to narrow down where the OK message comes from (which callback) and which command is currently being executed when you get the unexpected response back.  The latter information should be available from the Gecko log, and the former might be possible to attain through using the FunctionName.caller reflection to see which function called it.
Flags: needinfo?(ato)
James, could you have someone to check this issue?
Flags: needinfo?(jlal)
This issue occurred in 6 devices out of 8 devices in recent run.
In moztwlab-01, flamekk.v2.2.moztwlab01.319.mtbf_op build 91-95.
I'm just about to land a major refactoring to Marionette as part of bug 1107706.  It changes the way commands are dispatched and responses sent back.  If you can spare the delay, it might be worth holding off for a few days to see if that resolves your problem.  Or you could pull down the bookmark off Review Board and try it out.

If it doesn't we're going to need much more detailed logging from gecko.log as well as the TC being run during the crash to have any hope at investigating.
(In reply to Andreas Tolfsen (:ato) from comment #6)
> I'm just about to land a major refactoring to Marionette as part of bug
> 1107706.  It changes the way commands are dispatched and responses sent
> back.  If you can spare the delay, it might be worth holding off for a few
> days to see if that resolves your problem.  Or you could pull down the
> bookmark off Review Board and try it out.
> 
> If it doesn't we're going to need much more detailed logging from gecko.log
> as well as the TC being run during the crash to have any hope at
> investigating.

Its risky to rely on the new implementation. It might also introduce new failures. Can we do a MTBF run with the current patch of 1107706 to make sure its an improvement and we don't need much time to make the new framework stable?
Hopefully we can have patch for b2g37 since marionette repo on master changed quite a lot.
We can provide logcat every 2 minutes and other process information.
James, Was there something changed in marionette to cause this ? If so can that be rolled-back if the forward fix is risky ?
If this is our main mtbf blocker right now we need to move forward here. Can we get a regression window? When was the last good run, can we get a regression window?

Andreas, can you get a logging patch up here?
paul, can you also confirm you were able to reproduce this and this isn't one of those rare race conditions that happened ?

OTOH, looking into changes related marioneette on b2g 37 recently,http://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/3802bdd23b74 seems to have touched that area, so I am Ni the author to help here and comment on if his changes could be related here ?
Flags: needinfo?(ejpbruel)
https://docs.google.com/a/mozilla.com/spreadsheets/d/1euwhyStKtg3yPy2D5rxuBndadGwKny8QI-YFb_FpyPI/edit#gid=1251305406
We started new test run and 1 of 7 has the issue.  Hopefully it's not race conditions but not efficient str to reproduce.
(In reply to bhavana bajaj [:bajaj] from comment #11)
> paul, can you also confirm you were able to reproduce this and this isn't
> one of those rare race conditions that happened ?
> 
> OTOH, looking into changes related marioneette on b2g 37
> recently,http://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/3802bdd23b74
> seems to have touched that area, so I am Ni the author to help here and
> comment on if his changes could be related here ?

looks like that change was a year ago, :-/ missed the date entirely on that. apologies.
Flags: needinfo?(ejpbruel)
Attached file logcat.tgz
Attaching gecko log. Andreas, can you help to see if something weird in these?
Flags: needinfo?(ato)
We encountered this bug few more times this time with marionette-v2.2. Need attention on this bug
Flags: needinfo?(bbajaj)
(In reply to Paul Yang [: pyang] from comment #14)
> Attaching gecko log. Andreas, can you help to see if something weird in
> these?

After a quick glance I can't spot anything odd in those.  The only two references I can find of the Marionette server sending back a packet containing the `error' field are these:

% g sendToClient | grep error
./logcat141:322:03-26 03:41:26.971  1262  1262 I Gecko   : 1427312486982	Marionette	INFO	sendToClient: {"from":"0","error":{"message":"timed out","status":28,"stacktrace":null}}, {e9703bd9-5293-48bd-94bd-3537125fb385}, {e9703bd9-5293-48bd-94bd-3537125fb385}
./logcat149:281:03-26 04:28:16.211  1262  1262 I Gecko   : 1427315296222	Marionette	INFO	sendToClient: {"from":"0","error":{"message":"Unable to locate element: power-switch","status":7,"stacktrace":null}}, {f6cd93f8-84ba-4f18-b349-86896a5dcc67}, {f6cd93f8-84ba-4f18-b349-86896a5dcc67}

The Python stacktrace indicates that the value of the `error' field is malformed, i.e. it's not a dictionary.  Both of those instances can be unmarshaled to dictionaries in Python.
Flags: needinfo?(ato)
This was fixed in bug 1137653 which will need uplifting to get around this issue.
Is it? I see it still happens sometimes, and someone commented in bug 1137653.
No longer depends on: 1137653
Flags: needinfo?(pyang)
(In reply to Walter Chen[:ypwalter][:wachen] from comment #18)
> Is it? I see it still happens sometimes, and someone commented in bug
> 1137653.

I couldnt see anything after comment 19 in but 1137653 about the malformed packet, they all appear to be related to a crash they are trying to capture. Can you point me to the comment please?

If that doesn't solve the malformed packet then bug 1107706 will have it but that might be slightly more risky to uplift.
It affects b2g37-v2.2 so uplift should be done IMO.
Flags: needinfo?(pyang)
I think we are still suffering from the same issue as for now. For bug 1137653, I just want to indicate that is is reopened I think :P
I think there is a patch on the way, that Paul requested uplift to help here.
Flags: needinfo?(bbajaj)
According to last run result bug 1137653 didn't resolve this.
David, do you think we should try to uplift bug 1107706 or probably another patch to go?
Flags: needinfo?(dburns)
I saw device recovered after hours; not sure how long since we came back from weekend and all devices work well without reboot.  As a workaround probably we can suspend for a while.
(In reply to Paul Yang [: pyang] from comment #23)
> According to last run result bug 1137653 didn't resolve this.
> David, do you think we should try to uplift bug 1107706 or probably another
> patch to go?

You could try uplift it but it may be a non-trivial amount of work. Are you sure that the build you used had the necessary things?
Flags: needinfo?(dburns)
I'm not sure.  But it certainly is a critical blocker since all devices now dropped into this issue, which made stability test stopped accidentally.
Comment 3 mentioned it's a possibly race condition, but we can't run on master to make sure it does not exist under new structure(so we can move to latest).
Per comment 24, I'll try to retry or wait on client side to avoid it.  And we need clues to set debug log on server side at the same time.
nominate to get more help.
blocking-b2g: --- → 2.2?
Attached file bugreport
bug report from a device. free memory only 8MB but commit size is not large.
Attached file about:memory.tgz
about memory log.
Hi Ting, can you help to see the memory report?
Flags: needinfo?(janus926)
Hi Kyle, we saw weird number of queued-ipc-messages, is that possibly memory issue?
Flags: needinfo?(khuey)
We need to add some logs to figure out how is the Marionette command failed, either it's doing incorrect command or somehow it sends incorrect response.

Keep NI since I haven't finished reading the memory report.
Re attachment 8601398 [details], I can see leaked AppWindow such as:

  0xa5b2cb00 [rc=8] FragmentOrElement (xhtml) div id='AppWindow_25' class=' appWindow render transition-opening enlarge loading' (orphan) app://system.gaiamobile.org/index.html

Seems destroy() is called on the AppWindow object 0xaed4e3d0 but not destroyBrowser(), because it has class transition-opening.

Alive, what's the reason of isActive() checking here [1], how will the browser be destroyed when it returns true?

[1] http://mxr.mozilla.org/gaia/source/apps/system/js/app_window.js#995
Flags: needinfo?(janus926) → needinfo?(alive)
If the app window is active, we need to wait until its closing transition ended. (We cannot remove the DOM right away).
In worst case the transitionend does not happen, we will close the appWindow immediately in
http://mxr.mozilla.org/gaia/source/apps/system/js/app_window.js#530
to make sure
http://mxr.mozilla.org/gaia/source/apps/system/js/app_window.js#512
gets the closed event and proceed the destroy function()
Flags: needinfo?(alive)
I have talked to Alive f2f, he claims Gaia has removed the elment from DOM tree and clear the reference [1], which Gecko should recycle it and its decendants anyway.

After double checking, I noticed there's a reference from nsGlobalWindow#5 to the AppWindow object through ontimeformatchange event listener 0xa5ea3220, which should have been removed by uninstallSubComponents(), not sure why it's still existed.

Also, |frame| points to the same thing as |element| [2], but it is not cleared to null.

[1] http://mxr.mozilla.org/gaia/source/apps/system/js/app_window.js#585
[2] http://mxr.mozilla.org/gaia/source/apps/system/js/app_window.js#648
triage: per comment 26, this blocks stability testing so let's make it 2.2+
blocking-b2g: 2.2? → 2.2+
Attached file logcat
Attaching latest logcat for marionette debug
(In reply to Paul Yang [: pyang] from comment #35)
> Created attachment 8603286 [details]
> logcat
> 
> Attaching latest logcat for marionette debug

Is there any chance of seeing the python stack for this run. The logcat looks like what I would expect
(In reply to Paul Yang [: pyang] from comment #29)
> Hi Kyle, we saw weird number of queued-ipc-messages, is that possibly memory
> issue?

It means the parent process is leaking iframes again.  These are usually caused by problems in the system app, like what Ting-Yu mentions in comment 33.
Flags: needinfo?(khuey)
(In reply to David Burns :automatedtester from comment #36)

Not sure if I understand but do you mean stacktrace in bug description?

> (In reply to Paul Yang [: pyang] from comment #35)
> > Created attachment 8603286 [details]
> > logcat
> > 
> > Attaching latest logcat for marionette debug
> 
> Is there any chance of seeing the python stack for this run. The logcat
> looks like what I would expect
Flags: needinfo?(dburns)
sleepSession() is done twice for Homescreen and registers handler for 'Marionette:restart' twice when test_mtbf_camera_multiple_shots.py gets timeout with this stack:

  TEST-UNEXPECTED-ERROR | test_mtbf_camera_multiple_shots.py TestCameraMultipleShots.test_capture_multiple_shots | TimeoutException: TimeoutException: Timed out after 10.0 seconds

  Traceback (most recent call last):
    File "/home/ting/w/fx/tests/mtbf_operation/mtbf-env/local/lib/python2.7/site-packages/marionette_client-0.8.6-py2.7.egg/marionette/marionette_test.py", line 268, in run
      testMethod()
    File "/home/ting/w/fx/tests/mtbf_operation/tests/mtbf/camera/test_mtbf_camera_multiple_shots.py", line 22, in test_capture_multiple_shots
      self.camera.launch()
    File "/home/ting/w/fx/tests/mtbf_operation/mtbf-env/local/lib/python2.7/site-packages/gaiatest-0.32-py2.7.egg/gaiatest/apps/camera/app.py", line 45, in launch
      self.wait_for_capture_ready()
    File "/home/ting/w/fx/tests/mtbf_operation/mtbf-env/local/lib/python2.7/site-packages/gaiatest-0.32-py2.7.egg/gaiatest/apps/camera/app.py", line 125, in wait_for_capture_ready
      Wait(self.marionette, timeout=10).until(lambda m: m.execute_script('return arguments[0].readyState;', [viewfinder]) > 0)
    File "/home/ting/w/fx/tests/mtbf_operation/mtbf-env/local/lib/python2.7/site-packages/marionette_client-0.8.6-py2.7.egg/marionette/wait.py", line 143, in until
      cause=last_exc)

For the next test case, Homescreen receives 'Marionette:restart' twice when switch to its frame and sends Ok twice to client, which fails command's expected response.

I don't know yet where's the extra sleepSession from.
Attached file logcat.dbg
The logcat from comment 40.
(In reply to Ting-Yu Chou [:ting] from comment #40)
> sleepSession() is done twice for Homescreen and registers handler for
> 'Marionette:restart' twice when test_mtbf_camera_multiple_shots.py gets
> timeout with this stack:

This is not all correct. sleepSession() is done twice for Homescreen, but the extra one is from test_mtbf_camera_multiple_shots.py:

  self.apps.set_permission('Camera', 'geolocation', 'deny')

Probably the timeout causes the normal path to restart it is not going through. I'll figure this out.
(In reply to Ting-Yu Chou [:ting] from comment #42)
> This is not all correct. sleepSession() is done twice for Homescreen, but
> the extra one is from test_mtbf_camera_multiple_shots.py:
> 
>   self.apps.set_permission('Camera', 'geolocation', 'deny')
> 
> Probably the timeout causes the normal path to restart it is not going
> through. I'll figure this out.

Here's the log which expect to restart it, but it did not:

I/Gecko   ( 4384): server.js | receiveMessage msg=Marionette:switchedToFrame val={"frameValue":"{22edb0ae-5754-4812-b233-16e8022b8d74}"}
I/Gecko   ( 4384): 1431667175692	Marionette	INFO	Switched to frame: {"frameValue":"{22edb0ae-5754-4812-b233-16e8022b8d74}"}
I/Gecko   ( 4384): listener.js | sendToServer lid=3-b2g msg=Marionette:switchToFrame val={"win":3,"frame":1,"command_id":"{73334a48-2613-41b4-9c79-48d383f5cea9}"}
I/Gecko   ( 4384): server.js | receiveMessage msg=Marionette:switchToFrame val={"win":3,"frame":1,"command_id":"{73334a48-2613-41b4-9c79-48d383f5cea9}"}
I/Gecko   ( 4384): 1431667175699	Marionette	INFO	trying remote frame 0
I/Gecko   ( 4384): 1431667175701	Marionette	INFO	deleting frame
I/Gecko   ( 4384): 1431667175702	Marionette	INFO	trying remote frame 1
I/Gecko   ( 4384): 1431667175703	Marionette	INFO	deleting frame
I/Gecko   ( 4384): 1431667175704	Marionette	INFO	trying remote frame 2
I/Gecko   ( 4384): 1431667175705	Marionette	INFO	deleting frame
I/Gecko   ( 4384): 1431667175711	Marionette	INFO	frame-manager load script: [object ChromeMessageSender]
Attached patch v1 (for v2.2) (obsolete) — Splinter Review
The splice here [1] adjusts the array we're iterating. Assume the array |x| is [a, b, c], if x[0] is removed, the the for loop will miss checking b.

I haven't confimred it's the root cause, but this should be fixed.

Paul, could you please apply this patch to current MTBF see if the issue goes away?

[1] https://dxr.mozilla.org/mozilla-central/source/testing/marionette/frame-manager.js#137
Flags: needinfo?(pyang)
Set 3 devices running for this patch.
Hopefully see something positive next week.
Flags: needinfo?(pyang)
I can not see from the comment 40 the issue described in comment 0 so from my part I consider it fixed by what I suggested in comment 17.

As for comment 44, that looks like it might be a separate issue. That type of problem would not return invalid packet
Flags: needinfo?(dburns)
From the result marionette server worked just fine.  Will setup another run for confirming.

Memory leak issue might worth of discussing in another bug.
Attached patch v2Splinter Review
Maybe I didn't explain clearly in comment 40.

The malformed packet is because there're two "Marionette:restart" listeners registered for Homescreen app, so once switch_to_frame is called to switch to Homescreen, marionette-server.js will receive two "Marioneete:register" messages and sendOk() to client twice. So the following command of switch-to-frame will get a Ok response, which may not match what it expects and raise malformed packet exception.

And the reason why there're two "Marionette:restart" listeners registered is because the splice call while iterating |remoteFrames| array makes the loop miss checking some elements, so the "Marionette:restart" is not sent when it should, then the listener is kept. Later when test ends, another "Marionette:restart" listener is registered from sessionTearDown(). Note the extra listener is never removed.

Hope this explains clearly why the splice call is the root cause of malformed packet.
Attachment #8606243 - Attachment is obsolete: true
Attachment #8606828 - Flags: review?(jgriffin)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #37)
> (In reply to Paul Yang [: pyang] from comment #29)
> > Hi Kyle, we saw weird number of queued-ipc-messages, is that possibly memory
> > issue?
> 
> It means the parent process is leaking iframes again.  These are usually
> caused by problems in the system app, like what Ting-Yu mentions in comment
> 33.

Paul filed bug 1165836 for this.
Comment on attachment 8606828 [details] [diff] [review]
v2

Review of attachment 8606828 [details] [diff] [review]:
-----------------------------------------------------------------

I believe the problem is actually that we should be iterating until i <= remoteFrames.length, rather than i < remoteFrames.length; we always skip the last element in the array.
Attachment #8606828 - Flags: review?(jgriffin) → review-
(In reply to Jonathan Griffin (:jgriffin) from comment #50)
> I believe the problem is actually that we should be iterating until i <=
> remoteFrames.length, rather than i < remoteFrames.length; we always skip the
> last element in the array.

Iterating an array from index 0 to length-1 is just as correct as it is.

The problem is splice() removes current element with index i, take an example:

array = [A, B, C]
i = 0: print array[i];     /* A */;
       array.splice(i, 1); /* array = [B, C], array.length = 2 */
i = 1: print array[i];     /* C */; 

Note B is not iterated, that's why i-- is used to fix the index.
Flags: needinfo?(jgriffin)
Comment on attachment 8606828 [details] [diff] [review]
v2

Review of attachment 8606828 [details] [diff] [review]:
-----------------------------------------------------------------

Ah thanks, yes, you're right.
Attachment #8606828 - Flags: review- → review+
Flags: needinfo?(jgriffin)
Attached patch patch for v2.2Splinter Review
Attachment #8607321 - Flags: approval-mozilla-b2g37?
Comment on attachment 8607321 [details] [diff] [review]
patch for v2.2

Will ask for aproval later when the patch is landed on m-c.
Attachment #8607321 - Flags: approval-mozilla-b2g37?
Test result with v2.2 patch, client didn't show malformed packet exception.  Looks clean for v2.2.
Keywords: checkin-needed
Comment on attachment 8607321 [details] [diff] [review]
patch for v2.2

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): n/a
User impact if declined: none as this affects only marionette
Testing completed: marionette unit test
Risk to taking this patch (and alternatives if risky): low
String or UUID changes made by this patch: n/a
Attachment #8607321 - Flags: approval-mozilla-b2g37?
https://hg.mozilla.org/mozilla-central/rev/431859cfb71d
Assignee: nobody → janus926
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
Josh, can you help to grant this uplift?
Flags: needinfo?(jocheng)
Flags: needinfo?(jocheng)
Attachment #8607321 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
Flags: needinfo?(jlal)
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: