Closed Bug 974807 Opened 10 years ago Closed 10 years ago

[Sora][gallery]the image editor hang at WaitForCompositor() during a call

Categories

(Core :: Graphics: CanvasWebGL, defect, P2)

defect

Tracking

()

RESOLVED DUPLICATE of bug 947227

People

(Reporter: sync-1, Assigned: jerry)

Details

(Keywords: regression, Whiteboard: [in-code-review] QARegressExclude)

Attachments

(9 files, 6 obsolete files)

7.82 KB, image/png
Details
479.88 KB, text/x-log
Details
269.71 KB, image/png
Details
8.55 KB, patch
Details | Diff | Splinter Review
61.52 KB, text/plain
Details
46 bytes, text/x-github-pull-request
bjacob
: review+
Details | Review
46 bytes, text/x-github-pull-request
bjacob
: review+
Details | Review
18.78 KB, image/png
Details
3.58 KB, patch
snorp
: review+
Details | Diff | Splinter Review
Mozilla build ID: 20140208004002
 
 Created an attachment (id=640198)
 pic
 
 DEFECT DESCRIPTION:
 the edit picture screen display black screen during a call
 
  REPRODUCING PROCEDURES:
 1.Launch Gallery,tap a picture and click edit icon;
 2.Press power key,lock the screen;
 3.Receive a call,answer the call,tap homekey,unlock the screen-->the edit screen
 display black screen-->ko
 
  EXPECTED BEHAVIOUR:
 The edit screen should display normal;
 
  ASSOCIATE SPECIFICATION:
 
  TEST PLAN REFERENCE:
 
  TOOLS AND PLATFORMS USED:
 
  USER IMPACT:
 
  REPRODUCING RATE:5/5
Attached image pic
Does this reproduce on a 1.1 Buri device?
Flags: needinfo?(sync-1)
(In reply to Jason Smith [:jsmith] from comment #2)
> Does this reproduce on a 1.1 Buri device?

Is not reproduce on 1.1
ni since Yaon already provided information
Flags: needinfo?(sync-1)
blocking-b2g: --- → 1.3?
Attached file LOG
QA Contact: ktucker
Before we do the regression window here, let's confirm this on the latest 1.3 build.
blocking-b2g: 1.3? → ---
This issue was occurring on today's Buri v 1.3.0 Mozilla RIL.
blocking-b2g: --- → 1.4?
blocking-b2g: 1.4? → 1.3?
Regression Window:

The issue did not occur on this build:

Environmental Variables:
Device: buri 1.3 MOZ
BuildID: 20140207064103
Gaia: 89f0cc7fb7587ce222798001c10671f211ed7fa8
Gecko: af0e7fea51b9
Version: 28.0
Firmware Version: v1.2-device.cfg

The issue was first introduced on this build: 

Environmental Variables:
Device: buri 1.3 MOZ
BuildID: 20140207105105
Gaia: 89f0cc7fb7587ce222798001c10671f211ed7fa8
Gecko: 164c55b989eb
Version: 28.0
Firmware Version: v1.2-device.cfg

The Gaia is the same on both builds so this appears to be a Gecko issue.
Nuwa is in that regression range. Seems strange that this is caused by nuwa though.
Blocks: 930282
Component: Gaia::Gallery → IPC
Product: Firefox OS → Core
Version: unspecified → 28 Branch
Blocks: 950266
No longer blocks: 930282
With the finger being pointed at Nuwa, Ben's patch from bug 956218 comment 26 came to mind.  Any chance of that being related, Ben?

I've also asked Jason to re-test this with a trunk build from today to include your fixes from bug 956218 to know if that is related.
Flags: needinfo?(bent.mozilla)
Keywords: qawanted
I don't think PBackground is related, and the deadlock I hit with PBackground shouldn't be an issue for PCompositor/PImageBridge because it looks like we start the compositor stuff after Nuwa freezes.
Flags: needinfo?(bent.mozilla)
Sounds like this is a different problem then, so pulling qawanted as this is probably still going to repro on the latest build.
Keywords: qawanted
Cervantes,

Please review as triage thinks Nuwa regression
blocking-b2g: 1.3? → 1.3+
Flags: needinfo?(cyu)
Can we test this with APZC disabled on the working revision?  This might be another crazy Nuwa/APZC interaction.
Flags: needinfo?(jsmith)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #15)
> Can we test this with APZC disabled on the working revision?  This might be
> another crazy Nuwa/APZC interaction.

Sure.

QA Wanted - Retest this on the working revision with APZC disabled.
Flags: needinfo?(jsmith)
Keywords: qawanted
QA Contact: ktucker → mvaughan
I was actually able to reproduce this issue on the 02/07/14 1.3 build. This led me to test this issue on older builds and I've found an alternate regression window.

This issue looks to have started reproducing on the 11/01/13 1.3 build.

- Last Working -
Device: Buri v1.3 MOZ RIL
BuildID: 20131031040201
Gaia: 412fd463bcb81f0e8bebf6d32500d0c02712748d
Gecko: f0d363d72753
Version: 28.0a1
Firmware Version: V1.2-device.cfg

- First Broken -
Device: Buri v1.3 MOZ RIL
BuildID: 20131101040203
Gaia: ccdf357ea150fc7d8b8a4b74c7adf31e7a57e465
Gecko: abe6790a5dd8
Version: 28.0a1
Firmware Version: V1.2-device.cfg

*This appears to be a gecko issue*

last working gaia/first broken gecko = REPRO
Gaia  412fd463bcb81f0e8bebf6d32500d0c02712748d
Gecko abe6790a5dd8

first broken gaia/last working gecko = NO REPRO
Gaia  ccdf357ea150fc7d8b8a4b74c7adf31e7a57e465
Gecko f0d363d72753

Push log: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f0d363d72753&tochange=abe6790a5dd8
Keywords: qawanted
I need to check with Kevin here if you guys are reproducing the same problem here. I'm really confused why there are two regression windows derived here. Can you figure out why there's an inconsistency of results here?
Flags: needinfo?(ktucker)
After further investigation, the regression window in comment 17 is the correct one. I tried multiple times today using my regression window in comment 8 and it appeared to be correct at first. Finally, I realized that an extra step is needed to reproduce the issue 100%. If you use the same picture over and over you will not encounter the issue. If you do not encounter the issue and change to a different picture and follow the original STR, the issue occurs. 

STR:
1. Tap on Gallery.
2. Tap on a picture.
3. Tap on edit.
4. Lock the phone while in edit mode.
5. Call the locked phone and answer the call.
6. Tap the home button and unlock the device.

Note: If the the black screen issue does not occur, change to a different picture in gallery(Very important) and repeat steps 4-6.

Actual Results:The user will encounter the black screen.

Correct Regression Window:

This issue looks to have started reproducing on the 11/01/13 1.3 build.

- Last Working -
Device: Buri v1.3 MOZ RIL
BuildID: 20131031040201
Gaia: 412fd463bcb81f0e8bebf6d32500d0c02712748d
Gecko: f0d363d72753
Version: 28.0a1
Firmware Version: V1.2-device.cfg

- First Broken -
Device: Buri v1.3 MOZ RIL
BuildID: 20131101040203
Gaia: ccdf357ea150fc7d8b8a4b74c7adf31e7a57e465
Gecko: abe6790a5dd8
Version: 28.0a1
Firmware Version: V1.2-device.cfg

*This appears to be a gecko issue*

last working gaia/first broken gecko = REPRO
Gaia  412fd463bcb81f0e8bebf6d32500d0c02712748d
Gecko abe6790a5dd8

first broken gaia/last working gecko = NO REPRO
Gaia  ccdf357ea150fc7d8b8a4b74c7adf31e7a57e465
Gecko f0d363d72753

Push log: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f0d363d72753&tochange=abe6790a5dd8
Flags: needinfo?(ktucker)
No longer blocks: 950266
Flags: needinfo?(cyu)
Unfortunately nothing it that range really jumps out at me :/
Only interesting points seen in the logcat are:

E/GeckoConsole( 1167): Content JS ERROR at app://gallery.gaiamobile.org/js/frame_scripts.js:1476 in displayImage: The thumbnail contained in the jpeg doesn't fitthe device screen. The full size image is rendered.This might cause out of memory errors

E/GeckoConsole(  284): Content JS WARN at app://system.gaiamobile.org/js/attention_screen.js:320 in as_tryLockOrientation/tryToUnlock: Attention screen fails on locking orientation, retrying..
Component: IPC → General
Product: Core → Firefox OS
Version: 28 Branch → unspecified
That's strange. I disable Nuwa on the latest 1.3 branch (gaia: f67b0020818272a8c8d8c1749ddb85907c1d5501, gecko: f0c0119ca2ddb053bd16c7822bad0cdb2587fa59) by:

> diff --git a/b2g/confvars.sh b/b2g/confvars.sh
> index 0912f57..ceed56a 100644
> --- a/b2g/confvars.sh
> +++ b/b2g/confvars.sh
> @@ -57,6 +57,6 @@ MOZ_PLACES=
>  MOZ_B2G=1
>  
>  if test "$OS_TARGET" = "Android"; then
> -MOZ_NUWA_PROCESS=1
> +MOZ_NUWA_PROCESS=
>  fi
>  MOZ_FOLD_LIBS=1

and the problem still can be reproduced, although with a little difference. Instead of seeing a blank screen, I saw a screen that looks like the gallery app was not refreshed (please refer to the attachment). I also see the errors/warnings as in comment #21.
Oh comment #19 already confirms this is not a Nuwa regression :).
The CanvasLayerComposite(the image edit preview) is disappeared after turn off and turn on the power.

Here is the layer's dump data.
http://pastebin.mozilla.org/4454737
We can see that the CanvasLayerComposite is disappeared.

Maybe we need to check the layout or layer parts.
Component: General → Graphics: Layers
Product: Firefox OS → Core
Version: unspecified → 28 Branch
Benoit, can you help with this?
Flags: needinfo?(bjacob)
I also can check this problem if Benoit has another block issue.
(In reply to Jerry Shih[:jerry] from comment #27)
> I also can check this problem if Benoit has another block issue.

Great, thanks!
Assignee: nobody → hshih
Flags: needinfo?(bjacob)
I just turn off and turn on the device. No call in. The screen will also be black.
Furthermore, when we change to the card view and go back, it is also black.
The crystal skull app has the same situation.
They are all using webgl.

When we turn off and turn on device, the webgl context will lost.
http://dxr.mozilla.org/mozilla-central/source/content/canvas/src/WebGLContext.cpp#902
Thus, we can't get the canvas layer for layout processing.

I will check why the webgl context lost.
Summary: [Sora][gallery]the edit picture screen display black screen during a call → [Sora][gallery]WebGL context is lost when the screen is turned off
First of all this is actually an optimization opportunity we should check for. When we send an app to the background we should make sure we force the WebGL context to be lost.

Second, when this happens, we should be sending a webglcontextlost event to content and then webglcontextrestored once we restore the context again when the device comes back.
Comment 30 is right, this sounds like the system is working as intended, and needs to be handled by the application. See Example VI there:

http://www.khronos.org/registry/webgl/specs/latest/1.0/#5.15.3

An example page implementing this is this WebGL conformance test:

https://www.khronos.org/registry/webgl/sdk/tests/conformance/context/context-lost-restored.html

WebGL contexts should, to the extent possible, be considered 'discardable' resources, especially on mobile devices. Besides turning off the screen, context loss would typically occur in low-memory situations, or when putting an application to the background, etc.

For that reason it is worth architecting WebGL-using applications in such a way that their entire WebGL state can be recreated at any time from non-WebGL state, so as to be ready to do that from a webglcontextrestored event handler.
(In reply to Jerry Shih[:jerry] from comment #29)
> ...
> 
> When we turn off and turn on device, the webgl context will lost.
> http://dxr.mozilla.org/mozilla-central/source/content/canvas/src/
> WebGLContext.cpp#902
> Thus, we can't get the canvas layer for layout processing.
> 
> I will check why the webgl context lost.

Just to be clear - we want to make sure the right messages are being sent and handled, we do not want to retain the context in this case.
I'm really surprised we want to kill contexts when they are sent to the background/screen turns off in the general case. If there's no memory pressure, we should just throttle their rAF to zero and leave them be. Reconstructing GL data takes both time and power, so context-lost should probably not be the default behavior as soon as a context leaves the foreground.

We should definitely keep our LRU and memory-pressure context lost logic, but unconditionally forcing context-lost when leaving the foreground does not seem like a good choice to me.
On a somewhat related note, I'm planning to change ProcessPriorityManager so that it does not modify the oom_adjust parameters for a process that goes to the background *unless* a new process comes to the foreground.  This is for more or less the same reason: it doesn't make sense to reclaim resources such as memory if nothing else wants to use them yet.
When we turn off the device, b2g will send the "low-memory" message at:
http://dxr.mozilla.org/mozilla-central/source/dom/ipc/ProcessPriorityManager.cpp?from=ProcessPriorityManager.cpp&case=true#1042

and then we force to lose the webgl context at:
http://dxr.mozilla.org/mozilla-central/source/content/canvas/src/WebGLContext.cpp#91

If we go back to revision 1f21bc9fc654, the webgl context will not lost when we turn off/on device.
to http://hg.mozilla.org/mozilla-central/rev/3d18224c5065
As Kyle said, I think we should modify the memory pressure condition in ProcessPriorityManager.
Thats the wrong fix. Gaia has to handle losing the context. Losing the context here is the right thing to do.
To clarify 36: the optimizations above in comment 33 are fine, we can try to avoid the context loss where it makes sense, but the real bug here is that Gaia doesn't handle context loss. That we have to fix first. In certain scenarios we will want to use the context, and Gaia must survive that.
I think that the original situation might be a real low memory pressure, so the webgl context is lost.

We have tree app here:
1.gallery
2.lock screen
3.communication

(In reply to sync-1 from comment #0)
> Mozilla build ID: 20140208004002  
>  DEFECT DESCRIPTION:
>  the edit picture screen display black screen during a call
>   REPRODUCING PROCEDURES:
>  1.Launch Gallery,tap a picture and click edit icon;
>  2.Press power key,lock the screen;
>  3.Receive a call,answer the call,tap homekey,unlock the screen
>  -->the edit screen display black screen-->ko

I will handle the context restore event in gallery first.
I think the gallery editor will work after doing that.
(In reply to Andreas Gal :gal from comment #36)
> Thats the wrong fix. Gaia has to handle losing the context. Losing the
> context here is the right thing to do.

I means that we should not always lost the context when we turn off the device.
We may or may not lose the context when we switch of the device. It can always happen. Bringing up the lockscreen for example might need just enough memory to be worth it losing the context.

On Tarako in general we want to always flush out as much memory as we can since compression is so expensive. Basically if we have zram enabled we should always compact as much as we can, and favor losing the context when in doubt.
Handle the webgl lose/restore context event in gallery app.
I can see the image back now.
There is one thing weird for webgl context lose/restore.
When I turn off the device, I can see the webgl context lost, but we receive the context restore event immediately. All webgl resources(ex. texture) go back. Even though I don't turn on the device.

If we turn off/on the device quickly, the screen is still black for a while. We need to wait the context restore processing.
Hi Matthew,
Could you try this patch?
Flags: needinfo?(mvaughan)
Status: NEW → ASSIGNED
When we handle the lose event as:
  canvas.addEventListener('webglcontextlost', function(event) {
    self.lostContext = true;
    event.preventDefault();
  }, false);

we will restore the context at:
http://dxr.mozilla.org/mozilla-central/source/content/canvas/src/WebGLContext.cpp?from=WebGLContext.cpp&case=true#1274

It's very strange. Lose the device and wait 1000 ms, then restore.
Benoit, what do you think?
Flags: needinfo?(bjacob)
The way that WebGL works is that the application first receives a webglcontextlost event; if it handles it and calls preventDefault() on it, then it will later receive a webglcontextrestored event, at which time it can resume making normal WebGL calls (but has to recreate all its WebGL state from scratch). See the spec link in comment 31.

The code linked to in comment 44 is how this is currently implemented; if this is unsatisfactory to you (for example if the 1000 ms timer delay is too long), then we can consider changing that.

Does that answer your question?

Regarding the earlier comments about whether we should be losing the WebGL context in the particular present situation: keep in mind that regardless of the details of the present situation here, there will always be circumstances where WebGL context loss happens; even if we never triggered context loss ourselves in the WebGL implementation, the EGL driver can decide to lose a context "under our feet" at any time, and since that is driver-dependent, it could happen on any device without happening on other devices. For these reasons, the only way to get Gaia apps to perform reliably on all devices and in all circumstances will be to handle WebGL context loss.
Flags: needinfo?(bjacob)
Note that you can programmatically trigger WebGL context loss using the WEBGL_lose_context extension. This should allow to reproduce this bug much more comfortably than having to e.g. turn off the screen of switch applications.

Just do this in JavaScript, assuming that your WebGL context is 'gl':

  var WEBGL_lose_context = gl.getExtension("WEBGL_lose_context");
  WEBGL_lose_context.loseContext();

and once the context is lost (in the webglcontextlost event handler), call:

  WEBGL_lose_context.restoreContext();

(Note that you can't call getExtension on a lost context, so you need to keep this WEBGL_lose_context variable around).

See this conformance test:
http://www.khronos.org/registry/webgl/sdk/tests/conformance/context/context-lost-restored.html
(In reply to Jerry Shih[:jerry] from comment #43)
> Hi Matthew,
> Could you try this patch?

Can you provide this patch in the form of a github branch? We need that to be able to test this bug.
Component: Graphics: Layers → Gaia::Gallery
Flags: needinfo?(mvaughan)
Product: Core → Firefox OS
Version: 28 Branch → unspecified
Just to be clear and explicit as there are different conversations going on.  This bug is only about "what do we do once the context is lost for whatever reason".  It does not attempt to deal with or change "when is the context lost" part.  We know the context can be lost.  We are going to deal with it properly once this bug is fixed, and the "we removed the loss of the context" is not the proper fix for this bug.

Any other conversation about "when should we force the context lost, etc.", for performance, power consumption or any other reasons should be in another bug.

Jerry, we talked about this last night - please open another bug for the conversation on "should we get rid of the context in some situations" topic.  This bug here is just about dealing with the context lost, and (I don't mind repeating myself), we are not going to deal with it by removing the cause of the context getting lost, but by dealing with the loss itself.
I so want a test of this using comment 46.
Attached file github_link.txt (obsolete) —
like this?
I am not familiar with gaia patch form.
Flags: needinfo?(jsmith)
(In reply to Jerry Shih[:jerry] from comment #50)
> Created attachment 8386181 [details]
> github_link.txt
> 
> like this?
> I am not familiar with gaia patch form.

That looks right, although the branch there needs to rebased against the latest master.
Flags: needinfo?(jsmith)
Jerry, can you find help from all the Gaia people in Taipei, no reason to spend time sorting out git patches instead of graphics work :)
for gaia 1.3 branch
Attachment #8385990 - Attachment is obsolete: true
Attachment #8386181 - Attachment is obsolete: true
for gaia master branch
Attachment #8386531 - Flags: review?(dflanagan)
Attachment #8386531 - Flags: review?(bjacob)
Attachment #8386532 - Flags: review?(dflanagan)
Attachment #8386532 - Flags: review?(bjacob)
I'd like to point out that a 1.3+ bug is not an "optimization opportunity" per comment 31, and is also not the place for statements like (per comment 31) "applications should be architected so that..."

Jerry's patch can't solve this for gallery in the general case. Though it might be made to work partially. I don't understand the issue well, but I don't see any easy way to rearchitect Gallery for 1.3 or for 1.4. I think that would be feature work, not bug fixing.

I really don't get the bit about the context being given back to us after 1000ms. If we just get it back, even while the screen is locked, why was it taken away in the first place?  That is, if locking the screen causes context loss followed by having the context restored one second later, why have the context loss in the first place?

If we can't come up with anything better, we could just have Gallery abandon any edits in progress on visibility change events when it is no longer visible.  But the user could lose work in that case.

The trickiest part I see is dealing with context loss while saving an edited image. That saving process can take 5 seconds or more for big images, and from this bug it sounds like it will fail if the user locks the screen while it is in progress.
Comment on attachment 8386531 [details] [review]
handle lose/restore webgl context event in gallery (gaia 1.3)

r- because calling finishEdit() when the context is restored is not always the right thing to do and will not work in the getFullSizeBlob() case.
Attachment #8386531 - Flags: review?(dflanagan) → review-
Comment on attachment 8386532 [details] [review]
handle lose/restore webgl context event in gallery (gaia master)

Clearing the review flag. This looks like exactly the same patch as the other one, and does not need separate review.  (I actually meant to review this patch and clear the flag on the other, but got it wrong.)
Attachment #8386532 - Flags: review?(dflanagan)
Create bug 980215 for memory pressure condition discussion and bug 980221 for webgl context lose/restore handling in b2g.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #34)
> On a somewhat related note, I'm planning to change ProcessPriorityManager so
> that it does not modify the oom_adjust parameters for a process that goes to
> the background *unless* a new process comes to the foreground.  This is for
> more or less the same reason: it doesn't make sense to reclaim resources
> such as memory if nothing else wants to use them yet.

Sounds like a good idea to me, have you opened a bug already for it?
(In reply to David Flanagan [:djf] from comment #56)
> Comment on attachment 8386531 [details] [review]
> handle lose/restore webgl context event in gallery (gaia 1.3)
> 
> r- because calling finishEdit() when the context is restored is not always
> the right thing to do and will not work in the getFullSizeBlob() case.

Yes, I got the wrong result when the context lose during calling getFullSizeBlob().
I will debug for that.
(In reply to David Flanagan [:djf] from comment #55)
> I'd like to point out that a 1.3+ bug is not an "optimization opportunity"
> per comment 31,

Indeed, for 1.3+ we are only concerned about getting correct behavior, not about minimizing memory usage further, but from the perspective of a Gaia application, what it takes is still the same one thing: hangle webglcontextlost/webglcontextrestored events. So that has to be done regardless.

> and is also not the place for statements like (per comment
> 31) "applications should be architected so that..."

This (organizing [maybe "architecting" as too big a word] WebGL code so as to be able to restore all WebGL state in a webglcontextrestored event handler) is necessary already from the "get correct behavior" perspective i.e. survive arbitrary system events (of which turning the screen of is just one possibility) on arbitrary devices.

> 
> Jerry's patch can't solve this for gallery in the general case. Though it
> might be made to work partially. I don't understand the issue well, but I
> don't see any easy way to rearchitect Gallery for 1.3 or for 1.4. I think
> that would be feature work, not bug fixing.

Please elaborate: since Gallery only has fairly limited usage of WebGL, it didn't seem to me that it would take very large changes to reorganize WebGL code to make it all restorable from a webglcontextrestorerd event handler, but I haven't looked very closely yet.

> 
> I really don't get the bit about the context being given back to us after
> 1000ms. If we just get it back, even while the screen is locked, why was it
> taken away in the first place?  That is, if locking the screen causes
> context loss followed by having the context restored one second later, why
> have the context loss in the first place?

If that is the behavior that Gecko implements at the moment, then that is a Gecko bug that needs to be fixed separately. I thought, and the intent was, to only generate a webglcontextrestored event after 1 second; the idea was that Gecko could pause the delivery of events if an application is in the background. If the webglcontextrestored event handler is actually called after 1 second when it's not useful (e.g. application in the background, or screen off) then that's a Gecko bug.

> 
> If we can't come up with anything better, we could just have Gallery abandon
> any edits in progress on visibility change events when it is no longer
> visible.  But the user could lose work in that case.

It wasn't clear to me that Gallery would rely so much on WebGL state during edits that it couldn't easily recreate its WebGL state from non-WebGL state. But if that were the case, then yes we would have to do that.

> 
> The trickiest part I see is dealing with context loss while saving an edited
> image. That saving process can take 5 seconds or more for big images, and
> from this bug it sounds like it will fail if the user locks the screen while
> it is in progress.

If saving an edited image uses WebGL, then indeed it will fail at the time of context loss. But it shouldn't be hard to retry after the context has been restored.
comment 59: this is not strictly true with ZRAM. For reasonable app switching time we need to make sure there is actual available RAM, not just available compressed swap since swapping out memory is so slow.
(In reply to Andreas Gal :gal from comment #62)
> comment 59: this is not strictly true with ZRAM. For reasonable app
> switching time we need to make sure there is actual available RAM, not just
> available compressed swap since swapping out memory is so slow.

We were actually considering of turning off memory-pressure events entirely for background applications on zRAM-enabled devices, see bug 963477 and bug 975360. The issue with them is that once an application is sent to the background parts of it can be swapped out to zRAM; forcing a GC/CC cycle and various observers to run can cause an I/O storm in which pages are swapped in and out of zRAM hampering whatever is running in the foreground. So for those devices the general consensus reached in bug 963477 is that we should let the LMK reap background processes instead of trying to minimize their memory footprint as it's faster and more efficient to deal with them that way.
By the way, there was talk above about regression windows. Have we figured exactly what regressed it?  My comments on this bug were mostly ignoring the regression aspect...
Attachment #8386531 - Attachment is obsolete: true
Attachment #8386531 - Flags: review?(bjacob)
update github patch
https://bugzilla.mozilla.org/attachment.cgi?id=8386532

handle the lose context during image saving(for master)
Comment on attachment 8386532 [details] [review]
handle lose/restore webgl context event in gallery (gaia master)

If I turn on/off the device quickly for several times during saving image, the the processNextTile() will not be called sometimes. I don't know whether I make mistake in gaia code.

David, could you take a look for this patch? I will also add more log in b2g to check the event processing.

You may need to change this value, or you will trigger the "screen reader" if you turn on/off power key for 3 times.
https://github.com/JerryShih/gaia/blob/master/apps/system/js/accessibility.js#L15
Attachment #8386532 - Flags: feedback?(dflanagan)
So, I spent some time today debugging this, with printfs in Gecko and in the gallery app.

First (in case this issue happens to other people) I had to 'make reset-gaia' before the device would run the gallery app code that I was building and flashing (!).

I'm attaching my gecko patch allowing to study what happens in the WebGL implementation while context loss/restoration happens.

With Jerry's new patch, on master, I can't reproduce any user-visible problem: I can turn off and on the device many times during either preview or saving, the logs confirm that context loss and restoration happens thanks to the event handler from Jerry's patch. Everything works fine.

However, logs do show something spurious that is worrying me a little and, if other people can still reproduce user-visible issues, could explain them. More on that below.
Attached patch debug-webgl-context-loss (obsolete) — Splinter Review
This allows to study what happens in Gecko during context loss and restoration.
(In reply to Benoit Jacob [:bjacob] from comment #67)
> However, logs do show something spurious that is worrying me a little and,
> if other people can still reproduce user-visible issues, could explain them.
> More on that below.

Ignore that, I was just mis-interpreting my logs. I can't see anything wrong at the moment, with Jerry's latest patch.

Just one suggestion in case anyone can reproduce a failure to restore the context. In WebGLContext.cpp:

void
WebGLContext::MaybeRestoreContext()
{
    // Don't try to handle it if we already know it's busted.
    if (mContextStatus != ContextNotLost || gl == nullptr) {
        return;
    }

This early-return might conceivably be wrong. My above patch has a printf there, saying "leaving early". This does show up in my above log, but I think it's OK, and in all my logs, the context ends up being properly restored at the end, even after this "leaving early". Still, if you can reproduce a failure to restore, this would be a place where to look.
Attachment #8386532 - Attachment is patch: true
Attachment #8386532 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8386532 - Attachment is patch: false
Comment on attachment 8386532 [details] [review]
handle lose/restore webgl context event in gallery (gaia master)

R+ from my WebGL perspective, this does the right thing, but I'm not a Gallery reviewer and I can't comment wrt application logic here. Just saying, the patch looks correct and the logs confirm that it does correctly and successfully drive context restoration in my local testing.
Attachment #8386532 - Flags: review?(bjacob) → review+
With some on/off tempo during image saving, the canvas element is gone.
http://dxr.mozilla.org/mozilla-central/source/content/canvas/src/WebGLContext.cpp#1247

After this, the restore event will not be called.
I will check the canvas element life cycle.
Comment on attachment 8386532 [details] [review]
handle lose/restore webgl context event in gallery (gaia master)

Hi David,
I use a ctx to save all states during the image processing, and the canvas will not be delete now. I can receive the restore event even when I turn on/off for many times.
Attachment #8386532 - Attachment mime type: text/plain → text/x-github-pull-request
Ah, interesting. The log I attached here shows that I never hit this case, otherwise the log would say "Leaving early, canvas is gone" per the patch attached above.
I don't know why the canvas is gone. Maybe the local variable in getFullSizeBlob() will be gc(ex. the tile canvas). The processNextTile() will be called even when the getFullSizeBlob() finished. 
https://github.com/JerryShih/gaia/blob/master/apps/gallery/js/ImageEditor.js#L739
Comment on attachment 8386532 [details] [review]
handle lose/restore webgl context event in gallery (gaia master)

Jerry,

Thanks for your work on this. I don't think I have any device that I can duplicate the bug on, so I can't test your patch, but I like what you've done with the callbacks.

I think you may not even need that context object, since the outer function is a closure for all the state values.

See my comments on github.  But basically, if you can test this and if fixes the bug, I'm in favor of landing it, especially if you can clean it up by removing the context object.
Attachment #8386532 - Flags: feedback?(dflanagan) → feedback+
(In reply to David Flanagan [:djf] from comment #77)
> 
> See my comments on github.  But basically, if you can test this and if fixes
> the bug, I'm in favor of landing it, especially if you can clean it up by
> removing the context object.

We need to hold the canvas object.
The |processNextTile()| function will exit when the webgl context lost.
https://github.com/JerryShih/gaia/commit/93cfa650167dc689c76751fa5149cc8d9f5922b0#diff-6e48fa715fc315697388e7075d015a41R792
This |return| prevents the cpu to run the useless |processNextTile()| function loop(leave the |setTimeout(processNextTile)| loop).
If the webgl context lost and all |processNextTile()| return, there is nobody holding the reference to the local variable(ex. the tile canvas). They might be GC.
I can see the gecko CC the HTMLCanvasElement in my log if I don't use the ctx implementation, and the HTMLCanvasElement's size is the same as the tile canvas's size.
Thus, I can make sure that the tile canvas is gone.

We need to hold these ojbect in somewhere. David, what do you think?
Flags: needinfo?(dflanagan)
(In reply to David Flanagan [:djf] from comment #77)
> 
> Thanks for your work on this. I don't think I have any device that I can
> duplicate the bug on, so I can't test your patch, but I like what you've
> done with the callbacks.

The webgl context will lose when we turn off the device if gecko's revision is after:
http://hg.mozilla.org/mozilla-central/rev/3d18224c5065
If your device is newer than this revision, you can test it.
We should restart tile processing instead of forcing the context to stay alive so we can better handle OOM.
Release the resource when the context losts, and restart the tile processing when restore.
Jerry,

Thanks for your continued work on this. I left some comments on github, but basically I really don't understand webgl well enough to offer much help.  

To respond to comment #78, when looking at your patch, I thought that the registered callbacks would be enough to keep the enclosing function and its variables around in memory, which is why I thought that you wouldn't need to maintain state in a separate object.  But if you've tried it and it doesn't work, I can't argue with that!

On this patch if we're trying to release resources on context lost for the image saving function, maybe you should let the internal canvas go as well and restart completely from scratch (i.e., don't pass false as the last argument).

Earlier in this bug there was discussion of the fact that after losing the context, it is restored a second or two later.  I don't understand why, but if that is happening, and our goal is to release resources when the user is not using the app, then it seems to me that we should not respond to context restored and instead just wait for the user to come back to the gallery app and click the Save button again.

In fact, I'm coming around to the view that I'd be okay with some data loss here. If the user locks the screen or switches apps while editing an image (and we lose the context) we just exit edit mode and discard their pending edits. If that is what it takes to fix that, its fine with me.  If there is ever time to refactor this, I'm going to try to avoid WebGL anyway, at least for the image saving case.
Flags: needinfo?(dflanagan)
Hi David,
Could you check the attachment 8391367 [details] [review]. I simply the lost mechanism. I just cancel the processing task when context lose and show the save button again for user to press. In this time, all resource(image canvas, webgl canvas and texture) will release.

(In reply to David Flanagan [:djf] from comment #82)
> here. If the user locks the screen or switches apps while editing an image
> (and we lose the context) we just exit edit mode and discard their pending
> edits. If that is what it takes to fix that, its fine with me.  If there is
> ever time to refactor this, I'm going to try to avoid WebGL anyway, at least
> for the image saving case.

If we just exit the edit mode, I feel a little strange. I prefer to stay at edit mode. We can show more message for cancel task.
Flags: needinfo?(dflanagan)
Handle webgl context lost/restore for v1.3.
Skip the image processing when lost. User should press the save button again.
Attachment #8386532 - Attachment is obsolete: true
Attachment #8391367 - Attachment is obsolete: true
Attachment #8394005 - Flags: review?(dflanagan)
Attachment #8394005 - Flags: review?(bjacob)
Flags: needinfo?(dflanagan)
Handle webgl context lost/restore for master.
Skip the image processing when lost. User should press the save button again.
Attachment #8394006 - Flags: review?(dflanagan)
Attachment #8394006 - Flags: review?(bjacob)
Comment on attachment 8394006 [details] [review]
handle lose/restore webgl context event in gallery (master)

r+ from a pure webgl perspective, with two nits on the github PR; but I don't know the application logic here, so I can't comment on other aspects.
Attachment #8394006 - Flags: review?(bjacob) → review+
Attachment #8394005 - Flags: review?(bjacob) → review+
Whiteboard: [in-code-review]
Comment on attachment 8394006 [details] [review]
handle lose/restore webgl context event in gallery (master)

I still haven't figured out how to reproduce the bug, but maybe I'm just using an old version of gecko.

I like how simple the fix has become.  I'm sending it back for a few more simplifications, but overall, it seems quite solid to me.  (I'm still relying on others to actually test it and verify that it works, however.)

I don't see anything wrong with the code. At this point I'm just asking for changes that will make it easier to understand and maintain.

Nice work, Jerry!
Attachment #8394006 - Flags: review?(dflanagan) → review-
Comment on attachment 8394005 [details] [review]
handle lose/restore webgl context event in gallery (gaia 1.3)

Clearing the review flag on the 1.3 patch since I assume it is almost identical to the other patch that I just reviewed,
Attachment #8394005 - Flags: review?(dflanagan)
Comment on attachment 8394006 [details] [review]
handle lose/restore webgl context event in gallery (master)

Update for review comment 86 and comment 87.
Attachment #8394006 - Flags: review- → review?(dflanagan)
Matthew,
Could you try these patches for 1.3 and master?

https://bugzilla.mozilla.org/attachment.cgi?id=8394005   => 1.3

https://bugzilla.mozilla.org/attachment.cgi?id=8394006   => master
Flags: needinfo?(mvaughan)
Keywords: qawanted
(In reply to Jerry Shih[:jerry] from comment #90)
> Matthew,
> Could you try these patches for 1.3 and master?
> 
> https://bugzilla.mozilla.org/attachment.cgi?id=8394005   => 1.3

The issue still reproduces for me on the 03/26/14 1.3 build on a Buri after applying the patch. I used the STR from comment 0.

> https://bugzilla.mozilla.org/attachment.cgi?id=8394006   => master

The issue does *not* reproduce for me on the 03/27/14 Master 1.5 build on a Buri after applying the patch. I used the STR from comment 0.
Flags: needinfo?(mvaughan)
Keywords: qawanted
Matthew,

Can you confirm that you can still reproduce the issue on 1.5 without the patch?
Flags: needinfo?(mvaughan)
Comment on attachment 8394006 [details] [review]
handle lose/restore webgl context event in gallery (master)

Clearing the review request given that QA says the 1.3 patch does not work.

I've left a couple of github comments on the 1.3 version of the patch, however, including one bug in the destroy() method.
Attachment #8394006 - Flags: review?(dflanagan)
(In reply to David Flanagan [:djf] from comment #92)
> Matthew,
> 
> Can you confirm that you can still reproduce the issue on 1.5 without the
> patch?

On the 03/27/14 Master 1.5 build, the issue seems to partially reproduce. When unlocking the phone during a phone call, the Edit screen will show but the image being edited will not be present. However, the image will be present after applying the patch.

Attached is a screenshot for what 1.5 looks like without the patch applied.
Flags: needinfo?(mvaughan)
Sorry for the "WRONG" direction for 1.3 device with this issue.
I have the buri device now.

For master, the webgl context will lost when we turn off device. We need the gallery's patch to handle context lost/restore.
But for 1.3, the black screen comes from forever wait for a condition variable.
https://github.com/mozilla/gecko-dev/blob/B2G_1_3_20140317_MERGEDAY/gfx/gl/SurfaceStream.cpp#L463

It's the same problem as bug 947227.

James, this patch is just from bug 947227.
Attachment #8398549 - Flags: review?(snorp)
Matthew, could you test the attachment 8398549 [details] [diff] [review] for 1.3 buri device?
I test my device as comment 0 step and it works well.
Keywords: qawanted
David, should we handle the webgl context problem for 1.3?
The 1.3 totally black screen in gallery is not a gallery's problem.
Flags: needinfo?(dflanagan)
(In reply to Jerry Shih[:jerry] from comment #96)
> Matthew, could you test the attachment 8398549 [details] [diff] [review] for
> 1.3 buri device?
> I test my device as comment 0 step and it works well.

Our contractors don't really have a build system setup to do custom gecko builds. However, I could get someone else to verify this who does custom gecko builds.

No-Jun - Can you look into this?
Flags: needinfo?(npark)
Whiteboard: [in-code-review] → [in-code-review] QARegressExclude
Attachment #8398549 - Flags: review?(snorp) → review+
Jerry, in comment #29 you say that the phone call is not necessary to reproduce this bug, and that the webgl context lost also occurs if you just lock and unlock the screen.  Is that still true? (I've never been able to reproduce that) And does doing that on 1.3 cause bug 947227 to occur?

That is: is there any bug in 1.3 that is resolved by this patch alone?  If not, then we shouldn't take the risk of landing it to 1.3, since there is no reward.

But if this bug is deemed severe enough to be 1.3+ and we need 947227 landed to fix this, then we should nominate that bug for 1.3 and uplift this to 1.3 if and when that one lands.  If this bug can be reproduced in 1.3 without the phone call then it seems pretty severe and worth the 1.3+ designation.  But if this bug only happens when editing images while on a phone call and fixing it requires a significant gecko change, then maybe we should reset this to 1.3? and ask to have it re-triaged.

It does seem clear to me that Jerry's patch (with minor fixes) should land in 1.5 and 1.4.

needinfo Jerry: does this bug reproduce without the phone call in 1.3? (And 1.5?)  Is there any bug that the gaia patch here can resolve without the gecko patch from bug 947227?  If not, I'd suggest that you set 1.3? on bug 947227.  And if this bug only happens in 1.3 when there is a phone call, then I'd suggest also that you change this bug to 1.3? and have it re-triaged.
Flags: needinfo?(dflanagan) → needinfo?(hshih)
(In reply to David Flanagan [:djf] from comment #99)
> Jerry, in comment #29 you say that the phone call is not necessary to
> reproduce this bug, and that the webgl context lost also occurs if you just
> lock and unlock the screen.  Is that still true? (I've never been able to
> reproduce that) And does doing that on 1.3 cause bug 947227 to occur?

Sorry for that. There is no context lost if I turn on/off device for "1.3".
I test with mozilla-central, not 1.3. Sorry.

Bug 947227 said that we might hang at WaitForCompositor() when the webgl app at background. 
It's not related to context lost.
 
> 
> That is: is there any bug in 1.3 that is resolved by this patch alone?  If
> not, then we shouldn't take the risk of landing it to 1.3, since there is no
> reward.

Which patch?
The gecko's patch(attachment 8398549 [details] [diff] [review])
or
the gaia lost/restore handling patch(attachment 8394005 [details] [review])?

> 
> But if this bug is deemed severe enough to be 1.3+ and we need 947227 landed
> to fix this, then we should nominate that bug for 1.3 and uplift this to 1.3
> if and when that one lands.  If this bug can be reproduced in 1.3 without
> the phone call then it seems pretty severe and worth the 1.3+ designation. 
> But if this bug only happens when editing images while on a phone call and
> fixing it requires a significant gecko change, then maybe we should reset
> this to 1.3? and ask to have it re-triaged.
> 
> It does seem clear to me that Jerry's patch (with minor fixes) should land
> in 1.5 and 1.4.
> 
> needinfo Jerry: does this bug reproduce without the phone call in 1.3? (And
> 1.5?)  Is there any bug that the gaia patch here can resolve without the
> gecko patch from bug 947227?  If not, I'd suggest that you set 1.3? on bug
> 947227.  And if this bug only happens in 1.3 when there is a phone call,
> then I'd suggest also that you change this bug to 1.3? and have it
> re-triaged.

Gallery editor mode works well when just turn on/off device for 1.3.
There is no context lost when we just turn off device.

For 1.5, the context will lost when turn off device(the new memory pressure rule http://hg.mozilla.org/mozilla-central/rev/3d18224c5065).

For 1.3, I think we don't have a easy way at gaia to fix the problem in bug 947227.
For 1.5, it already has bug 947227 patch.
Flags: needinfo?(hshih)
Per an IRC discussion with Jerry, there are actually two different bugs here:

1) the initial bug reported in comment #0 affects release 1.3 and is a duplicate of bug 947227. The fix for that bug is to uplift 947227. Jerry things (but will double-check) that the Gaia patch attached here is not required to fix this 1.3 bug.

2) There is also the bug that Jerry identified in comment #29. That bug does not require a background phone call, and does not affect release 1.3  That second bug is fixed by the Gaia patch that Jerry has attached here.

I've set 1.3? on bug 947227. That should be re-triaged to decide whether it is worth uplifing. Note that this 1.3 bug occurs only when editing images while also on a phone call and maybe does not need a fix.

I've changed this bug from 1.3+ to 1.4? (I don't have the power to just set it 1.4+, but that is what we should do.)

I do not know which of the two patches is needed for 1.3T. QA wanted to find out if 1.3T is affected by the comment #29 version of the bug or the comment #0 version of the bug.
blocking-b2g: 1.3+ → 1.4?
Depends on: 947227
Okay - I'm really confused. Let me see if I can clear up what's mentioned above.

1. The original bug as filed here is a dupe of bug 947227. Is that correct?
2. The other issue found is a new problem found by investigating this issue. Is that right?

If so, for the sake of clarity & avoiding a bug morph, can we do the following:

1. Dupe this bug over to bug 947227 & keep the 1.3? nomination
2. Open a new bug for the issue found during the investigation of this issue & triage it accordingly based on if it's a regression or not & the bug's severity.

Does that work?
Flags: needinfo?(npark) → needinfo?(dflanagan)
Keywords: qawanted
This is my test env:
gecko 1.3
commit 7b0efc4aeaa5bc55beead30b40518bcf054b5105
gaia 1.3
commit 0d54a796bdaeb96ca7fb63da2166a46b4f8859f2

I just apply the gecko bug 947227 patch. I DON'T modify the gaia part.
And it works will when use the comment 0 step.
So for 1.3, uplift bug 947227 will solve this issue.

--
I don't know whether have risk to add the webgl context handling to gaia 1.3.
There is no webgl context lost bug for gaia 1.3, but we actually don't handle this in 1.3.
I'm going to move forward with duping this.

Let's open a new bug for the second issue identified during the investigation of this bug.
Status: ASSIGNED → RESOLVED
blocking-b2g: 1.4? → ---
Closed: 10 years ago
No longer depends on: 947227
Resolution: --- → DUPLICATE
Component: Gaia::Gallery → Canvas: WebGL
Product: Firefox OS → Core
Summary: [Sora][gallery]WebGL context is lost when the screen is turned off → [Sora][gallery]the image editor hang at WaitForCompositor() during a call
bug 989847 was filed for the additional problem found during investigation of this bug.
Flags: needinfo?(dflanagan)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: