Closed Bug 256180 Opened 16 years ago Closed 11 months ago

Deeply nested elements are not rendered (TAGLVL has been exceeded)

Categories

(Core :: Layout, defect, major)

defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla66
Tracking Status
firefox66 --- fixed

People

(Reporter: brian, Assigned: hsivonen)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Keywords: testcase, Whiteboard: [webcompat])

Attachments

(6 files, 6 obsolete files)

780 bytes, text/html
Details
1.12 KB, patch
snorp
: review+
Details | Diff | Splinter Review
14.46 KB, patch
hsivonen
: review+
Details | Diff | Splinter Review
489.43 KB, patch
hsivonen
: review+
Details | Diff | Splinter Review
1.62 KB, patch
glandium
: review+
Details | Diff | Splinter Review
3.12 KB, patch
snorp
: review+
Details | Diff | Splinter Review
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040616
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/2004081709

Note: I have also seen this problem in the 20040616 Windows build on XP, I
downloaded the 2004081709 to confirm it with the latest version and got the same
problem.

The problem was discovered on a phpbb forum, which is restricted to registered
users, so I copied the html and made another page for it (which is why some of
the images and none of the links work), doing my best to remove irrelevant code.

In addition to the nested tables not showing beyond a certain point (an already
reported bug), once the nesting reached a certain point, it seemed that the
browser was interperating the </table>'s but not the <table>'s, resulting in the
browser breaking out of the outermost tables prematurely.  The problem begins
slightly above http://nerdlife.net/mozilla/index.html#30446 , where you can see
that the "class=" is not taking effect after the "*flees in terror from the
cannibal quotes*".  Then onward from that it continues to break out of the
tables, until http://nerdlife.net/mozilla/index.html#30447 , when there is no
sign of tables besides the quotes.

Reproducible: Always
Steps to Reproduce:
1.Goto http://nerdlife.net/mozilla/index.html#30446
2.Scroll up slightly, until you reach "*flees in terror from the cannibal quotes*"
3.Observe problem form there down.

Actual Results:  
Incorrect display

Expected Results:  
Displayed the same way as the top 4 posts.

No addition information
Also, my next post to the site brought it over to another page, so if you only
want to see the bug and not the correct post, head on over to
http://nerdlife.net/mozilla/index2.html
Attached file Testcase
Testcase showing behavior in both Mozilla and IE
Note that this behavior is also exhibited with 200 or more of any other
container element (span, div, blockquote, etc).  Tables, however, have a hard
limit of 50.
I do not experience the error in IE version 6.0.2800.1106
I'm not sure whether we should really mark bugs like this duplicate of bug
58917.  Probably not, since that one's about tags that should have been closed.
Taking bug, patch coming up...
Assignee: nobody → mats.palmgren
Status: UNCONFIRMED → NEW
Component: Layout: Tables → HTML: Parser
Ever confirmed: true
Keywords: testcase
OS: Windows XP → All
Attached patch Patch rev. 1 (obsolete) — Splinter Review
The current limit was set 03-Aug-2001, I think it's over-pessimistic for the
average PC today.
Attachment #156544 - Flags: review?(bzbarsky)
Maybe the deCOMtamination that's been done since then means that each level
uses less memory too?
Comment on attachment 156544 [details] [diff] [review]
Patch rev. 1

This has little to do with systems improving.  Stack size is generally limited
separately from memory.  Furthermore, deCOMtamination has little to do with
improving the amount of stack we use -- the main problem is the huge
nsHTMLReflowState, nsBlockReflowState, etc. objects we put on the stack.  One
big way of improving that would be to use the pres shell's StackArena for
allocating them -- I think there's a bug somewhere on doing that.

However, I'm against changing this without getting concrete numbers.  Also,
XP_MAC is dead (it's Classic Mac).
Attachment #156544 - Flags: review?(bzbarsky) → review-
FWIW, on Fedora Core 2,

$ ulimit -a | grep stack
stack size            (kbytes, -s) 10240
FWIW, on SuSE 9.1,
test@linux:~> ulimit -a|grep stack
stack size            (kbytes, -s) unlimited

Even on Fedora Core 2, isn't the number you have dynamically set by the OS
based on how much RAM there is?
*** Bug 259898 has been marked as a duplicate of this bug. ***
*** Bug 278348 has been marked as a duplicate of this bug. ***
(In reply to comment #9)
> However, I'm against changing this without getting concrete numbers.

   System (RAM)                       default ulimit -s (kilobytes)
1. Linux debian 2.2.20/i386 (256MB):  8192
2. Red Hat 9/i386 (1GB):              8192
3. FreeBSD 5.3/i386 (512MB):          65536
4. SuSE 8.?/i386 (128MB):             unlimited
5. Solaris 8/sparc (256MB):           8192
6. Windows XP+sp2/i386 (1GB):         2043 (according to cygwin ulimit)
7. SuSE 9.0/i386 (1GB):               unlimited
8. MacOSX 11.3/G4 (512MB):            8192

Test results (all builds with --disable-optimize & --enable-debug):
6. crash at frame depth 331 (Firefox static build)
7. Mozilla:
     ulimit -s      crash at frame depth
     800            292
     1024           375
     2043           753 (compare with XP's:331)
     2048           756
     4096           1518
     8192           3042
8. Camino:
     ulimit -s      crash at frame depth
     1024           368
     2048           744
     4096           1497
     8192           3004


So, for Linux/MacOSX a MAX_REFLOW_DEPTH of say 2800 seems safe.
For Windows I would suggest 300, I haven't tested 98/ME/2000 but
those could be verified using nightly builds?
For other systems we should probably stay with the current value 200
until we have figures.
The Mac situation is complicated.... apparently different OSX installs have the
stack depth limited very differently.  We've had JS engine crashes due to that,
actually.
Hmm, maybe we should actually calculate a max frame depth on platforms
that supports getrlimit(RLIMIT_STACK)?
Blocks: 205564
Severity: normal → major
Summary: Upon too many nested tables, Mozilla begins to break out of the outer tables incorrectly → Deeply nested elements are not rendered (TAGLVL has been exceeded)
suse, ulimit -s = unlimited.

i don't think 278348 is a dup of this. since 278348 have only one level nested
table.
*** Bug 287629 has been marked as a duplicate of this bug. ***
*** Bug 303373 has been marked as a duplicate of this bug. ***
Blocks: 309658
Whether the tag is closed or unclosed, it should never be the case that Mozilla
just plain gives up and stops displaying content. The patch for bug 58917
watches over certain formatting tags and omits them if the stack depth reaches
dangerous levels.

Why not add a whole bunch more tags to that list? It would certainly be better
than letting half the page disappear into the void.
Depends on: 58917
Hardware: PC → All
*** Bug 261748 has been marked as a duplicate of this bug. ***
*** Bug 260548 has been marked as a duplicate of this bug. ***
*** Bug 254767 has been marked as a duplicate of this bug. ***
*** Bug 302319 has been marked as a duplicate of this bug. ***
*** Bug 322394 has been marked as a duplicate of this bug. ***
How much slower will this be on most architectures using a custom stack allocated on the heap rather than the callstack?  (I know it will be slower, but by how much?)  Is it possible to switch from using the system stack to using a custom stack data structure once the system stack gets close to full?

This would be accomplished by having a delegation function that creates the stack on the heap, a node at a time, and then calls each function that would be called at that place in the call hierarchy.

In fact, if there is one call per node in the DOM (i.e. if there is a 1-1 mapping between the DOM and the call hierarchy), this can handled without even allocating a new stack, by capturing the state of execution (any values passed up/down in the call hierarchy) in additional node properties, and then simply walking the DOM iteratively in some appropriate order (e.g. a depth-first postorder traversal).
Duplicate of this bug: 380822
Duplicate of this bug: 377056
Duplicate of this bug: 409263
Should this bug's summary be changed now that bug 58917 is fixed?
Blocks: 428727
Duplicate of this bug: 438395
See also bug 452038 for increasing the maximum tag nesting limit.
Duplicate of this bug: 453552
QA Contact: layout.tables → parser
Duplicate of this bug: 529585
No longer blocks: 309658
Duplicate of this bug: 309658
Setting dependency of bug 452038 pointed by comment #32, for ease of tracking.

FYI.
TAGLVL referred in bug summary comes from;
  http://www.w3.org/TR/html4/sgml/sgmldecl.html
After transfer to HTML5 parser, SGML is irrelevant because of HTML 5.  
Following is copy of bug 452038 comment #15.
> We don't seem to have TAGLVL anymore.  We do still have MAX_REFLOW_DEPTH
> (which as it happens is different on Mac Classic and WINCE from everything else).
Blocks: 452038
Duplicate of this bug: 911332
Depends on: 1073463
Duplicate of this bug: 1089044
Duplicate of this bug: 1216353
Assignee: mats → nobody
Duplicate of this bug: 1247234
Happens for me on 

http://web.maxirem.fr/

> 1st button "Cotisations sociales" > calculate

On the right side, next the table, are missing inputs

Works fully, for once, on IE and Edge ;)
Duplicate of this bug: 1270866
Guys, any update here? I have reports that users hit this on gmail when copy/paste to/from html mail. IE, Chrome and Chromium and Safari are reported as working, don't let people to switch from Firefox because of this bug.
AFAICT, the HTML parser part of this bug has been fixed for a while.
Component: HTML: Parser → Layout
I guess we may need to do the depth detection in the frame constructor rather than during reflow, so that we can flattern the tree when it's getting too deep.
What are the chances of getting our layout code to a state where:
 1) Layout doesn't crash with deep trees.
 2) When the tree is too deep, have layout code flatten it so that text isn't lost even if CSS boxes are lost.
?

The tree depth limiting code in the parser tries to avoid dropping text on the floor, but in some cases, it's not safe to flatten the DOM arbirarily, and bug 1188731 shows a case where the full-system (parser plus layout) result is text not rendering anyway.

The HTML parser continues to be an inappropriate place to work around this bug. I'd much prefer fixes/workarounds to live in layout instead of complicating the hacks in the HTML parser to make HTML emails show their full text to users.
Flags: needinfo?(xidorn+moz)
Duplicate of this bug: 1397204
I'm going to try to approximate the Blink behavior per dev-platform.
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Flags: needinfo?(xidorn+moz)
Attachment #156544 - Attachment is obsolete: true
Depends on: 1400811
Filed bug 1400811 about Android on Aarch64, which we aren't shipping yet, so we shouldn't block on it.

The remaining problem is that I've calibrated the Android limit on Android 6 (i.e. using ART), but we support (run in CI) Android releases that use Dalvik, and there seems to be a lower stack limit in the Dalvik case.
Could someone with access to an (ARM-based; not x86) Jelly Bean Android device (preferably 4.3 but 4.2 or 4.1 should be OK, too), please run the following test? (I don't have access to a Jelly Bean device.)

1) Install a try build from https://queue.taskcluster.net/v1/task/eXRsNHo1R6ukfYquC_sRyw/runs/0/artifacts/public/build/target.apk (the whole try run is visible at https://treeherder.mozilla.org/#/jobs?repo=try&revision=8d85675d60f1&selectedJob=131946525 )

2) In the try build, load https://hsivonen.com/test/moz/deeptree/reftest/256180-4.html and see if it crashes.

3) In the try build, load https://hsivonen.com/test/moz/deeptree/reftest/256180-5.html and see if it crashes.

4) If either test page crashed, load https://hsivonen.com/test/moz/deeptree/dom/div.html and see what number the counter reaches before that page, too, crashes.
(In reply to Henri Sivonen (:hsivonen) from comment #70)
> Could someone with access to an (ARM-based; not x86) Jelly Bean Android
> device (preferably 4.3 but 4.2 or 4.1 should be OK, too), please run the
> following test? (I don't have access to a Jelly Bean device.)

From off-bug correspondence:

> 2) In the try build, load
> https://hsivonen.com/test/moz/deeptree/reftest/256180-4.html and see if it
> crashes.

Crashes.

> 3) In the try build, load
> https://hsivonen.com/test/moz/deeptree/reftest/256180-5.html and see if it
> crashes.

Doesn't crash.

> 4) If either test page crashed, load
> https://hsivonen.com/test/moz/deeptree/dom/div.html and see what number the
> counter reaches before that page, too, crashes.

588.
Attachment #8907991 - Flags: review?(mh+mozilla)
Attachment #8907992 - Flags: review?(bzbarsky)
Attachment #8907067 - Flags: review?(bugs)
Comment on attachment 8907067 [details]
Bug 256180 parser part - Insert elements as siblings instead of children at the Blink-defined magic depth for compatibility.

https://reviewboard.mozilla.org/r/178790/#review187670

I think I need to read this again, but could you update those small things.

::: layout/reftests/bugs/256180-5-ref.html:1
(Diff revision 13)
> +<!DOCTYPE html><meta charset=utf-8><title>&lt;font&gt; with alternating attributes and a character between each tag 4000</title>

Shouldn't we have also some tests for innerHTML

::: parser/html/javasrc/TreeBuilder.java:4715
(Diff revision 13)
>                  }
>                  removeFromListOfActiveFormattingElements(formattingEltListPos);
>                  return true;
>              }
>              StackNode<T> commonAncestor = stack[formattingEltStackPos - 1]; // weak ref
> +            T insertionCommonAncestor = depthLimited(formattingEltStackPos - 1); // weak ref

Why -1 here?

::: parser/html/javasrc/TreeBuilder.java:6291
(Diff revision 13)
>              }
>          }
>          return -1;
>      }
>  
> +    private T depthLimited(int stackPos) throws SAXException {

This needs a bit better documentation. Like say what Blink's magic number is and what this method is used for etc.

::: parser/html/javasrc/TreeBuilder.java:6291
(Diff revision 13)
>              }
>          }
>          return -1;
>      }
>  
> +    private T depthLimited(int stackPos) throws SAXException {

I think the method name is quite vague. The name doesn't hint that it returns some object. It sounds more like some method returning boolean.

Could this be something like parentFromStack or depthLimitedParentFromStack or some such.
Attachment #8907067 - Flags: review?(bugs) → review-
Comment on attachment 8907067 [details]
Bug 256180 parser part - Insert elements as siblings instead of children at the Blink-defined magic depth for compatibility.

https://reviewboard.mozilla.org/r/178790/#review187670

> Why -1 here?

Because the previous line has -1 and this line tries to match it with the twist that the Blink compatibility threshold is applied. As for why the previous line has -1: Step 12 in the spec says "Let common ancestor be the element immediately above formatting element in the stack of open elements."
Comment on attachment 8907067 [details]
Bug 256180 parser part - Insert elements as siblings instead of children at the Blink-defined magic depth for compatibility.

https://reviewboard.mozilla.org/r/178790/#review187670

> Shouldn't we have also some tests for innerHTML

Added a test that uses `innerHTML`.

> This needs a bit better documentation. Like say what Blink's magic number is and what this method is used for etc.

Added documentation.

> I think the method name is quite vague. The name doesn't hint that it returns some object. It sounds more like some method returning boolean.
> 
> Could this be something like parentFromStack or depthLimitedParentFromStack or some such.

Renamed to `nodeFromStackWithBlinkCompat`.
Renamed a method not to rely on overloading:
https://reviewboard.mozilla.org/r/178790/diff/14-15/
Duplicate of this bug: 1188731
Comment on attachment 8907067 [details]
Bug 256180 parser part - Insert elements as siblings instead of children at the Blink-defined magic depth for compatibility.

https://reviewboard.mozilla.org/r/178790/#review187888

::: parser/html/javasrc/TreeBuilder.java:4715
(Diff revision 15)
>                  }
>                  removeFromListOfActiveFormattingElements(formattingEltListPos);
>                  return true;
>              }
>              StackNode<T> commonAncestor = stack[formattingEltStackPos - 1]; // weak ref
> +            T insertionCommonAncestor = nodeFromStackWithBlinkCompat(formattingEltStackPos - 1); // weak ref

This really needs some comment that why we use commonAncestor for some things, but insertionCommonAncestor for others.

::: parser/html/javasrc/TreeBuilder.java:4785
(Diff revision 15)
>                  // } XXX AAA CHANGE
>                  detachFromParent(lastNode.node);
> -                appendElement(lastNode.node, node.node);
> +                appendElement(lastNode.node, nodeFromStackWithBlinkCompat(nodePos));
>                  lastNode = node;
>              }
>              if (commonAncestor.isFosterParenting()) {

Like here, why we use commonAncestor. Please add some comments.
Attachment #8907067 - Flags: review?(bugs) → review+
Comment on attachment 8907992 [details]
Bug 256180 layout part - Increase MAX_REFLOW_DEPTH to reduce the probability of content going silently missing.

https://reviewboard.mozilla.org/r/179666/#review187966

I think this is ok, but we should try to keep an eye out for stack overflow crashes, assuming we can detect them on crash-stats....

::: layout/generic/nsIFrame.h:21
(Diff revision 12)
> -#define MAX_REFLOW_DEPTH 200
> +#if (defined(XP_WIN) && !defined(HAVE_64BIT_BUILD)) || defined(ANDROID)
> +// Blink's magic depth limit from its HTML parser plus a little. Involves only
> +// a small increase to the default runtime stack size on 32-bit Windows with
> +// display: block and fits within the default stack available on 32-bit Android
> +// on ARM.
> +#define MAX_REFLOW_DEPTH 590

I'm not sure I follow the "Involved only a small increase" bit.  Does that mean that it fits in whatever stack size we have right now?  Or that it doesn't fit but would fit if we increased the stack a bit?   I guess the latter, given the other patches on this bug.

In general, it would be good to have the actual data, whatever it is, in the commit message or code comments.  And maybe the testcases used in the commit message...

::: layout/reftests/bugs/reftest.list:41
(Diff revision 12)
>  == 18217-zorder-4.html 18217-zorder-ref-inline-table.html
>  == 18217-zorder-5.html 18217-zorder-ref-inline-table.html
>  == 23604-1.html 23604-1-ref.html
>  == 23604-2.html 23604-2-ref.html
>  != 24998-1.html 24998-1-ref.html
> +skip-if(isDebugBuild||Android||AddressSanitizer||(winWidget&&(!is64Bit))) == 256180-1.html 256180-1-ref.html

There should be comments here explaining why the skip-if is present.
Attachment #8907992 - Flags: review?(bzbarsky) → review+
Comment on attachment 8907992 [details]
Bug 256180 layout part - Increase MAX_REFLOW_DEPTH to reduce the probability of content going silently missing.

https://reviewboard.mozilla.org/r/179666/#review187966

> I'm not sure I follow the "Involved only a small increase" bit.  Does that mean that it fits in whatever stack size we have right now?  Or that it doesn't fit but would fit if we increased the stack a bit?   I guess the latter, given the other patches on this bug.
> 
> In general, it would be good to have the actual data, whatever it is, in the commit message or code comments.  And maybe the testcases used in the commit message...

Rewrote the comment with a lot more detail.

> There should be comments here explaining why the skip-if is present.

Added comments.
Comment on attachment 8907992 [details]
Bug 256180 layout part - Increase MAX_REFLOW_DEPTH to reduce the probability of content going silently missing.

https://reviewboard.mozilla.org/r/179666/#review187966

> Rewrote the comment with a lot more detail.

Oh, and I also edited in the actual Dalvik-compatible magic limit this time!
Comment on attachment 8907067 [details]
Bug 256180 parser part - Insert elements as siblings instead of children at the Blink-defined magic depth for compatibility.

https://reviewboard.mozilla.org/r/178790/#review187888

> This really needs some comment that why we use commonAncestor for some things, but insertionCommonAncestor for others.

Added comment.

> Like here, why we use commonAncestor. Please add some comments.

Added comment why. Not sure if it's the right call, but at least it's much simpler than doing something else.
Comment on attachment 8907991 [details]
Bug 256180 build config part - Increase the max size for the runtime stack on Windows.

https://reviewboard.mozilla.org/r/179664/#review189074
Attachment #8907991 - Flags: review?(mh+mozilla) → review+
Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1ae1bd9b2b5c
build config part - Increase the max size for the runtime stack on Windows. r=glandium
https://hg.mozilla.org/integration/autoland/rev/6a527c7a89a0
layout part - Increase MAX_REFLOW_DEPTH to reduce the probability of content going silently missing. r=bz
https://hg.mozilla.org/integration/autoland/rev/0ce8d073a16e
parser part - Insert elements as siblings instead of children at the Blink-defined magic depth for compatibility. r=smaug
Backed out for crashtest failures on Windows 7 debug, e.g. in layout/base/crashtests/507119.html, and on suspicion of causing crashtest failures on Android:

https://hg.mozilla.org/integration/autoland/rev/30ddb433b63d21e548321b014cf683dbc0298290
https://hg.mozilla.org/integration/autoland/rev/5adec9b302150ba3f6dcb2d534f88ba59aa222e7
https://hg.mozilla.org/integration/autoland/rev/5fce48b39d5838eb91e2a0441b44bf256d956097

Push with Windows failure: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=0ce8d073a16ee3275d3bbc93741ec9a40e7f1261&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=133774697&repo=autoland
Very deep stack full of reflows.
TEST-UNEXPECTED-FAIL | file:///Z:/task_1506587758/build/tests/reftest/tests/layout/base/crashtests/507119.html | application terminated with exit code 1
REFTEST PROCESS-CRASH | file:///Z:/task_1506587758/build/tests/reftest/tests/layout/base/crashtests/507119.html | application crashed [@ nsBlockFrame::ReflowBlockFrame(mozilla::BlockReflowInput &,nsLineList_iterator,bool *)]
Flags: needinfo?(hsivonen)
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #101)
> REFTEST PROCESS-CRASH |
> file:///Z:/task_1506587758/build/tests/reftest/tests/layout/base/crashtests/
> 507119.html | application crashed [@
> nsBlockFrame::ReflowBlockFrame(mozilla::BlockReflowInput
> &,nsLineList_iterator,bool *)]

Curiously, this doesn't reproduce with the win32 pgo autoland build when using https://hsivonen.com/test/moz/deeptree/dom/507119.html
Flags: needinfo?(hsivonen)
(The ReviewBoard interdiff tool is very confused about the rebase.)
Duplicate of this bug: 1444555
Duplicate of this bug: 1481584
Duplicate of this bug: 512605
Attached patch Windows stack, rebased (obsolete) — Splinter Review
Attachment #8907991 - Attachment is obsolete: true
Attachment #9031212 - Flags: review+
Attachment #8915028 - Attachment is obsolete: true
Attachment #8907992 - Attachment is obsolete: true
Attachment #9031214 - Flags: review+
Attachment #8907067 - Attachment is obsolete: true
Attachment #9031216 - Flags: review+
Looks like Windows crashes again, so the Windows stack needs to be readjusted.
>    GeckoThread() {
>        setName("Gecko");	
>        // Request more (virtual) stack space to avoid overflows in the CSS frame
>        // constructor. 8 MB matches desktop.
>	 super(null, null, "Gecko", 8 * 1024 * 1024);

According to bug 1400811 comment 2, that's the right place to increase the stack size. Yet, Android 4.3 API16+ debug in CI still crashes.

Is this not the right place to increase the stack size for the thread that runs Gecko layout for Fennec after all? (The file is in a geckoview directory.) Or is Android 4.3 (Dalvik, not ART) just ignoring the stack size request?
Flags: needinfo?(snorp)
(In reply to Henri Sivonen (:hsivonen) from comment #130)
> Looks like Windows crashes again, so the Windows stack needs to be
> readjusted.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=002336a77c29c576833afb8c6b0c45e5c0a790fe
(In reply to Henri Sivonen (:hsivonen) from comment #131)
> >    GeckoThread() {
> >        setName("Gecko");	
> >        // Request more (virtual) stack space to avoid overflows in the CSS frame
> >        // constructor. 8 MB matches desktop.
> >	 super(null, null, "Gecko", 8 * 1024 * 1024);
> 
> According to bug 1400811 comment 2, that's the right place to increase the
> stack size. Yet, Android 4.3 API16+ debug in CI still crashes.
> 
> Is this not the right place to increase the stack size for the thread that
> runs Gecko layout for Fennec after all? (The file is in a geckoview
> directory.) Or is Android 4.3 (Dalvik, not ART) just ignoring the stack size
> request?

It looks like Android 7.0 x86 opt is fine, which strongly suggests that this bug is just unfixable on Dalvik.
Attachment #9031213 - Flags: review?(snorp)
This patch is the same as the previously r+ed one, except the stack size for 32-bit Windows is slightly larger.
Attachment #9031212 - Attachment is obsolete: true
Attachment #9032390 - Flags: review?(mh+mozilla)
This patch turns off some unit tests for Dalvik. Given that the patches is this series work on ART, I suspect Dalvik simply ignores the requested stack size. 

This leaves 3 options:

 1) Not fixing this bug for anyone. This is basically what backing the patches out in September 2017 meant. I think this solution isn't acceptable, considering that despite the bug being 15 years old, we continue to get duplicates, including in the past 6 months.

 2) Letting Dalvik crash with unlucky inputs. This is the approach this patch set currently takes.

 3) Making the recursion limit in layout dynamic depending on the Android API level determined at run time.
Attachment #9032391 - Flags: review?(snorp)
(In reply to Henri Sivonen (:hsivonen) from comment #139)
> 
>  2) Letting Dalvik crash with unlucky inputs. This is the approach this
> patch set currently takes.

Assuming this isn't a security concern, I think I'm fine with this.
Flags: needinfo?(snorp)
Comment on attachment 9032390 [details] [diff] [review]
Windows stack, adjusted number for 32-bit case

Review of attachment 9032390 [details] [diff] [review]:
-----------------------------------------------------------------

::: config/config.mk
@@ +308,5 @@
> +# to deal with frame construction for unreasonably deep DOM trees
> +# with worst-case styling. This uses address space unnecessarily for
> +# non-main threads, but that should be tolerable on 64-bit systems.
> +# (Bug 256180)
> +WIN32_EXE_LDFLAGS      += -STACK:8388608

On Windows, committed memory (which stack is) consumes more than virtual address space: it consumes virtual memory on the system (the sum of the pagefile and physical memory) that other processes won't be able to use. I'm not sure this is really wise.
Attachment #9032390 - Flags: review?(mh+mozilla)
Comment on attachment 9032390 [details] [diff] [review]
Windows stack, adjusted number for 32-bit case

Review of attachment 9032390 [details] [diff] [review]:
-----------------------------------------------------------------

::: config/config.mk
@@ +308,5 @@
> +# to deal with frame construction for unreasonably deep DOM trees
> +# with worst-case styling. This uses address space unnecessarily for
> +# non-main threads, but that should be tolerable on 64-bit systems.
> +# (Bug 256180)
> +WIN32_EXE_LDFLAGS      += -STACK:8388608

I thought we didn't eagerly commit thread stacks? https://github.com/rust-lang/rust/pull/52847 for the rust thread-pool, but we use STACK_SIZE_PARAM_IS_A_RESERVATION elsewhere too.

(Though maybe this flag implies also eagerly committing it? Not sure)
Not all places that create threads use STACK_SIZE_PARAM_IS_A_RESERVATION, and while we can possible change our code for it to be used, we can't change third party libraries (think drivers, system libraries, etc.) that create threads without STACK_SIZE_PARAM_IS_A_RESERVATION. We /could/ hook all the relevant functions (at least CreateThread and _beginthreadex, are there others?), but in any case, it would seem like that should happen before considering changing the default stack size.

Another option would be to use a different thread as the main thread, making it possible to alter its stack size without affecting the default.
(In reply to Mike Hommey [:glandium] from comment #141)
> On Windows, committed memory (which stack is) consumes more than virtual
> address space: it consumes virtual memory on the system (the sum of the
> pagefile and physical memory) that other processes won't be able to use. I'm
> not sure this is really wise.

Considering that you previously r+ed the same patch with a slightly smaller magic number and this bug has already gone through a bitrot cycle once and is an ongoing Web compat problem, could we land this as-is and leave optimizing per-thread stack sizes on Windows as a follow-up?
Flags: needinfo?(mh+mozilla)

Please file a followup.

Flags: needinfo?(mh+mozilla)

(In reply to Mike Hommey [:glandium] from comment #145)

Please file a followup.

Thanks for the r+. The follow-up is already on file: bug 1257522.

Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2d69841d2eb7
build config part - Increase the max size for the runtime stack on Windows. r=glandium.
https://hg.mozilla.org/integration/mozilla-inbound/rev/b696df615c8b
mobile part - Increase the max size for the runtime stack of the Gecko main thread on Android. r=snorp.
https://hg.mozilla.org/integration/mozilla-inbound/rev/bc2e0a89d88e
layout part - Increase MAX_REFLOW_DEPTH to reduce the probability of content going silently missing. r=bzbarsky.
https://hg.mozilla.org/integration/mozilla-inbound/rev/125ebcfac58d
parser part - Insert elements as siblings instead of children at the Blink-defined magic depth for compatibility. r=smaug.
https://hg.mozilla.org/integration/mozilla-inbound/rev/e85e41f84971
Dalvik part - Disable some deep-tree tests on Android, because running the tests on Dalvik is not feasible. r=snorp.

(In reply to Andreea Pavel [:apavel] from comment #149)

Backed out for failing win xpcshell at xpcshell.ini:toolkit/mozapps/extensions/test/xpcshell/test_temporary.js

Treeherder on try showed the title of bug 1469904, which says "perma-failing", so I thought it was ignorable orange. :-(

(In reply to Henri Sivonen (:hsivonen) from comment #150)

(In reply to Andreea Pavel [:apavel] from comment #149)

Backed out for failing win xpcshell at xpcshell.ini:toolkit/mozapps/extensions/test/xpcshell/test_temporary.js

Treeherder on try showed the title of bug 1469904, which says "perma-failing", so I thought it was ignorable orange. :-(

On surface, test_temporary.js looks completely unrelated to these patches.

The test only ever fails on Windows. aswan, blame shows you've added Windows special-casing to the test. Do you have any idea what the test might be doing that could be affected by a larger-than-previous run-time stack size? Could it be using accidental recursion and relying on the SpiderMonkey stack size-based recursion limit taking effect before the test times out?

Flags: needinfo?(hsivonen) → needinfo?(aswan)

Just trying things: Making task creation conditional on Windowsness of the platform instead of making the task return early based on the Windowsness of the platform:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fd97684651e546ffb0e5e11086c38039c86030e

(In reply to Henri Sivonen (:hsivonen) from comment #154)

Just trying things: Making task creation conditional on Windowsness of the platform instead of making the task return early based on the Windowsness of the platform:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fd97684651e546ffb0e5e11086c38039c86030e

This did not help. :-(

(In reply to Henri Sivonen (:hsivonen) from comment #152)

The test only ever fails on Windows. aswan, blame shows you've added Windows special-casing to the test. Do you have any idea what the test might be doing that could be affected by a larger-than-previous run-time stack size? Could it be using accidental recursion and relying on the SpiderMonkey stack size-based recursion limit taking effect before the test times out?

Nothing like that is happening on purpose. My past run-ins with this test have had to do with Windows handling file locking different than mac/unix, that seems totally unrelated to the changes in this bug. Sampling a few other runs on treeherder, it looks like this test is typically taking ~4 minutes on Windows debug which is uncomfortably close to the 5 minute cap. Could your patches have slowed down debug builds a bit more and pushed this test over 5 minutes?

In any case, since we're flirting with 5 minute run times, this test really needs to be broken up into multiple smaller tests. I'd have no objection to you disabling it on Windows debug for now if its blocking something else.

Flags: needinfo?(aswan)

(In reply to Andrew Swan [:aswan] from comment #156)

I'd have no objection to you disabling it on Windows debug for now if its blocking something else.

Thanks. It fails on 64-bit Windows opt builds, too, so here's a try run disabling it on Windows generally:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=13a3e0fca5c6a434c78b94ac7c4d511e77a7cb9f

(In reply to Henri Sivonen (:hsivonen) from comment #150)

(In reply to Andreea Pavel [:apavel] from comment #149)

Backed out for failing win xpcshell at xpcshell.ini:toolkit/mozapps/extensions/test/xpcshell/test_temporary.js

Treeherder on try showed the title of bug 1469904, which says "perma-failing", so I thought it was ignorable orange. :-(

Hi Henri. Yes, that bug is perma-failing, but it's tier2. the failures from the push you landed were tier1, reason why I backed out.

Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/95d8f77451e2
build config part - Increase the max size for the runtime stack on Windows. r=glandium.
https://hg.mozilla.org/integration/mozilla-inbound/rev/0fab42583213
mobile part - Increase the max size for the runtime stack of the Gecko main thread on Android. r=snorp.
https://hg.mozilla.org/integration/mozilla-inbound/rev/feb776af8fd3
layout part - Increase MAX_REFLOW_DEPTH to reduce the probability of content going silently missing. r=bzbarsky.
https://hg.mozilla.org/integration/mozilla-inbound/rev/519e21226224
parser part - Insert elements as siblings instead of children at the Blink-defined magic depth for compatibility. r=smaug.
https://hg.mozilla.org/integration/mozilla-inbound/rev/e4550e8b5a79
Dalvik part - Disable some deep-tree tests on Android, because running the tests on Dalvik is not feasible. r=snorp.
Whiteboard: [webcompat]

This looks to be crashing on win10/aarch64. See bug 1530033.

Depends on: 1537609
You need to log in before you can comment on or make changes to this bug.