Closed Bug 348901 Opened 18 years ago Closed 17 years ago

Extra whitespace from HTML source in accessible text

Categories

(Core :: Disability Access APIs, defect)

defect
Not set
major

Tracking

()

RESOLVED FIXED

People

(Reporter: parente, Assigned: aaronlev)

References

(Blocks 2 open bugs)

Details

(Keywords: access)

Attachments

(2 files, 10 obsolete files)

224 bytes, text/html
Details
50.46 KB, patch
surkov
: review+
dbaron
: approval1.9+
Details | Diff | Splinter Review
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20060816 Minefield/3.0a1

1) Run firefox.
2) Navigate to http://www.mozilla.org/access/unix/new-atk.html
3) Use at-poke.
4) Look at the third from the bottom accessible in the document (i.e. the table cell in the first column of the last row of the big table on the page).
5) Notice the leading whitespace in the static text that is not visible on the page.

The whitespace appears to be coming from the raw HTML source.
Blocks: newatk
Keywords: access
OS: Linux → All
Hardware: PC → All
Summary: Extra whitespace in accessible text → Extra whitespace from HTML source in accessible text
This is going to make things more quite a bit more complicated in nsHyperTextAccessible.cpp
Attached file test case
i think first we should have a list showing all scenerio which leading whitespace is omitted.
Blocks: keya11y
Severity: normal → major
We probably want to expose the transformed text, whatever it is. For example, CSS rules can make test that is exposed as all caps or lowercase, even though the original content is not. No doubt the text transformer has all sorts of interesting smarts.
After the new text frame rewrite in bug 333659, roc will add a method for us to get the transformed text directly. That will include giving us all caps if CSS text transform to uppercase is used, and giving us the rendered whitespace instead of source whitespace.

So we need to wait until that lands and ask for a public method to get that.
Depends on: 333659
*** Bug 356369 has been marked as a duplicate of this bug. ***
Blocks: texta11y
No longer blocks: newatk, keya11y
Blocks: lsr
Blocks: orca
I'm curious if anyone has any suggestions for how an assistive technology can identify the difference between extra whitespace that supposed to be shown versus whitespace that isn't supposed to be shown.
Sorry, I pressed "Commit" too soon - what I meant to type was "how an assistive technology can determine if whitespace is erroneously exposed by Gecko."
Will, I worked on this some today. It's dependent on some core layout changes which will allow us to query for the actual rendered text, as well as turn a DOM offset into a rendered offset into that text.

It's in the works. As soon as bug 375760 is checked in, I can start trying different things.

If you need a quick fix now, I suggest mapping multiple spaces to one space, but this will not help you with <pre> etc. And of course you can't do it in entry fields. If we for some reason have to get this done in the next few weeks, we'll have to expose an object attribute to tell you whether to compress multiple spaces to 1 or not.
(In reply to comment #9)
> Will, I worked on this some today. It's dependent on some core layout changes
> which will allow us to query for the actual rendered text, as well as turn a
> DOM offset into a rendered offset into that text.

Thanks Aaron!  The main impact this has is on our braille users.  If we can detect that we're getting bad whitespace, we can compress it, but we'd really like to also present good whitespace.  :-)  BTW, things are starting to look a lot better these days.  The fix for bug 363214 is especially useful.
We need to call nsTextFrame::EnsureTextRun and get the returned gfxSkipCharsIterator and the mTextRun. This means new a API to call that externally from layout.

Once we have those pieces of info we can use:
gfxTextRun::GetText8Bit/GetTextUnicode -- get the rendered text without extra whitespace.
gfxSkipCharsIterator::ConvertOriginalToSkipped -- convert content offset into trimmed text run offset 
gfxSkipCharsIterator::ConvertSkippedToOriginal -- convert content offset into trimmed text run offset 

No, you should add a textframe method that returns the string you want. You should probably implement this in a new nsITextFrame interface implemented as a tear-off so it doesn't bloat textframes.
Unfortunately this is a bit inconistent with nsIFrame in that callers will leak unless they use an nsCOMPtr<nsITextFrame> or explictly release the nsITextFrame. They might not realize that because most places no longer do that with nsIFrame -- in fact Release() doesn't do anything for it.

I guess if the frame is destroyed but nsITextFrame still exists, that is also a problem. But, I'm not sure how to prevent that or if it would really happen.

This old comment in nsIFrame would worry me if it wasn't Hyatt's from 2001. Should it be removed?
/* nsIFrame is in the process of being deCOMtaminated, i.e., this file is eventually
   going to be eliminated, and all callers will use nsFrame instead.  At the moment
   we're midway through this process, so you will see inlined functions and member
   variables in this file.  -dwh */


Alright, how about creating nsTextFrameBase, making it a subclass of nsFrame, make nsTextFrameThebes subclass that, put your APIs in there.
Hmm, I could use nsTextFrameBase too :-)
The new text frame is turned on in trunk so we can work on this again.
nsTextFrameBase seems like a bad idea, because either:
1) We cannot use directly as an interface in mozilla/accessible because it depends on nsFrame
or
2) We need to add a new interface that it can support, in which case we don't need the extra class at all, and I think we're adding 4 bytes per text frame instance because of multiple inheritance.

Robert, what am I missing?
Attachment #270568 - Flags: review?(roc)
Attachment #270568 - Attachment is obsolete: true
Attachment #270569 - Flags: review?(roc)
Attachment #270568 - Flags: review?(roc)
Attached patch Put on nsIFrame for now (obsolete) — Splinter Review
Per our email, putting the new methods directly in nsIFrame.
Attachment #264502 - Attachment is obsolete: true
Attachment #270569 - Attachment is obsolete: true
Attachment #270732 - Flags: superreview?(roc)
Attachment #270732 - Flags: review?(roc)
Attachment #270569 - Flags: review?(roc)
+  virtual PRUint32 GetRenderedTextLength() { return 0; }
+
+  /**
+   * Convert a text content (original) offset to a skipped (rendered text) offset
+   */
+  virtual PRUint32 OriginalToSkippedOffset(PRInt32 aOriginalOffset) { return aOriginalOffset; }
+
+  /**
+   * Convert a skipped (rendered text) offset to a text content (original) offset
+   */
+  virtual PRInt32 SkippedToOriginalOffset(PRUint32 aSkippedOffset) { return aSkippedOffset; }

Why do you need these three methods? The first two are trivial to implement with AppendRenderedTextTo. The last one can be implemented with it using binary search.
> The first two are trivial to implement with AppendRenderedTextTo. 
I'm trying to avoid the string copies. There will be a lot more of them if I use AppendRenderedTextTo() to implement.

> The last one can be implemented with it using binary search.
That sounds extremely slow and overcomplicated, compared with adding a method here. I can buy maybe removing GetRenderedTextLength() and OriginalToSkippedOffset(), but it seems a little extreme to do remove this method by implementing it using a binary search.
Is there any reason to believe this code is performance sensitive? How many times per second can it be called?
> Is there any reason to believe this code is performance sensitive?
It depends on the particular AT that's being used. Assistive technologies often walk the entire document or selection. At the moment, Firefox's superior performance with Windows screen readers is one of the reasons blind users switch to it, and I'd like it to stay that way :)

> How many times per second can it be called?
Again, it depends on the particular AT.

What do we gain by whittling down the nsIFrame methods down to just 1? Is it worth the extra work? I don't feel like a11y really asks the main codebase for all that much, but I can work to try and remove 1 or 2 methods if you require it.
One more detail on the perf issue.

The problem is that these methods are cumulative. When dealing with text offsets at the end of a large paragraph there can be many nodes etc. in between.
OK ok. How about a single method that returns a string and a gfxSkipCharsIterator representing the mapping between content offsets and string offsets? We'd have to move gfxSkipCharsIterator to gfxSkipChars.h/.cpp but that's OK. That gives you the ultimate interface.

Do you want to do this frame by frame? We could have the method return a string + iterator for the entire text node (you'd then only call it once on the primary frame for the node). That could be faster for you if performance does matter. You should be aware though that that could represent a mix of RTL and LTR text runs which may not be visually contiguous.

If you like that idea I can write the implementation for you.

Keeping the number of methods down does matter. Each one costs 4 bytes of vtable data per frame type, there's >120 frame types, so that's 1.4K of code if we can save three methods. Worth doing if it's not too painful.

BTW your patch doesn't take account of text-transform.
(In reply to comment #26)
> How about a single method that returns a string and a
> gfxSkipCharsIterator representing the mapping between content offsets and
> string offsets? We'd have to move gfxSkipCharsIterator to gfxSkipChars.h/.cpp
> but that's OK. That gives you the ultimate interface.
That sounds good.

> We could have the method return a string
> + iterator for the entire text node (you'd then only call it once on the
> primary frame for the node). 
I do like it, but I'm not sure I want to hold up the patch more for it.

> You should be aware though that that could represent a mix of RTL and
> LTR text runs which may not be visually contiguous.
How can text direction switch without a change of DOM nodes?

> there's >120 frame types, so that's 1.4K of code if
> we can save three methods.
Wow, that's true. I understand.

> BTW your patch doesn't take account of text-transform.
I realized that after I tested it, but didn't want to hold it up for that right now. Is there any easy way to take text-transform into account?
  

> How can text direction switch without a change of DOM nodes?

Embedded characters with a different natural direction. For example, the DOM text

english HEBREW

(where capital letters are Hebrew characters) would be rendered as

english WERBEH

with two text frames, one mapping "english " and one mapping "HEBREW

> Is there any easy way to take text-transform into account?

Not really. I'll do it.
Problem:
If I include gfxSkipCharsIterator inside of nsIFrame, then we need to update tons of Makefile.in's to add thebes to REQUIRES.
Actually a worse problem is linker errors -- I can't use gfxSkipCharsIterator from inside mozilla/accessible.
The idea was to move gfxSkipCharsIterator to gfxSkipChars.h/.cpp. Then you just need to link to thebes from accessible. I assume you can do that...

> If I include gfxSkipCharsIterator inside of nsIFrame, then we need to update
> tons of Makefile.in's to add thebes to REQUIRES.

just take a gfxSkipCharsIterator* to fill in, and declare "class gfxSkipCharsIterator;" at the top, so you don't have to include gfxSkipChars.h from nsIFrame.h.
>  gfxSkipCharsIterator to gfxSkipChars.h/.cpp
It's already there, or am I missing something

I like the pointer to gfxSkipCharsIterator.
Attachment #270732 - Attachment is obsolete: true
Attachment #270732 - Flags: superreview?(roc)
Attachment #270732 - Flags: review?(roc)
Attachment #271039 - Attachment is obsolete: true
Attachment #271045 - Flags: superreview?(roc)
Attachment #271045 - Flags: review?(roc)
Attachment #271039 - Flags: superreview?(roc)
Attachment #271039 - Flags: review?(roc)
How do you want to handle line breaks?

If we trim spaces at the start and end of lines, AND we return the text for an entire text node at once, you'll get things like

<p> Hello there
->
Hello
there
->
"Hellothere"

Should we replace trimmed whitespace at the end of the line with a single \n, perhaps?
Attached patch updated patch (obsolete) — Splinter Review
Try this. I haven't tested it, but it builds. I had to add a gfxSkipChars* parameter because I'm constructing a whole new offset-mapping.
Attachment #271045 - Attachment is obsolete: true
Attachment #271045 - Flags: superreview?(roc)
Attachment #271045 - Flags: review?(roc)
Soft line breaks should come out as a single " ". We only want \n for forced line breaks in white-space: pre or via <br>.
In other words, I only meant to trim extra spaces at the start and end of hard line breaks like block level elements, because it's redundant

If that's too much extra logic for now we can deal with it later.
Haven't tried it yet, but looking at the code, I would expect these impls to change:
+PRUint32 nsHyperTextAccessible::OriginalToSkippedOffset(nsIFrame *aFrame, PRInt32 aOriginalOffset)
+PRInt32 nsHyperTextAccessible::SkippedToOriginalOffset(nsIFrame *aFrame, PRUint32 aSkippedOffset)

They haven't changed, but the amount of text at the start of the string can be less now, because whitespace is trimmed off.
Hi Roc, the GetRendererText() is completely different.

Are you walking through the text continuations like this because of the line trimming? Again, I did not mean that I wanted space trimmed there. Only at the start or end of a DOM node where multiple spaces should have disappeared.

The kind of extra space i was talking about is like this:
1.   <p>    Hello    </p>
2.   <p>Hello</p>

These are rendered the same but my previous code gives me:
1. " Hello "
2. "Hello"

I wanted just "Hello" in both cases.
Sorry for the noise. Now that I'm looking at the code it seems it may be right. But, I ran out of time for testing. Back after next week.
Had a quick moment to run it. It gets stuck in an infinite loop -- keeps coming back line in GetRenderedText() skipChars.KeepChar(); -- maybe stuck on the same char?

Will have to work more on it with you when I return. I can come up with a Javascript testcase to help us work together more easily.
+    TrimmedOffsets offsets = f->GetTrimmedOffsets(frag, PR_TRUE);

Change PR_TRUE to PR_FALSE here to not trim the space from the ends of lines.

You still might get a space at the end of a block. I don't see an easy way to suppress that.

Infinite loop --- yes! Add "iter.AdvanceOriginal(1);" to the end of the while-loop body.
Attached patch Debugged and commented (obsolete) — Splinter Review
One question: we're appending 1 character at a time. Is setting the string size and then using an iterator a lot better?
Attachment #272771 - Flags: superreview?(roc)
+    PRBool isLastInContinuation = !textFrame->GetNextContinuation();
+    TrimmedOffsets trimmedContentOffsets = textFrame->GetTrimmedOffsets(textFrag, isLastInContinuation);

This means that if you have text like "<span>Foo bar baz </span>qux" and there happens to be a line break at the end of the span, the text nodes returned will be "Foo bar baz" and "qux" ... i.e. you lose the space. But if the line break goes after "bar", then you won't lose any spaces. That inconsistency bothers me; I think GetRenderedText should probably not depend on the exact layout. Is it so bad to just possibly have a trailing space at the end of a block?

Appending one character at a time is OK I think. Other approaches might be a little faster but they're not going to be hugely better; we should just go for the simpler code unless we discover this is a critical path for some important scenario.
Note: this also fixes bug 387418.
Attachment #272771 - Attachment is obsolete: true
Attachment #273403 - Flags: superreview?(roc)
Attachment #272771 - Flags: superreview?(roc)
Attachment #273403 - Flags: superreview?(roc)
Attachment #273403 - Flags: superreview+
Attachment #273403 - Flags: review+
Attachment #273403 - Flags: review?(surkov.alexander)
you forgot to update uuid of nsPIAccessible interface.
Attachment #273403 - Attachment is obsolete: true
Attachment #273748 - Flags: review?(surkov.alexander)
Attachment #273403 - Flags: review?(surkov.alexander)
Probably 'skiped' is much more for layout but for accessibility is it better 'rendered'?
(In reply to comment #50)
> Probably 'skiped' is much more for layout but for accessibility is it better
> 'rendered'?
> 

I mean probably OriginalToSkippedOffset -> ContentToRenderedOffset.
Also since we have two concepts of offsets and texts that are mixed through accessibility module then every our internal method should mention what type of concept it uses. For example.

AppendTextTo(string, offset, length) - appends rendered text to a string, offset - rendered offset.

and

nsIntRect nsHyperTextAccessible::GetBoundsForString(nsIFrame *aFrame, PRInt32 aStartOffset, PRInt32 aEndOffset) - aStartOffset and aEndOffset are content offsets.

Though it adds some complexity to code. But I assume we can't avoid this and we guarantee only that public interface methods works with rendered offsets/text. Right?
Also if our internal methods will use two concepts of offsets (i.e if it's impossible to work with one concept in method arguments) then method shouldn't have 'offset' word, it should have either 'renderedOffset' or 'contentOffset'.
Sometime in code you work with gfxSkipChars, gfxSkipCharsIterator, GetRenderedText() (I look at nsIFrame* nsHyperTextAccessible::GetPosA), can we put their usage to separate methods to hide unusual structures from every day accessibility code like you did in methods SkippedToOriginalOffset/OriginalToSkippedOffset?
Comment on attachment 273748 [details] [diff] [review]
Update uuid of nsPIAccessible interface.


>   /**
>+   * Append the rendered text to the passed-in string.
>+   * The appended text will often not contain all the whitespace from source,
>+   * depending on whether the CSS rule white-space: pre is active for this fram e.

nit, please quote "white-space: pre" for easier reading.

>+static PRUnichar TransformChar(const nsStyleText* aStyle, gfxTextRun* aTextRun,
>+                               PRUint32 aSkippedOffset, PRUnichar aChar)

>+  case NS_STYLE_TEXT_TRANSFORM_CAPITALIZE:
>+    if (aTextRun->CanBreakLineBefore(aSkippedOffset)) {
>+      nsContentUtils::GetCaseConv()->ToTitle(aChar, &aChar);
>+    }
>+    break;
>+  default:
>+    break;
>+  }

nit, you don't need default statement

I didn't dig deeper in the patch yet ;)
> public interface methods works with rendered offsets/text. Right?
Right.

(In reply to comment #54)
> Sometime in code you work with gfxSkipChars, gfxSkipCharsIterator,
> GetRenderedText() (I look at nsIFrame* nsHyperTextAccessible::GetPosA), can we
> put their usage to separate methods to hide unusual structures from every day
> accessibility code like you did in methods
> SkippedToOriginalOffset/OriginalToSkippedOffset?
In that case it's an optimization to avoid calling layout to calculate those structures twice.

As far as changing the name of all the offsets to be rendered/content:

All offsets into the hypertext accessible are rendered offsets.
Should we use the name hypertextOffset instead of renderedOffset?

Otherwise, in the case of DOMPointToOffset we would get
DOMPointToRenderedOffset(), but somehow it's not clear to me from that it's an offset into the hypertext itself. Does that make sense?

If you want to find an easy way to fix it for Firefox 3 that's fine. We can always file a follow-up bug for after Firefox 3.
I have also seen a related problem for the IAccessible.description field in ARIA widgets.  I use to see the problem for the button on http://www.mozilla.org/access/dhtml/button but the problem is no longer there.  Has this been fixed or is it a transient problem?
Comment 58 was for a different bug, my mistake.
Scott, I don't see the extra whitespace in description even in Firefox 2, unless it's via describedby. In any case, all of those codepaths will now go through the rendered text method instead, so description should be fine.
Attachment #271166 - Attachment is obsolete: true
Attachment #273748 - Attachment is obsolete: true
Attachment #274443 - Flags: review?(surkov.alexander)
Attachment #273748 - Flags: review?(surkov.alexander)
Comment on attachment 274443 [details] [diff] [review]
Change some offset names in methods to make it clear what kinds of offsets they are

r=me
Attachment #274443 - Flags: review?(surkov.alexander) → review+
Without this Braille display support of Firefox content is very difficult. Patch is ready to go (has r+sr=roc, r=surkov)
Flags: blocking1.9?
Attachment #274443 - Flags: approval1.9?
Attachment #274443 - Flags: superreview?(roc)
Comment on attachment 274443 [details] [diff] [review]
Change some offset names in methods to make it clear what kinds of offsets they are

a1.9=dbaron.  But what's the point of passing a null aAppendToString?
Attachment #274443 - Flags: approval1.9? → approval1.9+
Pavlov, the changes I made for surkov after roc's sr= were only changes to names and comments, and an added UUID. Also, I didn't make any more changes to the layout bits. Are you sure I need the extra sr=?

David, passing the null string avoids extra string copies.
Attachment #274443 - Flags: superreview?(roc)
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Flags: blocking1.9?
Building Camino Trunk is failing with --enable-accessibility (built at approx. 05:00GMT). Removing that line from .mozconfig results in a successful build. I was pointed here by phiw13, I posted my error at pastebin:

http://pastebin.mozilla.org/181393
(In reply to comment #67)
> Building Camino Trunk is failing with --enable-accessibility (built at approx.
> 05:00GMT). Removing that line from .mozconfig results in a successful build. I
> was pointed here by phiw13, I posted my error at pastebin:
> 
> http://pastebin.mozilla.org/181393
> 

This is due to bug 390280, not this one.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: