Closed Bug 196175 Opened 21 years ago Closed 6 years ago

underscore should be part of word, not punctuation

Categories

(Core :: DOM: Selection, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox59 --- fixed

People

(Reporter: vdvo, Assigned: jfkthame)

References

(Blocks 1 open bug)

Details

(Keywords: topembed+, Whiteboard: EDITORBASE+)

Attachments

(4 files, 1 obsolete file)

This may not be universal opinion, but I think that the underscore (the '_'
character) should be considered part of a word, not punctuation, for purposes of
word operations (Ctrl+Left/Right, Ctrl+Shift+Left/Right, Ctrl+Delete,
Ctrl+Backspace, double-click... anything else?). E.g., variable_name should
always be considered one word, regardless of
layout.word_select.stop_at_punctuation. [The fact that variableName may be
considered a nicer name by many people is irrelevant for this. ;-)]

I'm satisfied with all the other punctuation characters, except for '_'.
We should be following platform conventions here, not just making stuff up.  If
the native platform behavior is to stop at '_', we should stop at it.
My first reaction was agreement: I'd expect underscore to be part of the word.
Then I tried a few programs not related to mozilla:
Stopped at underscore (it's punctuation): emacs, gnome-terminal, gedit, gim (gtk
textareas).
Included underscore (it's part of the word): kedit, gvim, NS4 (i.e. motif text
fields).
So there's no platform consistency on linux.  I personally think it makes more
sense for underscore to be part of the word, but maybe we should follow what
other platforms do.  What does windows do?  Kathy, what does the mac do?
On my Mac (OS X 10.12.4.), I checked TextEdit, BBEdit and even the Finder, and
they all agrees that an underscore is part of a word, while a dash isn't for
instance. Safari included both the underscore *and* the dash, while Explorer
only included the dash, but not the underscore. Confusing.

The only reference I found was
http://developer.apple.com/techpubs/mac/HIGuidelines/HIGuidelines-217.html#HEADING217-0
 (old HIG), but that doesn't speak about underscores.
Yes, we should adhere to platform conventions, at least by default. But we could
also have a pref for what exactly is "punctuation", just as we have a pref for
whether we stop selection at punctuation or not. The request that its default
setting conform to the respective platform's conventions is natural.

As for the platform conventions, I am not convinced there are any. I've tried
three places on Windows and each acts differently: Notepad, Word, and the input
field you get when you select Start->Run. On Unix, I tried KDE3's equivalent of
Start->Run (Alt+F2), KDE3's Konsole, XFree 4.2.0's xterm, and Netscape 4.8, and
each acts differently, too. I'm not pretending that it's a representative or
even exhaustive list, but it seems that there is no strict, widely recognized
guideline. And as for Mac, I have no idea whatsoever.

On my Linux:
xterm stops selection on .-/?=%& but not _
konsole stops on ?=%& but not .-_/
KDE's run program (Alt+F2) stop on . but not on -_/?=%&
Netscape stops on nothing but whitespace.
Mozilla with stop_at_punctuation=true stops on .-_/?=%& (all that I tried).
Mozilla with stop_at_punctuation=false stops on nothing but whitespace.

I think that there should really be a pref for this. (And if it will be
introduced, I hope that the underscore will be exempt from the default
word-delimiter list, as the summary of this bug says.) Perhaps I will change the
summary into something like "word-delimiters should be configurable".
BTW, the pref has actually already been suggested in bug 98546 comment 0,
"Additional Comments From Akkana 2000-04-24 17:13". (There are many unnumbered
comments in that bug, I don't know why - did Bugzilla not support comment
numbers back then?) The idea was dropped then, but I still it's a valid one.
Also, in bug 98546 comment 51, someone mentions how this is configurable in xterm.
I'm pretty sure that Apple HIG guidelines say that _ is considered part of the
word (and should be selected as part of a word when double-clicked or doing
other word selection (like option-arrow)).
Keywords: topembed
Agree with Boris' comment #1; we should follow platform conventions.

Apple HIG says '_' is part of the word, Windows convention seems to consider it
a separate word.  

Marking Topembed+ 
- on Mac, we should treat the '_' as part of the word.
- on Windows, we need to treat it as a separate word (not as punctuation).
- Linux does not seem to have a convention; I suggest follow Mac approach.
Keywords: topembedtopembed+
Whiteboard: EDITORBASE
Well, I'm not terribly happy about the Windows convention (if it *is* a
convention - where did that statement come from?), but as long as it's
configurable, I won't object.
editorbase+ for following the platform convention, but not for the making it a
configurable preference.
Whiteboard: EDITORBASE → EDITORBASE+
need to have another meeting to argue about this.
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.6alpha
(In reply to comment #10)
> need to have another meeting to argue about this.

Has there been a meeting? What's the outcome of the argument?
Here's a preliminary patch. It adds a new pref called
layout.word_select.word_characters that is a string composed of characters that
are to be treated as a part of word. The pref defaults to "_" except on
Windows, where it's "". (I'd prefer to not exclude Windows, though. Is it
really necessary?)

Is this the right approach? Or is the string pref too much bloat - should I
just add a boolean pref for the underscore, sacrificing the flexibility?

Am I handling the string right? Have I used the right class? Won't I leak
memory when the pref is changed?
Assignee: mjudge → vdvo
Comment on attachment 144935 [details] [diff] [review]
patch, first attempt

Seems to work fine for me. Let's see what the reviewers think.
Attachment #144935 - Flags: superreview?(roc)
Attachment #144935 - Flags: review?(bzbarsky)
I probably won't be able to get to this for at least a week, more likely two,
and even then this is not code I know well...  dbaron or rbs or mjudge or akkana
all know it better, I think.
Comment on attachment 144935 [details] [diff] [review]
patch, first attempt

Thanks for letting me know, Boris. Trying rbs.
Attachment #144935 - Flags: review?(bzbarsky) → review?(rbs)
Comment on attachment 144935 [details] [diff] [review]
patch, first attempt

I'm not an expert (by any means) in this part of the code but I don't
understand the changes to all.js.  I expected to see the first change in all.js
and the 2nd change in the windows prefs.js file.

I assume this patch will address both double-click selection as well as
selection via keybindings such as option-shift-left (on Mac).

I'm not sure if there is a concern for double-byte languages or not.  Perhaps
that would go through a different text transformer so this is fine?
(In reply to comment #16)
> (From update of attachment 144935 [details] [diff] [review])
> I'm not an expert (by any means) in this part of the code but I don't
> understand the changes to all.js.  I expected to see the first change in all.js
> and the 2nd change in the windows prefs.js file.

Apparently, the default prefs of all platforms are now in the one file, and
through some magic it gets preprocessed at build time so the sections like
#ifdef XP_WIN etc. are properly processed. I don't know how exactly it's done,
but it seems to work.

> I assume this patch will address both double-click selection as well as
> selection via keybindings such as option-shift-left (on Mac).

AFAICT it does, although I have only tried it on Linux.

> I'm not sure if there is a concern for double-byte languages or not.  Perhaps
> that would go through a different text transformer so this is fine?

Hmmm... no idea about that. :-(

Although, I've noticed that there is something called nsIWordBreaker. I know
nothing about it, but it doesn't seem to be used in
nsTextTransformer::ScanNormalAsciiText_[FB]_ForWordBreak...
Sorry, I don't have the stamina to review this, and trace in the debugger, etc.
nsTextFrame is already too obfuscated and I don't like this pref-controlled
change that 99% of users don't know/care about, with other programs at variance
on the matter. This is a personal opinion. Try other reviewers.
Attachment #144935 - Flags: review?(rbs) → review?(akkzilla)
Personally, I think it would be cleaner to get rid of stop_at_punctuation and
replace it with a string pref called something like word_break_chars, which
could be defaulted to include underscore or not, as the case may be, rather than
compiling in a list of punctuation characters then special-case excluding _ (and
whatever else gets added).  It looks like maybe that wouldn't be that much
harder, but you're now the expert on this code :-) -- what do you think?

It's a bummer to see how many times you had to repeat almost the same code in
the diff.  I wonder if you could make a macro or function or something to replace
"if (sWordSelectStopAtPunctuation && readingAlphaNumeric && isalnum(ch) &&
IS_ASCII_CHAR(ch)" which would be more consistent about the order of the tests.
 Though I suspect, without looking closely at the surrounding code, that there's
a lot more than that in that file which ought to be combined, refactored and
otherwise cleaned up (that's not your fault, of course).

I'm not very confident of my ability to review this either.  I suspect it's
fine, if it modifies all the places previously using
sWordSelectStopAtPunctuation. We definitely should try to get some i18n eyes on
this (cc smontagu & shanjian -- looks like shanjian has been in this code before
and perhaps knows it somewhat?), though it looks like most of those concerns
should already have been taken care of when StopAtPunctuation went in.

How many times does FindChar get called?  Do we need to worry about performance
problems when right-arrowing through text?  Or is only one of these clauses ever
called per action, so we'll never call it more than once on the same letter?
I think making the preference mean "punctuation characters treated as letters
for word breaking" is the right way to go, because the set of punctuation
characters could be rather large, especially if we included Unicode punctuation,
which we might eventually want to.

It looks like nsIWordBreaker is used only for Unicode, and the ASCII routines do
their own thing. This is actually bad because it means ASCII text occurring in a
Unicode string is treated differently (the platform punctuation preference is
not honoured, for example).

Do we ever really pass non-ASCII text to
nsTextTransformer::ScanNormalAsciiText_F_ForWordBreak? It seems to just treat
any non-ASCII characters as Latin-1. Is that really the right thing to do?

I think a good thing to do here would be:
-- make nsSampleWordBreaker::GetClass a public static inline method in
nsIWordBreaker
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsSampleWordBreaker.cpp#89
-- Factor out the single-byte (Latin-1) case into a new static inline function
nsIWordBreaker::GetClassLatin1()
-- add the punctuation control preference into that method. That way it will be
picked up by the Unicode code paths too.
-- Call GetClassLatin1 from the ASCII nsTextTransformer methods when you're
checking to see whether something is punctuation.
Comment on attachment 144935 [details] [diff] [review]
patch, first attempt

taking this off my queue until there's more action :-)
Attachment #144935 - Flags: superreview?(roc)
Comment on attachment 144935 [details] [diff] [review]
patch, first attempt

Taking it off my review queue, too many questions remaining.
Attachment #144935 - Flags: review?(akkzilla)
Blocks: 393227
From a user perspective, this has become more visible since bug 16203 was fixed. Any chance we can revisit this in time for 1.9?
Flags: wanted1.9.0.x?
Not wanted for 1.9.0.x. You might try for 1.9.1 though...
Flags: wanted1.9.0.x? → wanted1.9.0.x-
QA Contact: pmac → selection
Wouldn't it be better to allow user to list characters which should be treated as part of word rather than hardcoding them? Similar to how it's made in the Gnome Terminal, screenshot attached.
If underscores can't be treated as part of a word by default, I fully agree with  Sergey's idea.
Chrome doesn't treat underscore as delimiter and correctly selects words with underscore by double-clicking. Default windows controls (in system dialogs) also don't treat underscore as delimiter.
Depends on: 863358
Here we are 12 years later and STILL firefox does not conform to platform standards 

queue_directory

above word fails to highlight on firefox ( ubuntu ) yet correctly does highlight entire word on Chrome, Gnome terminal, Opera Browser, Sublime Text 3 ....

Please just put in this change
SOLUTION :   
    about:config
         
search for
    Layout.word_select.stop_at_punctuation   
                change from true --> false
(In reply to EventHorizon from comment #30)
> SOLUTION :   
>     about:config
>          
> search for
>     Layout.word_select.stop_at_punctuation   
>                 change from true --> false

This workaround creates another problem - selection doesn't stop at ANY punctuation, including periods.
>> This workaround creates another problem - selection doesn't stop at ANY punctuation, including periods.

So true. Just yesterday I've started using FF instead of chrome, but this little thing is so annoying.
One more to join a request to make it!
"Layout.word_select.stop_at_punctuation" does not solve a problem.
At least make it in the form of custom setting.
Please!!!

(mac os)
When selecting, for example, function names containing underscore, this little thing seems quite awkward compared to Chrome/Safari. I too would like the underscore to be treated as part of the word.
As a user that has shuffled back and forth from Firefox over the years (and recently came back after reading about the stellar enhancements that have been made over the years), this punctuation quirk (particularly underscores) has been the single behavior annoyance outside of an otherwise great browser experience.

I'm just curious, it appears as though the assignee is @Vaclav Dvorak, who is also the person that first reported the bug 15 years ago, and even came up with an initial patch (that improved but unfortunately did not fix the problem - as his patch did not honor any punctuation including parenthesis).

Since then, it does not appear that there has been any movement on this in 14 years.

@Jet Villegas, or anybody else that knows, what are the usual next steps from here?
Flags: needinfo?(bugs)
Brian: can someone from your team pick this up? 15-year old bug--it comes with a steak!
Assignee: vdvo → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(bugs) → needinfo?(bbirtles)
Also Masayuki mentions that:

> If it's necessary only for word selection, etc, looks like that we need some additional code here and its callers:
> https://searchfox.org/mozilla-central/rev/919dce54f43356c22d6ff6b81c07ef412b1bf933/layout/generic/nsFrame.cpp#8857

I'll follow up with members of my team next week and see if anyone is available to take it soon and if no-one can take it immediately, I'll set a reminder to chase this up in a month.
Flags: needinfo?(bbirtles)
taken from kubuntu konsole settings :

      "characters considered part of a word when double clicking"

           :@-./_~?&=%+#

Please assure solution works against the above set of chars

Also , when I double click on text today (bad behaviour) 
      firefox fails to highlight the entire following word :

       this_is_word 

... or if I alter settings

about:config
Layout.word_select.stop_at_punctuation   change from true --> false

then it is also bad since it now assumes following is just one word

       (this_is_word),(blah)

Clearly above line of text is NOT one word
After a bit of digging I believe it can be solved by adding the correct Unicode to this enum here:
https://searchfox.org/mozilla-central/source/gfx/harfbuzz/src/hb-unicode.h#64

And then adding that to the is punctuation check here:
https://searchfox.org/mozilla-central/source/layout/generic/nsTextFrame.cpp#8103

Unfortunately I'm not familiar with how the hb-unicode definitions work, but I thought that could help.
(In reply to EventHorizon from comment #39)
> taken from kubuntu konsole settings :
> 
>       "characters considered part of a word when double clicking"
> 
>            :@-./_~?&=%+#
> 
> Please assure solution works against the above set of chars

I don't think we should necessarily expand the scope of this bug to cover such an extended set of characters. IMO, treating underscore as if it were alphabetic (so it becomes "part of a word") should be fairly uncontroversial, as there aren't many common use cases outside its use as a "joiner" in multi_word_identifiers (or the convention of adding _leading or trailing_ underscores, e.g. to tag different types of identifier).

That Kubuntu set, though, looks like it would result in an entire email address like name@subdomain.example.com or a URL such as http://example.com/~user/path/to/something+with+spaces?param=foo#frag being selected as a single word. That's handy for the purpose of heuristically recognizing links in plain text, which the console may want to do, but I don't think it would be appropriate behavior for general text editing purposes, which the browser has to consider. 

(In reply to Hiroyuki Ikezoe (:hiro) from comment #37)
> It seems to me that using icu::BreakIterator[1] is a proper solution. 
> Chrome seems to use it [2].
> 
> [1] http://icu-project.org/apiref/icu4c/classicu_1_1BreakIterator.html
> [2]
> https://chromium.googlesource.com/chromium/blink.git/+/master/Source/
> platform/text/TextBreakIteratorICU.cpp#478

A solution could certainly be built based on BreakIterator, but we don't currently include that in our build (and adding it is controversial, due to the size of associated data -- or if we cut down on the data included, we get less-complete functionality). See also bug 1423593.

In the meantime, I think it's worth trying a minimal fix that just tweaks the behavior of underscore in our current code.
See Also: → 1423593
This seems to give the desired behavior for underscore, in my (minimal) local testing. I expect it'll probably affect some existing mochitests (in editor, a11y, etc), so if we do this, they'll need adjusting to match.
Rather surprisingly, this change didn't appear to break any existing tests: https://treeherder.mozilla.org/#/jobs?repo=try&revision=35764d67b2d36d0626ef8ba5d72ccee48a39202d. So we should add a specific testcase for this behavior, to make sure we don't inadvertently regress it in a future update (e.g. migrating to a BreakIterator-based implementation).

:m_kato, does this seem reasonable to you as a fix we can apply immediately, without waiting on decisions about BreakIterator?
Attachment #8938811 - Flags: review?(m_kato)
Note that my proposed patch here does NOT address all the stuff in the decade-ago discussion about "following platform conventions", nor does it expose this as a pref-controlled behavior. It's not clear to me that there _are_ in fact differing, clearly-established platform conventions that we should be following; and a counter-argument would be that "the platform is Firefox (or The Web)", and as far as possible it should behave consistently, regardless of the host OS.

IMO, treating underscore as word-forming for this purpose is overwhelmingly more useful than treating it as word-separating, and we should simply make this the standard Gecko behavior. (And FWIW, that seems to be Chrome's behavior. And Safari's. Though Safari does the same with period, too, which I think is more questionable -- comment 41.)
Awesome!  I agree that using BreakIterator is not the best choice for now.  (Actually I wrote the same fix last night and tried to add test cases in test_movement_by_words.html but it originally did not work on Linux. :/)
This adds an example (with expectations based on current behavior) of underscore_separated_words to test_movement_by_words.html.
Attachment #8938884 - Flags: review?(m_kato)
Attachment #8938811 - Attachment is obsolete: true
Attachment #8938811 - Flags: review?(m_kato)
Assignee: nobody → jfkthame
Status: NEW → ASSIGNED
Comment on attachment 8938885 [details] [diff] [review]
patch 2 - Override category of underscore for word-selection purposes so it is treated as a word-forming character, and update word-movement tests accordingly

Review of attachment 8938885 [details] [diff] [review]:
-----------------------------------------------------------------

Actually, although Edge and Blink are different behavior for this case too, we should change it.  So I agree to change to Blink and WebKit way.
Attachment #8938885 - Flags: review?(m_kato) → review+
Attachment #8938884 - Flags: review?(m_kato) → review+
Pushed by jkew@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/46053289b084
patch 1 - Add word-movement tests that involve underscore_delimited_words (currently treated as separate words). r=m_kato
https://hg.mozilla.org/integration/mozilla-inbound/rev/b1e9b56dbc7d
patch 2 - Override category of underscore for word-selection purposes so it is treated as a word-forming character, and update word-movement tests accordingly. r=m_kato
https://hg.mozilla.org/mozilla-central/rev/46053289b084
https://hg.mozilla.org/mozilla-central/rev/b1e9b56dbc7d
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: mozilla1.6alpha → mozilla59
Can you add configuration option for underscore_delimited_words behavior to allow revert old behavior, because now it is confusing at least for me, to select whole file name in urlbar instead of just part of it, which was default until now?
(In reply to drJeckyll from comment #52)
> Can you add configuration option for underscore_delimited_words behavior to
> allow revert old behavior, because now it is confusing at least for me, to
> select whole file name in urlbar instead of just part of it, which was
> default until now?

While I'm sure there will be times when selecting one "word" of a multi_word_identifier is desirable, I'm not convinced this is a common enough scenario to be worth the added complexity of making it configurable. Take the URL of this bug, for example: if I double-click anywhere in the "show_bug" part of the URL, the whole of "show_bug" will now be selected, whereas previously it would have selected either "show" or "bug", depending where I clicked. But I don't see a strong use-case for selecting its individual parts... yes, any change may take a bit of time to get used to, but is it really a serious problem?
I work with bunch of url's which are something like something_500.jpg. I'm not really interesting about something part, but I need these numbers. Until now I simple double click on 500 and it was selected. Now ... well, now it select something_500, which is confusing for me. Furthermore now I need to click on start or end of 500 and to select these digits. It is annoying and take more time, have to be careful what select & etc. I understand that selecting words make sense, but I ask to make config setting to include or exclude underscore character of start/stop words or even better to have setting which chars are treated like start/stop for words.

It was a time when Firefox have settings for almost everything. Hope something which was default for so long time can be reverted back from users which need it.
(In reply to drJeckyll from comment #54)
> I work with bunch of url's which are something like something_500.jpg. I'm
> not really interesting about something part, but I need these numbers.

When I voted for this bug, I used to work with identifiers like T__AC9505D3F968466FAF734657C92A1B63, and after switching to Firefox from IE I found out that I could no longer select the whole identifier by simply double-clicking it. That was very inconvenient.
So? :) Now you can, I don't. What is the difference? One of us is still unhappy. 

My only point is to ask this to be configurable, since users are different and have different needs and habits. If devs decided that it is too much trouble (which is not something depending on me to decide), then let it be. It will be not the first one which will make Firefox more like other browsers (see that argument before don't you?). I don't want to bring this to debate like alsa/pulse audio, removing legacy extensions, ugly UI & etc. which is pointless, I was just asking if it is possible.
+1 for making this a preference. This has been incredibly aggravating as I also frequently modify URLs that contain underscores.

Firefox already has a number of click-selection preferences that can change platform behavior:

* layout.word_select.eat_space_to_next_word
* layout.word_select.stop_at_punctuation
* browser.urlbar.clickSelectsAll
* browser.urlbar.doubleClickSelectsAll (which I always disable)
* browser.triple_click_selects_paragraph (which I also disable).

I feel it would be natural to include something like layout.word_select.stop_at_underscore or layout.word_select.stop_at_chars.

A quick search online show some conflicting opinions on this, and there is already precedent for this being configurable in many editors, IDEs, and terminals. For example, I have tweaked word breaks to be more aggressive (i.e. [^a-zA-Z0-9]) on systems where I do programming. It's much easier to double-click-drag than click-drag, as the former needs less precision than the latter.

As an aside - I find it absurd that, by convention, hyphen-ated words are treated as two words, but under_scored words are a single word. In my book, that's backwards. But apparently that's another argument.
Blocks: 1431672
Any further change here -- such as adding a preference -- really needs to be considered in a separate follow-up bug, so that we can track things properly. Accordingly, I've filed bug 1431672.
Thank you very much. Old behavior is back with layout.word_select.stop_at_underscore set to true.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: