Gecko:Line Breaking: Difference between revisions

Jump to navigation Jump to search
Updating to Gecko 1.9 spec candidate (Removing the old proposals)
(Updating to Gecko 1.9 spec candidate (Removing the old proposals))
Line 8: Line 8:
* [http://www.w3.org/TR/css3-text/#line-breaking CSS3 Text]
* [http://www.w3.org/TR/css3-text/#line-breaking CSS3 Text]


= Breaking by Example =
= Breaking rules of Gecko 1.9 =


[https://bugzilla.mozilla.org/show_bug.cgi?id=255990 Bug 255990] introduced a lot of major changes to our line breaking behavior, some of which we don't really want. We should do some planning and investigation to see what we can do to allow those new breaks where we want them, and disallow them where we don't. First, let's get a clear idea of what breaks we want and what breaks we don't want. Feel free to add to the following three lists.
This section is explanation for line breaking rules of Gecko 1.9.


Unless otherwise specified, assume each listed string is surrounded by spaces.
Gecko 1.9 uses JIS X 4051 based breaking rules. You can read actual detailed specification in [http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#51 the comment of nsJISx4501LineBreaker.cpp].


== Breaks We Don't Want ==
== Generic rules for most languages ==


List breaks we don't want here.
===Breaking between two words===


* c/o
The white spaces of ASCII are common breakable points. We defined white spaces are:
* /etc
* s/he
* 2/23/98
* cm/sec
* --
* ---
* index.html
* colo(u)r
* guest(s)
* "foo..."
* \3,000 (BACKSLASH is used as yen sign and won sign.)
* \n
* \x0A
* 100%
*  
* AT&T
* 1-5
* 2007-08-02
* -ed
* Init()
* sin(x)
* smileys: :) ;) :-) :P :D :-D :^) :-/ =^_^= \^_^/ ^-^ ^^;; o_O -__-;; >_< ><
* !?
* ?!
* ???
* !!!
* $20
* 20$
* US$20
* 20$US
* 100°
* °C


== Breaks We Want ==
* U+0020 (SPACE)
* U+0009 (CHARACTER TABULATION)
* U+000D (CARRIAGE RETURN)
* U+2000 (EN QUAD)
* U+2001 (EM QUAD)
* U+2002 (EN SPACE)
* U+2003 (EM SPACE)
* U+2004 (THREE-PER-EM SPACE)
* U+2005 (FOUR-PER-SPACE)
* U+2006 (SIX-PER-EM SPACE)
* U+2008 (PUNCTUATION SPACE)
* U+2009 (THIN SPACE)
* U+200A (HAIR SPACE)
* U+200B (ZERO WIDTH SPACE)
* U+3000 (IDEOGRAPHIC SPACE)


List breaks we want here.
See NS_IsSpace in [http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/public/nsILineBreaker.h nsILineBreaker.h].


* somereallylongword/someotherreallylongword
=== Don't break at near the other breakable points ===
* %82%B1%82%EA%82%CD%93%FA%96%8C%EA%82%C5%82%B7
* foo=bar&foo=bar&foo=bar&foo=bar
* foo=bar;foo=bar;foo=bar;foo=bar
* c:\foo\bar\foo\bar\foo\bar
* \\foo\bar\foo\bar\foo\bar
* 2-bromo-4,4-dichlorophenol
* never-ending
* kites&#8212;but
* kites--but


== Breaks We're Not Sure About ==
For long words, we need to find other breakable points. But such rules are limited by this rule. This rule prohibits the too short fragments of a word.


List breaks we're not sure about here.
* Don't break at near the start of a word.
* Don't break at near the end of a word.
* Don't break at near the previous breakable point.
* Don't break at near the non-breakable spaces.


= Proposed Goals for 1.9 =
the non-breakable spaces are:


* Don't break at punctuation if there is a better opportunity within 2 characters. E.g. c/o, &co, AT&T will not break but in/out can.
* U+00A0 (NO-BREAK SPACE)
* Don't break within a sequence of punctuation.
* U+2007 (FIGURE SPACE)
* Don't break at degree signs.
 
This is defined in IS_NONBREAKABLE_SPACE in [http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp nsJISx4501LineBreaker.cpp].
 
We defined the "near" is 6 characters (See CONSERVATIVE_BREAK_RANGE of [http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp nsJISx4501LineBreaker.cpp]). 6 is a magic number. It prohibits the breaking in most date formats and smileys without special code for them.
 
=== Breaking between CLASS_CLOSE and CLASS_OPEN ===
 
CLASS_CLOSE is a class of close parentheses and some punctuation. CLASS_OPEN is a class of open parentheses. E.g., breaking at ")(", "][" and also ")[".
 
=== Breaking after hyphens ===
 
The after hyphens are good breaking points for Western languages. We defined hyphens are:
 
* U+002D (HYPHEN-MINUS)
* U+058A (ARMENIAN HYPHEN)
* U+2010 (HYPHEN)
* U+2012 (FIGURE DASH)
* U+2013 (EN DASH)
 
They are defined in UAX#14.
 
However, this rule has following limitations:
 
* If both the previous character and the next character of '''a''' hyphen are numeric, don't break after the hyphen.
* Don't break between hyphens.
* If the hyphen(s) is start of the word, don't break after the hyphen(s). This limitation helps command line parameters.
* If the previous character of the hyphen(s) or the next character of the hyphen(s) are non-numeric and non-CLASS_CHARACTER, don't break after hyphen(s).
 
=== Breaking in file path ===
 
The file paths are sometimes too long. It is also included in URIs. The path delimiters are good points for breaking. We know two delimiters for file path:
 
* U+002F SOLIDUS
* U+005C REVERS SOLIDUS
 
Gecko 1.9 breaks '''before''' them, by this rule, we can keep the last fragment being always non-natural word. E.g., if the last directly name is natural word in some languages (e.g., "foo/bar/document") and breaking '''after''' them, the last segment is not clear whether the word is a part of file path (i.e., the previous example is separated to "foo/bar/" and "document"). So, by our rules, the second or later words are always starting with the delimiter (i.e., the example is separated to "foo/bar" and "/document").
 
And there are following limitations:
 
* Don't break before the first delimiter.
* Don't break between delimiters.
 
Note that we don't support the mixed path of both delimiters. But it is very rare case.
 
=== Breaking in query parameters of URI ===
 
We can break between parameters. Generally, U+0026 (AMPERSAND) is a delimiter of parameters. And HTML4.01 specification prefers to use U+003B (SEMICOLON). We support them as parameter delimiters.
 
After the delimiters are breakable if the previous characters of the delimiter include U+003D (EQUALS SIGN).
 
=== Breaking in percent encoded string ===
 
Intl characters in URI are encoded to percent format string. The string is too long in most cases. (E.g., Japanese text is encoded to ASCII text which length is the number of characters * 9.)
 
U+0025 (PERCENT SIGN) is breakable before if the 3 characters before is U+0025 or the 3 characters after is U+0025.
 
== CJK breaking rules ==
 
Chinese and Japanese characters are more breakable than other languages. Therefore, if a word has CJK characters, "Don't break at near the other breakable points" rule is not used in the word breaking.
 
== Complex scripts ==
 
The complex scripts are:
 
* Thai
* Lao
* Tibetan
 
They are not handled ourselves. Uniscribe/Pango/ATSUI are handling them. So, the layout results depend on the platforms if the context is complex scripts.
87

edits

Navigation menu