Line Breaking Resources
- Line-breaking meta bug
- JIS X 4051 (in Japanese; fantasai can scan pages as requested)
- UAX 14, but see also proposed update
- Masayuki's browser chart 1 and browser chart 2
- Jukka Korpela's comments on UAX 14
- CSS3 Text
Breaking rules of Gecko 1.9
This section is explanation for line breaking rules of Gecko 1.9.
Gecko 1.9 uses JIS X 4051 based breaking rules. You can read actual detailed specification in the comment of nsJISx4501LineBreaker.cpp.
This behavior is designed with following principles:
- Don't break the Western text layout compatibility with Gecko 1.8.1 or earlier.
- URIs should be breakable for Japanese marketing.
- Hyphen should be breakable for Western language.
- Don't use risky changing/way. It was too late in this cycle.
Generic rules for most languages
Breaking between two words
The white spaces of ASCII are common breakable points. We defined white spaces are:
- U+0020 (SPACE)
- U+0009 (CHARACTER TABULATION)
- U+000D (CARRIAGE RETURN)
- U+2000 (EN QUAD)
- U+2001 (EM QUAD)
- U+2002 (EN SPACE)
- U+2003 (EM SPACE)
- U+2004 (THREE-PER-EM SPACE)
- U+2005 (FOUR-PER-SPACE)
- U+2006 (SIX-PER-EM SPACE)
- U+2008 (PUNCTUATION SPACE)
- U+2009 (THIN SPACE)
- U+200A (HAIR SPACE)
- U+200B (ZERO WIDTH SPACE)
- U+3000 (IDEOGRAPHIC SPACE)
See NS_IsSpace in nsILineBreaker.h.
Don't break at near the other breakable points
For long words, we need to find other breakable points. But such rules are limited by this rule. This rule prohibits the too short fragments of a word.
- Don't break at near the start of a word.
- Don't break at near the end of a word.
- Don't break at near the previous breakable point.
- Don't break at near the non-breakable spaces. Because the non-breakable spaces might be word separator.
The non-breakable spaces are:
- U+00A0 (NO-BREAK SPACE)
- U+2007 (FIGURE SPACE)
This is defined in IS_NONBREAKABLE_SPACE in nsJISx4501LineBreaker.cpp.
We defined the "near" is 6 characters (See CONSERVATIVE_BREAK_RANGE of nsJISx4501LineBreaker.cpp). 6 is a magic number. It prohibits the breaking in most date formats and smileys without special code for them.
Breaking between CLASS_CLOSE(_LIKE_CHARACTER) and CLASS_OPEN(_LIKE_CHARACTER)
CLASS_CLOSE(_LIKE_CHARACTER) is a class of close parentheses and some punctuation. CLASS_OPEN(_LIKE_CHARACTER) is a class of open parentheses. E.g., breaking at ")(", "][" and also ")[".
Breaking after hyphens
The after hyphens are good breaking points for Western languages. We defined hyphens are:
- U+002D (HYPHEN-MINUS)
- U+058A (ARMENIAN HYPHEN)
- U+2010 (HYPHEN)
- U+2012 (FIGURE DASH)
- U+2013 (EN DASH)
They are defined in UAX#14.
However, this rule has following limitations:
- If both the previous character and the next character of a hyphen are numeric, don't break after the hyphen.
- Don't break between hyphens.
- If the hyphen(s) is start of the word, don't break after the hyphen(s). This limitation helps command line parameters.
- If the previous character of the hyphen(s) or the next character of the hyphen(s) are non-numeric and non-CLASS_CHARACTER, don't break after hyphen(s).
Breaking in file path
The file paths are sometimes too long. It is also included in URIs. The path delimiters are good points for breaking. We know two delimiters for file path:
- U+002F SOLIDUS
- U+005C REVERS SOLIDUS
Gecko 1.9 breaks before them, by this rule, we can keep the last fragment being always non-natural word. E.g., if the last directly name is natural word in some languages (e.g., "foo/bar/document") and breaking after them, the last segment is not clear whether the word is a part of file path (i.e., the previous example is separated to "foo/bar/" and "document"). So, by our rules, the second or later words are always starting with the delimiter (i.e., the example is separated to "foo/bar" and "/document").
And there are following limitations:
- Don't break before the first delimiter.
- Don't break between delimiters.
Note that we don't support the mixed path of both delimiters. But it is very rare case.
Breaking in query parameters of URI
We can break between parameters. Generally, U+0026 (AMPERSAND) is a delimiter of parameters. And HTML4.01 specification prefers to use U+003B (SEMICOLON). We support them as parameter delimiters.
After the delimiters are breakable if the previous characters of the delimiter include U+003D (EQUALS SIGN).
Breaking in percent encoded string
Intl characters in URI are encoded to percent format string. The string is too long in most cases. (E.g., Japanese text is encoded to ASCII text which length is the number of characters * 9.)
U+0025 (PERCENT SIGN) is breakable before if the 3 characters before is U+0025 or the 3 characters after is U+0025.
CJK breaking rules
Chinese and Japanese characters are more breakable than other languages. Therefore, if a word has CJK characters, "Don't break at near the other breakable points" rule is not used in the word breaking.
The complex scripts are:
They are not handled ourselves. Uniscribe/Pango/ATSUI are handling them. So, the layout results depend on the platforms if the context is complex scripts.
Issues for future version
Use UAX#14 based new code
We should drop current ugly implementation. And we should create new class with UAX#14.
Implement prioritized line-breaking
The prioritized line-breaking is better way for us. E.g., non too long words should not break in them if the word doesn't overflow from the line.
Should not ignore the lang attribute?
Current implementation is ignoreing the lang attribute. But it is sometimes important. E.g., in Korean context, should not break between Chinese characters, right? (In CJ context, the words are not separated by spaces. But In Korean Context, the words are separated by spaces.)
Tibetan should be handled ourselves (?)
Tibetan line-breaking rules are defined in UAX#14 and the spec is not complex. So, we can handle it ourselves in future. (The spec for Tibetan in UAX#14 is good enough??)