This document is still a draft. Comments are most welcome in the discussion page.
In this document, I will:
- Describe the fundamental issues involved in bidirectional editing.
- Give an overview of how Mozilla currently tackles these issues.
- Point out some problems with the current approach.
- Present an alternative approach, which can solve some of these problems.
- Suggest a relatively simple way for implementing this aproach within the existing framework.
Before starting, one important note:
Mozilla currently implements what's known as visual caret movement. That is, pressing the left (right) arrow key always moves the caret one place to the left (right), regardless of the directionality of the text the caret is on, or of the paragraph directionality. This approach is also the system approach on Mac OS X (and always was on Mac OS), but is not the system behavior on Windows (which uses logical caret movement instead).
This document assumes that this functionality is going to remain in place, i.e., the examples all refer to visual caret movement. However, most of the points raised below also apply to logical caret movement, since some operations (such as moving up/down, or positioning the caret using the mouse) are visual in nature even when logical caret movement is used. If you're interested in the visual-vs.-logical debate, see bug 167288.
The main issue involved in bidirectional editing is that there is no one-to-one relationship between a logical position in the text, and a visual position on the screen. A single logical position can map to two different visual positions, and a single visual position might map to two different logical positions.
In the following examples, I will use uppercase Latin letters to represent RTL (e.g. Hebrew) letters, whereas lowercase Latin letters will represent LTR (e.g. Latin) letters. This is the convention, as it makes it easier for people who can not read RTL languages to understand the examples.
Consider, for example, the text with the following logical representation: latinHEBREWmore (this example deliberately omits spaces, in order to avoid the issues associated with resolving their directionality). This text is displayed on the screen as latinWERBEHmore.
Consider the logical position between n and H. This is immediately after n, so it maps to the visual position between n and W. But it is also immediately before H, so it also maps to the visual position between H and m.
Now, consider the visual position between n and W. It is immediately after n, so it can be mapped to the logical position between n and H. But it also immediately after W, so it can also be mapped to the logical position between W and m.
Bidirectional text is stored logically, and (obviously) displayed visually. The caret, being a graphical element, corresponds to a visual location. The user can manipulate text and move the caret through a combination of logical functions (such as typing or deleting) and visual functions (such as using the arrow keys). Therefore, the problem of mapping between logical and visual positions in a way that will meet the expectations of the user is the central problem of bidirectional editing.
At this point, I would like to recommend reading Guidelines of a Logical User Interface (UI) for Editing Bidirectional Text by Matitiahu Allouche of IBM. This document presents a method for dealing with the problems associated with bidi editing. It contains some useful definitions, as well as a detailed description of a logical-to-visual mapping algorithm (in the "Conversion of cursor positions" section). This document is the basis of the current Mozilla bidi editing implementation (which I'll describe below). I'd like to thank Simon Montagu for introducing me to this document, which I'll hereby refer to as "the IBM document" (or "the IBM algorithm").
The current Mozilla implementation
This section describes my understanding of the current system. It might contain inaccuracies. If you spot any, please let me know.
Mozilla represents the caret location internally as a (collapsed) selection, which consists of:
- A logical position inside the content tree (i.e. a content node and an offset into that node).
- A "hint", which is a boolean value indicating whether the caret should be drawn adjacent to the character immediately preceding it, or to the character immediately following it (in logical order).
The "hint" mechanism was originally devised as way of indicating where to display the caret when it is at the end of a wrapping line. When arriving from the left (in LTR text), using the right arrow key, the caret should be displayed at the end of the line (after the space at which the line wraps). Pressing the right arrow key again, the caret should be moved to the beginning of the following line. Note that the caret remains in the same logical position - so this is a simple (non-bidi) case of where one logical position can be mapped to two visual positions.
When bidi support was added to Mozilla, the "hint" mechanism's role expanded to handle other cases where one logical position maps to two visual positions, as described in the previous section.
When required to display the caret, the system examines the selection object, and invokes the IBM algorithm in order to determine the visual position in which the caret will be displayed (when the logical position is in the middle of an LTR run of characters or an RTL run of characters, there is no ambiguity and the algorithm is trivial. Things get interesting only when the logical caret position is between runs of different directions, or, more accurately: between runs of different bidi embedding levels).
When the user types a character, that character is inserted into the text stream at the logical insertion point. The insertion point is then moved to after the new character, and the new visual caret position is determined again according to the IBM logical-to-visual mapping algorithm.
When the user attempts to delete a character (either forward, using the "delete" key, or backwards, using the "backspace" key), things get a bit more complicated. In our previous example (latinHEBREWmore), consider that the logical insertion point is between n and H. Now suppose, that the visual mapping currently chosen for this position is the first one mentioned above, i.e. after (to the right of) the n. Visually, this looks like this: latin|WERBEHmore (where | represents the caret). Suppose now that the user pressed the (forward) "delete" key. Since deleting is a logical function, the character that should be deleted is the one that logically follows the insertion point, that is, H. However, notice that the caret isn't currently displayed as adjacent to that character! So deleting the H would likely be confusing and unexpected. IBM's algorithm handles this by specifying that in this case, no deletion will actually be done, but instead, the logical insertion point will be mapped to the alternative visual caret position, so that the caret will appear between H and m, indicating to the user that another press of the "delete" key will delete the H.
Actually, typing is not as simple as described above. Consider The same scenario as above, but this time, instead of pressing "delete", the user types a Hebrew letter (let's say X). This letter is inserted (logically) between the n and the H, which means that it will appear visually to the right of the H: latinWERBEHXmore. Notice that the newly-inserted letter appears away from the visual caret position! IBM's algorithm tries to address this issue in two ways:
- It tries to ensure that the keyboard language selection will match the current logical-to-visual mapping. For example, the situation described above will occur when the caret arrived at that position by using the right-arrow to move past the word latin. In this case, the IBM algorithm specifies that the keyboard layout should be set to an LTR layout, so the user will not be able to type a Hebrew letter without manually switching the keyboard layout to Hebrew (similarly, the other visual mapping for the same logical position is triggered by left-arrowing over the Hebrew word, which would set the keyboard layout to Hebrew). Note that this part of the IBM algorithm is currently not implemented by Mozilla - see bug 162242.
- When the user does manually switch the keyboard layout, the system adjusts the visual positioning of the caret to match the position in which the next expected character will actually appear. In the example above, if the user switches the keyboard layout from English to Hebrew, the logical-to-visual mapping will be switched so that the caret will be displayed between the H and the m, so when the X is typed, it appears at the caret's location.
Note that the combination of these two methods still doesn't solve the problem entirely. It's possible for the user to type LTR characters (such as numbers) when the keyboard layout is Hebrew, or to type neutral characters, which will become part of an RTL run, while the keyboard layout is English. I'll get back to this in the next section.
Moving the caret
I'll focus here on using the left and right arrow keys (without modifiers) to move the caret. There are other methods of moving the caret, but for the purpose of this documents they will be ignored.
When the user presses an arrow key (left or right), the following process is initiated:
- The system determines the logical position associated with the visual position to the left (or right) of the current visual position (which is derived from the current logical position and the current hint). In addition to logical position, the system also determines the new value for the "hint", to ensure the caret is indeed painted in the intended visual position. The new logical position and hint are stored in the selection object.
- The system then uses he logical-to-visual mapping algorithm to map the new selection object (logical position + hint) back to a visual position, where the caret is displayed.
Problems with the current implementation
The main problems with the current system should be evident from my description of the system above. I'll re-iterate them briefly:
- The system does not always behave as the user expects:
- In the case of typing, the system (even when fully implemented) does not ensure that the typed character will appear at the location of the caret. The result could be confusing and even frustrating for the user. See bug 300004, and, specifically, the second testcase attached to it.
- In the case of deleting, when the caret is not adjacent to the to-be-deleted character, the system's solution is to not actually delete a character, but to move the caret (possibly a long distance!) to the position where the deletion would have taken place. This is likely not what the user expects. The user expects for a character visually adjacent to the caret to be deleted.
- When switching keyboard layouts, the caret might move to a different position. This, again, is unexpected from the point of view of the user, which would expect the text being typed to be inserted at the caret position even if the typing is preceded by switching the keyboard layout.
The process used by the system to perform visual functions (such as responding to right or left arrow keys), is extremely complicated, as it involves visual-to-logical mapping followed by logical-to-visual mapping, both being ambiguous, complex, tasks (and the fist of which is undocumented, as far as I know). All of this is to achieve a seemingly simple result: moving the caret visually (e.g.) one place to the left. The complexity of this process makes its implementing code bug-prone and difficult to maintain, as the many dependencies of bug 207186 will attest.This is now mostly cleared up.
A proposal for an alternative system
In this section I'll present an alternative approach to implementing bidi editing (with visual caret movement). I won't go into implementation details, but I'll sketch the basic principles.
The system I propose has two modes: logical mode and visual mode. The system is placed in logical mode following any logical function performed by the user (such as typing or deleting), and is similarly placed in visual mode following any visual function performed by the user (such as pressing arrow keys, or clicking anywhere in the text).
When the system is in logical mode, it stores the location of the caret "logically", i.e. as an offset to the (logically stored) text. When it is in visual mode, it stores the caret location visually, i.e. relative to the text as it is presented on the screen.
While the system is in logical mode, and the user performs logical functions, the system operates very much like the current one: typed in characters are inserted at the logical insertion point; pressing "delete" (or "backspace") deletes the character logically following (or preceding) the current insertion point. Since the caret has to be drawn on the screen, its visual position must be determined following each logical function. This is done as with the current system - using IBM's algorithm (or a somewhat simplified variant thereof).
Switching from logical to visual mode
When switching from logical to visual mode (that is, when the user performs a visual function while the system is the logical mode), the visual position of the caret is determined as above, and becomes the base position for the performed visual function.
While in visual mode, only the visual position of the caret is tracked. Performing visual functions at this mode is simple (the caret is moved using a visual representation of the document), and no visual-to-logical or logical-to visual mappings are performed.
Note: When I use the term "visual position", I do not mean a "geometrical position", expressed in terms of coordinates on the screen. Rather, I'm referring to a position within the stream of text as it is rendered, i.e. between two characters which, according to the rules of the bidi layout algorithm, end up being displayed next to each other.
Switching from visual to logical mode
When switching from visual to logical mode, the logical position of the caret is determined based on its stored visual position and on the logical function which is performed.
The last point is at the heart of this system, so I'll elaborate on it:
Let's take our example of latinHEBREWmore again, displayed as latinWERBEHmore. Suppose the user visually moved the caret to between the n and the W (i.e. by using the arrow keys or the mouse). Now, suppose the user types an LTR character (x). The system has to switch to logical mode, that is, to map the visual position to a logical one. As we recall, this is ambiguous: the logical positions after the n and after the W both correspond to the current visual position. However, since at this stage the system knows that the user typed a LTR character, it will prefer the logical position following the n, and the result would be logically latinxHEBREWmore and visually latinx|WERBEHmore.
Conversely, if, at the same visual location, the user types an RTL character (X), the system will prefer the other logical position mapped to this visual position, and the result will be logically latinHEBREWXmore, and visually latin|XWERBEHmore.
In both cases, the result will be likely what the user expected. Notice that we are able to do this because unlike the current system, which tries to "guess" what character the user will type next (based, e.g., on the keyboard layout), the proposed system only resolves the visual-to-logical ambiguity when it has all the information, i.e. when it knows what character was, in fact, typed in.
Now consider deletion. In the above example, when that caret is visually positioned between n and W, the user presses the "backspace" key. In this case, the system can use the paragraph direction (which we'll assume is LTR) to determine that the expected result is deleting the character on the left (the n). So the logical position selected will be that between n and H.
Pressing "delete" in the same position should just do nothing, as there is no character which is both adjacent to the caret and logically following any of the logical positions which the caret's visual position can be mapped to.
In a more realistic case, when the character on one of the sides of the caret is of neutral directionality (i.e., a space), and the other is of strong directionality, the directionality of the strong character can be used instead of the paragraph direction to determine the direction of deletion. So for the following visual setting: latin |WERBEH, pressing "backspace" will delete the W, (while "delete" will just do nothing).
In any event, the deleted character will always be visually adjacent to the caret.
Note on implementation
We currently have no direct method for tracking the caret's visual position (which is necessary for implementing visual mode). Adding such a method might be complicated and cumbersome. However, adding a separate representation of the caret's position is not really required, if we are willing to give up the "simplification" benefits of the proposed system.
Notice that a logical position, when coupled with a hint, does uniquely correspond to a single visual position. So a visual position can be represented in terms of (logical position, hint).
This will make visual mode work very much like the current system works - for each visual function performed, the current logical position and hint will be considered as a visual position; the visual function will be performed based on that visual position; and the resulting visual position will be converted back to a logical position and a hint.
The difference between the current system and the proposed one will only be when switching from visual to logical mode. Instead of ignoring the hint, and performing the logical function based on the current logical position (as the current system does), the proposed system will use the hint to determine the current visual position, and will then, if that visual position is logically ambiguous, use the information about the logical function itself in order to choose at which of the possible logical positions the function will be applied.
Implementing the proposed system this way should be relatively easy given the current system. No new data structures or tracking mechanisms are required, but only new logic for converting a logical position, a hint, and a logical function to a new logical position.