Skip to content

Commit 6903dde

Browse files
Update line-col-tracking documentation.
--HG-- extra : convert_revision : svn%3Aacbfec75-9323-0410-a652-858a13e371e0/trunk%401291
1 parent f8b30a5 commit 6903dde

File tree

1 file changed

+7
-31
lines changed

1 file changed

+7
-31
lines changed

docs/line-col-tracking.txt

+7-31
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,9 @@
11
Line and column number tracking
22

3-
The first thing is that it's trivial to do in-efficiently. The second thing
4-
is that it's nontrivial to do it efficiently. We have two competing interests
5-
at stake:
6-
7-
1. Code brevity and comprehensibility, which means that we factor things out,
8-
make methods, sub-calls, a stream class, etc. This is what the Python and
9-
Ruby implementations did.
10-
2. Performance, which means avoiding function calls like the plague.
11-
12-
This is mildly manageable, until you realize that column numbers are tracked
13-
as per Unicode characters, not bytes. Which means that if we go for (2), we
14-
will end up with a large amount of duplicated code. Further complicating issues
15-
is the fact that the initial implementation PH5P opted to make function calls
16-
to retrieve characters. I personally find this unacceptable.
17-
18-
It should be noted, however, that we're already performing a dynamic function
19-
lookup on every character of HTML, which is ALSO unacceptable. One possible
20-
way of restructuring this is turning the entire thing into a giant loop with
21-
a giant conditional. Note that the conditional is evaluated sequentially, so
22-
let's test whether or not that's more expensive.
23-
24-
Some surprising results: the cost it takes to perform all of those string
25-
comparisons dwarfs the cost from calling functions. If you convert the states
26-
into integers, however, having a gigantic loop is slightly faster SO LONG AS
27-
you use the loop to get rid of a function call. However, if we pull the common
28-
code out of the individual state functions and place it in the looper, things
29-
work nicely.
30-
31-
The conclusion, I suppose, is that we're going to keep the method-based state
32-
machine and save on method calls by moving the common code outside. Let's do
33-
this right now.
3+
The original implementation of this performed line and column tracking
4+
in place. However, it was found that this approximately doubled the
5+
runtime of tokenization, so we decided to take a more optimistic approach:
6+
only calculate line/column numbers when explicitly asked to. This
7+
is slower if we attempt to calculate line/column numbers for everything
8+
in the document, but if there is a small enough number of errors it
9+
is a great improvement.

0 commit comments

Comments
 (0)