|
1 | 1 | Line and column number tracking
|
2 | 2 |
|
3 |
| -The first thing is that it's trivial to do in-efficiently. The second thing |
4 |
| -is that it's nontrivial to do it efficiently. We have two competing interests |
5 |
| -at stake: |
6 |
| - |
7 |
| -1. Code brevity and comprehensibility, which means that we factor things out, |
8 |
| - make methods, sub-calls, a stream class, etc. This is what the Python and |
9 |
| - Ruby implementations did. |
10 |
| -2. Performance, which means avoiding function calls like the plague. |
11 |
| - |
12 |
| -This is mildly manageable, until you realize that column numbers are tracked |
13 |
| -as per Unicode characters, not bytes. Which means that if we go for (2), we |
14 |
| -will end up with a large amount of duplicated code. Further complicating issues |
15 |
| -is the fact that the initial implementation PH5P opted to make function calls |
16 |
| -to retrieve characters. I personally find this unacceptable. |
17 |
| - |
18 |
| -It should be noted, however, that we're already performing a dynamic function |
19 |
| -lookup on every character of HTML, which is ALSO unacceptable. One possible |
20 |
| -way of restructuring this is turning the entire thing into a giant loop with |
21 |
| -a giant conditional. Note that the conditional is evaluated sequentially, so |
22 |
| -let's test whether or not that's more expensive. |
23 |
| - |
24 |
| -Some surprising results: the cost it takes to perform all of those string |
25 |
| -comparisons dwarfs the cost from calling functions. If you convert the states |
26 |
| -into integers, however, having a gigantic loop is slightly faster SO LONG AS |
27 |
| -you use the loop to get rid of a function call. However, if we pull the common |
28 |
| -code out of the individual state functions and place it in the looper, things |
29 |
| -work nicely. |
30 |
| - |
31 |
| -The conclusion, I suppose, is that we're going to keep the method-based state |
32 |
| -machine and save on method calls by moving the common code outside. Let's do |
33 |
| -this right now. |
| 3 | +The original implementation of this performed line and column tracking |
| 4 | +in place. However, it was found that this approximately doubled the |
| 5 | +runtime of tokenization, so we decided to take a more optimistic approach: |
| 6 | +only calculate line/column numbers when explicitly asked to. This |
| 7 | +is slower if we attempt to calculate line/column numbers for everything |
| 8 | +in the document, but if there is a small enough number of errors it |
| 9 | +is a great improvement. |
0 commit comments