Commit d692006
committed
feat(chunking): list-aware break point scanner
Replaces the two naive list patterns in BREAK_PATTERNS with a
stack-based scanner that tracks nested list frames and emits
depth-weighted break points plus a list-end transition break point.
Old behavior:
[/\n[-*]\s/g, 5, 'list']
[/\n\d+\.\s/g, 5, 'numlist']
Both scored every list-item start at 5, so the break point almost
always lost to nearby heading/blank/codeblock scores and chunks
landed mid-item on long lists. Nested sublists and the ordered `1)`
form were not detected at all.
New scanner (findListBreakPoints):
- depth 0 item (top-level): score 70
- depth 1 item (first sublist): score 45
- depth 2+ item (deeper): score 25
- list-end (list -> non-list transition): score 75
Scope:
- Unordered markers: `-`, `*` (matches previous behavior; `+` not
supported — agents and modern docs don't use it)
- Ordered markers: `1.` and `1)` (new: `1)` was never detected)
- Mixed marker characters at the same indent are treated as one
list (simpler than CommonMark's split rule, better for chunking)
- Nested sublists with proper depth tracking (new)
- Blank lines inside items don't terminate the list
- Column-0 non-list lines terminate the list and emit list-end
Deliberately deferred:
- Loose vs tight list distinction (rendering concern, no chunking
impact)
- Lazy continuation (column-0 line that CommonMark folds back into
the preceding item)
- 4-space indented code blocks inside items (ambiguous with
continuation; defer)
- Tab-as-marker-separator (`-\t`); not a regression since neither
old nor new matches tab indentation
Integration: chunkDocument and chunkDocumentAsync now merge
findListBreakPoints output with scanBreakPoints before passing to
chunkDocumentWithBreakPoints. mergeBreakPoints already handles
"higher score wins at same position." AST points continue to layer
on top in the async path.
16 new tests in test/store.test.ts covering empty input, prose,
unordered/ordered/mixed lists, three-deep nesting, mixed marker
nesting, list-end at prose and EOF, blank-line continuation, `+`
rejection, position convention, and an end-to-end integration test
through chunkDocument confirming long lists split at item boundaries.1 parent 2458400 commit d692006
3 files changed
Lines changed: 294 additions & 14 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
5 | 16 | | |
6 | 17 | | |
7 | 18 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
106 | 106 | | |
107 | 107 | | |
108 | 108 | | |
109 | | - | |
110 | | - | |
111 | 109 | | |
112 | 110 | | |
113 | 111 | | |
| |||
189 | 187 | | |
190 | 188 | | |
191 | 189 | | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
192 | 329 | | |
193 | 330 | | |
194 | 331 | | |
| |||
2179 | 2316 | | |
2180 | 2317 | | |
2181 | 2318 | | |
2182 | | - | |
| 2319 | + | |
| 2320 | + | |
| 2321 | + | |
2183 | 2322 | | |
2184 | 2323 | | |
2185 | 2324 | | |
| |||
2201 | 2340 | | |
2202 | 2341 | | |
2203 | 2342 | | |
| 2343 | + | |
2204 | 2344 | | |
2205 | 2345 | | |
2206 | | - | |
| 2346 | + | |
2207 | 2347 | | |
2208 | 2348 | | |
2209 | 2349 | | |
2210 | 2350 | | |
2211 | | - | |
| 2351 | + | |
2212 | 2352 | | |
2213 | 2353 | | |
2214 | 2354 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| 36 | + | |
36 | 37 | | |
37 | 38 | | |
38 | 39 | | |
| |||
609 | 610 | | |
610 | 611 | | |
611 | 612 | | |
612 | | - | |
| 613 | + | |
613 | 614 | | |
614 | 615 | | |
615 | | - | |
616 | | - | |
617 | | - | |
618 | | - | |
619 | | - | |
620 | | - | |
621 | | - | |
622 | | - | |
| 616 | + | |
| 617 | + | |
623 | 618 | | |
624 | 619 | | |
625 | 620 | | |
| |||
796 | 791 | | |
797 | 792 | | |
798 | 793 | | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
| 803 | + | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
| 821 | + | |
| 822 | + | |
| 823 | + | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
| 828 | + | |
| 829 | + | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
| 833 | + | |
| 834 | + | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
| 838 | + | |
| 839 | + | |
| 840 | + | |
| 841 | + | |
| 842 | + | |
| 843 | + | |
| 844 | + | |
| 845 | + | |
| 846 | + | |
| 847 | + | |
| 848 | + | |
| 849 | + | |
| 850 | + | |
| 851 | + | |
| 852 | + | |
| 853 | + | |
| 854 | + | |
| 855 | + | |
| 856 | + | |
| 857 | + | |
| 858 | + | |
| 859 | + | |
| 860 | + | |
| 861 | + | |
| 862 | + | |
| 863 | + | |
| 864 | + | |
| 865 | + | |
| 866 | + | |
| 867 | + | |
| 868 | + | |
| 869 | + | |
| 870 | + | |
| 871 | + | |
| 872 | + | |
| 873 | + | |
| 874 | + | |
| 875 | + | |
| 876 | + | |
| 877 | + | |
| 878 | + | |
| 879 | + | |
| 880 | + | |
| 881 | + | |
| 882 | + | |
| 883 | + | |
| 884 | + | |
| 885 | + | |
| 886 | + | |
| 887 | + | |
| 888 | + | |
| 889 | + | |
| 890 | + | |
| 891 | + | |
| 892 | + | |
| 893 | + | |
| 894 | + | |
| 895 | + | |
| 896 | + | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + | |
| 903 | + | |
| 904 | + | |
| 905 | + | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
| 913 | + | |
| 914 | + | |
| 915 | + | |
| 916 | + | |
| 917 | + | |
| 918 | + | |
| 919 | + | |
| 920 | + | |
| 921 | + | |
| 922 | + | |
| 923 | + | |
| 924 | + | |
| 925 | + | |
| 926 | + | |
| 927 | + | |
799 | 928 | | |
800 | 929 | | |
801 | 930 | | |
| |||
0 commit comments