Skip to content

refactor: custom lexer #437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

refactor: custom lexer #437

wants to merge 14 commits into from

Conversation

psteinroe
Copy link
Collaborator

@psteinroe psteinroe commented Jul 1, 2025

  • adds a new tokenizer crate that turns a string into simple tokens
  • adds a new lexer + lexer_codegen that uses the tokeniser to lex into a new SyntaxKind enum

the new implementation is

  • much more performant (no extra string allocations, no call to C library)
  • works with broken strings (!!!!)
  • custom-made to our use-case (eg the LineEnding variant comes with a count)

in a follow-up, we will be able to:

  • parse custom parameters that popular tools use
  • pre-process to remove unsupported stuff
  • parse non-sql content (e.g. commands) via a simple custom parser

todos:

  • use new lexer in splitter
  • make sure we support all the different parameter formats popular tools use -> will do it in a follow-up
  • tests

@psteinroe psteinroe changed the title refactor: parser refactor: lexer Jul 1, 2025
@psteinroe psteinroe requested a review from juleswritescode July 4, 2025 16:00
@psteinroe psteinroe marked this pull request as ready for review July 4, 2025 16:00
@psteinroe psteinroe changed the title refactor: lexer refactor: custom lexer Jul 4, 2025
@@ -24,6 +24,7 @@ biome_rowan = "0.5.7"
biome_string_case = "0.5.8"
bpaf = { version = "0.9.15", features = ["derive"] }
crossbeam = "0.8.4"
enum-iterator = "2.1.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wir haben schon strum, denke mal das macht das selbe

let mut ends_with_semicolon = false;

// Iterate through tokens in reverse to find the last non-whitespace token
for idx in (0..lexed.len()).rev() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wie wär's hier mit matches(iter.filter(..).next_back(), Some(semi)) ?


/// Returns an iterator over token kinds
pub fn tokens(&self) -> impl Iterator<Item = SyntaxKind> + '_ {
(0..self.len()).map(move |i| self.kind(i))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.kind.iter().copied() ?


/// Returns the kind of token at the given index
pub fn kind(&self, idx: usize) -> SyntaxKind {
assert!(idx < self.len());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

willst du hier noch ne message anfügen?
ansonsten wär's vllt. besser, einfach den access in 53 panicen zu lassen, dann kriegt man zumindest eine index-out-of-bounds meldung, oder?

.collect()
}

pub(crate) fn text_range(&self, i: usize) -> std::ops::Range<usize> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ich glaube in allen fällen wird die std::ops::Range<usize> zu einer TextRange gemapped, vllt. dann besser die logic in range(..) packen?

Comment on lines +101 to +106
fn range_text(&self, r: std::ops::Range<usize>) -> &str {
assert!(r.start < r.end && r.end <= self.len());
let lo = self.start[r.start] as usize;
let hi = self.start[r.end] as usize;
&self.text[lo..hi]
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wenn ich richtig sehe wird das hier nur in text genutzt, vllt. die logik dann da rein packen statt einer indirection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants