How to make it easier for Geany to use PEG parsers #4053

dolik-rce · 2024-08-18T09:07:21Z

dolik-rce
Aug 18, 2024

Geany developers are somewhat reluctant to adopt PEG based parsers (see geany/geany#3934 or geany/geany#3034). I do understand most of their points:

large blobs of generated code
need to support extra infrastructure (packcc and now also optionally pegof) on all supported platforms
lower parsing speed then handwritten parsers

Hopefully, at least some of these could be resolved. I have few ideas, which I'd like to share and discuss, even though I'm aware that none of them is perfect. Maybe other people will be able to think about other ways, or improve upon these. Here they are:

1. Extended source code distribution

The code generated by packcc is platform independent. So there is no need to compile it every time the final product (i.e. Geany) is built. There could be some way to distribute ctags sources with the PEG parsers pre-generated. The build system might need a few tweaks to be choose whether it should use the pre-generated files or generate them if they are not present.

Variant 1A: Extended source tarball

Ctags project could provide those sources directly in its releases as another source tarball.

Pros:

relatively simple and cheap to implement

Cons:

only works for tagged releases

Variant 1B: External repository

It should be possible to create a separate repository, that would mirror all the changes in ctags and automatically replace the PEG files with the generated code.

Pros:

works with any commit, not just tags

Cons:

more work needed to make all the automation
possibly confusing for users, as there would be two repositories with very similar content and only slightly different purpose

2. Distribute ctags as a library

Geany could simply link against ctags in form of static or dynamic library.

Pros:

simple to use for the downstream

Cons:

possible compatibility issues
ctags would have to build a library for each of the many supported platforms
the library would contain all parsers, even those that are not used in Geany

3. Keep the generated code in the repository

The parser code could by generated during the development process and kept in the version control. This is kind of against the good manners and I suggest it just for the sake of completeness.

Pros:

works with any commit, not just tags
simple to implement

Cons:

large blobs of generated code in git

4. Provide a way to generate the parsers easily

One of the main problems is that packcc and pegof are not readily available on most platforms. Even if they can be built for many, they can't be easily installed using package manager, since there are no packages for them.

However ctags repository could provide a simple script that would clone the repositories for these tools, compile them and then use them to generate the source code for parsers.

Variant 4A: generate parsers on ctags upgrade only

Since Geany just copies the ctags code into its own repository, this would mean the script only needs to be run by the person who upgrades ctags.

Pros:

simple to implement
lower knowledge requirements (downstream doesn't need to know much about the parser generation)

Cons:

extra step in the upgrade
generated code in the Geany repository

Variant 4B: generate parsers on each build

It would of course also be possible to just copy the *.peg files and run the generator on each build.

Pros:

no generated code in git
lower knowledge requirements (downstream doesn't need to know much about the parser generation)

Cons:

extra step in the build
it would only work on platforms supported by all the tools (packcc should work pretty much everywhere, pegof still needs some work on some platforms)

5. Parser-generator-as-a-service

It would be possible to create a web service that would accept PEG grammar as input and respond with the generated source code. This could be queried either during the ctags upgrade or as part of the build process (so it would have similar pros and cons as previous variant).

Pros:

Totally platform independent
Includes a modern buzzword 🙂

Cons:

someone would have to setup and maintain the service and ensure reasonable accessibility
requires internet connection

Conclusion

None of the ideas is perfect and I honestly don't know which is best. But I hope this might spark some discussion and hopefully we could come up with something that would be acceptable for everyone. So let the ideas flow please.

dolik-rce · 2024-08-18T09:09:46Z

dolik-rce
Aug 18, 2024
Author

CC: @masatake, @elextr, @techee

0 replies

techee · 2024-08-18T11:51:14Z

techee
Aug 18, 2024
Collaborator

My personal problem (or maybe it's better to say "worry") are mainly these two points:

large blobs of generated code

lower parsing speed then handwritten parsers

plus it is

hard to review the generated code (with possible security implications where one cannot say if some smart use of macros doesn't lead to something like in xz)
hard to fix performance/parsing issues where one doesn't really see what's going on

The problem this discussion tries to solve is the smaller one for me - I'd go for Variant 4A for simplicity. It just requires building ctags and then copying the generated code which isn't really a big problem. The generated blob would have to stay in the repository but since it would be typically updated once a release, it shouldn't lead to big diffs.

In any case, I don't think it's something the ctags project should worry about - it's really up to us what we want to do with ctags in Geany.

1 reply

dolik-rce Aug 18, 2024
Author

Thanks for your input!

hard to review the generated code (with possible security implications where one cannot say if some smart use of macros doesn't lead to something like in xz)

I guess you'd just have to believe the parser-generator. For me it's like trusting the compiler that it doesn't insert malicious instructions in my programs. I guess it is the same as when people trusted the xz library, but there is unfortunately always some point beyond which it is just not feasible to personally check every part of the system.

hard to fix performance/parsing issues where one doesn't really see what's going on

That is one thing where pegof might help. It's not just an optimizer, it also allows user to benchmark a grammar and possibly debug which parts of it are slow. I'm not saying it's simple thing to do, just that it might help. DISCLAIMER: I'm the author of pegof, so I might be a bit biased.

In any case, I don't think it's something the ctags project should worry about - it's really up to us what we want to do with ctags in Geany.

Well, to be brutally honest, I'm just trying to lower the bar to get my kotlin parser accepted in Geany 🙂 If it means I have to do some extra work, I'm up to it. And if there is something I can do, then I assume that the best place to do it is in ctags, where it can help all ctags users, not just Geany maintainers.

masatake · 2024-08-18T17:21:33Z

masatake
Aug 18, 2024
Maintainer

As @techee wrote, 4a may be the best. Making .c and .h files is easy.

[yamato@dev64]~/var/ctags-github% make $(for x in peg/*.peg; do echo ${x/.peg/.c}; done)
make: 'peg/elm.c' is up to date.
make: 'peg/kotlin.c' is up to date.
make: 'peg/thrift.c' is up to date.
make: 'peg/toml.c' is up to date.
make: 'peg/varlink.c' is up to date.

1 reply

dolik-rce Aug 18, 2024
Author

Oh, I was mistakenly thinking that the process would be something like

clone packcc
build packcc
clone pegof
build pegof
configure ctags to use packcc and pegof
make $(for x in peg/*.peg; do echo ${x/.peg/.c}; done)

I didn't notice that snapshot of packcc is present in the ctags repo. So the first two steps are actually not needed. If one wants to use optimized grammars, then it is just 3.-6., without optimization it's just the last step, as you wrote.

elextr · 2024-08-19T00:27:46Z

elextr
Aug 19, 2024

In my judgement only 4a "build the .h .c at upgrade" is viable, and if the clone of uctags has packcc thats fine. Pegof would need to be packaged by most distros since it uses a different build tool and is mostly a one person project so thats a future improvement.

But as @techee said, we are sort of paranoid about performance, the parsers need to run between keystrokes, and stuttering typing is pretty unacceptable. I was told that the generated toml.c is 5000 lines, thats 10% of the total of all the other parsers except C++, and its more than a third of the C++ parser. Whilst the size is not a major indicator of speed, maybe only a teeny part of that code is run each parse, but then what is it there for, so the concern remains.

And it would be compiled into everybodys Geany, whether they use toml or not. And how big is kotlin.c? Maybe someone who has a copy of toml.c and kotlin.c (not pegof-ed) could post to a gist so we can all see the gory details.

I can see both arguments about reviewability of the toml.c @techee is right its a major risk vector, and @dolik-rce is right about trust your compiler, although its not a widely used one. So for me, since you are both correct, that doesn't make the decision.

get my kotlin parser accepted in Geany 🙂

Google says the kotlin LSP is looking for a maintainer, maybe you could improve that since Geany now has initial support for LSPs 😁

1 reply

dolik-rce Aug 19, 2024
Author

Ok, here are all the "gory details":

Google says the kotlin LSP is looking for a maintainer, maybe you could improve that since Geany now has initial support for LSPs 😁

Well, I've been thinking about it. And also about creating ctags-based LSP 🙂 But both would be significantly bigger projects than creating PEG and adding it to Geany.

BTW, related to the speed concerns: I'm using geany with the kotlin parser daily at work. I use very low-end hardware (oldish Thnikpad x270 with Core i5) and I never noticed any latency while typing.

elextr · 2024-08-19T10:46:26Z

elextr
Aug 19, 2024

Thats not C its a-cc-embler 😉 ... and TOTALLY unreviewable.

But even though it seems pretty simple it will still be fairly large when compiled.

One of the benefits of LSP is that it isn't compiled into Geany, so no matter how big one LSP is it won't cost anything to someone not using that language. Whilst the "small and lightweight" Geany ship sailed a long time ago we always have concerns that we should not expand Geany too much when there are regular posts from users on Raspberry Pi systems.

Or a PR to support loadable DLL parsers instead would be a "good thing" ™️ so users only pay the price for the languages they use and we don't need to care how big any parser is. But that is probably a Geany thing, not a uctags thing.

And kotlin.c was so big it broke gist 😄

Anyway since uctags includes the packcc tool it really doesn't have anything to do, its up to Geany how it addresses the issue.

PS I was serious about the Kotlin LSP needing help https://github.com/fwcd/kotlin-language-server?tab=readme-ov-file#this-repository-needs-your-help

0 replies

techee · 2024-08-19T19:08:48Z

techee
Aug 19, 2024
Collaborator

Well, to be brutally honest, I'm just trying to lower the bar to get my kotlin parser accepted in Geany 🙂

Kind of suspected that ;-).

From what I remember, on Raspberry pi 3/4 a 400 LOC Kotlin file started do produce unacceptable slowdowns with the PEG parser (probably was the unoptimized version). And even if the optimized version is faster for normal editing, I'm quite worried that we can't easily check if there isn't some pathological path in the grammar that would produce much worse slow-downs when some specific conditions are met.

I don't want to say we should never use a PEG parser in Geany and I don't want to veto such attempts if others a have different opinion, I'm just not very thrilled about it myself.

If it means I have to do some extra work, I'm up to it. And if there is something I can do, then I assume that the best place to do it is in ctags, where it can help all ctags users, not just Geany maintainers.

To be clear, I'm sure you did a great job with the PEG parser - for me the problem are PEG parsers in general, not your work.

0 replies

masatake · 2024-08-19T19:15:20Z

masatake
Aug 19, 2024
Maintainer

@techee What do you think about bison/flex? This is just a question.

0 replies

techee · 2024-08-19T19:24:45Z

techee
Aug 19, 2024
Collaborator

Or a PR to support loadable DLL parsers instead would be a "good thing" ™️ so users only pay the price for the languages they use and we don't need to care how big any parser is. But that is probably a Geany thing, not a uctags thing.

Or implement a LSP server based on universal-ctags (with all the parsers it contains). I think it would be a super-cool (but huge) project. One could grab what we call the "tag manager" in Geany plus some extra code from the files that use it and add the JSON RPC API of LSP. I was playing with the idea of implementing it myself but it's really a big project and I've just finished the LSP plugin which consumed a huge amount of my time and don't plan anything bigger now.

1 reply

masatake Aug 19, 2024
Maintainer

https://github.com/eranif/codelite/tree/master/ctagsd this looks interesting.

techee · 2024-08-19T20:26:37Z

techee
Aug 19, 2024
Collaborator

@techee What do you think about bison/flex? This is just a question.

They'll be

fast (no unlimited lookahead like PEG)
not so powerful like PEG and various language quirks will have to be handled manually by extra C code
there will still be the "unreviewable generated C blob" - probably smaller

Again, I personally prefer hand-written parsers which one can easily debug and see what's going on by looking at their code.

0 replies

elextr · 2024-08-19T23:34:40Z

elextr
Aug 19, 2024

One point on Bison is it only handles a subset of syntaxes (LR(1) IIRC and context free), so its ok if your language happens to fit in that mold, but otherwise its forced into using the LALR or GLR extensions and those can have the same overheads as PEG. And no guarantee it will handle context sensitivity at all, it may be necessary to hand roll extras. Sadly nothing is free, to get better performance there is a trade off of generality, but if the language needs generality the performance goes down. You can't win ☹️ .

My understanding is that several language implementations have moved away from Bison for some of these reasons.

One thing that has been missed with the discussions of the Kotlin parser speed is that most hand rolled ctags parsers skip a lot of code, IIUC none of them parse expressions or statements fully, but at least at a quick glance the Kotlin PEG describes the full language including statements and expressions. So it is likely doing a lot of work that is not useful for the ctags use-case (just parse declarations). Maybe if that could be removed it will be a good deal faster since most programs have more statements and expressions than declarations.

And how Bison can be made to skip statements and expressions is also unknown.

Finally, how do PEG and Bison parsers handle the issue of incorrect code, being run between keystrokes there is no guarantee that the code will parse correctly (since the user hasn't finished typing yet).

The ctagsd LSP that @masatake posted could push the problem out of Geany 😜 of course.

4 replies

masatake Aug 20, 2024
Maintainer

One thing that has been missed with the discussions of the Kotlin parser speed is that most hand rolled ctags parsers skip a lot of code, IIUC none of them parse expressions or statements fully, but at least at a quick glance the Kotlin PEG describes the full language including statements and expressions. So it is likely doing a lot of work that is not useful for the ctags use-case (just parse declarations). Maybe if that could be removed it will be a good deal faster since most programs have more statements and expressions than declarations.

I'm also thinking about this.

For writing an ad-hoc parser quickly, I have been interested in island parsing.
I expect the island parsing to contribute to speed.

arithy/packcc#32

In the issue, I introduced research parsers. The papers propose extending the PEG syntax to support island parsing and an algorithm converting the extended PEG expression to normal PEG.
pegof looks to be in a good position to implement these things:-).

dolik-rce Aug 20, 2024
Author

One thing that has been missed with the discussions of the Kotlin parser speed is that most hand rolled ctags parsers skip a lot of code, IIUC none of them parse expressions or statements fully, but at least at a quick glance the Kotlin PEG describes the full language including statements and expressions. So it is likely doing a lot of work that is not useful for the ctags use-case (just parse declarations). Maybe if that could be removed it will be a good deal faster since most programs have more statements and expressions than declarations.

I've been thinking about this too. Just parsing class, variable and function declarations would be simple a fast. The problem is when you want to also get other information, like scope. Kotlin is a very rich language, where (thanks to lambdas) pretty much anything can be nested in anything else (e.g.: you can have a function defined inside variable definition, which itself is within instantiation of anonymous class), so it turns out that if you want to recognize the scope correctly, you have to parse it all.

elextr Aug 20, 2024

pretty much anything can be nested in anything else

That is the case for C++ as well, a class or a lambda can be declared anywhere. A lambda can contain declarations and a class can contain functions that can contain declarations etc... For ctags the C++ parser is the only one I know that parses contents and hence local declarations, most others just do global declarations, and well, the ctags C++ parser is about 13000 lines, more than most of the others in total, but not as big as kotlin.c 😉. @masatake and the ctags team did a good job with the C++ parser.

The problem is when you want to also get other information, like scope

I presume you mean named scope like a function or class, the tags format does not have any way to express lexical scopes, which means no visibility and inferred type computations (I'm presuming a modern language like Kotlin has type inference). Tags were invented back in the days of C where nothing can nest in anything else and there is only one namespace and everything is hard declared (and originally at the top of the function IIRC) so it didn't matter so much.

masatake Aug 20, 2024
Maintainer

NOTE: @pragmaware wrote the C/C++ parser.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make it easier for Geany to use PEG parsers #4053

{{title}}

Replies: 10 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How to make it easier for Geany to use PEG parsers #4053

dolik-rce Aug 18, 2024

1. Extended source code distribution

Variant 1A: Extended source tarball

Variant 1B: External repository

2. Distribute ctags as a library

3. Keep the generated code in the repository

4. Provide a way to generate the parsers easily

Variant 4A: generate parsers on ctags upgrade only

Variant 4B: generate parsers on each build

5. Parser-generator-as-a-service

Conclusion

Replies: 10 comments · 8 replies

dolik-rce Aug 18, 2024 Author

techee Aug 18, 2024 Collaborator

dolik-rce Aug 18, 2024 Author

masatake Aug 18, 2024 Maintainer

dolik-rce Aug 18, 2024 Author

elextr Aug 19, 2024

dolik-rce Aug 19, 2024 Author

elextr Aug 19, 2024

techee Aug 19, 2024 Collaborator

masatake Aug 19, 2024 Maintainer

techee Aug 19, 2024 Collaborator

masatake Aug 19, 2024 Maintainer

techee Aug 19, 2024 Collaborator

elextr Aug 19, 2024

masatake Aug 20, 2024 Maintainer

dolik-rce Aug 20, 2024 Author

elextr Aug 20, 2024

masatake Aug 20, 2024 Maintainer

dolik-rce
Aug 18, 2024

Replies: 10 comments 8 replies

dolik-rce
Aug 18, 2024
Author

techee
Aug 18, 2024
Collaborator

dolik-rce Aug 18, 2024
Author

masatake
Aug 18, 2024
Maintainer

dolik-rce Aug 18, 2024
Author

elextr
Aug 19, 2024

dolik-rce Aug 19, 2024
Author

elextr
Aug 19, 2024

techee
Aug 19, 2024
Collaborator

masatake
Aug 19, 2024
Maintainer

techee
Aug 19, 2024
Collaborator

masatake Aug 19, 2024
Maintainer

techee
Aug 19, 2024
Collaborator

elextr
Aug 19, 2024

masatake Aug 20, 2024
Maintainer

dolik-rce Aug 20, 2024
Author

masatake Aug 20, 2024
Maintainer