Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate thrift code in multiple mods/files #454

Open
Xuanwo opened this issue Jun 19, 2024 · 9 comments · May be fixed by #510
Open

Generate thrift code in multiple mods/files #454

Xuanwo opened this issue Jun 19, 2024 · 9 comments · May be fixed by #510
Assignees
Labels
A-volo-build This issue concerns the `volo-build` crate. C-feature-accepted This feature request is accepted.

Comments

@Xuanwo
Copy link
Contributor

Xuanwo commented Jun 19, 2024

Feature Request

Crates

volo-thrift / pilota

Motivation

Pilota often generates all code in a single file, which can result in extremely large files when the thrift IDL is substantial.

This issue is evident in hive_metastore, where the generated file size approaches 10MiB. Such large files can slow down IDEs and cause web pages to crash.

Try: https://docs.rs/hive_metastore/latest/src/hive_metastore/hms.rs.html#52510-52520

Proposal

Generate thrift code in multiple mods/files by struct / service.

Alternatives

None.

@PureWhiteWu
Copy link
Member

I think this is a workaround for the lack of IDE and website capabilities. However, it may be worth doing.

We need detailed design, especially in large-scale idl (thousands of structs, enums, etc.) to avoid IDE problems, website problems, and most importantly, naming conflicts caused by too many files, and to allow users to reference all files at once in a convenient way and maintain the namespace relationship, which will be a more troublesome part.

At present, this feature does not affect normal use and is not a high priority.

@missingdays
Copy link

missingdays commented Sep 9, 2024

This issue becomes worse the more entites there are, with some files growing in size as large as several hunders megabyes.

For the design, the solution might be simply utilizing the include! macro to include the generated parts into the volo_gen.rs.
For each entity, a separate file can be generated, named volo_gen_$N.rs, where $N is an incrementing counter. Then, this file will be included in volo_gen.rs using include!("volo_gen_$N.rs");.

On most filesystems, many files in a single directory can lead to performance degradations. To workaround this problem, a separate directory can be generated for each 1_000 or 1_000_000 files. Then, the files would be included like include!("directory_$M/volo_gen_$N.rs");.

This solution preserves the single entry point volo_gen.rs, while allowing to reduce its file size. If the resulting file volo_gen.rs gets too big even with this solution, then volo_gen_$N.rs files can be generated not for a single entity, but for a batch of entities, e.g. for each 10 entities.

@PureWhiteWu
Copy link
Member

PureWhiteWu commented Sep 9, 2024

I think maybe we can try separating by mods first? Or if the file is still too large, we can try separating by struct/enum/service/const?

For example, we have two idls article.thrift and author.thrift:

// article.thrift
namespace rs article

struct Article {
    1: required i64 id,
    2: required string title,
    3: required string content,
    4: required i64 author_id,
}

struct GetArticleRequest {
    1: required i64 id,
}

struct GetArticleResponse {
    1: required Article article,
}

service ArticleService {
    GetArticleResponse GetArticle(1: GetArticleRequest req),
}

// author.thrift
namespace rs article.author // just for example

struct Author {
    1: required i64 id,
    2: required string username,
    3: required string email,
}

In that case, we can generate the following layout:

 ┌─ volo_gen.rs
 │  ┌─ struct_Article.rs
 │  │  ┌─ mod.rs
 │  │  ├─ struct_Author.rs
 │  ├─ author
 │  ├─ struct_GetArticleResponse.rs
 │  ├─ struct_GetArticleRequest.rs
 │  ├─ service_ArticleService.rs
 │  ├─ mod.rs
 ├─ article
 $OUT_DIR

With the following example code:

// in volo_gen.rs
mod article {
    include!("article/mod.rs");
}

// in article/mod.rs
include!("struct_Article.rs");
include!("struct_GetArticleRequest.rs");
include!("struct_GetArticleResponse.rs");
include!("service_ArticleService.rs");

mod author {
    include!("author/mod.rs");
}

// in article/author/mod.rs
include!("struct_Author.rs");

This will result in more readable code than just generate something like volo_gen_1.rs, volo_gen_2.rs... What do you think?

Furthermore, I'm not quite familiar with how IDEs(RustRover, VSCode + rust-analyzer) deal with include! macro, will this still result in performance problems(such as, include all files together and then analyze, which will not solve the performance problem) or some feature disabled(such as doesn't support analyzing include! macro in include! macro)? @missingdays

@missingdays
Copy link

That naming scheme is better for sure, and will make exploring the generated files easier.

will this still result in performance problems

While there's no 100% guarantee that no performance problems will arise, the main problem I see is the actual file size. To analyze files with include!, IDEs usually don't replace with macro call with the file contents, but rather analyze the included file separately.

some feature disabled

A well-formed files which contain valid Rust code should be well-supported by IDEs, even if they are include!d.

This generation scheme could also be optional, enabled only by an explicit flag.
Or it can be turned on automatically only if the big scheme is detected. For example, if there are less than 1000 entities - generate a single file, and a lot of separate files otherwise.

@missingdays
Copy link

@PureWhiteWu if the issue is not a high priority for you right now, I can try implementing it, though I would probably need some guidance on some entry points.

@PureWhiteWu
Copy link
Member

@missingdays Great! Thanks very much!
In fact, we really want to realize this function earlier, but we have not had the manpower.

Feel free to contact @Millione if you need any help.

@PureWhiteWu PureWhiteWu added A-volo-build This issue concerns the `volo-build` crate. C-feature-accepted This feature request is accepted. labels Sep 10, 2024
@missingdays
Copy link

missingdays commented Sep 10, 2024

After looking around the code a bit, I noticed that there's a Workspace mode supported by pilota, and that @Millione has done some work in supporting this mode in volo-build/src/workspace.rs, though it seems to not be used at the moment

If I'm reading the implementation correctly, it creates a workspace with one crate for each definition. So it looks like it solves the same problem, but in a more straightforward way

@PureWhiteWu
Copy link
Member

In fact, the problem we encountered comes from the both workspace and normal mode. Although the workspace mode has split at the idl level, the single idl and its dependencies are still too complicated.

In addition, because we generate all common-used(used by multi idl) structures in a common library by default, the generated files in this library are very large (this is the problem we encountered when using RustRover).

@missingdays
Copy link

This needs a support on the pilota side, I've created a pull request with the proposed solution here cloudwego/pilota#270

missingdays added a commit to missingdays/volo that referenced this issue Oct 18, 2024
@missingdays missingdays linked a pull request Oct 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-volo-build This issue concerns the `volo-build` crate. C-feature-accepted This feature request is accepted.
Development

Successfully merging a pull request may close this issue.

4 participants