Skip to content

Conversation

@patrickbrophy
Copy link
Contributor

Previously, concatenated ClassAds (e.g., "][" without whitespace) were handled by a wrapper reader that inserted newlines. This change modifies the grammar to natively support parsing multiple concatenated ClassAds.

Changes:

  • Modified grammar (classad.y) to accept classad_list with multiple ClassAds
  • Added ParseMultipleClassAds() function to parser package
  • Added ParseMultiple() wrapper in classad package
  • Refactored Reader to use grammar-based parsing instead of wrapper reader
  • Removed classAdSeparatorReader wrapper implementation
  • Added comprehensive tests for concatenated ClassAd parsing

This provides cleaner architecture, better performance, and native grammar support for the HTCondor format where ClassAds may be concatenated without whitespace between them.

Previously, concatenated ClassAds (e.g., "][" without whitespace) were
handled by a wrapper reader that inserted newlines. This change modifies
the grammar to natively support parsing multiple concatenated ClassAds.

Changes:
- Modified grammar (classad.y) to accept classad_list with multiple ClassAds
- Added ParseMultipleClassAds() function to parser package
- Added ParseMultiple() wrapper in classad package
- Refactored Reader to use grammar-based parsing instead of wrapper reader
- Removed classAdSeparatorReader wrapper implementation
- Added comprehensive tests for concatenated ClassAd parsing

This provides cleaner architecture, better performance, and native grammar
support for the HTCondor format where ClassAds may be concatenated without
whitespace between them.
@codecov
Copy link

codecov bot commented Dec 12, 2025

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

Copy link
Collaborator

@bbockelm bbockelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This buffers everything into memory -- I think we need to keep things more stream-oriented, processing ads one-by-one. Reasonable first attempt, just needs some tuning.

…fixes

- Add buffer size limit (10MB) to prevent unbounded memory growth
- Fix UTF-8 handling in findCompleteClassAd for proper multi-byte character support
- Refactor duplicate EOF handling logic into handleEOF() method
- Fix variable shadowing in findCompleteClassAd return values
- Add comprehensive edge case tests:
  * UTF-8 characters in attribute values and strings
  * Empty brackets handling
  * Buffer size limit enforcement
  * Unclosed brackets (malformed input)
  * Multiple ClassAds with UTF-8 content
- Optimize buffer size checks to occur before expensive scans
- Add constants for maxBufferSize and readChunkSize for maintainability

All existing tests pass, and new edge case tests verify robustness.
Copy link
Collaborator

@bbockelm bbockelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're hitting a basic design issue in the current lexer - it wants to be fed a full ClassAd as a string.

That's a dumb design causing you to generate basically a mini-parser in the reader to buffer out a full ad.

I'll try a more fundamental change, allowing the lexer to work on streams.

@bbockelm
Copy link
Collaborator

Fixed in #6

@bbockelm bbockelm closed this Dec 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants