fix: increase MAX_DIFF_SIZE default from 800KB to 5MB #19

LearningCircuit · 2025-11-16T12:01:00Z

Summary

Increases the default MAX_DIFF_SIZE from 800KB to 5MB to allow larger PRs to be reviewed
The 800KB limit was causing failures for PRs with ~900KB diffs in local-deep-research

Context

The AI Code Reviewer workflow in local-deep-research failed with:

❌ Error: Diff is too large (903991 bytes, max: 800000 bytes)
Please split this PR into smaller changes for review.

Changes

Updated MAX_DIFF_SIZE default from 800000 to 5000000 bytes
This can still be overridden via environment variable if needed

Impact

Allows reviewing larger PRs without manual configuration
Still prevents excessive API usage (5MB is reasonable for most PRs)

Allows larger PRs to be reviewed without manual configuration. The 800KB limit was too restrictive for many real-world PRs.

Adds the following environment variables to reduce token usage: - INCLUDE_PREVIOUS_REVIEWS (default: true) - INCLUDE_HUMAN_COMMENTS (default: true) - INCLUDE_CHECK_RUNS (default: true) - INCLUDE_LABELS (default: true) - INCLUDE_PR_DESCRIPTION (default: true) - INCLUDE_COMMIT_MESSAGES (default: true) Set any to 'false' to exclude that context and reduce token count. This helps handle large PRs that would otherwise exceed model context limits.

github-actions · 2025-11-16T12:14:03Z

AI response could not be processed. Please check the workflow logs for debugging information.

When DEBUG_MODE is enabled, the workflow captures both debug output (stderr) and JSON output (stdout) together. This caused JSON parsing to fail because it tried to parse the entire mixed output. Now the workflow extracts just the JSON object from the mixed output before attempting to parse it.

github-actions · 2025-11-16T12:24:59Z

AI Code Review

This PR effectively addresses the 800KB size limit and JSON parsing issues from debug output mixing, but introduces fragile extraction logic and potential performance concerns that should be addressed before merge.

🔒 Security

Potential Data Leakage in Debug Mode

Debug output includes head -c 3000 of raw AI responses which may contain sensitive code, tokens, or credentials
Recommend: Sanitize debug output to redact secrets, API keys, and tokens before logging

No Input Validation on Environment Variables

INCLUDE_* flags accept any string value; only exact "true" is checked
Malformed values default to disabling features silently
Recommend: Use case-insensitive comparison or strict validation

⚡ Performance & Resource Usage

Substantial Token Usage Increase

MAX_DIFF_SIZE increased 6.25x (800KB → 5MB) with all context flags defaulting to true
Combined with 64K max tokens, this could exhaust API quotas or hit timeout limits
Recommend: Consider progressive limits or warn users about token implications

Inefficient JSON Extraction Pipeline

Three-stage extraction (awk → perl → jq) adds ~100-500ms overhead per review
Perl recursive regex (?R) can be slow on large inputs and introduces unnecessary dependency
Recommend: Use single jq call with error handling or stream parsing

🐛 Code Quality & Reliability

Fragile JSON Extraction Logic

Awk pattern /^{$/,/^}$/ fails on single-line JSON or nested objects without line breaks
Perl regex may capture incomplete JSON if multiple objects exist in output
Impact: High risk of parsing failures on valid AI responses
Recommend: Use grep -o with proper JSON boundary detection or require AI output to use delimiters

Magic Numbers Without Documentation

head -c 3000, tail -c 1000000, head -c 10000, head -c 20000 limits are arbitrary and undocumented
No explanation for why these specific values were chosen
Recommend: Define constants with comments explaining rationale

Inconsistent Error Handling

Extraction failure exits with code 1, but JSON validation failure continues to logging block
Debug output duplicated in multiple branches
Recommend: Consolidate error handling and use functions for debug logging

Code Duplication

if [ "$DEBUG_MODE" = "true" ] blocks repeated 4+ times
Same gh api pattern repeated for each context type
Recommend: Extract debug logging and API calls into reusable functions

💡 Best Practices

Configuration Defaults

All INCLUDE_* flags default to true, negating token savings benefit
Users must explicitly opt-out rather than opt-in to optimizations
Recommend: Default to false for non-essential context (e.g., commit messages, check runs)

Missing Documentation

No comments explaining the complex JSON extraction fallback chain
No examples of expected AI response format
Recommend: Add inline comments and update README with configuration options

Dependency Assumption

Introduces perl dependency without checking availability
GitHub Actions Ubuntu runners have perl, but local usage may fail
Recommend: Add dependency check or use more portable solution

✅ Verdict: Approved with Recommendations

The changes solve real problems but need refinement:

Fix JSON extraction to be robust against all valid JSON formats
Add input validation for environment variables
Sanitize debug output to prevent credential leakage
Document magic numbers and configuration implications
Consider performance impact of 5MB default with full context

These issues are not blocking but significantly impact reliability and maintainability.

Review by Friendly AI Reviewer - made with ❤️

AI Code Review

This PR effectively addresses the 800KB size limit and JSON parsing issues from debug output mixing, but introduces fragile extraction logic and potential performance concerns that should be addressed before merge.

🔒 Security

Potential Data Leakage in Debug Mode

Debug output includes head -c 3000 of raw AI responses which may contain sensitive code, tokens, or credentials
Recommend: Sanitize debug output to redact secrets, API keys, and tokens before logging

No Input Validation on Environment Variables

INCLUDE_* flags accept any string value; only exact "true" is checked
Malformed values default to disabling features silently
Recommend: Use case-insensitive comparison or strict validation

⚡ Performance & Resource Usage

Substantial Token Usage Increase

MAX_DIFF_SIZE increased 6.25x (800KB → 5MB) with all context flags defaulting to true
Combined with 64K max tokens, this could exhaust API quotas or hit timeout limits
Recommend: Consider progressive limits or warn users about token implications

Inefficient JSON Extraction Pipeline

Three-stage extraction (awk → perl → jq) adds ~100-500ms overhead per review
Perl recursive regex (?R) can be slow on large inputs and introduces unnecessary dependency
Recommend: Use single jq call with error handling or stream parsing

🐛 Code Quality & Reliability

Fragile JSON Extraction Logic

Awk pattern /^{$/,/^}$/ fails on single-line JSON or nested objects without line breaks
Perl regex may capture incomplete JSON if multiple objects exist in output
Impact: High risk of parsing failures on valid AI responses
Recommend: Use grep -o with proper JSON boundary detection or require AI output to use delimiters

Magic Numbers Without Documentation

head -c 3000, tail -c 1000000, head -c 10000, head -c 20000 limits are arbitrary and undocumented
No explanation for why these specific values were chosen
Recommend: Define constants with comments explaining rationale

Inconsistent Error Handling

Extraction failure exits with code 1, but JSON validation failure continues to logging block
Debug output duplicated in multiple branches
Recommend: Consolidate error handling and use functions for debug logging

Code Duplication

if [ "$DEBUG_MODE" = "true" ] blocks repeated 4+ times
Same gh api pattern repeated for each context type
Recommend: Extract debug logging and API calls into reusable functions

💡 Best Practices

Configuration Defaults

All INCLUDE_* flags default to true, negating token savings benefit
Users must explicitly opt-out rather than opt-in to optimizations
Recommend: Default to false for non-essential context (e.g., commit messages, check runs)

Missing Documentation

No comments explaining the complex JSON extraction fallback chain
No examples of expected AI response format
Recommend: Add inline comments and update README with configuration options

Dependency Assumption

Introduces perl dependency without checking availability
GitHub Actions Ubuntu runners have perl, but local usage may fail
Recommend: Add dependency check or use more portable solution

✅ Verdict: Approved with Recommendations

The changes solve real problems but need refinement:

Fix JSON extraction to be robust against all valid JSON formats
Add input validation for environment variables
Sanitize debug output to prevent credential leakage
Document magic numbers and configuration implications
Consider performance impact of 5MB default with full context

These issues are not blocking but significantly impact reliability and maintainability.

Review by Friendly AI Reviewer - made with ❤️

github-actions · 2025-11-16T12:38:10Z

AI Code Review

Increases diff limits and adds context controls, but fragile JSON parsing and insecure debug logging require fixes before merge.

🔒 Security

Sensitive Data Exposure in Debug Mode

Debug output logs raw AI responses (head -c 3000) which may contain credentials, tokens, or sensitive code
Recommendation: Sanitize output to redact secrets before logging

Silent Configuration Failures

INCLUDE_* flags require exact "true" match; malformed values silently disable features
Recommendation: Implement case-insensitive validation with user warnings

⚡ Performance

Excessive Token Usage by Default

6.25x diff size increase (800KB → 5MB) with all context features enabled by default
Will exhaust API quotas and significantly increase costs
Recommendation: Default non-essential context flags to false

Inefficient JSON Extraction

Three-stage pipeline (awk → perl → jq) adds unnecessary overhead
Perl slurps entire response into memory (-0777)
Recommendation: Use single jq call with delimiter protocol

🐛 Critical Reliability Issues

Broken JSON Extraction Logic

Awk pattern /^{$/,/^}$/ fails on single-line JSON or nested structures
Perl recursive regex may capture incomplete JSON objects
Evidence: "AI response could not be processed" error in checks
Recommendation: Implement robust parsing with proper JSON boundaries

Undocumented Magic Numbers

Arbitrary limits (3000, 1000000, 10000, 20000) with no rationale
Recommendation: Define named constants with explanatory comments

Code Duplication

Repeated DEBUG_MODE checks and gh api patterns
Recommendation: Extract into reusable functions

❌ Request Changes

Blocking issues:

Fragile JSON extraction causing workflow failures (see check error)
Potential credential leakage in debug logs
Silent failures on configuration typos
Defaults negate token optimization benefits

Address these before merge to ensure reliability and security.

Review by Friendly AI Reviewer - made with ❤️

AI Code Review

Increases diff limits and adds context controls, but fragile JSON parsing and insecure debug logging require fixes before merge.

🔒 Security

Sensitive Data Exposure in Debug Mode

Debug output logs raw AI responses (head -c 3000) which may contain credentials, tokens, or sensitive code
Recommendation: Sanitize output to redact secrets before logging

Silent Configuration Failures

INCLUDE_* flags require exact "true" match; malformed values silently disable features
Recommendation: Implement case-insensitive validation with user warnings

⚡ Performance

Excessive Token Usage by Default

6.25x diff size increase (800KB → 5MB) with all context features enabled by default
Will exhaust API quotas and significantly increase costs
Recommendation: Default non-essential context flags to false

Inefficient JSON Extraction

Three-stage pipeline (awk → perl → jq) adds unnecessary overhead
Perl slurps entire response into memory (-0777)
Recommendation: Use single jq call with delimiter protocol

🐛 Critical Reliability Issues

Broken JSON Extraction Logic

Awk pattern /^{$/,/^}$/ fails on single-line JSON or nested structures
Perl recursive regex may capture incomplete JSON objects
Evidence: "AI response could not be processed" error in checks
Recommendation: Implement robust parsing with proper JSON boundaries

Undocumented Magic Numbers

Arbitrary limits (3000, 1000000, 10000, 20000) with no rationale
Recommendation: Define named constants with explanatory comments

Code Duplication

Repeated DEBUG_MODE checks and gh api patterns
Recommendation: Extract into reusable functions

❌ Request Changes

Blocking issues:

Fragile JSON extraction causing workflow failures (see check error)
Potential credential leakage in debug logs
Silent failures on configuration typos
Defaults negate token optimization benefits

Address these before merge to ensure reliability and security.

Review by Friendly AI Reviewer - made with ❤️

Fixes 'Unable to process file command env' error by: 1. Adding quotes around $GITHUB_ENV variable 2. Trimming whitespace/newlines from DECISION value before setting it

github-actions · 2025-11-16T12:47:09Z

AI Code Review

❌ Error: Invalid JSON response from AI model

Review by Friendly AI Reviewer - made with ❤️

github-actions · 2025-11-16T12:47:58Z

AI Code Review

❌ Error: Invalid JSON response from AI model

Review by Friendly AI Reviewer - made with ❤️

Reverts the fragile awk/perl JSON extraction that was causing failures. Changes: - Removed AI_JSON extraction logic - Removed 2>&1 to prevent stderr mixing with JSON output - Debug messages now go to workflow logs naturally - Keeps DECISION trimming and GITHUB_ENV quoting fixes The script outputs JSON to stdout and debug to stderr, so they don't need complex extraction - just capture stdout for JSON.

github-actions · 2025-11-16T13:37:20Z

AI Code Review

This PR effectively increases diff size limits and adds granular context controls while fixing shell scripting robustness issues. The changes demonstrate good iterative development based on real-world failures.

🔒 Security

Improved: Removing 2>&1 prevents potential leakage of sensitive error messages into the JSON response
Hardened: Proper quoting of $GITHUB_ENV mitigates shell injection risks
No new security concerns introduced

✅ Code Quality & Best Practices

Robustness Improvements:

Added quotes around $GITHUB_ENV variable to prevent word splitting (line 133)
Implemented whitespace trimming for DECISION value using tr -d '\n\r' | xargs to prevent 'Unable to process file command' errors (line 107)
Correct string comparison syntax [ "$VAR" = "true" ] used throughout for bash safety

Performance & Flexibility:

Major enhancement: New environment variables (INCLUDE_*) allow selective context fetching, reducing token usage and API calls
Trade-off: MAX_DIFF_SIZE increased 6.25x (800KB → 5MB) balances real-world PR needs with reasonable API usage limits
Defaults maintain backward compatibility while enabling opt-out optimization

Design:

Clean separation: Debug output now goes directly to workflow logs rather than being mixed with JSON (removal of 2>&1)
Exit code handling preserved for failure detection despite stderr separation

Inference (not verified): Debug mode output may be harder to correlate with specific AI response failures since stderr is no longer captured in AI_RESPONSE, though this improves log clarity overall.

Review by Friendly AI Reviewer - made with ❤️

LearningCircuit added 2 commits November 16, 2025 13:00

fix: increase MAX_DIFF_SIZE default from 800KB to 5MB

57e2d7a

Allows larger PRs to be reviewed without manual configuration. The 800KB limit was too restrictive for many real-world PRs.

LearningCircuit added the ai_code_review Friendly AI Code Review label Nov 16, 2025

LearningCircuit added ai_code_review Friendly AI Code Review and removed ai_code_review Friendly AI Code Review labels Nov 16, 2025

github-actions bot added enhancement New feature or request bug Something isn't working and removed ai_code_review Friendly AI Code Review labels Nov 16, 2025

LearningCircuit added the ai_code_review Friendly AI Code Review label Nov 16, 2025

github-actions bot added security Auto-created by AI reviewer and removed ai_code_review Friendly AI Code Review labels Nov 16, 2025

fix: quote GITHUB_ENV and trim DECISION value

01dae60

Fixes 'Unable to process file command env' error by: 1. Adding quotes around $GITHUB_ENV variable 2. Trimming whitespace/newlines from DECISION value before setting it

LearningCircuit added ai_code_review Friendly AI Code Review and removed ai_code_review Friendly AI Code Review labels Nov 16, 2025

github-actions bot removed the ai_code_review Friendly AI Code Review label Nov 16, 2025

LearningCircuit added the ai_code_review Friendly AI Code Review label Nov 16, 2025

github-actions bot removed the ai_code_review Friendly AI Code Review label Nov 16, 2025

LearningCircuit merged commit 2810a5b into main Nov 16, 2025
1 check passed

LearningCircuit deleted the fix/increase-max-diff-size branch November 16, 2025 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: increase MAX_DIFF_SIZE default from 800KB to 5MB #19

fix: increase MAX_DIFF_SIZE default from 800KB to 5MB #19

Uh oh!

LearningCircuit commented Nov 16, 2025

Uh oh!

github-actions bot commented Nov 16, 2025

Uh oh!

github-actions bot commented Nov 16, 2025

Uh oh!

github-actions bot commented Nov 16, 2025

Uh oh!

github-actions bot commented Nov 16, 2025

Uh oh!

github-actions bot commented Nov 16, 2025

Uh oh!

github-actions bot commented Nov 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: increase MAX_DIFF_SIZE default from 800KB to 5MB #19

fix: increase MAX_DIFF_SIZE default from 800KB to 5MB #19

Uh oh!

Conversation

LearningCircuit commented Nov 16, 2025

Summary

Context

Changes

Impact

Uh oh!

github-actions bot commented Nov 16, 2025

Uh oh!

github-actions bot commented Nov 16, 2025

AI Code Review

🔒 Security

⚡ Performance & Resource Usage

🐛 Code Quality & Reliability

💡 Best Practices

✅ Verdict: Approved with Recommendations

AI Code Review

🔒 Security

⚡ Performance & Resource Usage

🐛 Code Quality & Reliability

💡 Best Practices

✅ Verdict: Approved with Recommendations

Uh oh!

github-actions bot commented Nov 16, 2025

AI Code Review

🔒 Security

⚡ Performance

🐛 Critical Reliability Issues

❌ Request Changes

AI Code Review

🔒 Security

⚡ Performance

🐛 Critical Reliability Issues

❌ Request Changes

Uh oh!

github-actions bot commented Nov 16, 2025

AI Code Review

Uh oh!

github-actions bot commented Nov 16, 2025

AI Code Review

Uh oh!

github-actions bot commented Nov 16, 2025

AI Code Review

🔒 Security

✅ Code Quality & Best Practices

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants