feat: Phase 10 post-training completion (v0.4.0)#7
Conversation
Closes the "trained, now what?" gap with five new modules: - forgelm/inference.py: shared generation primitives (load_model, generate, generate_stream, logit_stats, adaptive_sample) used by chat, judge, synthetic - forgelm/chat.py: interactive REPL with streaming, /reset /save /temperature /system slash commands, rich optional rendering, optional Llama Guard routing - forgelm/fit_check.py: VRAM pre-flight advisor — architecture via AutoConfig, formula-based estimation, FITS/TIGHT/OOM/UNKNOWN verdict + recommendations - forgelm/export.py: GGUF export wrapping llama-cpp-python converter; adapter merge; 6 quant levels; SHA-256 appended to model_integrity.json - forgelm/deploy.py: deployment config generation for ollama/vllm/tgi/hf-endpoints CLI: forgelm chat / export / deploy subcommands + --fit-check flag; full backward compat with existing flat CLI. All subcommands work without --config. Tests: 89 new tests across 5 modules; full suite 427 passed, 34 skipped. Docs: CHANGELOG, README, roadmap, quickstart, usage reference updated. Version: 0.3.1rc1 → 0.4.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Sorry @cemililik, your pull request is larger than the review limit of 150000 diff characters
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 37 minutes and 14 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (28)
📝 WalkthroughWalkthroughv0.4.0 (Phase 10 post‑training) release: adds shared inference primitives (streaming, adaptive sampling, logit stats), interactive terminal chat REPL, GGUF export (adapter merge + integrity), VRAM fit-check estimator, deployment config generators, CLI subcommands (chat/export/deploy/--fit-check), docs, tests, and a package version bump. Changes
Sequence DiagramsequenceDiagram
participant User as User
participant CLI as CLI Handler
participant Chat as ChatSession
participant Model as Model
participant Tokenizer as Tokenizer
participant Stream as TokenStreamer
User->>CLI: forgelm chat --model-path <path> --stream
CLI->>CLI: call forgelm.inference.load_model()
CLI->>Chat: instantiate ChatSession(model, tokenizer)
loop REPL
User->>Chat: user prompt
Chat->>Tokenizer: _to_messages(system + history + input)
Tokenizer-->>Chat: messages
Chat->>Tokenizer: _build_prompt(messages)
Tokenizer-->>Chat: prompt
alt stream enabled
Chat->>Model: generate_stream(prompt,...)
Model->>Stream: spawn TextIteratorStreamer (background)
Stream-->>Chat: token chunks
Chat->>User: print tokens live
else non-stream
Chat->>Model: generate(prompt,...)
Model-->>Chat: full response
Chat->>User: print response
end
Chat->>Chat: append/trim history
end
User->>Chat: /save path
Chat->>Chat: write JSONL file
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| BestPractice | 15 minor |
| Security | 2 critical |
🟢 Metrics 533 complexity · 35 duplication
Metric Results Complexity 533 Duplication 35
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request implements Phase 10 of the roadmap, adding comprehensive post-training features including an interactive chat REPL, GGUF export functionality, a VRAM fit advisor, and deployment configuration generators for Ollama, vLLM, TGI, and Hugging Face Endpoints. Review feedback highlights several technical improvements: correcting the GGUF quantization workflow to account for script limitations, ensuring consistent terminal rendering with the rich library, refining configuration templates to handle special characters and remote model IDs, and optimizing resource management through better memory usage and temporary file cleanup.
| _OUTTYPE_MAP = { | ||
| "f16": "f16", | ||
| "q2_k": "q2_k", | ||
| "q3_k_m": "q3_k_m", | ||
| "q4_k_m": "q4_k_m", | ||
| "q5_k_m": "q5_k_m", | ||
| "q8_0": "q8_0", | ||
| } |
There was a problem hiding this comment.
The convert_hf_to_gguf.py script bundled with llama-cpp-python (and llama.cpp) typically only supports a limited set of output types for the initial conversion (e.g., f32, f16, q8_0). It does not support K-quants like q4_k_m directly. Passing these unsupported types to --outtype will likely cause the conversion script to fail. A separate quantization step using the quantize utility is usually required for these formats.
| print(token, end="", flush=True) | ||
| response += token |
There was a problem hiding this comment.
When rich rendering is enabled, tokens should be printed using the rich console object to ensure consistent output formatting and handling of control characters. Mixing standard print with rich.console.Console.print (used for the prefix on line 224) can lead to rendering inconsistencies or issues with Rich's internal state management.
| print(token, end="", flush=True) | |
| response += token | |
| if _HAS_RICH: | |
| _console.print(token, end="") | |
| else: | |
| print(token, end="", flush=True) | |
| response += token |
| # Escape any double quotes in the system prompt | ||
| safe_prompt = system_prompt.replace('"', '\\"') | ||
| lines.append(f'SYSTEM "{safe_prompt}"') | ||
| lines.append("") |
There was a problem hiding this comment.
The Ollama SYSTEM directive is better handled using triple quotes ("""). This avoids issues with double quotes inside the system prompt and correctly handles multi-line prompts without manual backslash escaping.
| # Escape any double quotes in the system prompt | |
| safe_prompt = system_prompt.replace('"', '\\"') | |
| lines.append(f'SYSTEM "{safe_prompt}"') | |
| lines.append("") | |
| lines.append('SYSTEM """') | |
| lines.append(system_prompt) | |
| lines.append('"""') | |
| lines.append("") |
| "volumes": [f"{os.path.abspath(model_path)}:/data"], | ||
| "command": ( | ||
| f"--model-id /data " | ||
| f"--max-input-length {max_input_length} " |
There was a problem hiding this comment.
The TGI generator assumes model_path is always a local directory and mounts it as a volume. However, model_path can also be a Hugging Face Hub ID. If a Hub ID is provided, os.path.abspath will produce an invalid local path, and the volume mount will fail. The generator should detect if the path is local and only mount it if necessary.
| "volumes": [f"{os.path.abspath(model_path)}:/data"], | |
| "command": ( | |
| f"--model-id /data " | |
| f"--max-input-length {max_input_length} " | |
| "volumes": [f"{os.path.abspath(model_path)}:/data"] if os.path.isdir(model_path) else [], | |
| "command": ( | |
| f"--model-id {'/data' if os.path.isdir(model_path) else model_path} " | |
| f"--max-input-length {max_input_length} " |
|
|
||
| logger.info("Merging adapter %s into base model %s ...", adapter_path, model_path) | ||
| tokenizer = AutoTokenizer.from_pretrained(model_path) | ||
| model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="auto", device_map="cpu") |
There was a problem hiding this comment.
Loading large models on CPU for merging can consume significant system RAM (e.g., ~140GB for a 70B model in fp16). It is highly recommended to use low_cpu_mem_usage=True to reduce the peak RAM requirement during the loading process.
| model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="auto", device_map="cpu") | |
| model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="auto", device_map="cpu", low_cpu_mem_usage=True) |
| if update_integrity: | ||
| _update_integrity_manifest(model_path, result) | ||
|
|
||
| return result |
There was a problem hiding this comment.
The temporary directory created for the merged model weights (line 235) is not cleaned up after the conversion process. Since these weights are saved in fp16 format, they can occupy significant disk space. Consider adding a cleanup step.
| return result | |
| if merged_dir and os.path.exists(merged_dir): | |
| import shutil | |
| shutil.rmtree(merged_dir) | |
| return result |
| thread = Thread(target=model.generate, kwargs=gen_kwargs, daemon=True) | ||
| thread.start() | ||
|
|
||
| for token_text in streamer: | ||
| yield token_text | ||
|
|
||
| thread.join() |
There was a problem hiding this comment.
The background thread used for streaming generation should be joined in a finally block. This ensures that the thread is properly joined even if an exception occurs during the iteration over the streamer or if the generator is closed prematurely.
| thread = Thread(target=model.generate, kwargs=gen_kwargs, daemon=True) | |
| thread.start() | |
| for token_text in streamer: | |
| yield token_text | |
| thread.join() | |
| thread = Thread(target=model.generate, kwargs=gen_kwargs, daemon=True) | |
| thread.start() | |
| try: | |
| for token_text in streamer: | |
| yield token_text | |
| finally: | |
| thread.join() |
- forgelm/chat.py: remove unused `rich.text.Text` import - forgelm/inference.py: remove unused `math` import (auto-fixed) - tests/test_safety_advanced.py: remove unused `result` assignment (F841) - tests/test_cli_phase10.py: remove unused _run_deploy_cmd/_run_export_cmd imports - tests/test_deploy.py: remove unused pytest import, sort imports (I001) - tests/test_export.py: remove unused subprocess/pytest imports, sort imports - tests/test_fit_check.py: remove unused types import - tests/test_inference.py: sort import block (I001), remove unused importlib/math - tests/test_oom_recovery.py: remove unused generate_training_manifest import Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- test_oom_recovery: rename capture_manifest arg cfg→config to match generate_training_manifest(config=...) keyword call signature - test_safety_advanced: patch transformers.pipeline instead of forgelm.safety.pipeline — pipeline is a local import inside run_safety_evaluation so no module-level attribute exists Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- roadmap.md: remove redundant "Phase 12 (reserved)" row; fix Phase 10.5 Mermaid color (was green/done, still planned → blue); update doc map labels to reflect completed vs planned status - completed-phases.md: add Phase 10 to summary table; replace 650-line Phase 10-13 duplicate content with a concise ✅ entry for Phase 10; remove Phase 11-13 content (each has its own dedicated file) - phase-12-quickstart.md: update heading from "Phase 12" to "Phase 10.5"; add file-name note explaining why filename stays as phase-12-* - phase-11-data-ingestion.md: fix delivery target from "v0.4.5 or v0.5.0" to v0.5.0 (consistent with roadmap.md and releases.md) - releases.md: remove duplicate --- separator; clarify v0.3.1rc1 status as folded into v0.4.0 rather than a standalone release candidate Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/roadmap/completed-phases.md (1)
3-3:⚠️ Potential issue | 🟡 MinorUpdate the archive-scope note to include Phase 10.
The summary table now includes Phase 10, but the note at line 3 still says "Phase 1-9 completed work." Tweak so the description matches the table contents.
📝 Suggested change
-> **Note:** Historical archive of ForgeLM's Phase 1-9 completed work. Kept for reference and onboarding. For active planning, see [../roadmap.md](../roadmap.md). +> **Note:** Historical archive of ForgeLM's Phase 1-10 completed work. Kept for reference and onboarding. For active planning, see [../roadmap.md](../roadmap.md).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/roadmap/completed-phases.md` at line 3, Update the archive-scope note string that currently reads "**Note:** Historical archive of ForgeLM's Phase 1-9 completed work." to reference Phase 10 (e.g. "**Note:** Historical archive of ForgeLM's Phase 1-10 completed work. Kept for reference and onboarding. For active planning, see [../roadmap.md](../roadmap.md).") so the description in completed-phases.md matches the summary table; locate and edit that exact note text in the docs/roadmap/completed-phases.md file.
🧹 Nitpick comments (12)
forgelm/export.py (1)
69-77: Minor lint cleanup: redundantimport llama_cppandformatbuiltin shadowing.
- Lines 70/77:
llama_cppis imported twice (inside thetryand again unconditionally below). Ruff flags this as F811. Drop the second import — the binding from line 70 is already in scope.- Lines 181/206: parameter name
formatshadows the builtin. Renaming tooutput_format(orfmt) avoids the SonarCloud warning and keeps the local namespace clean. This change ripples into a fewformat=formatkwargs further down but is mechanical.Optional polish, not a correctness issue.
Also applies to: 181-181, 206-207
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/export.py` around lines 69 - 77, Remove the duplicate unconditional "import llama_cpp" (the import inside the try already binds llama_cpp) to fix the F811 lint; also rename the parameter named "format" (e.g., in the function signature that currently uses format and any calls passing format=format) to "output_format" (or "fmt") and update all internal references and keyword args (e.g., format=format -> output_format=output_format) so the builtin isn't shadowed and all call sites match the new name (search for occurrences of the symbol "format" in the function and its callers around the existing references).forgelm/deploy.py (3)
117-151: TGIcount: 1GPU reservation is hard-coded.The compose template always reserves a single GPU. Multi-GPU users who actually want sharding will silently get one device. Consider exposing
num_gpus(mapping tocount) and/or--num-shardin the TGI command for parity with vLLM'stensor_parallel_size. Not a defect for the v0.4.0 single-GPU happy path; flagging as a usability gap to track.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/deploy.py` around lines 117 - 151, The _tgi_compose function currently hard-codes GPU reservation count: devices=[{"driver":"nvidia","count":1,...}], which forces single-GPU; change the signature of _tgi_compose to accept a num_gpus (or gpu_count) parameter and use that value for the devices "count" field, and also append a corresponding TGI CLI flag (e.g., "--num-shard {num_gpus}" or "--num-gpus {num_gpus}") into the command string so the container and compose reservation stay in sync; update any callers to pass the desired num_gpus or default to 1 to preserve current behavior.
188-212: 17-parameter signature could be grouped into per-target option dataclasses.SonarCloud flags the parameter count, and the comment groupings already document the per-target split (
# vLLM-specific,# TGI-specific,# HF Endpoints-specific). Promoting those groups to small dataclasses (OllamaOptions,VLLMOptions, …) — passed as**asdict()to the per-target generators — would shrink the public signature and make adding a new target easier without churning every caller. Optional, deferrable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/deploy.py` around lines 188 - 212, Refactor the large parameter list of generate_deploy_config into small per-target dataclasses (e.g., OllamaOptions, VLLMOptions, TGIOptions, HFEndpointsOptions) so callers pass grouped option objects instead of 17 separate params; update the generate_deploy_config signature to accept those dataclass instances (or Optional[...] with defaults) and move related params (temperature/top_k/top_p -> sampling group, gpu_memory_utilization/dtype -> VLLMOptions, port -> TGIOptions, task/instance_size/instance_type/region/framework -> HFEndpointsOptions, system_prompt/max_length/trust_remote_code remain shared) and then pass each asdict(...) into the existing per-target generators inside generate_deploy_config to preserve behavior.
99-114: The vLLM "comment pseudo-keys" trick yields a malformed-looking line.Building comments by abusing dict keys and then post-processing produces output like:
# Generated by ForgeLM deploy: # Run: python -m vllm.entrypoints.openai.api_server --config vllm_config.yamlThe trailing
:after# Generated by ForgeLM deployis harmless to YAML parsers (it's inside a comment) but looks broken to humans reading the file. Prefer writing comments as a plain header string the same way_tgi_composealready does:♻️ Suggested refactor
def _vllm_config( model_path: str, max_length: int, trust_remote_code: bool, gpu_memory_utilization: float, dtype: str, ) -> str: """Generate a vLLM engine config YAML.""" cfg = { "model": model_path, "max_model_len": max_length, "gpu_memory_utilization": gpu_memory_utilization, "trust_remote_code": trust_remote_code, "dtype": dtype, - "# Generated by ForgeLM deploy": None, - "# Run": "python -m vllm.entrypoints.openai.api_server --config vllm_config.yaml", } - # Remove comment pseudo-keys (yaml.dump handles None as null; post-process) - comment_keys = [k for k in cfg if k.startswith("#")] - comments = "\n".join(f"{k}: {cfg[k] or ''}" for k in comment_keys) - for k in comment_keys: - del cfg[k] - - return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + "\n" + comments + "\n" + header = ( + "# Generated by ForgeLM deploy\n" + "# Run: python -m vllm.entrypoints.openai.api_server --config vllm_config.yaml\n\n" + ) + return header + yaml.dump(cfg, default_flow_style=False, sort_keys=False)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/deploy.py` around lines 99 - 114, The current trick of creating comment pseudo-keys in cfg then extracting comment_keys and joining them into comments produces lines with a trailing ": " (e.g., "# Generated by ...: "), so replace that flow by constructing a proper comment header string (instead of pseudo-dict keys) before dumping the YAML: remove adding keys that start with "#" to cfg, build comments as a simple newline-joined string where each line begins with "# " plus the desired text (e.g., "# Generated by ForgeLM deploy" and "# Run: ..."), and then return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + "\n" + comments + "\n" so there is no stray colon; update the variables comment_keys and comments accordingly and ensure yaml.dump only receives a clean cfg without any "#" pseudo-keys.forgelm/fit_check.py (3)
286-294: Drop two unused symbols flagged by SonarCloud.
- Line 288:
free_bytesis captured but never used — switch to_, total_bytes = torch.cuda.mem_get_info()(or use the free value, which is arguably more useful than total when other processes hold VRAM).- Line 328:
archparameter to_build_recommendationsis never referenced inside the function; remove it (and update the call site at line 309).Both are pure cleanup, no behavior change.
Also applies to: 326-333
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/fit_check.py` around lines 286 - 294, Replace the unused local free_bytes by discarding it when calling torch.cuda.mem_get_info (e.g., use "_, total_bytes = torch.cuda.mem_get_info()" or capture and use free value if preferred) so free_bytes is not left unused, and ensure available_gb is computed from the chosen value; also remove the unused "arch" parameter from the _build_recommendations function signature and delete the corresponding argument at its call site in the code (update the function definition of _build_recommendations and every call that passes arch, including the call currently adjacent to its invocation) so the unused symbol is eliminated.
161-172: Minor accuracy note on_lora_adapter_gb(FYI, not blocking).The formula
2 * lora_r * h * target_module_count * num_hidden_layersassumes every target module's A and B matrices both have dimensionh. For modules where the "other" dimension isintermediate_size(e.g.gate_proj,up_proj,down_proj) orn_kv * head_dimunder GQA (e.g.k_proj,v_proj), the actual adapter parameter count differs. This is fine for a ±15 % advisor, but if users with FFN-targeted LoRA see noticeable underestimates, it's the place to revisit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/fit_check.py` around lines 161 - 172, The current _lora_adapter_gb uses a simplified formula assuming both A and B matrices use hidden_size (h) for their other dimension; update _lora_adapter_gb to compute adapter_params per-target-module shape instead: detect or accept which modules are targeted (or derive from arch) and for each module use the correct second dimension (e.g., arch["intermediate_size"] for FFN/gate/up/down projections, or n_kv*head_dim / arch["num_attention_heads"] for k/v under GQA) to compute A and B sizes, sum across modules and layers, then convert to GB using fp32; keep lora_r, arch["num_hidden_layers"], and arch fields as inputs and fall back to h when module-specific dims are not available.
253-258:trainable_paramsis being reverse-engineered from a GB value; compute it directly.
adapter_gbis computed fromadapter_params * 4 / (1024**3), then a few lines latertrainable_paramsis recovered asint(adapter_gb * (1024**3) / 4). The round-trip via float GB introduces unnecessary precision loss and obscures what's happening. Prefer computing the parameter count once and deriving GB from it.♻️ Suggested refactor
- target_module_count = len(lora.target_modules) - adapter_gb = _lora_adapter_gb(arch, lora.r, target_module_count) - trainable_params = int(adapter_gb * (1024 ** 3) / 4) # fp32 param count + target_module_count = len(lora.target_modules) + trainable_params = ( + 2 * lora.r * arch["hidden_size"] * target_module_count * arch["num_hidden_layers"] + ) + adapter_gb = trainable_params * 4 / (1024 ** 3) # fp32🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/fit_check.py` around lines 253 - 258, Compute the LoRA parameter count directly instead of round-tripping through GB: replace the adapter_gb -> trainable_params two-step with a direct calculation of adapter_params (e.g., call a new or existing helper like _lora_adapter_params(arch, lora.r, target_module_count) or reproduce the exact formula used by _lora_adapter_gb) and set trainable_params = int(adapter_params); then derive adapter_gb = adapter_params * 4 / (1024**3) for any GB reporting. Keep the optimizer_type = getattr(t, "galore_optim", "adamw") if getattr(t, "galore_enabled", False) else "adamw" and pass the direct trainable_params into _optimizer_state_gb(trainable_params, optimizer_type).forgelm/cli.py (1)
539-564: RedundantKeyboardInterrupthandling around_run_chat_cmd.
_run_chat_cmdalready swallowsKeyboardInterruptat lines 560-561, so the outertry/except KeyboardInterruptat the call site (lines 653-656) is unreachable. Either drop the inner handler and rely on the outer one (so a Ctrl-C between command dispatch andrun_chatis also caught), or drop the outer wrapper. Keeping both just adds noise.🧹 Suggested simplification
- if command == "chat": - try: - _run_chat_cmd(args) - except KeyboardInterrupt: - pass - sys.exit(EXIT_SUCCESS) + if command == "chat": + _run_chat_cmd(args) + sys.exit(EXIT_SUCCESS)Also applies to: 652-657
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/cli.py` around lines 539 - 564, The inner KeyboardInterrupt handler in _run_chat_cmd makes the outer KeyboardInterrupt catch at the caller unreachable; remove the inner "except KeyboardInterrupt: pass" from _run_chat_cmd (and any other command helpers that swallow KeyboardInterrupt similarly) so Ctrl-C can be handled centrally by the outer wrapper that calls run_chat; leave the ImportError and generic Exception handlers intact and ensure the function still returns/propagates on KeyboardInterrupt.README.md (1)
175-197: Project Structure section is stale relative to the v0.4.0 changes.This block isn't part of the diff, but since the PR adds five new modules (
inference.py,chat.py,fit_check.py,export.py,deploy.py) and ~89 tests, the directory tree and the trailing comment "145+ unit tests across 11 test files" are now out of date. Consider refreshing this section in a follow-up so the README reflects the v0.4.0 surface.📝 Suggested additions
├── synthetic.py # Synthetic data generation (teacher→student distillation) +├── inference.py # Shared generation primitives (load, generate, stream, sample) +├── chat.py # Interactive REPL with streaming + slash commands +├── fit_check.py # VRAM pre-flight estimator (FITS / TIGHT / OOM / UNKNOWN) +├── export.py # GGUF export via llama-cpp-python + integrity hashing +├── deploy.py # Deployment config generation (ollama / vllm / tgi / hf-endpoints) ├── wizard.py # Interactive configuration wizard ├── webhook.py # Slack/Teams webhook notifications └── utils.py # Authentication & checkpoint management🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 175 - 197, Update the Project Structure block to list the five new top-level modules (inference.py, chat.py, fit_check.py, export.py, deploy.py) alongside the existing entries (e.g., cli.py, config.py, model.py, etc.), and refresh the tests/training summary to reflect the added ~89 unit tests (update "145+ unit tests across 11 test files" to the new total and file count). Ensure the directory tree still shows the same sections (configs/deepspeed/, notebooks/, tests/, docs/guides/) and adjust any trailing comments so the README accurately represents the v0.4.0 surface.forgelm/chat.py (1)
215-256: Streaming/prefix output bypasses theoutput_fntestability seam.
__init__acceptsinput_fn/output_fnand the comment advertises the class as "re-entrant for testing", but_generate_and_printwrites theAssistant:prefix and every streamed token via rawprint()/_console.printdirectly. Tests that overrideoutput_fnwill see assistant turns vanish, and the rich/plain branching is duplicated here vs._default_output. Route everything throughself._print_inline/self._printso the seam actually holds.♻️ Suggested fix
- # Print "Assistant: " prefix - if _HAS_RICH: - _console.print("[bold green]Assistant:[/bold green] ", end="") - else: - print("Assistant: ", end="", flush=True) + # Print "Assistant: " prefix + self._print_inline("[bold green]Assistant:[/bold green] " if _HAS_RICH else "Assistant: ") response = "" if self.stream: try: for token in generate_stream( self.model, self.tokenizer, "", messages=messages, temperature=self.temperature, max_new_tokens=self.max_new_tokens, ): - print(token, end="", flush=True) + self._print_inline(token) response += token except Exception as e: self._print(f"\n[Generation error: {e}]") return - print() # trailing newline + self._print("") # trailing newline🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/chat.py` around lines 215 - 256, The _generate_and_print method writes the "Assistant:" prefix and streamed tokens using print()/_console.print directly which bypasses the output_fn seam; update _generate_and_print to call the existing self._print_inline (or self._print for final newline) for the "Assistant:" prefix and for every streamed token, and use self._print for non-streamed responses so all assistant output goes through the class's output hooks (refer to _generate_and_print, self._print_inline, self._print, and _default_output to mirror existing behavior rather than calling print()/_console.print directly).tests/test_export.py (1)
260-265: Tautological assertion masks regressions.
assert result.error is None or "not found" not in result.error or result.successis satisfied by virtually any outcome — including a failure with an unrelated error, since the second clause ("not found" not in result.error) will be true for almost every error string. The test will pass even ifexport_modelregresses to returningsuccess=Falsefor every quant. Tighten to a positive success assertion.♻️ Suggested fix
for quant in SUPPORTED_QUANTS: with patch.dict(sys.modules, {"llama_cpp": llama_cpp_stub}): with patch("subprocess.run", side_effect=fake_run): result = export_model(str(tmp_path), output_gguf, quant=quant, update_integrity=False) - # Should not fail on quant validation - assert result.error is None or "not found" not in result.error or result.success + assert result.success, f"quant={quant} failed: {result.error}" + assert result.quant == quant🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_export.py` around lines 260 - 265, The current test loop in tests/test_export.py uses a tautological assertion that masks failures; update the assertion after calling export_model (in the loop over SUPPORTED_QUANTS) to explicitly require a successful export by asserting result.success (or assert result.error is None and result.success) instead of the current "result.error is None or 'not found' not in result.error or result.success" so the test fails on regressions; keep the setup using patch.dict(sys.modules, {"llama_cpp": llama_cpp_stub}) and patch("subprocess.run", side_effect=fake_run) and only change the final assertion that checks the exported result variable `result`.forgelm/inference.py (1)
63-74: UseBitsAndBytesConfigfor 8‑bit too — bareload_in_8bitkwarg is deprecated.The 4‑bit branch correctly builds a
BitsAndBytesConfig, but the 8‑bit branch passes the rawload_in_8bit=Truekwarg tofrom_pretrained. In moderntransformers(>=4.37) the directload_in_8bitargument is deprecated and the recommended path isBitsAndBytesConfig(load_in_8bit=True)viaquantization_config. Aligning both branches also makes the code easier to read.♻️ Proposed refactor
if torch.cuda.is_available(): if load_in_4bit: from transformers import BitsAndBytesConfig model_kwargs["quantization_config"] = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", ) elif load_in_8bit: - model_kwargs["load_in_8bit"] = True + from transformers import BitsAndBytesConfig + + model_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/inference.py` around lines 63 - 74, The 8-bit branch currently sets model_kwargs["load_in_8bit"] = True which uses the deprecated direct kwarg; instead construct a BitsAndBytesConfig for 8-bit like the 4-bit branch and assign it to model_kwargs["quantization_config"]. Update the elif handling of load_in_8bit to import/instantiate BitsAndBytesConfig (as used in the load_in_4bit branch), set load_in_8bit=True and appropriate compute dtype (e.g., torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16), and remove the raw load_in_8bit kwarg so from_pretrained receives quantization_config consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@forgelm/chat.py`:
- Around line 262-275: The _safety_hint currently only checks that
forgelm.safety exists and does not perform any classification; update
ChatSession to lazy-load and store a safety classifier at construction (e.g., in
ChatSession.__init__ when enable_safety is true) so the model is loaded once,
then have _safety_hint call that classifier with the prompt/response, map
results to meaningful category/severity using forgelm.safety._HARM_CATEGORIES
(or exposed API), and print a concise user-facing annotation after each
response; alternatively, if you prefer not to implement classification now,
change the enable_safety startup path to emit a clear warning instead of
silently no-op and remove/adjust the misleading debug messaging in _safety_hint.
In `@forgelm/cli.py`:
- Around line 47-122: The shared flags (--output-format, --quiet, --log-level)
must be available to each subcommand parser; create a parent ArgumentParser
(e.g., common = argparse.ArgumentParser(add_help=False)) that defines these
arguments and pass parents=[common] into each subparser created in
_add_chat_subcommand, _add_export_subcommand, and _add_deploy_subcommand so
invocations like "forgelm export ./model --quiet" work; alternatively, add those
flags directly to each p inside those three functions. Also remove the redundant
outer KeyboardInterrupt try/except in main() since _run_chat_cmd already handles
KeyboardInterrupt.
In `@forgelm/deploy.py`:
- Around line 154-180: _hf_endpoints_json currently hard-codes "vendor": "aws"
which breaks validation for non-AWS regions; update the function to either
accept a vendor parameter or infer vendor from the region prefix (e.g., regions
starting with "europe-west"/"us-" -> "gcp", "eastus"/"westus" -> "azure",
fallback to "aws"), set that vendor into the "cloud" object, and relocate or
remove the auxiliary keys ("_generated_by" and "_docs") by placing them in a
"metadata" block (or stripping them) so the generated spec conforms to HF
Endpoints validation; target the _hf_endpoints_json function to make these
changes.
In `@forgelm/export.py`:
- Around line 231-245: The adapter-merge flow leaks the merged_dir and updates
the integrity manifest on the wrong path: ensure _merge_adapter writes into a
temporary directory (use tempfile.TemporaryDirectory()) or always remove
merged_dir in a finally block to avoid leaving large files, set source_path to
the merged_dir when calling the export and then call _update_integrity_manifest
with that source_path (not the original model_path) so the integrity entry is
recorded against the merged/exported artifact; also ensure any early returns
(e.g., the ExportResult on exception) still clean up the temporary merged_dir
before returning.
- Around line 263-280: The subprocess.run call that executes the converter
(using cmd) needs a timeout to avoid indefinite hangs: add a timeout argument to
subprocess.run and propagate a new timeout parameter through export_model()
(choose a sensible default like 600s) so callers can override for large models;
catch subprocess.TimeoutExpired and return an ExportResult(success=False,
format=format, quant=quant, error="converter timed out" or include the timeout
value) and ensure you still handle other exceptions as before. Use the same
symbols (cmd, subprocess.run invocation, ExportResult, export_model) when making
the changes so the timeout is applied to the converter invocation and surfaced
to callers.
- Around line 32-40: The _OUTTYPE_MAP currently lists K-quant strings (q2_k,
q3_k_m, q4_k_m, q5_k_m) which will be passed to convert_hf_to_gguf.py and cause
an argparse error; either remove/disable K-quant entries from _OUTTYPE_MAP and
SUPPORTED_QUANTS so only converter-accepted outtypes remain (e.g., f16, f32,
bf16, q8_0, auto) or implement a two-step flow in the export path that maps
requested K-quant values to a supported initial outtype (f16 or f32) for
convert_hf_to_gguf.py and then invokes the llama-quantize binary (or equivalent
function) to produce the final q*_k output; update any code that reads the
--quant flag to route K-quant requests through this two-step process and ensure
helpful error messages when an unsupported outtype is requested.
- Around line 61-96: _find_converter_script currently only searches inside the
installed llama_cpp package (so wheels without vendor/convert_hf_to_gguf.py
always fail); update the discovery to accept an explicit path override and a
network fallback: change the function signature to accept an optional
converter_script argument (or read an env var FORGELM_CONVERTER_SCRIPT) and
first verify that path, then try the existing local candidates, and if still
missing attempt to download the convert_hf_to_gguf.py from the
llama.cpp/convert/ repository (or return a clear error explaining to pass
--converter-script); ensure logger.debug/info messages reflect each step and
that callers (CLI parsing) add a --converter-script flag that passes the path
into _find_converter_script.
In `@forgelm/fit_check.py`:
- Around line 257-258: The estimator hard-codes optimizer_type to "adamw" when
galore is disabled, causing wrong memory estimates; update the TrainingConfig
model to add a general optimizer field (e.g., optimizer: str = "adamw") for
non-GaLore runs, then change the logic that sets optimizer_type (currently using
getattr(t, "galore_optim", "adamw") if getattr(t, "galore_enabled", False) else
"adamw") to prefer t.optimizer when galore_enabled is False, falling back to
existing galore_optim when galore_enabled is True and preserving the "adamw"
default for backward compatibility; ensure _optimizer_state_gb(trainable_params,
optimizer_type) receives this corrected optimizer_type.
In `@forgelm/inference.py`:
- Around line 256-262: Wrap the Thread target call to model.generate in a small
try/except that captures any exception into a shared container (e.g., an
exception_holder list or dict) instead of letting it be swallowed; start the
thread as before (Thread(target=run_generate, kwargs=gen_kwargs, ...)), stream
tokens from streamer, then after thread.join() check the exception_holder and
re-raise the captured exception in the main thread so OOMs/type errors/missing
eos_token_id etc. surface to the caller; reference the existing symbols
model.generate, Thread, gen_kwargs, streamer and thread.join when making this
change.
- Around line 50-51: The Unsloth backend call drops quantization flags and
silently ignores adapters: update the caller to forward load_in_4bit and
load_in_8bit into _load_unsloth and change _load_unsloth's signature to accept
those two booleans and pass them through to FastLanguageModel.from_pretrained
(instead of hardcoding load_in_4bit=True); also remove the logger.warning path
that returns a model without the adapter and instead raise a ValueError in
_load_unsloth if adapter is provided, telling the user to merge the adapter
first or use backend='transformers'.
In `@pyproject.toml`:
- Around line 74-84: The declared export extra is broken because installing
"llama-cpp-python>=0.2.90" does not provide the upstream conversion script
convert_hf_to_gguf.py; fix by either (A) removing the export extra entry named
export from pyproject.toml to avoid advertising non-functional functionality, or
(B) bundling the conversion script into the repository and packaging it (add
convert_hf_to_gguf.py to the source tree, include it in package_data in
pyproject.toml and ensure it’s installed alongside your package), and update any
docs to state the requirement on llama.cpp if you choose not to bundle it;
reference the export extra and the convert_hf_to_gguf.py script when making the
change.
---
Outside diff comments:
In `@docs/roadmap/completed-phases.md`:
- Line 3: Update the archive-scope note string that currently reads "**Note:**
Historical archive of ForgeLM's Phase 1-9 completed work." to reference Phase 10
(e.g. "**Note:** Historical archive of ForgeLM's Phase 1-10 completed work. Kept
for reference and onboarding. For active planning, see
[../roadmap.md](../roadmap.md).") so the description in completed-phases.md
matches the summary table; locate and edit that exact note text in the
docs/roadmap/completed-phases.md file.
---
Nitpick comments:
In `@forgelm/chat.py`:
- Around line 215-256: The _generate_and_print method writes the "Assistant:"
prefix and streamed tokens using print()/_console.print directly which bypasses
the output_fn seam; update _generate_and_print to call the existing
self._print_inline (or self._print for final newline) for the "Assistant:"
prefix and for every streamed token, and use self._print for non-streamed
responses so all assistant output goes through the class's output hooks (refer
to _generate_and_print, self._print_inline, self._print, and _default_output to
mirror existing behavior rather than calling print()/_console.print directly).
In `@forgelm/cli.py`:
- Around line 539-564: The inner KeyboardInterrupt handler in _run_chat_cmd
makes the outer KeyboardInterrupt catch at the caller unreachable; remove the
inner "except KeyboardInterrupt: pass" from _run_chat_cmd (and any other command
helpers that swallow KeyboardInterrupt similarly) so Ctrl-C can be handled
centrally by the outer wrapper that calls run_chat; leave the ImportError and
generic Exception handlers intact and ensure the function still
returns/propagates on KeyboardInterrupt.
In `@forgelm/deploy.py`:
- Around line 117-151: The _tgi_compose function currently hard-codes GPU
reservation count: devices=[{"driver":"nvidia","count":1,...}], which forces
single-GPU; change the signature of _tgi_compose to accept a num_gpus (or
gpu_count) parameter and use that value for the devices "count" field, and also
append a corresponding TGI CLI flag (e.g., "--num-shard {num_gpus}" or
"--num-gpus {num_gpus}") into the command string so the container and compose
reservation stay in sync; update any callers to pass the desired num_gpus or
default to 1 to preserve current behavior.
- Around line 188-212: Refactor the large parameter list of
generate_deploy_config into small per-target dataclasses (e.g., OllamaOptions,
VLLMOptions, TGIOptions, HFEndpointsOptions) so callers pass grouped option
objects instead of 17 separate params; update the generate_deploy_config
signature to accept those dataclass instances (or Optional[...] with defaults)
and move related params (temperature/top_k/top_p -> sampling group,
gpu_memory_utilization/dtype -> VLLMOptions, port -> TGIOptions,
task/instance_size/instance_type/region/framework -> HFEndpointsOptions,
system_prompt/max_length/trust_remote_code remain shared) and then pass each
asdict(...) into the existing per-target generators inside
generate_deploy_config to preserve behavior.
- Around line 99-114: The current trick of creating comment pseudo-keys in cfg
then extracting comment_keys and joining them into comments produces lines with
a trailing ": " (e.g., "# Generated by ...: "), so replace that flow by
constructing a proper comment header string (instead of pseudo-dict keys) before
dumping the YAML: remove adding keys that start with "#" to cfg, build comments
as a simple newline-joined string where each line begins with "# " plus the
desired text (e.g., "# Generated by ForgeLM deploy" and "# Run: ..."), and then
return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + "\n" +
comments + "\n" so there is no stray colon; update the variables comment_keys
and comments accordingly and ensure yaml.dump only receives a clean cfg without
any "#" pseudo-keys.
In `@forgelm/export.py`:
- Around line 69-77: Remove the duplicate unconditional "import llama_cpp" (the
import inside the try already binds llama_cpp) to fix the F811 lint; also rename
the parameter named "format" (e.g., in the function signature that currently
uses format and any calls passing format=format) to "output_format" (or "fmt")
and update all internal references and keyword args (e.g., format=format ->
output_format=output_format) so the builtin isn't shadowed and all call sites
match the new name (search for occurrences of the symbol "format" in the
function and its callers around the existing references).
In `@forgelm/fit_check.py`:
- Around line 286-294: Replace the unused local free_bytes by discarding it when
calling torch.cuda.mem_get_info (e.g., use "_, total_bytes =
torch.cuda.mem_get_info()" or capture and use free value if preferred) so
free_bytes is not left unused, and ensure available_gb is computed from the
chosen value; also remove the unused "arch" parameter from the
_build_recommendations function signature and delete the corresponding argument
at its call site in the code (update the function definition of
_build_recommendations and every call that passes arch, including the call
currently adjacent to its invocation) so the unused symbol is eliminated.
- Around line 161-172: The current _lora_adapter_gb uses a simplified formula
assuming both A and B matrices use hidden_size (h) for their other dimension;
update _lora_adapter_gb to compute adapter_params per-target-module shape
instead: detect or accept which modules are targeted (or derive from arch) and
for each module use the correct second dimension (e.g.,
arch["intermediate_size"] for FFN/gate/up/down projections, or n_kv*head_dim /
arch["num_attention_heads"] for k/v under GQA) to compute A and B sizes, sum
across modules and layers, then convert to GB using fp32; keep lora_r,
arch["num_hidden_layers"], and arch fields as inputs and fall back to h when
module-specific dims are not available.
- Around line 253-258: Compute the LoRA parameter count directly instead of
round-tripping through GB: replace the adapter_gb -> trainable_params two-step
with a direct calculation of adapter_params (e.g., call a new or existing helper
like _lora_adapter_params(arch, lora.r, target_module_count) or reproduce the
exact formula used by _lora_adapter_gb) and set trainable_params =
int(adapter_params); then derive adapter_gb = adapter_params * 4 / (1024**3) for
any GB reporting. Keep the optimizer_type = getattr(t, "galore_optim", "adamw")
if getattr(t, "galore_enabled", False) else "adamw" and pass the direct
trainable_params into _optimizer_state_gb(trainable_params, optimizer_type).
In `@forgelm/inference.py`:
- Around line 63-74: The 8-bit branch currently sets
model_kwargs["load_in_8bit"] = True which uses the deprecated direct kwarg;
instead construct a BitsAndBytesConfig for 8-bit like the 4-bit branch and
assign it to model_kwargs["quantization_config"]. Update the elif handling of
load_in_8bit to import/instantiate BitsAndBytesConfig (as used in the
load_in_4bit branch), set load_in_8bit=True and appropriate compute dtype (e.g.,
torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16), and remove
the raw load_in_8bit kwarg so from_pretrained receives quantization_config
consistently.
In `@README.md`:
- Around line 175-197: Update the Project Structure block to list the five new
top-level modules (inference.py, chat.py, fit_check.py, export.py, deploy.py)
alongside the existing entries (e.g., cli.py, config.py, model.py, etc.), and
refresh the tests/training summary to reflect the added ~89 unit tests (update
"145+ unit tests across 11 test files" to the new total and file count). Ensure
the directory tree still shows the same sections (configs/deepspeed/,
notebooks/, tests/, docs/guides/) and adjust any trailing comments so the README
accurately represents the v0.4.0 surface.
In `@tests/test_export.py`:
- Around line 260-265: The current test loop in tests/test_export.py uses a
tautological assertion that masks failures; update the assertion after calling
export_model (in the loop over SUPPORTED_QUANTS) to explicitly require a
successful export by asserting result.success (or assert result.error is None
and result.success) instead of the current "result.error is None or 'not found'
not in result.error or result.success" so the test fails on regressions; keep
the setup using patch.dict(sys.modules, {"llama_cpp": llama_cpp_stub}) and
patch("subprocess.run", side_effect=fake_run) and only change the final
assertion that checks the exported result variable `result`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4eae5ed4-6ee7-4408-8062-73ea8f7509df
📒 Files selected for processing (25)
CHANGELOG.mdREADME.mddocs/guides/quickstart.mddocs/reference/usage.mddocs/roadmap.mddocs/roadmap/completed-phases.mddocs/roadmap/phase-10-post-training.mddocs/roadmap/phase-11-data-ingestion.mddocs/roadmap/phase-12-quickstart.mddocs/roadmap/releases.mdforgelm/__init__.pyforgelm/chat.pyforgelm/cli.pyforgelm/deploy.pyforgelm/export.pyforgelm/fit_check.pyforgelm/inference.pypyproject.tomltests/test_cli_phase10.pytests/test_deploy.pytests/test_export.pytests/test_fit_check.pytests/test_inference.pytests/test_oom_recovery.pytests/test_safety_advanced.py
| optimizer_type = getattr(t, "galore_optim", "adamw") if getattr(t, "galore_enabled", False) else "adamw" | ||
| optim_gb = _optimizer_state_gb(trainable_params, optimizer_type) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Look for an optimizer-name field on TrainingConfig that we should be reading.
rg -n -C2 'class TrainingConfig\b' --type=py
rg -nP '\b(optim|optimizer)\s*[:=]' --type=py -g '!**/test*/**' -g '!**/tests/**'Repository: cemililik/ForgeLM
Length of output: 618
🏁 Script executed:
# Inspect TrainingConfig class definition and its fields
sed -n '100,150p' forgelm/config.py | cat -nRepository: cemililik/ForgeLM
Length of output: 3180
🏁 Script executed:
# Get full context around lines 257-258 in fit_check.py
sed -n '250,265p' forgelm/fit_check.py | cat -nRepository: cemililik/ForgeLM
Length of output: 926
🏁 Script executed:
# Check the _optimizer_state_gb function to understand what it accepts
sed -n '175,210p' forgelm/fit_check.py | cat -nRepository: cemililik/ForgeLM
Length of output: 1630
Optimizer is hard-coded to AdamW when GaLore is disabled, causing incorrect memory estimates.
When galore_enabled=False, the code defaults to "adamw" regardless of which optimizer the user is actually using. However, _optimizer_state_gb() intelligently detects optimizer types: 8-bit variants use ~2 bytes per parameter, Adafactor uses ~4 bytes, and AdamW uses 8 bytes. This causes the estimator to overestimate memory for users running 8-bit Adam or Adafactor without GaLore, potentially falsely returning OOM/TIGHT.
TrainingConfig currently exposes only GaLore-specific optimizer selection (galore_optim) and lacks a general optimizer field for non-GaLore cases. To fix this, either add a general optimizer configuration field to TrainingConfig (defaulting to "adamw" for backward compatibility), or document that optimizer selection is only available via galore_optim when GaLore is enabled.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@forgelm/fit_check.py` around lines 257 - 258, The estimator hard-codes
optimizer_type to "adamw" when galore is disabled, causing wrong memory
estimates; update the TrainingConfig model to add a general optimizer field
(e.g., optimizer: str = "adamw") for non-GaLore runs, then change the logic that
sets optimizer_type (currently using getattr(t, "galore_optim", "adamw") if
getattr(t, "galore_enabled", False) else "adamw") to prefer t.optimizer when
galore_enabled is False, falling back to existing galore_optim when
galore_enabled is True and preserving the "adamw" default for backward
compatibility; ensure _optimizer_state_gb(trainable_params, optimizer_type)
receives this corrected optimizer_type.
| if backend.lower() == "unsloth": | ||
| return _load_unsloth(path, adapter, trust_remote_code) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
unsloth FastLanguageModel.from_pretrained load_in_8bit support
💡 Result:
Unsloth's FastLanguageModel.from_pretrained supports load_in_8bit=True for loading models in 8-bit quantization using BitsAndBytes. The parameter is explicitly listed in the method signature with default False. It passes the quantization config to the underlying Transformers from_pretrained when load_in_8bit is True (along with load_in_4bit), enabling 8-bit loading. Official docs and examples primarily highlight load_in_4bit for QLoRA (default True), but 8-bit is available for higher precision when VRAM allows, using about 2x the memory of 4-bit. Early bugs (e.g., vision.py not handling load_in_8bit, issue #2679 from Jun 2025; memory bug #3354 from Sep 2025) were fixed in later releases, including PR #2022 adding explicit checks. Recent issues confirm it works correctly as of Oct 2025. Use with load_in_4bit=False to avoid conflicts. Example: from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/llama-3-8b-bnb-4bit", # or non-quantized base model max_seq_length=2048, load_in_8bit=True, load_in_4bit=False, )
Citations:
- 1: [Bug]FastLanguageModel/ FastModel can not load model in 8bits unslothai/unsloth#2679
- 2: https://leeroopedia.com/index.php/Implementation:Unslothai_Unsloth_FastLanguageModel_From_Pretrained
- 3: https://unsloth.ai/docs/get-started/fine-tuning-llms-guide
- 4: [Bug] load_in_8bit uses ~16 bits of memory unslothai/unsloth#3354
- 5: Add load_in_16bit Parameter and Fix 8-bit Quantization Config unslothai/unsloth#2022
🏁 Script executed:
# First, find and read the inference.py file
git ls-files | grep -E "inference\.py$"Repository: cemililik/ForgeLM
Length of output: 106
🏁 Script executed:
wc -l forgelm/inference.pyRepository: cemililik/ForgeLM
Length of output: 86
🏁 Script executed:
cat -n forgelm/inference.pyRepository: cemililik/ForgeLM
Length of output: 14513
_load_unsloth ignores quantization flags and silently drops adapters.
Two correctness issues in the Unsloth path:
-
Line 51 forwards only
(path, adapter, trust_remote_code). The caller'sload_in_4bitandload_in_8bitparameters are dropped, while line 107 hardcodesload_in_4bit=True— soload_model(..., backend="unsloth", load_in_4bit=False)still loads in 4‑bit, contradicting the API contract. -
Lines 112–117 emit a
logger.warningand return a model that does not have the adapter applied. A user who passesadapter=...receives inference results from the base model with no indication beyond a log line — silently incorrect outputs. The docstring (lines 35–36) states "Adapter weights are merged into the base model," so failing loud is the right default.
Fixes required:
Pass load_in_4bit and load_in_8bit to _load_unsloth; update the function signature to accept and forward these parameters to FastLanguageModel.from_pretrained. Raise a ValueError immediately if an adapter is passed (rather than warning and silently ignoring it), directing the user to merge the adapter first or use backend='transformers'.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@forgelm/inference.py` around lines 50 - 51, The Unsloth backend call drops
quantization flags and silently ignores adapters: update the caller to forward
load_in_4bit and load_in_8bit into _load_unsloth and change _load_unsloth's
signature to accept those two booleans and pass them through to
FastLanguageModel.from_pretrained (instead of hardcoding load_in_4bit=True);
also remove the logger.warning path that returns a model without the adapter and
instead raise a ValueError in _load_unsloth if adapter is provided, telling the
user to merge the adapter first or use backend='transformers'.
| # Phase 10: GGUF export via llama-cpp-python's bundled conversion script. | ||
| # Linux and macOS only; Windows builds are available but less reliable. | ||
| export = [ | ||
| "llama-cpp-python>=0.2.90; sys_platform != 'win32'", | ||
| ] | ||
|
|
||
| # Phase 10: rich terminal rendering for the forgelm chat REPL. | ||
| # Falls back to plain text output when not installed. | ||
| chat = [ | ||
| "rich>=13.0.0", | ||
| ] |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does the llama-cpp-python wheel from PyPI bundle convert_hf_to_gguf.py inside the llama_cpp package, or must users obtain it from the llama.cpp repository?
💡 Result:
No, the llama-cpp-python wheel from PyPI does not bundle convert_hf_to_gguf.py inside the llama_cpp package. Users must obtain it from the llama.cpp repository.
Citations:
- 1: https://github.com/ggml-org/llama.cpp/blob/master/convert_hf_to_gguf.py
- 2: https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py
- 3: Tutorial: How to convert HuggingFace model to GGUF format [UPDATED] ggml-org/llama.cpp#7927
- 4: Tutorial: How to convert HuggingFace model to GGUF format ggml-org/llama.cpp#2948
- 5: https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file
- 6: https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py
- 7: https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama_cpp.py
Critical issue: export extra declares incomplete functionality.
The export extra declares llama-cpp-python>=0.2.90; sys_platform != 'win32' but this alone does not provide the conversion script. The llama-cpp-python wheel does not bundle convert_hf_to_gguf.py—that script is maintained in the upstream llama.cpp repository and must be obtained separately. Users installing the export extra will successfully install the dependency but fail at runtime with FileNotFoundError when attempting to use the export functionality.
Either add the conversion script to the repository, provide clear documentation that users must manually obtain it from llama.cpp, or remove the extra entirely. As currently declared, it represents broken functionality.
The chat extra remains correct: rich>=13.0.0 with appropriate fallback handling in the code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 74 - 84, The declared export extra is broken
because installing "llama-cpp-python>=0.2.90" does not provide the upstream
conversion script convert_hf_to_gguf.py; fix by either (A) removing the export
extra entry named export from pyproject.toml to avoid advertising non-functional
functionality, or (B) bundling the conversion script into the repository and
packaging it (add convert_hf_to_gguf.py to the source tree, include it in
package_data in pyproject.toml and ensure it’s installed alongside your
package), and update any docs to state the requirement on llama.cpp if you
choose not to bundle it; reference the export extra and the
convert_hf_to_gguf.py script when making the change.
- chat: emit warning instead of silent no-op in _safety_hint - cli: add --output-format/--quiet/--log-level to chat/export/deploy subparsers using argparse.SUPPRESS so they don't override main parser values; remove redundant inner KeyboardInterrupt handler in chat - deploy: parameterize cloud vendor for HF Endpoints config; fix vLLM comment generation to drop trailing colon for None values - export: produce f16 GGUF for K-quants with explicit warning (two-step llama-quantize required); add FORGELM_GGUF_CONVERTER env var override; add subprocess timeout (3600s); cleanup merged adapter dir in finally; remove duplicate llama_cpp import - fit_check: use free_bytes (not total) for available_gb; drop unused arch parameter from _build_recommendations - inference: propagate exceptions from background generate thread; raise ValueError when adapter passed to Unsloth backend; warn when Unsloth receives load_in_4bit=False; switch 8-bit loading to BitsAndBytesConfig - README: add 5 Phase 10 modules to project structure - completed-phases: update scope note to Phase 1-10 - tests: fix tautological assertion in test_export; update test_fit_check to expect free_bytes; mock signatures accept timeout kwarg - pyproject: clarify export extra comment re: K-quant two-step flow Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
forgelm/inference.py (1)
51-54:⚠️ Potential issue | 🟡 Minor
load_in_8bit=Trueis silently dropped on the Unsloth backend.Lines 51–54 warn when
load_in_4bit=Falsebut say nothing when the caller explicitly passesload_in_8bit=True. Theload_modelfunction receivesload_in_8bitas a parameter but does not forward it to_load_unsloth()at line 54. Instead,_load_unslothhardcodesload_in_4bit=True(line 111) and ignores the caller'sload_in_8bitrequest — a user callingload_model(..., backend="unsloth", load_in_8bit=True)silently gets a 4-bit model.Since Unsloth's
FastLanguageModel.from_pretrainedacceptsload_in_8bitand properly passes it to bitsandbytes, forward the parameter:Recommended fix
def _load_unsloth( path: str, adapter: Optional[str], trust_remote_code: bool, + load_in_8bit: bool, ) -> Tuple[Any, Any]: """Load model via the Unsloth backend for faster inference.""" try: from unsloth import FastLanguageModel except ImportError as e: raise ImportError( "Unsloth backend requested but 'unsloth' is not installed. Install with: pip install unsloth" ) from e model, tokenizer = FastLanguageModel.from_pretrained( model_name=path, dtype=None, load_in_4bit=True, + load_in_8bit=load_in_8bit, trust_remote_code=trust_remote_code, )And update the call site (line 54):
- return _load_unsloth(path, adapter, trust_remote_code) + return _load_unsloth(path, adapter, trust_remote_code, load_in_8bit)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/inference.py` around lines 51 - 54, The Unsloth path drops the caller's load_in_8bit flag: update the load_model branch that returns _load_unsloth(...) to forward the load_in_8bit argument (i.e., call _load_unsloth(path, adapter, trust_remote_code, load_in_8bit=load_in_8bit)); then modify the _load_unsloth(...) function signature to accept load_in_8bit (and stop hardcoding load_in_4bit=True), and pass that flag into FastLanguageModel.from_pretrained(...) so bitsandbytes receives the requested 8-bit/4-bit option; also keep the existing warning about load_in_4bit=False when appropriate or add a sanity warning when both load_in_4bit and load_in_8bit are conflicting.
🧹 Nitpick comments (9)
forgelm/fit_check.py (3)
287-295: Drop unusedtotal_bytes.SonarCloud flags
total_bytesas unused; replacing with_makes the intent explicit.♻️ Suggested cleanup
- if torch.cuda.is_available(): - free_bytes, total_bytes = torch.cuda.mem_get_info() - available_gb = free_bytes / (1024**3) + if torch.cuda.is_available(): + free_bytes, _ = torch.cuda.mem_get_info() + available_gb = free_bytes / (1024**3)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/fit_check.py` around lines 287 - 295, The variable total_bytes returned from torch.cuda.mem_get_info is unused; update the unpacking in the try block where torch.cuda.mem_get_info() is called (within fit_check.py) to use an underscore for the unused value (e.g., free_bytes, _ = torch.cuda.mem_get_info()) so intent is explicit while leaving available_gb calculation (available_gb = free_bytes / (1024**3)) and the surrounding logic (hypothetical assignment and exception handling) unchanged.
360-361: Tautological condition —quant == "4bit"iffm.load_in_4bitis True.Per lines 243-249,
quantis set to"4bit"exactly whenm.load_in_4bitis true, soquant not in ("4bit",) and not m.load_in_4bitsimplifies tonot m.load_in_4bit.♻️ Suggested simplification
- if quant not in ("4bit",) and not m.load_in_4bit: + if not m.load_in_4bit: recs.append("Enable load_in_4bit: true for QLoRA; reduces base model weight memory by ~75 %.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/fit_check.py` around lines 360 - 361, The condition in the check is tautological because quant is set to "4bit" exactly when m.load_in_4bit is True; replace the compound check `if quant not in ("4bit",) and not m.load_in_4bit:` with the simplified `if not m.load_in_4bit:` (keeping the same behavior) in the block that appends the recommendation via recs.append, referencing the variables quant and m.load_in_4bit and the recs.append call.
254-260: Reverse-computingtrainable_paramsfrom GB introduces unnecessary float round-trips.
adapter_gbwas just produced byadapter_params * 4 / (1024**3), and we immediately reverse it. Hoisting the param count out of_lora_adapter_gb(or computing it inline here) avoids the precision-losing round trip and reads more clearly.♻️ Suggested refactor
- target_module_count = len(lora.target_modules) - adapter_gb = _lora_adapter_gb(arch, lora.r, target_module_count) - trainable_params = int(adapter_gb * (1024**3) / 4) # fp32 param count + target_module_count = len(lora.target_modules) + trainable_params = ( + 2 * lora.r * arch["hidden_size"] * target_module_count * arch["num_hidden_layers"] + ) + adapter_gb = trainable_params * 4 / (1024**3) # fp32🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/fit_check.py` around lines 254 - 260, The code currently computes adapter_gb via _lora_adapter_gb(arch, lora.r, target_module_count) then reverses that to get trainable_params, which causes unnecessary float round-trips and precision loss; instead compute the adapter parameter count directly (e.g., compute adapter_params = <formula using arch, lora.r, target_module_count> or hoist that logic out of _lora_adapter_gb) and set trainable_params = adapter_params (or adapter_params cast to int) before calling _optimizer_state_gb; update references in this block (target_module_count, _lora_adapter_gb, adapter_gb, trainable_params, _optimizer_state_gb, lora.r, arch) so adapter_gb is derived from integer param count rather than used as the source of truth.tests/test_deploy.py (1)
186-195: Optional: prefermonkeypatch.chdirover manualos.chdir+ try/finally.
monkeypatch.chdir(tmp_path)is the pytest-idiomatic equivalent and auto-restores the cwd at fixture teardown, removing the manual save/restore boilerplate.♻️ Suggested refactor
- def test_default_filename_used_when_output_none(self, tmp_path): - original_cwd = os.getcwd() - os.chdir(tmp_path) - try: - result = generate_deploy_config("./model", "ollama") - assert result.success is True - assert result.output_path == "Modelfile" - assert os.path.isfile("Modelfile") - finally: - os.chdir(original_cwd) + def test_default_filename_used_when_output_none(self, tmp_path, monkeypatch): + monkeypatch.chdir(tmp_path) + result = generate_deploy_config("./model", "ollama") + assert result.success is True + assert result.output_path == "Modelfile" + assert os.path.isfile("Modelfile")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_deploy.py` around lines 186 - 195, Replace the manual cwd management in test_default_filename_used_when_output_none with pytest's monkeypatch fixture: call monkeypatch.chdir(tmp_path) at the top of the test (and remove original_cwd, os.chdir calls and the try/finally block), then run generate_deploy_config("./model", "ollama") and keep the existing assertions; this uses the built-in monkeypatch fixture to auto-restore the working directory.README.md (1)
199-202: Stale test/file counts in Project Structure footer.The PR description reports 427 passing tests; the README's "145+ unit tests across 11 test files" is from an earlier phase and undersells the current coverage. Worth refreshing as part of this README update.
📝 Suggested edit
-tests/ # 145+ unit tests across 11 test files +tests/ # 427+ unit tests across the codebase🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 199 - 202, Update the stale test count in the README's Project Structure footer: replace the line containing "tests/ # 145+ unit tests across 11 test files" with an updated summary that reflects the current test suite (e.g., "tests/ # 427 passing tests" or "tests/ # 427+ unit tests across N test files") so the README matches the PR's reported 427 passing tests; ensure the text remains in the same Project Structure section and preserves formatting.forgelm/deploy.py (1)
125-146: Drop theversion: "3.8"key from the generated docker-compose.Compose v2+ ignores the top-level
versionfield and emits a deprecation warning on each run. It is recommended to remove it from new or updated compose files.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/deploy.py` around lines 125 - 146, The generated docker-compose dict includes a top-level "version": "3.8" key which is deprecated; remove that key from the compose dict construction (the compose variable in forgelm/deploy.py) so the output YAML no longer contains the version field, keeping the "services" structure and header generation (header and yaml.dump usage) unchanged.tests/test_cli_phase10.py (1)
273-289: Minor: chat smoke tests are looser than the docstrings imply.Two small gaps worth tightening when convenient:
test_chat_does_not_require_configbuilds aMagicMock(side_effect=KeyboardInterrupt)and then re-passes it viapatch(..., side_effect=mock_run_chat). A flatpatch("forgelm.cli._run_chat_cmd", side_effect=KeyboardInterrupt)is equivalent and clearer.test_chat_subcommand_registeredsays "must succeed and show model_path" but the assertion only checksexit_info.value.code == 0. If the goal is to verifymodel_pathis documented in--help, capture stdout and assert"model_path"(or the actual help string) is present.Neither blocks the suite — flagging for cleanup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cli_phase10.py` around lines 273 - 289, In test_chat_does_not_require_config replace the indirection with a direct side effect on the patch: patch "forgelm.cli._run_chat_cmd" using side_effect=KeyboardInterrupt instead of creating a MagicMock and re-passing it (this simplifies the test and keeps behavior identical); in test_chat_subcommand_registered, run main() while capturing stdout (e.g., via capsys or patching sys.stdout) when invoking ["forgelm","chat","--help"] and add an assertion that the help output contains "model_path" (or the exact help token expected) to ensure the help text documents the model_path parameter.forgelm/chat.py (1)
218-259: Streaming bypasses the injectedoutput_fn.The constructor accepts
output_fnto support testing/embedding (line 73), but_generate_and_printwrites the streamed prefix and tokens via rawprint(...)and_console.print(...)directly (lines 224–245). Tests that inject a customoutput_fnwon't capture streamed output, and downstream embedders (e.g., a Web UI or notebook frontend) can't redirect it. Same for_print_welcome(lines 294–298).Routing all output through
self._print/self._print_inlinewould keep the abstraction honest. Not blocking, but worth a small consolidation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/chat.py` around lines 218 - 259, The streaming path in _generate_and_print bypasses the injected output abstraction by using raw print and _console.print; change it to route all output through the instance methods (use self._print for full-line output and add/reuse a self._print_inline for incremental/token output) so generate_stream and generate both write via self._print/_print_inline instead of print/_console.print; replace the "Assistant:" prefix emission, each token emission, the trailing newline, and the error messages to call those methods; do the same consolidation in _print_welcome to use self._print/_print_inline so injected output_fn captures all output.tests/test_fit_check.py (1)
357-367: Tautological hypothetical-mode assertion.
format_fit_checkalways renders the verdict line containing the literal"UNKNOWN"token wheneververdict="UNKNOWN", so theor "UNKNOWN" in textbranch makes this assertion pass even if the "not detected (hypothetical mode)" line were silently dropped. Drop the"UNKNOWN"clause (or split into two assertions) so this actually guards the hypothetical/no-GPU note.♻️ Suggested tightening
text = format_fit_check(result) - assert "not detected" in text or "hypothetical" in text or "UNKNOWN" in text + assert "not detected" in text or "hypothetical" in text🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_fit_check.py` around lines 357 - 367, The test test_hypothetical_mode_note is tautological because format_fit_check always prints the literal "UNKNOWN" when verdict="UNKNOWN", so the current assertion can pass even if the hypothetical/no-GPU note is missing; update the test in tests/test_fit_check.py to remove the `"UNKNOWN" in text` branch and assert only for the hypothetical note (e.g., check "not detected" or "hypothetical"), or split into two assertions: one that verifies the verdict token is present and another that specifically checks for the hypothetical/no-GPU message; locate the test function name test_hypothetical_mode_note and the usage of FitCheckResult and format_fit_check to make this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@forgelm/chat.py`:
- Around line 265-275: The safety-warning log currently emitted every turn from
the per-turn call to _safety_hint should be moved to the session initialization:
in the class __init__ check enable_safety and emit the multi-line warning there
(so it runs once per session) and remove the per-turn invocation that currently
calls self._safety_hint in the response path; refactor _safety_hint to take no
parameters (or delete it and inline the single warning) to remove the unused
prompt/response parameters SonarCloud flagged, and update any references to
_safety_hint accordingly.
---
Duplicate comments:
In `@forgelm/inference.py`:
- Around line 51-54: The Unsloth path drops the caller's load_in_8bit flag:
update the load_model branch that returns _load_unsloth(...) to forward the
load_in_8bit argument (i.e., call _load_unsloth(path, adapter,
trust_remote_code, load_in_8bit=load_in_8bit)); then modify the
_load_unsloth(...) function signature to accept load_in_8bit (and stop
hardcoding load_in_4bit=True), and pass that flag into
FastLanguageModel.from_pretrained(...) so bitsandbytes receives the requested
8-bit/4-bit option; also keep the existing warning about load_in_4bit=False when
appropriate or add a sanity warning when both load_in_4bit and load_in_8bit are
conflicting.
---
Nitpick comments:
In `@forgelm/chat.py`:
- Around line 218-259: The streaming path in _generate_and_print bypasses the
injected output abstraction by using raw print and _console.print; change it to
route all output through the instance methods (use self._print for full-line
output and add/reuse a self._print_inline for incremental/token output) so
generate_stream and generate both write via self._print/_print_inline instead of
print/_console.print; replace the "Assistant:" prefix emission, each token
emission, the trailing newline, and the error messages to call those methods; do
the same consolidation in _print_welcome to use self._print/_print_inline so
injected output_fn captures all output.
In `@forgelm/deploy.py`:
- Around line 125-146: The generated docker-compose dict includes a top-level
"version": "3.8" key which is deprecated; remove that key from the compose dict
construction (the compose variable in forgelm/deploy.py) so the output YAML no
longer contains the version field, keeping the "services" structure and header
generation (header and yaml.dump usage) unchanged.
In `@forgelm/fit_check.py`:
- Around line 287-295: The variable total_bytes returned from
torch.cuda.mem_get_info is unused; update the unpacking in the try block where
torch.cuda.mem_get_info() is called (within fit_check.py) to use an underscore
for the unused value (e.g., free_bytes, _ = torch.cuda.mem_get_info()) so intent
is explicit while leaving available_gb calculation (available_gb = free_bytes /
(1024**3)) and the surrounding logic (hypothetical assignment and exception
handling) unchanged.
- Around line 360-361: The condition in the check is tautological because quant
is set to "4bit" exactly when m.load_in_4bit is True; replace the compound check
`if quant not in ("4bit",) and not m.load_in_4bit:` with the simplified `if not
m.load_in_4bit:` (keeping the same behavior) in the block that appends the
recommendation via recs.append, referencing the variables quant and
m.load_in_4bit and the recs.append call.
- Around line 254-260: The code currently computes adapter_gb via
_lora_adapter_gb(arch, lora.r, target_module_count) then reverses that to get
trainable_params, which causes unnecessary float round-trips and precision loss;
instead compute the adapter parameter count directly (e.g., compute
adapter_params = <formula using arch, lora.r, target_module_count> or hoist that
logic out of _lora_adapter_gb) and set trainable_params = adapter_params (or
adapter_params cast to int) before calling _optimizer_state_gb; update
references in this block (target_module_count, _lora_adapter_gb, adapter_gb,
trainable_params, _optimizer_state_gb, lora.r, arch) so adapter_gb is derived
from integer param count rather than used as the source of truth.
In `@README.md`:
- Around line 199-202: Update the stale test count in the README's Project
Structure footer: replace the line containing "tests/ # 145+ unit
tests across 11 test files" with an updated summary that reflects the current
test suite (e.g., "tests/ # 427 passing tests" or "tests/
# 427+ unit tests across N test files") so the README matches the PR's reported
427 passing tests; ensure the text remains in the same Project Structure section
and preserves formatting.
In `@tests/test_cli_phase10.py`:
- Around line 273-289: In test_chat_does_not_require_config replace the
indirection with a direct side effect on the patch: patch
"forgelm.cli._run_chat_cmd" using side_effect=KeyboardInterrupt instead of
creating a MagicMock and re-passing it (this simplifies the test and keeps
behavior identical); in test_chat_subcommand_registered, run main() while
capturing stdout (e.g., via capsys or patching sys.stdout) when invoking
["forgelm","chat","--help"] and add an assertion that the help output contains
"model_path" (or the exact help token expected) to ensure the help text
documents the model_path parameter.
In `@tests/test_deploy.py`:
- Around line 186-195: Replace the manual cwd management in
test_default_filename_used_when_output_none with pytest's monkeypatch fixture:
call monkeypatch.chdir(tmp_path) at the top of the test (and remove
original_cwd, os.chdir calls and the try/finally block), then run
generate_deploy_config("./model", "ollama") and keep the existing assertions;
this uses the built-in monkeypatch fixture to auto-restore the working
directory.
In `@tests/test_fit_check.py`:
- Around line 357-367: The test test_hypothetical_mode_note is tautological
because format_fit_check always prints the literal "UNKNOWN" when
verdict="UNKNOWN", so the current assertion can pass even if the
hypothetical/no-GPU note is missing; update the test in tests/test_fit_check.py
to remove the `"UNKNOWN" in text` branch and assert only for the hypothetical
note (e.g., check "not detected" or "hypothetical"), or split into two
assertions: one that verifies the verdict token is present and another that
specifically checks for the hypothetical/no-GPU message; locate the test
function name test_hypothetical_mode_note and the usage of FitCheckResult and
format_fit_check to make this change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ba0e40c2-8478-488d-bb9f-e482586e3952
📒 Files selected for processing (18)
README.mddocs/roadmap/completed-phases.mdforgelm/chat.pyforgelm/cli.pyforgelm/deploy.pyforgelm/export.pyforgelm/fit_check.pyforgelm/inference.pypyproject.tomltests/test_cli_phase10.pytests/test_compliance.pytests/test_deploy.pytests/test_export.pytests/test_fit_check.pytests/test_inference.pytests/test_judge_functions.pytests/test_model_card.pytests/test_oom_recovery.py
✅ Files skipped from review due to trivial changes (4)
- tests/test_model_card.py
- tests/test_compliance.py
- tests/test_judge_functions.py
- pyproject.toml
- chat: emit safety warning once per session in __init__ instead of every turn; remove unused _safety_hint method; route all streaming output and welcome banner through self._print/_print_inline so injected output_fn captures everything - inference: forward load_in_4bit and load_in_8bit to _load_unsloth (signature now accepts both as keyword args) and pass them to FastLanguageModel.from_pretrained; warn on conflicting flags - deploy: drop deprecated top-level "version" key from generated docker-compose.yaml - fit_check: discard unused total_bytes via underscore unpack; simplify tautological 4-bit recommendation check; compute adapter parameter count as an integer directly to avoid float round-trip via GB - README: update stale test count (145+/11 files → 427+/30 files) - tests: simplify chat KI test (drop redundant MagicMock indirection); assert chat --help output documents model_path; replace manual cwd management in deploy test with monkeypatch.chdir; split tautological assertion in fit_check hypothetical-mode test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Critical: - export: K-quant requests now write to a .f16.gguf intermediate path and ExportResult.quant reports "f16" (the actual file) instead of the requested K-quant. model_integrity.json SHA-256 stays honest — audit chain no longer claims a hash that doesn't match the file. Required llama-quantize follow-up step is logged with the exact path. - inference: TextIteratorStreamer now has timeout=60 and the streaming loop catches queue.Empty. If model.generate() raises before the queue is populated (OOM, dtype mismatch, invalid input ids), the captured worker exception is re-raised instead of blocking the chat REPL forever. High: - chat: remove unimplemented --safety flag (was a no-op after round 2; CLAUDE.md "stub tagged production" anti-pattern). Welcome banner no longer claims "Safety: on". - chat: token output now goes through _print_inline_raw which rich-escapes model tokens before rendering ([red]ALERT[/red] in training data was a low-grade prompt-injection vector). Rich console buffer is force-flushed so streaming UX doesn't appear frozen. - fit_check: size-hint matching uses word-bounded regex so "llama-13b" no longer falls back to the 3B architecture profile in --offline mode. - deploy: tgi/ollama targets now reject non-local model paths up front; HF Hub IDs no longer silently produce broken docker volumes / Modelfiles the user only discovers at deploy time. Medium: - export: adapter merge forces torch_dtype=float16 (CPU bf16 falls back to fp32 and stalls on large models); full converter stderr/stdout logged separately from the 500-char ExportResult.error so debug context is preserved. - inference: fallback prompt template (no chat_template) now appends "assistant: " to mimic add_generation_prompt=True; raw base models stop echoing "user: ..." continuations. - chat: /save persists the system prompt as the first JSONL message so saved transcripts are replayable. - fit_check: breakdown includes estimation_caveat noting ±30% typical error and unmodeled factors (FlashAttention, MoE, sliding window). Low: - cli: comment explaining the outer KeyboardInterrupt guard around the chat dispatch covers Ctrl-C during model load / banner render. - CHANGELOG: document the K-quant two-step requirement under v0.4.0. Tests: - test_kquant_produces_f16_intermediate: K-quant request → result.quant=f16 and .f16.gguf path on disk. - test_tgi_rejects_non_local_path / test_ollama_rejects_non_local_path: HF Hub IDs return success=False with an actionable error. - test_export mocks now read --outfile from the cmd args so the new K-quant rerouting is tested transparently. Existing ollama/tgi tests create real model directories under tmp_path. 430 passed, 34 skipped (3 new tests). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes ~94 of 104 SonarCloud findings. Behavior is unchanged; remaining critical findings are cognitive-complexity refactors that need their own focused PRs. Wave 1 — quick wins ------------------- Security: - nightly.yml / publish.yml: drop the workflow-level write permission and declare read-only at workflow scope; only the jobs that need write (notify-failure: issues; publish-testpypi/pypi: id-token) keep it. Unused parameters: - compliance.export_compliance_artifacts: drop the unused `config` arg. Manifest already has the config-derived data; updates 5 call sites (cli.py, trainer.py, test_compliance.py, test_integration_smoke.py). - model._apply_moe_expert_quantization: drop unused `num_experts`. - trainer.ForgeTrainer: drop unused `final_path` from _run_benchmark_if_configured / _run_safety_if_configured / _run_judge_if_configured / _export_compliance_if_needed; drop the unused `metrics` from _run_benchmark_if_configured. Updates the 4 call sites in train(). Constructor → literal: - model._apply_lora: lora_kwargs uses a dict literal. - trainer._get_common_training_kwargs / _build_trainer: kwargs and trainer_kwargs use dict literals. - inference.generate / generate_stream: gen_kwargs uses dict literals (separately flagged by SonarCloud). Duplicate literals → constants: - trainer: extract _EVT_REVERT_TRIGGERED for the audit event name used three times around the auto-revert paths. - wizard: extract _PREFERENCE_COLUMNS_HINT for the dpo/simpo/orpo dataset-format hint. PyTorch dim= explicit: - merging._slerp_merge: switch to torch.linalg.vector_norm so SLERP's scalar-magnitude semantics are explicit (previously torch.norm with no dim). - merging._ties_merge_tensor: pass dim=None to torch.quantile. Misc: - merging: rename unused `missing` to `_` in load_state_dict unpack. - wizard.run_wizard: return type now Optional[str] (it already returned None in the "defer" branch). - wizard: replace nested ternary for strategy_str with if/elif/else. - config.DistributedConfig: rephrase inline comment to avoid string-literals-that-look-like-code. - compliance._sanitize_md: type hint now Optional[str] (the body already handled None). Wave 2 — float == pytest.approx ------------------------------- Mechanical migration of ~60 assertions across 13 test files. These were exact comparisons against literal config defaults / dataclass fields (no FP arithmetic) so behavior is unchanged, but pytest.approx silences SonarCloud's float-equality rule and is the conventional pattern. Files: test_alignment, test_benchmark, test_compliance, test_config, test_cost_estimation, test_galore, test_integration_smoke, test_judge_functions, test_long_context, test_safety_advanced, test_synthetic, test_trainer, test_webhook. Three files needed `import pytest` added (test_compliance, test_long_context, test_webhook). False positives left as-is: - runtime_smoke.py:88 — already re-raises on non-zero exit; SystemExit(0) is the documented success path. - test_safety_advanced.py:235 — `result =` was already removed in round-2. - notebooks/kto_binary_feedback.ipynb — low-priority duplicate literal. Verification: - ruff format && ruff check: clean - pytest tests/: 430 passed, 34 skipped (no behavioral change) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_safety_advanced.py (1)
231-242:⚠️ Potential issue | 🟠 MajorRevert patch target to
forgelm.safety.pipeline— current target won't intercept the actual call.Since
forgelm/safety.pyusesfrom transformers import pipeline(line 86), thepipelinefunction is bound in the module's local namespace at import time. Patchingtransformers.pipelinedoes not retroactively affect this pre-bound reference. To intercept the actual call at line 147 ofsafety.py, you must patch where the name is looked up:forgelm.safety.pipeline. Otherwise, the test invokes the real Hugging Face pipeline and will fail when attempting to load a "mock/classifier" model.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_safety_advanced.py` around lines 231 - 242, The test is patching transformers.pipeline but safety.py imported pipeline into its local namespace, so the patch doesn't intercept the call; update the test to patch the exact symbol used by the code by replacing patch("transformers.pipeline", side_effect=mock_pipeline) with patch("forgelm.safety.pipeline", side_effect=mock_pipeline) so that the call inside safety_mod.run_safety_evaluation uses the mock_pipeline; keep the existing mock_pipeline and torch.cuda.empty_cache patching unchanged.
🧹 Nitpick comments (5)
forgelm/merging.py (1)
119-122: Optional: keep the missing-keys signal when applying merged state.Discarding the first return value of
load_state_dictremoves a useful diagnostic. In normal operation the linear merge produces a full state dict, somissing_keysshould be empty; logging the rare non-empty case helps catch silent merge bugs (e.g., adapters covering only a subset of layers, or key-name drift between PEFT/Transformers versions).♻️ Suggested adjustment
- _, unexpected = base_model.load_state_dict(merged_state, strict=False) + missing, unexpected = base_model.load_state_dict(merged_state, strict=False) + if missing: + logger.warning("Merge missing %d expected keys (kept at base values).", len(missing)) if unexpected: logger.warning("Merge produced %d unexpected keys (ignored).", len(unexpected))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/merging.py` around lines 119 - 122, The call to base_model.load_state_dict(merged_state, strict=False) currently discards the first return value (missing_keys) and only logs unexpected keys; change this to capture both return values (e.g., missing_keys, unexpected = base_model.load_state_dict(...)) and log when missing_keys is non-empty as well (include the count and/or the keys) so that load-state diagnostics from load_state_dict, not just unexpected keys, are preserved; update the warning to mention missing_keys and unexpected when present and keep returning base_model as before.forgelm/fit_check.py (1)
340-346: Remove the unusedquantparameter (SonarCloud).
_build_recommendationsdeclaresquantbut never reads it — the 4-bit recommendation at line 374 readsm.load_in_4bitdirectly. Drop the parameter and the corresponding argument at the call site (line 326).♻️ Proposed fix
def _build_recommendations( config: Any, total_gb: float, available_gb: Optional[float], - quant: str, grad_ckpt: bool, ) -> List[str]:recommendations = _build_recommendations( config=config, total_gb=total_gb, available_gb=available_gb, - quant=quant, grad_ckpt=grad_ckpt, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/fit_check.py` around lines 340 - 346, The _build_recommendations function declares an unused parameter quant; remove quant from the function signature and from every call site that passes it (the caller that currently forwards quant to _build_recommendations), and update any tests or callers accordingly. Specifically, edit _build_recommendations to drop the quant parameter and ensure internal logic uses m.load_in_4bit (as it already does) instead of relying on quant; then find and remove the corresponding argument where _build_recommendations(...) is invoked so signatures match.tests/test_benchmark.py (1)
158-161: Optional consistency: align line 160 withpytest.approx.The surrounding assertions in this test were normalized to
pytest.approx, but line 160 keeps the manualabs(... - 0.60) < 0.01form. Aligning it keeps the file uniform.♻️ Suggested change
- assert abs(result.average_score - 0.60) < 0.01 + assert result.average_score == pytest.approx(0.60, abs=0.01)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_benchmark.py` around lines 158 - 161, Replace the manual tolerance check for result.average_score with pytest.approx to match the other assertions: change the assertion that uses abs(result.average_score - 0.60) < 0.01 to use pytest.approx(0.60, abs=0.01) so the test uniformly uses pytest.approx for numeric comparisons referencing result.average_score.forgelm/compliance.py (1)
425-434: Heads-up: removing theconfigpositional parameter is a breaking change for external callers.All in-tree call sites (
forgelm/trainer.py:912,forgelm/cli.py:476, tests) are already updated, so this is safe for the repo. Ifexport_compliance_artifactsis considered part of the public API, consider noting the signature change in CHANGELOG under a### Changed(or### Breaking) entry for0.4.0so downstream users who imported it directly aren't surprised by aTypeError.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/compliance.py` around lines 425 - 434, Removing the positional parameter `config` from the function `export_compliance_artifacts` is a breaking API change for downstream callers; either restore backward compatibility by reintroducing a `config` parameter (e.g., def export_compliance_artifacts(manifest, output_dir, config=None) and ignore it) or document the breaking change by adding a clear entry to the CHANGELOG for version 0.4.0 under a "Changed" or "Breaking" section that states the signature changed (mention `export_compliance_artifacts(manifest, output_dir)` no longer accepts `config`), so external users know to update their imports/calls.tests/test_cli_phase10.py (1)
350-350: Usepytest.approxfor float comparison.SonarCloud flags this float-equality assertion. The PR description notes other test floats were converted to
pytest.approx; this one was missed.♻️ Proposed fix
- assert result["estimated_gb"] == 35.0 + assert result["estimated_gb"] == pytest.approx(35.0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cli_phase10.py` at line 350, The test uses exact float equality for result["estimated_gb"]; change the assertion to use pytest.approx to avoid fragile float comparisons (e.g. assert result["estimated_gb"] == pytest.approx(35.0)) and ensure pytest is imported in tests/test_cli_phase10.py if not already; update the assertion referencing result["estimated_gb"] accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CHANGELOG.md`:
- Around line 40-44: Replace the British-form "afterwards" with American-form
"afterward" in the CHANGELOG sentence that reads "`forgelm export ... --quant
q4_k_m model.gguf` produces an intermediate `model.f16.gguf`; run
`llama-quantize model.f16.gguf model.gguf Q4_K_M` afterwards to obtain the
K-quant." Update that sentence to use "afterward" for consistency with the
repo's American English documentation (search for the phrase containing
"llama-quantize model.f16.gguf model.gguf Q4_K_M" to locate the exact line).
- Line 84: Update the release heading for "## [0.3.1rc1] — 2026-04-25 (included
in v0.4.0 branch)" to use the correct release date "2026-03-28"; locate the
heading string "## [0.3.1rc1]" in CHANGELOG.md and replace the trailing date
portion so the line reads "## [0.3.1rc1] — 2026-03-28 (included in v0.4.0
branch)" to reflect the actual RC commit date.
In `@forgelm/fit_check.py`:
- Around line 97-114: The regex in the loop that scans _MODEL_SIZE_HINTS can
misclassify fractional sizes like "1.7B" because the negative lookbehind only
blocks digits; update the pattern used in the re.search inside the for loop (the
code that checks fragments against name_lower) to also reject a preceding
decimal point or other non-separator characters (e.g., use a negative lookbehind
that includes '.' such as (?<![\d.]) or otherwise ensure the fragment is
preceded by a separator or start-of-string) while keeping the existing negative
lookahead (e.g., (?![0-9a-z])) so fragments like "7b" aren’t matched inside
"1.7b".
In `@README.md`:
- Line 200: Update the stale test-count line in README.md by replacing the
existing string "tests/ # 427+ unit tests across 30 test files"
with the final suite result (e.g., "tests/ # 430 passed (+34
skipped) across 30 test files") so the README reflects the PR description;
locate the exact line containing "tests/ # 427+ unit tests across
30 test files" and edit it accordingly.
---
Outside diff comments:
In `@tests/test_safety_advanced.py`:
- Around line 231-242: The test is patching transformers.pipeline but safety.py
imported pipeline into its local namespace, so the patch doesn't intercept the
call; update the test to patch the exact symbol used by the code by replacing
patch("transformers.pipeline", side_effect=mock_pipeline) with
patch("forgelm.safety.pipeline", side_effect=mock_pipeline) so that the call
inside safety_mod.run_safety_evaluation uses the mock_pipeline; keep the
existing mock_pipeline and torch.cuda.empty_cache patching unchanged.
---
Nitpick comments:
In `@forgelm/compliance.py`:
- Around line 425-434: Removing the positional parameter `config` from the
function `export_compliance_artifacts` is a breaking API change for downstream
callers; either restore backward compatibility by reintroducing a `config`
parameter (e.g., def export_compliance_artifacts(manifest, output_dir,
config=None) and ignore it) or document the breaking change by adding a clear
entry to the CHANGELOG for version 0.4.0 under a "Changed" or "Breaking" section
that states the signature changed (mention
`export_compliance_artifacts(manifest, output_dir)` no longer accepts `config`),
so external users know to update their imports/calls.
In `@forgelm/fit_check.py`:
- Around line 340-346: The _build_recommendations function declares an unused
parameter quant; remove quant from the function signature and from every call
site that passes it (the caller that currently forwards quant to
_build_recommendations), and update any tests or callers accordingly.
Specifically, edit _build_recommendations to drop the quant parameter and ensure
internal logic uses m.load_in_4bit (as it already does) instead of relying on
quant; then find and remove the corresponding argument where
_build_recommendations(...) is invoked so signatures match.
In `@forgelm/merging.py`:
- Around line 119-122: The call to base_model.load_state_dict(merged_state,
strict=False) currently discards the first return value (missing_keys) and only
logs unexpected keys; change this to capture both return values (e.g.,
missing_keys, unexpected = base_model.load_state_dict(...)) and log when
missing_keys is non-empty as well (include the count and/or the keys) so that
load-state diagnostics from load_state_dict, not just unexpected keys, are
preserved; update the warning to mention missing_keys and unexpected when
present and keep returning base_model as before.
In `@tests/test_benchmark.py`:
- Around line 158-161: Replace the manual tolerance check for
result.average_score with pytest.approx to match the other assertions: change
the assertion that uses abs(result.average_score - 0.60) < 0.01 to use
pytest.approx(0.60, abs=0.01) so the test uniformly uses pytest.approx for
numeric comparisons referencing result.average_score.
In `@tests/test_cli_phase10.py`:
- Line 350: The test uses exact float equality for result["estimated_gb"];
change the assertion to use pytest.approx to avoid fragile float comparisons
(e.g. assert result["estimated_gb"] == pytest.approx(35.0)) and ensure pytest is
imported in tests/test_cli_phase10.py if not already; update the assertion
referencing result["estimated_gb"] accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2136c898-8b06-4bb9-bf52-b42188cc3afa
📒 Files selected for processing (33)
.github/workflows/nightly.yml.github/workflows/publish.ymlCHANGELOG.mdREADME.mdforgelm/chat.pyforgelm/cli.pyforgelm/compliance.pyforgelm/config.pyforgelm/deploy.pyforgelm/export.pyforgelm/fit_check.pyforgelm/inference.pyforgelm/merging.pyforgelm/model.pyforgelm/trainer.pyforgelm/wizard.pytests/test_alignment.pytests/test_benchmark.pytests/test_cli_phase10.pytests/test_compliance.pytests/test_config.pytests/test_cost_estimation.pytests/test_deploy.pytests/test_export.pytests/test_fit_check.pytests/test_galore.pytests/test_integration_smoke.pytests/test_judge_functions.pytests/test_long_context.pytests/test_safety_advanced.pytests/test_synthetic.pytests/test_trainer.pytests/test_webhook.py
✅ Files skipped from review due to trivial changes (4)
- tests/test_galore.py
- tests/test_config.py
- tests/test_webhook.py
- tests/test_synthetic.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/test_fit_check.py
- tests/test_export.py
Closes the SonarCloud cognitive-complexity findings (≥15 → all <15) without changing behavior. Each function was decomposed into private helpers with single responsibilities; 430 tests still pass with no regressions. Small (CC 16–24): - webhook._send → _resolve_url + _post_payload + _mask - synthetic.generate → _generate_one + _write_output - synthetic._call_file_teacher → _load_file_responses - model._apply_moe_expert_quantization → _is_frozen_expert_weight + _recast_expert_weight - model._freeze_unselected_experts → _parse_selected_experts + _expert_index_in_name - judge.run_judge_evaluation → _load_eval_prompts + _load_local_judge + _generate_response + _clip_judge_score + _save_judge_results - trainer._collect_resource_usage → _collect_gpu_info + _train_runtime_seconds + _resolve_cost_per_hour Medium (CC 29–32): - benchmark.run_benchmark → _extract_task_score + _parse_results + _save_benchmark_json (plus _ACC_METRIC_KEYS constant) - model_card.generate_model_card → _build_metrics_table + _build_benchmark_section + _build_safety_section + _format_method + _build_extra_tags - model.get_model_and_tokenizer → _load_unsloth + _load_tokenizer + _device_map_for + _build_model_kwargs + _apply_moe_config + _build_lora_config Large (CC 39–66): - cli._run_dry_run → _galore_dry_run_fields + _evaluation_dry_run_fields + _compliance_dry_run_fields + _build_dry_run_result - config._validate_consistency → _warn_general_consistency + _warn_high_risk_compliance + _validate_trainer_type + _validate_galore + _validate_distributed - cli.main → _dispatch_subcommand + _maybe_run_wizard + _load_config_or_exit + _apply_offline_flag + _maybe_run_no_train_mode + _run_training_pipeline - trainer.train → _measure_baseline_loss + _apply_benchmark_result + _apply_resource_usage + _apply_safety_result + _apply_judge_result + _finalize_artifacts + _handle_human_approval_gate + _run_training_pipeline (the orchestrator stays a thin try/except) - wizard.run_wizard → _collect_rope_scaling + _collect_neftune_alpha + _collect_webhook_config + _collect_safety_config + _collect_evaluation_config + _collect_compliance_config + _save_config_to_file + _print_wizard_summary - data._make_batch_processor → _process_text_format + _process_messages_format + _format_user_assistant_row + _process_user_assistant_format - data.prepare_dataset → _apply_mix_ratio + _merge_extra_datasets + _ensure_validation_split + _validate_trainer_columns + _shuffle_and_passthrough + _passthrough_for_trainer + _passthrough_multimodal + _format_sft_dataset - safety.run_safety_evaluation → _load_safety_prompts + _generate_safety_responses + _release_model_from_gpu + _classify_one_response + _classify_responses + _evaluate_safety_gates + _save_safety_results Verification: - ruff format && ruff check: clean - pytest tests/: 430 passed, 34 skipped (no behavioral change) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Verified findings: - CHANGELOG: "afterwards" → "afterward" (American English consistency) - CHANGELOG: v0.3.1rc1 release date corrected to 2026-03-28 (matches the actual `bump version to 0.3.1rc1` commit 7f827c0); v0.4.0 gains a Breaking section documenting that export_compliance_artifacts dropped its unused `config` parameter - README: stale test-count line updated to "430 passed (+34 skipped) across 30 test files" - fit_check: regex lookbehind now blocks `.` so size-hint matching doesn't classify "1.7b" as the 7B architecture (also caught Llama-3.2 fractional sizes) - fit_check._build_recommendations: drop unused `quant` param + matching call site (body uses `m.load_in_4bit` directly) - merging: restore missing/unexpected unpack from load_state_dict and log both directions of mismatch — useful when diagnosing a bad merge - test_benchmark: replace `abs(... - 0.60) < 0.01` with `pytest.approx(0.60, abs=0.01)` for consistency - test_cli_phase10: estimated_gb assertion uses pytest.approx instead of exact float equality Skipped: - test_safety_advanced patch target — reviewer suggestion is incorrect. `safety.py` does NOT have a top-level `from transformers import pipeline`; the import lives inside `run_safety_evaluation`'s body, so `forgelm.safety.pipeline` isn't a valid attribute. The current `patch("transformers.pipeline", ...)` is the right target. Verification: - ruff format && ruff check: clean - pytest tests/: 430 passed, 34 skipped (no behavioral change) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three actionable findings + one config-level suppression for the bulk B101 noise: - pyproject.toml: add a [tool.bandit] section with exclude_dirs=["tests", ...] and skips=["B101"]. Codacy's Bandit pass was reporting "use of assert" on every test assertion (~40 hits); pytest *requires* asserts, so the right fix is config, not per-call suppression. - chat.py: narrow the bare `except Exception` around `_console.file.flush()` to `(OSError, AttributeError)` — the actual failure modes (closed stdout, console without `.file`). Closes Bandit B110. - export.py: annotate the `subprocess.run(cmd, ...)` call with `# noqa: S603 # nosec B603` plus a comment explaining why it's safe (list-form, no shell=True, paths under user control). Closes B603. - test_deploy.py: replace the hardcoded "/tmp/Modelfile" string in test_deploy_result_dataclass with `tmp_path / "Modelfile"`. Closes B108 on that one flagged line; other /tmp references in tests/ are now covered by the Bandit exclude_dirs. Verification: - ruff format && ruff check: clean - pytest tests/: 430 passed, 34 skipped (no behavioral change) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
forgelm/safety.py (1)
105-124:⚠️ Potential issue | 🟡 Minor
del modelinside_release_model_from_gpuonly drops the local parameter, not the caller's reference.The
modelparameter is a local binding —del modelhere decrements the refcount by one, but the caller (run_safety_evaluationat Line 338) still holds the original reference, so the underlying model object (and any GPU-resident tensors) won't actually be freed until the caller's reference is also released. The.cpu()call does most of the actual work; thedelis misleading. Either drop thedel, or have the caller release its reference (e.g.,model = Noneafter the helper).- try: - model.cpu() - except Exception: - pass - del model - gc.collect() + try: + model.cpu() + except Exception: + pass + gc.collect()…and in the caller:
- _release_model_from_gpu(model) + _release_model_from_gpu(model) + model = None # drop caller's reference so the object can actually be GC'd🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/safety.py` around lines 105 - 124, The helper _release_model_from_gpu is deleting only its local binding (del model) which doesn't drop the caller's reference; remove the misleading del model from _release_model_from_gpu and rely on model.cpu(), gc.collect(), and torch.cuda.empty_cache(); then update the caller run_safety_evaluation to explicitly release its reference after calling the helper (e.g., set its model variable to None) so the object can be freed and VRAM reclaimed.
🧹 Nitpick comments (13)
forgelm/synthetic.py (3)
255-271: Silent failure when seed_file is missing/invalid for thefilebackend.
_load_file_responsesreturns an empty dict ifseed_fileis unset or not a file, and also silently swallowsJSONDecodeErrorper line. Combined with_call_file_teachercaching that empty dict inself._teacher, every prompt will produce an empty response and the entire run fails with no actionable signal beyond per-prompt "empty response" entries inresult.errors. Consider logging a warning (or raising) when the seed file is unusable, and at least debug-logging skipped malformed lines.♻️ Suggested change
def _load_file_responses(self) -> dict: """Read the pre-generated prompt→response map from seed_file (one-shot).""" responses: dict = {} seed_file = self.synth_cfg.seed_file if not seed_file or not os.path.isfile(seed_file): + logger.warning( + "File teacher backend: seed_file %r missing or not a file; no responses loaded.", + seed_file, + ) return responses with open(seed_file, encoding="utf-8") as f: - for line in f: + for lineno, line in enumerate(f, start=1): + if not line.strip(): + continue try: data = json.loads(line) except json.JSONDecodeError: + logger.debug("Skipping malformed JSON in %s at line %d", seed_file, lineno) continue p = data.get("prompt", "") r = data.get("response") or data.get("completion") or data.get("output", "") if p and r: responses[p] = r + logger.info("File teacher: loaded %d prompt→response pairs from %s", len(responses), seed_file) return responses🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/synthetic.py` around lines 255 - 271, The _load_file_responses function currently returns an empty dict silently when synth_cfg.seed_file is missing/invalid and swallows JSONDecodeError per line, which leads _call_file_teacher to cache an empty teacher and produce empty responses; update _load_file_responses to log a warning when synth_cfg.seed_file is unset or not a file (including the seed_file value) and to debug-log each skipped malformed line with the JSONDecodeError and line contents or line number; optionally (configurable) raise an error instead of returning an empty dict if the seed file is required. Ensure these logs reference synth_cfg.seed_file and that _call_file_teacher handles a non-empty failure mode (no caching of an empty dict) so callers see an actionable error.
119-132: Rate-limit sleep also runs after the final prompt.Minor:
time.sleep(self.synth_cfg.api_delay)executes inside the loop on every iteration including the last, addingapi_delayseconds to total runtime for no benefit. Also, the sleep happens regardless of whether_generate_onemade a network call (it doesn't on early failures, but the call still went through_call_teacher, so this is mostly fine). Consider skipping the sleep on the final iteration:♻️ Suggested change
- for i, prompt in enumerate(prompts): + last_idx = len(prompts) - 1 + for i, prompt in enumerate(prompts): entry = self._generate_one(prompt, i, result) if entry is not None: generated.append(entry) - if rate_limit: + if rate_limit and i != last_idx: time.sleep(self.synth_cfg.api_delay)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/synthetic.py` around lines 119 - 132, The loop in the synthetic generation routine sleeps after every prompt including the last, adding an unnecessary delay; update the for-loop that calls self._generate_one(prompt, i, result) to only call time.sleep(self.synth_cfg.api_delay) when rate_limit is true AND the current index is not the final one (e.g., check i < len(prompts) - 1) so the delay is skipped after the last prompt; keep references to self.synth_cfg.api_delay, rate_limit, prompts, and _generate_one when making the change.
71-74: Log a warning on empty response for consistency.The exception path at Line 68 logs a warning, but the empty-response branch silently updates
result.errorswithout an equivalent log. This makes silent-failure modes (e.g., a teacher returning blank content, or_call_file_teachernot finding a key) harder to debug.♻️ Suggested change
if not response: result.failed += 1 result.errors.append(f"Prompt {idx}: empty response") + logger.warning("Generation returned empty response for prompt %d", idx) return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/synthetic.py` around lines 71 - 74, When handling an empty response (the branch checking "if not response"), add a warning log mirroring the exception path's behavior so the silent failure is visible; call the same logger used elsewhere (e.g., logger.warning or processLogger.warning) to emit a message containing the prompt index (idx) and context before updating result.failed and result.errors and returning None, leaving the rest of the logic unchanged.forgelm/data.py (2)
236-258: Validation split is shuffled in passthrough — review the implications.
_shuffle_and_passthroughshuffles every split whenshuffle=True, includingvalidation(andtestwhen present). For preference/KTO/GRPO trainers (and multimodal passthrough), this means validation order changes per run, which can make eval-loss metrics harder to compare across runs and could interact withEarlyStoppingCallbackreproducibility. The existing_format_sft_datasethas the same behavior, so this preserves prior semantics — but if the goal is reproducible evaluation, consider only shuffling the train split.-def _shuffle_and_passthrough(dataset, shuffle: bool) -> Dict[str, Any]: - """Return splits as-is (optionally shuffled) — for trainers that need raw columns.""" - out: Dict[str, Any] = {} - for split in dataset: - current = dataset[split] - if shuffle: - current = current.shuffle(seed=42) - out[split] = current - return out +def _shuffle_and_passthrough(dataset, shuffle: bool) -> Dict[str, Any]: + """Return splits as-is (train optionally shuffled) — for trainers that need raw columns.""" + out: Dict[str, Any] = {} + for split in dataset: + current = dataset[split] + if shuffle and split == "train": + current = current.shuffle(seed=42) + out[split] = current + return out🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/data.py` around lines 236 - 258, The current _shuffle_and_passthrough shuffles every split when shuffle=True (including validation/test), which harms reproducible evaluation; change its behavior so only the "train" split is shuffled: inside _shuffle_and_passthrough iterate splits but only call current.shuffle(seed=42) when split == "train" and shuffle is True, leaving other splits untouched; keep _passthrough_for_trainer as-is so trainers still return the passthrough dataset.
100-119: Edge case: empty batch could be misclassified as a missing-column error.In
_process_user_assistant_format, when the dataset legitimately containsUser/Assistantcolumns but a particular batch happens to be empty,examples["User"]returns[], and theif not user_texts or not asst_texts:guard at Line 104 triggers the "missing columns"KeyErrorpath. In practice HFdatasets.map(batched=True)should not pass empty batches, so this is unlikely to fire — but the check would be more semantically correct as a key-presence test.♻️ Optional refactor
- has_system = "System" in examples - user_texts = examples.get("User", examples.get("instruction", [])) - asst_texts = examples.get("Assistant", examples.get("output", examples.get("response", []))) - - if not user_texts or not asst_texts: + has_system = "System" in examples + user_key = "User" if "User" in examples else ("instruction" if "instruction" in examples else None) + asst_key = next((k for k in ("Assistant", "output", "response") if k in examples), None) + if user_key is None or asst_key is None: fmt = _detect_dataset_format(list(examples.keys())) raise KeyError( f"Dataset must contain 'User'/'instruction' and 'Assistant'/'output' columns, " f"or a pre-formatted 'text' column. " f"Found: {list(examples.keys())}. " f"Detected format: {fmt['description']}. " f"Suggested trainer: {fmt['suggested_trainer']}" ) + user_texts = examples[user_key] + asst_texts = examples[asst_key]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/data.py` around lines 100 - 119, The code in _process_user_assistant_format currently treats empty lists as missing columns because it checks truthiness of user_texts/asst_texts; change that to check for key presence instead: determine has_user by checking "User" in examples or "instruction" in examples, and has_assistant by checking "Assistant" in examples or "output" in examples or "response" in examples; only raise the KeyError if those keys are absent (not when the retrieved lists are empty). Keep using examples.get(...) to fetch user_texts and asst_texts and preserve has_system ("System" in examples) for sys_texts handling.forgelm/judge.py (1)
108-121: Duplicate prompt-loading logic acrossjudge.pyandsafety.py.
_load_eval_promptshere is byte-for-byte equivalent to_load_safety_promptsinforgelm/safety.py(Lines 70–83). Since the refactor's goal is centralization, consider extracting a shared_load_jsonl_prompts(path)(e.g., into a smallforgelm/_io.pyutility) used by both modules.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/judge.py` around lines 108 - 121, The prompt-loading logic in _load_eval_prompts (judge.py) is duplicated from _load_safety_prompts (safety.py); extract the common behavior into a new shared helper named _load_jsonl_prompts(path) (e.g., in a small forgelm._io module), implement the same semantics (read lines, skip empty, try json.loads and return data.get("prompt", data.get("text", "")) or fallback to raw line), then replace calls to _load_eval_prompts and _load_safety_prompts to import and use _load_jsonl_prompts, removing the duplicated functions.forgelm/wizard.py (2)
194-208:_save_config_to_filefallback can lose data ifrequested_filenameand fallback collide.When the user types
my_config.yaml(the default) forrequested_filenameand the open fails (e.g., permission denied in CWD), the fallback also triesmy_config.yamlin CWD and re-raises an unmaskedOSError. Minor — consider guarding the fallback path or naming it after the input filename to make the failure mode obvious.- except OSError as e: - print(f"\n Error: Could not save config to {requested_filename}: {e}") - fallback = "my_config.yaml" + except OSError as e: + print(f"\n Error: Could not save config to {requested_filename}: {e}") + fallback = "my_config.yaml" if requested_filename != "my_config.yaml" else "forgelm_config_fallback.yaml" with open(fallback, "w") as f: yaml.dump(config, f, default_flow_style=False, sort_keys=False) print(f" Saved to fallback location: {fallback}") return fallback🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/wizard.py` around lines 194 - 208, The _save_config_to_file function currently retries writing to the same path when requested_filename equals the hardcoded fallback "my_config.yaml", which can re-raise the same OSError; change the fallback logic so it does not attempt the identical path and handle fallback errors: if requested_filename == fallback, compute a different fallback name (e.g., append a suffix or use a temporary/unique filename based on requested_filename or timestamp) or choose a safe location (user home), then wrap the fallback write in its own try/except to log and surface a clear error if it also fails; refer to _save_config_to_file, requested_filename and fallback to locate the changes.
248-475: SonarCloud is still failing —run_wizardcognitive complexity is 22 (limit 15).The helper extractions are a great start, but Sonar still flags
run_wizardas too complex. The remaining hotspots are the model-selection block (Lines 282–291, manualint(...)parsing withtry/except), the strategy-driven flag derivation (Lines 295–307), the training-objective parsing (Lines 321–332), and the GaLore branch (Lines 365–383). Extracting four more thin helpers should drop complexity well under 15.♻️ Suggested decomposition
+def _select_model() -> str: + print("\n[2/8] Model Selection") + print(" Popular models:") + for i, m in enumerate(POPULAR_MODELS, 1): + print(f" {i}) {m}") + print(f" {len(POPULAR_MODELS) + 1}) Custom (enter your own)") + raw = input(" Choice [1]: ").strip() + try: + idx = int(raw) if raw else 1 + if idx <= len(POPULAR_MODELS): + return POPULAR_MODELS[idx - 1] + return _prompt("Enter HuggingFace model name or local path") + except (ValueError, IndexError): + return POPULAR_MODELS[0] + + +def _select_strategy() -> tuple[bool, bool, bool]: + strategies = [ + "QLoRA (4-bit quantization — recommended, lowest memory)", + "LoRA (full precision — more memory, slightly better quality)", + "QLoRA + DoRA (4-bit + weight decomposition — best quality, more compute)", + "GaLore (full-parameter training via gradient projection — no adapters, lowest peak VRAM)", + ] + strategy = _prompt_choice("Choose your fine-tuning strategy:", strategies, default=1) + return "QLoRA" in strategy, "DoRA" in strategy, "GaLore" in strategy + + +def _select_trainer_type() -> str: + objectives = [...] # as today + obj = _prompt_choice("Choose your training objective:", objectives, default=1) + return obj.split(" — ")[0].strip().lower() + + +def _collect_galore_config() -> dict: + galore_rank = _prompt_int(...) + galore_optim = _prompt_choice(...) + return { + "galore_enabled": True, + "galore_optim": galore_optim.split(" ")[0], + "galore_rank": galore_rank, + "galore_update_proj_gap": 200, + "galore_scale": 0.25, + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/wizard.py` around lines 248 - 475, run_wizard is too cognitively complex; extract the four logical blocks into small helpers to reduce branching inside run_wizard: move the model selection try/except block into a new function (e.g., _select_model) that returns model_name given POPULAR_MODELS and input, move the strategy parsing (setting load_in_4bit, use_dora, use_galore) into _derive_strategy_flags(strategy) returning those booleans, move the objective->trainer_type selection and dataset_format_hint lookup into _parse_trainer_type(objective) returning trainer_type and the printed hint, and move the GaLore-specific interactive branch into _collect_galore_config() that returns galore_config (empty dict when not used) and prints the appropriate header; update run_wizard to call these helpers and remove the inlined logic so complexity is reduced.forgelm/model_card.py (1)
80-89: Metrics table includes namespaced keys (safety/*,resource/*,judge/*).
_build_metrics_tableonly filtersbenchmark/keys, so all the other namespaced metrics thattrainer.pymixes into the samemetricsdict (resource/peak_vram_gb,resource/estimated_cost_usd,safety/safe_ratio,judge/average_score, etc.) end up in the "Metrics" section alongsideeval_loss, with no grouping. If this is intentional, ignore; if you'd rather keep the table to "training metrics" only, broaden the filter.- lines = ["| Metric | Value |", "|--------|-------|"] - for key, value in sorted(metrics.items()): - if not key.startswith("benchmark/"): - lines.append(f"| {key} | {value:.4f} |") + _SECTION_PREFIXES = ("benchmark/", "resource/", "safety/", "judge/") + lines = ["| Metric | Value |", "|--------|-------|"] + for key, value in sorted(metrics.items()): + if not key.startswith(_SECTION_PREFIXES): + lines.append(f"| {key} | {value:.4f} |")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/model_card.py` around lines 80 - 89, _build_metrics_table currently only filters out keys starting with "benchmark/"; broaden the filter to also exclude namespaced keys like "safety/", "resource/", and "judge/" (or alternatively exclude any key containing a "/") so only top-level training metrics (e.g., "eval_loss") are rendered. Update the loop that iterates sorted(metrics.items()) to skip keys that start with any of the prefixes ["benchmark/", "safety/", "resource/", "judge/"] (or use a simple "if '/' in key: continue") before formatting the table rows.forgelm/trainer.py (1)
621-634:_handle_human_approval_gatesetssuccess=Truebefore human review actually approves.When
require_human_approvalis set, the run is paused awaiting reviewer signoff, buttrain_result.success = True(Line 633) marks the result as successful before any approval signal arrives. Downstream consumers readingTrainResult.success(e.g., CI gating, dashboards, the audit log "pipeline.completed" event) cannot distinguish "succeeded and approved" from "pending approval". Consider introducing a tri-state (pending/approved/rejected) or a separateawaiting_approvalflag.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/trainer.py` around lines 621 - 634, The handler _handle_human_approval_gate currently marks train_result.success = True before approval; change this to reflect a pending state instead: add a tri-state field on TrainResult (e.g., status with values "pending"/"approved"/"rejected") or a separate boolean awaiting_approval, update _handle_human_approval_gate to set train_result.status = "pending" (or train_result.awaiting_approval = True) and do NOT set TrainResult.success there, and update any downstream checks/audit emits (e.g., consumers reading TrainResult.success, pipeline.completed event logic) to treat "pending"/awaiting_approval as distinct from success so only the final approval flow sets status = "approved"/success = True.forgelm/cli.py (2)
813-856: Cognitive complexity right at threshold (16 vs 15).SonarCloud flags this function as 1 over the cognitive-complexity budget. The branching is concentrated in the dual error-handling block (
ImportError+Exception, each branching onjson_output) and the pre-train conditionals. Optionally, hoist the error-emission into a tiny helper to satisfy the threshold without restructuring the happy path.Optional refactor: extract error reporter
+def _report_pipeline_error(message: str, *, json_output: bool, log_traceback: bool = False) -> None: + if json_output: + print(json.dumps({"success": False, "error": message})) + elif log_traceback: + logger.exception("Training pipeline failed.") + else: + logger.error("%s", message) + def _run_training_pipeline(config, args, json_output: bool) -> None: ... except ImportError as e: - if json_output: - print(json.dumps({"success": False, "error": f"Missing dependency: {e}"})) - else: - logger.error("Missing dependency: %s. Check your installation.", e) + _report_pipeline_error(f"Missing dependency: {e}. Check your installation.", json_output=json_output) sys.exit(EXIT_TRAINING_ERROR) except Exception as e: - if json_output: - print(json.dumps({"success": False, "error": str(e)})) - else: - logger.exception("Training pipeline failed.") + _report_pipeline_error(str(e), json_output=json_output, log_traceback=True) sys.exit(EXIT_TRAINING_ERROR)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/cli.py` around lines 813 - 856, The function _run_training_pipeline has cognitive complexity just over the threshold due to duplicated error-handling branching on json_output; extract a small helper function (e.g., _report_training_error(json_output, payload: dict, log_msg: str, exc: Exception, exit_code: int)) to centralize the ImportError and Exception handling: move the JSON print vs logger calls into that helper and call it from both except blocks, passing a descriptive payload (including the exception) and the proper exit code; keep the rest of _run_training_pipeline unchanged so the happy path remains intact and only the error emission logic is hoisted to reduce branching.
47-182: Subparser-level flag re-declaration withargparse.SUPPRESSresolves the past invocation-order issue.Each subparser now accepts
--quiet,--log-level, and (where relevant)--output-formatdirectly, withdefault=argparse.SUPPRESSso the top-level defaults aren't clobbered when the user omits them on the subcommand.forgelm export ./model --output m.gguf --output-format jsonnow works. Nice.Minor (SonarCloud): the help strings
"Suppress INFO logs."and"Set logging verbosity (default: INFO)."are duplicated 3 and 4 times respectively. Optional cleanup is to hoist them into module-level constants and reuse, or extract a small_add_common_subparser_flags(p, include_output_format=True)helper that registers all three flags in one place.Optional helper extraction
+_HELP_QUIET = "Suppress INFO logs." +_HELP_LOG_LEVEL = "Set logging verbosity (default: INFO)." +_HELP_OUTPUT_FORMAT = "Output format: text (default) or json." +_LOG_LEVEL_CHOICES = ["DEBUG", "INFO", "WARNING", "ERROR"] + +def _add_common_subparser_flags(p, *, include_output_format: bool) -> None: + if include_output_format: + p.add_argument( + "--output-format", + type=str, + default=argparse.SUPPRESS, + choices=["text", "json"], + help=_HELP_OUTPUT_FORMAT, + ) + p.add_argument("-q", "--quiet", action="store_true", default=argparse.SUPPRESS, help=_HELP_QUIET) + p.add_argument( + "--log-level", + type=str, + default=argparse.SUPPRESS, + choices=_LOG_LEVEL_CHOICES, + help=_HELP_LOG_LEVEL, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/cli.py` around lines 47 - 182, Duplicate help strings for the common flags (--quiet, --log-level, and --output-format) appear across the subparser builders; extract them into a single helper (e.g., _add_common_subparser_flags) or module-level constants and call that helper from _add_chat_subcommand, _add_export_subcommand, and _add_deploy_subcommand to register the shared arguments once, reusing the same help text and default=argparse.SUPPRESS for each flag so behavior is preserved while removing repeated literal strings.forgelm/config.py (1)
472-479: Redundant truthiness check on requiredlorafield.
ForgeConfig.lora: LoraConfigModelis a required field (noOptional/default), soself.lorais always truthy here. Theself.lora andguard is defensive but can be dropped for clarity.Proposed simplification
- if self.lora and self.lora.r > 0: + if self.lora.r > 0:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/config.py` around lines 472 - 479, The conditional currently checks "self.lora and self.lora.r > 0" even though ForgeConfig.lora is a required LoraConfigModel and thus always truthy; simplify the guard by removing the redundant truthiness check and only test the rank (i.e., replace the condition with "self.lora.r > 0") so the logger.info block (referencing self.training.galore_rank and self.lora.r) remains executed only when the LoRA adapter rank is positive.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@forgelm/judge.py`:
- Around line 149-158: The current clipping logic in _clip_judge_score treats
sentinel parse/transport failures (score==0 returned by
_call_api_judge/_call_local_judge) as a real 1.0 score, which hides failures and
biases avg_score; change the API helper functions (_call_api_judge and
_call_local_judge) to return None (or a distinct sentinel) on parse/transport
errors instead of {"score":0,...}, update _clip_judge_score to accept
Optional[float] and return None for None inputs (or return Optional[float] and
not clip the sentinel), and update the averaging logic where avg_score is
computed to skip None results (or explicitly count failures separately) so
invalid responses are not included in the numeric average and are logged/handled
distinctly.
In `@forgelm/safety.py`:
- Around line 199-214: In the except block that creates the fallback detail dict
in forgelm/safety.py (the block that currently sets detail = {"prompt":...,
"response":..., "label":"error","confidence":0.0,"safe":False}), mark classifier
failures as low confidence as well by adding low_confidence: True (or an error
flag) so that the existing low_confidence_count increment and manual-review
logic see these failures; update the synthesized detail created in that except
branch and ensure it remains consistent with how confidence_scores, unsafe_count
and low_confidence_count are computed.
In `@forgelm/trainer.py`:
- Around line 821-842: The fuzzy-match in _resolve_cost_per_hour uses a
dict-order loop over self._GPU_PRICING which can pick a shorter, less-specific
known_gpu before a longer, more-specific one; change the fuzzy matching to
iterate over sorted(self._GPU_PRICING.items(), key=lambda kv: len(kv[0]),
reverse=True) so candidates are checked by descending known_gpu name length (use
lowercased comparisons as before) and set usage["cost_source"]="fuzzy_match"
only when the longest matching candidate is found; this prevents substring
collisions like "NVIDIA H1" matching "NVIDIA H100..." and ensures more specific
_GPU_PRICING keys win.
- Around line 509-607: _apply_benchmark_result unconditionally reverts failed
benchmarks while _apply_safety_result and _apply_judge_result respect
self.config.evaluation.auto_revert; update _apply_benchmark_result to mirror
their gating: after logging evaluation_completed, if benchmark_result.passed
return True; otherwise check if self.config.evaluation and
self.config.evaluation.auto_revert—if false, do not call self._revert_model and
simply return True (or only mark failure in train_result without deleting
artifacts); if true, log _EVT_REVERT_TRIGGERED, call
self._revert_model(final_path, reason), set train_result.success = False and
train_result.reverted = True, then return False. Ensure you reference
_apply_benchmark_result, _EVT_REVERT_TRIGGERED, _revert_model, and
self.config.evaluation.auto_revert when making the change.
In `@forgelm/webhook.py`:
- Around line 26-45: The _mask helper currently only truncates long URLs and can
leak credentials or signed query strings; update _mask(url: str) to parse the
URL (use urllib.parse.urlparse) and return a redacted string that omits userinfo
and query (e.g., "scheme://host" plus a safe path prefix like the first path
segment followed by "/..."), and ensure _post_payload uses this redacted value
in logger.warning instead of the raw URL; also stop logging full resp.text
(either remove it from the warning or run it through a redactor) so responses
cannot leak secrets (refer to _mask, _post_payload, logger.warning, resp.text).
---
Outside diff comments:
In `@forgelm/safety.py`:
- Around line 105-124: The helper _release_model_from_gpu is deleting only its
local binding (del model) which doesn't drop the caller's reference; remove the
misleading del model from _release_model_from_gpu and rely on model.cpu(),
gc.collect(), and torch.cuda.empty_cache(); then update the caller
run_safety_evaluation to explicitly release its reference after calling the
helper (e.g., set its model variable to None) so the object can be freed and
VRAM reclaimed.
---
Nitpick comments:
In `@forgelm/cli.py`:
- Around line 813-856: The function _run_training_pipeline has cognitive
complexity just over the threshold due to duplicated error-handling branching on
json_output; extract a small helper function (e.g.,
_report_training_error(json_output, payload: dict, log_msg: str, exc: Exception,
exit_code: int)) to centralize the ImportError and Exception handling: move the
JSON print vs logger calls into that helper and call it from both except blocks,
passing a descriptive payload (including the exception) and the proper exit
code; keep the rest of _run_training_pipeline unchanged so the happy path
remains intact and only the error emission logic is hoisted to reduce branching.
- Around line 47-182: Duplicate help strings for the common flags (--quiet,
--log-level, and --output-format) appear across the subparser builders; extract
them into a single helper (e.g., _add_common_subparser_flags) or module-level
constants and call that helper from _add_chat_subcommand,
_add_export_subcommand, and _add_deploy_subcommand to register the shared
arguments once, reusing the same help text and default=argparse.SUPPRESS for
each flag so behavior is preserved while removing repeated literal strings.
In `@forgelm/config.py`:
- Around line 472-479: The conditional currently checks "self.lora and
self.lora.r > 0" even though ForgeConfig.lora is a required LoraConfigModel and
thus always truthy; simplify the guard by removing the redundant truthiness
check and only test the rank (i.e., replace the condition with "self.lora.r >
0") so the logger.info block (referencing self.training.galore_rank and
self.lora.r) remains executed only when the LoRA adapter rank is positive.
In `@forgelm/data.py`:
- Around line 236-258: The current _shuffle_and_passthrough shuffles every split
when shuffle=True (including validation/test), which harms reproducible
evaluation; change its behavior so only the "train" split is shuffled: inside
_shuffle_and_passthrough iterate splits but only call current.shuffle(seed=42)
when split == "train" and shuffle is True, leaving other splits untouched; keep
_passthrough_for_trainer as-is so trainers still return the passthrough dataset.
- Around line 100-119: The code in _process_user_assistant_format currently
treats empty lists as missing columns because it checks truthiness of
user_texts/asst_texts; change that to check for key presence instead: determine
has_user by checking "User" in examples or "instruction" in examples, and
has_assistant by checking "Assistant" in examples or "output" in examples or
"response" in examples; only raise the KeyError if those keys are absent (not
when the retrieved lists are empty). Keep using examples.get(...) to fetch
user_texts and asst_texts and preserve has_system ("System" in examples) for
sys_texts handling.
In `@forgelm/judge.py`:
- Around line 108-121: The prompt-loading logic in _load_eval_prompts (judge.py)
is duplicated from _load_safety_prompts (safety.py); extract the common behavior
into a new shared helper named _load_jsonl_prompts(path) (e.g., in a small
forgelm._io module), implement the same semantics (read lines, skip empty, try
json.loads and return data.get("prompt", data.get("text", "")) or fallback to
raw line), then replace calls to _load_eval_prompts and _load_safety_prompts to
import and use _load_jsonl_prompts, removing the duplicated functions.
In `@forgelm/model_card.py`:
- Around line 80-89: _build_metrics_table currently only filters out keys
starting with "benchmark/"; broaden the filter to also exclude namespaced keys
like "safety/", "resource/", and "judge/" (or alternatively exclude any key
containing a "/") so only top-level training metrics (e.g., "eval_loss") are
rendered. Update the loop that iterates sorted(metrics.items()) to skip keys
that start with any of the prefixes ["benchmark/", "safety/", "resource/",
"judge/"] (or use a simple "if '/' in key: continue") before formatting the
table rows.
In `@forgelm/synthetic.py`:
- Around line 255-271: The _load_file_responses function currently returns an
empty dict silently when synth_cfg.seed_file is missing/invalid and swallows
JSONDecodeError per line, which leads _call_file_teacher to cache an empty
teacher and produce empty responses; update _load_file_responses to log a
warning when synth_cfg.seed_file is unset or not a file (including the seed_file
value) and to debug-log each skipped malformed line with the JSONDecodeError and
line contents or line number; optionally (configurable) raise an error instead
of returning an empty dict if the seed file is required. Ensure these logs
reference synth_cfg.seed_file and that _call_file_teacher handles a non-empty
failure mode (no caching of an empty dict) so callers see an actionable error.
- Around line 119-132: The loop in the synthetic generation routine sleeps after
every prompt including the last, adding an unnecessary delay; update the
for-loop that calls self._generate_one(prompt, i, result) to only call
time.sleep(self.synth_cfg.api_delay) when rate_limit is true AND the current
index is not the final one (e.g., check i < len(prompts) - 1) so the delay is
skipped after the last prompt; keep references to self.synth_cfg.api_delay,
rate_limit, prompts, and _generate_one when making the change.
- Around line 71-74: When handling an empty response (the branch checking "if
not response"), add a warning log mirroring the exception path's behavior so the
silent failure is visible; call the same logger used elsewhere (e.g.,
logger.warning or processLogger.warning) to emit a message containing the prompt
index (idx) and context before updating result.failed and result.errors and
returning None, leaving the rest of the logic unchanged.
In `@forgelm/trainer.py`:
- Around line 621-634: The handler _handle_human_approval_gate currently marks
train_result.success = True before approval; change this to reflect a pending
state instead: add a tri-state field on TrainResult (e.g., status with values
"pending"/"approved"/"rejected") or a separate boolean awaiting_approval, update
_handle_human_approval_gate to set train_result.status = "pending" (or
train_result.awaiting_approval = True) and do NOT set TrainResult.success there,
and update any downstream checks/audit emits (e.g., consumers reading
TrainResult.success, pipeline.completed event logic) to treat
"pending"/awaiting_approval as distinct from success so only the final approval
flow sets status = "approved"/success = True.
In `@forgelm/wizard.py`:
- Around line 194-208: The _save_config_to_file function currently retries
writing to the same path when requested_filename equals the hardcoded fallback
"my_config.yaml", which can re-raise the same OSError; change the fallback logic
so it does not attempt the identical path and handle fallback errors: if
requested_filename == fallback, compute a different fallback name (e.g., append
a suffix or use a temporary/unique filename based on requested_filename or
timestamp) or choose a safe location (user home), then wrap the fallback write
in its own try/except to log and surface a clear error if it also fails; refer
to _save_config_to_file, requested_filename and fallback to locate the changes.
- Around line 248-475: run_wizard is too cognitively complex; extract the four
logical blocks into small helpers to reduce branching inside run_wizard: move
the model selection try/except block into a new function (e.g., _select_model)
that returns model_name given POPULAR_MODELS and input, move the strategy
parsing (setting load_in_4bit, use_dora, use_galore) into
_derive_strategy_flags(strategy) returning those booleans, move the
objective->trainer_type selection and dataset_format_hint lookup into
_parse_trainer_type(objective) returning trainer_type and the printed hint, and
move the GaLore-specific interactive branch into _collect_galore_config() that
returns galore_config (empty dict when not used) and prints the appropriate
header; update run_wizard to call these helpers and remove the inlined logic so
complexity is reduced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 55db4858-cb59-4ba3-b1e1-441adff78279
📒 Files selected for processing (12)
forgelm/benchmark.pyforgelm/cli.pyforgelm/config.pyforgelm/data.pyforgelm/judge.pyforgelm/model.pyforgelm/model_card.pyforgelm/safety.pyforgelm/synthetic.pyforgelm/trainer.pyforgelm/webhook.pyforgelm/wizard.py
🚧 Files skipped from review as they are similar to previous changes (1)
- forgelm/model.py
CI failed on tests that don't run in the local sandbox (require torch + transformers). Two regressions slipped through Wave 1.2: - tests/test_moe_functions.py: drop the `num_experts` argument from the two `_apply_moe_expert_quantization(model, 4)` calls; the function signature is now single-arg. - tests/test_oom_recovery.py: drop the `final_path` positional from the two `trainer._export_compliance_if_needed(...)` calls; the method signature is now `(metrics, result)`. No production-code change. ruff clean; the local tests that *do* run still pass (430), and the affected suites collect cleanly under skip. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, gating) Real-bug fixes: - judge: replace the score=0 failure sentinel with score=None across _parse_judge_json / _call_api_judge / _call_local_judge. The old value was clipped up to 1.0 by _clip_judge_score and silently lowered the average. _clip_judge_score now passes None through; run_judge_evaluation averages only valid scores and reports the failure count separately. judge_results.json gains a per-row "judge_failed" boolean. - safety: mark classifier-crash rows as low_confidence=True (with a classifier_error message) so they show up in the manual-review surface instead of disappearing into the unsafe bucket. - safety: drop the misleading `del model` from _release_model_from_gpu (it only dropped the local arg binding, not the caller's reference). Document the contract in the docstring; the caller now sets its own `model = None` after calling the helper so VRAM can actually be reclaimed. - trainer: fuzzy GPU pricing match now sorts by descending name length so more-specific entries beat shorter substrings (e.g. "NVIDIA H100" wins over "NVIDIA H1"). - trainer: _apply_benchmark_result now respects evaluation.auto_revert (mirroring the safety/judge gates). Benchmark failure with auto_revert= False is recorded on TrainResult but no longer destroys saved artifacts. - webhook: _mask now redacts userinfo and query strings via urlparse, returning scheme://host plus a single safe path segment. Removed resp.text from the HTTP-error log so receiver-echoed payloads can no longer leak secrets — only status code is surfaced. Cleanups / nits also applied: - config: drop the redundant `self.lora and ...` truthiness check (lora is a required pydantic model, always truthy). - data: _shuffle_and_passthrough and _format_sft_dataset only shuffle the "train" split now — validation/test stay in source order so eval metrics line up sample-by-sample across runs. - data: _process_user_assistant_format distinguishes "wrong schema" (raise) from "empty batch" (return empty list) by checking key presence instead of column truthiness. - model_card: _build_metrics_table filters any namespaced key (anything with `/`) so safety/, resource/, judge/, benchmark/ each stay in their dedicated sections instead of cluttering the metrics table. - synthetic: log a warning on empty teacher response; skip the trailing api_delay after the last prompt; _load_file_responses now warns when seed_file is unset/missing and debug-logs malformed lines with line numbers so users have something to grep for. - wizard: _save_config_to_file no longer retries the same path on OSError (which would re-raise the same error when the user typed the literal fallback name). Falls back to a timestamped file in the user's home directory instead. - cli: extract _report_training_error to centralize the json-vs-human-log split for both ImportError and Exception paths. Extract _add_common_subparser_flags to deduplicate the --quiet / --log-level / --output-format triples across chat/export/deploy. Tests: - test_judge_functions: update three assertions for the new None sentinel (test_invalid_json_returns_none_sentinel, test_empty_string, test_api_timeout). Behavior change is intentional — score=0 was hiding failures in the average. Verification: - ruff format && ruff check: clean - pytest tests/: 430 passed, 34 skipped (no behavioral regression) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
forgelm/merging.py (1)
266-268: Explicitdim=Noneon 1-D tensor is redundant but acceptable.Since
flatis already 1-D from.flatten(),torch.quantilecomputes across all elements by default. Thedim=Noneparameter explicitly documents this intent, which may aid readability despite being functionally redundant.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/merging.py` around lines 266 - 268, The explicit dim=None on the 1-D tensor is redundant; change the call that computes threshold to use torch.quantile(flat.float(), trim_fraction) (leave the surrounding comment and the variables flat and threshold unchanged) so the quantile is computed across all elements without the unnecessary dim argument.forgelm/judge.py (1)
33-37:JudgeResult.scorestype no longer matches actual contents.The dataclass declares
scores: List[float], butrun_judge_evaluationnow appends_clip_judge_score(...)results, which can beNonefor parse/transport failures (Lines 258–261). Downstream consumers that iterateresult.scoresand treat values as floats may now hitTypeError/NoneTypeissues. Either widen the annotation toList[Optional[float]](and document the sentinel) or store only valid scores here and expose failures through the existingdetails/judge_failedfields.♻️ Proposed annotation widening
- scores: List[float] = field(default_factory=list) + scores: List[Optional[float]] = field(default_factory=list)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/judge.py` around lines 33 - 37, The dataclass field JudgeResult.scores is annotated as List[float] but run_judge_evaluation appends values from _clip_judge_score which can be None; change the model to reflect this by updating JudgeResult.scores to List[Optional[float]] (and add a short comment documenting None as a sentinel for parse/transport failures), or alternatively modify run_judge_evaluation to only append non-None scores and record failures into details/judge_failed instead—choose one approach and apply it consistently across usages that iterate result.scores (e.g., callers of run_judge_evaluation and any consumers that assume float values).forgelm/export.py (1)
197-227:formatparameter shadows the builtin.Both Codacy and SonarCloud flag the keyword-only
format: str = "gguf"parameter (and its reassignment on Line 226). Renaming tooutput_format(orfmt) avoids the shadowing and matches the existingsynthetic.pyconvention (output_format). Sinceexport_modelis a brand-new public API in this PR, now is the cheapest time to fix the name before downstream callers settle on it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forgelm/export.py` around lines 197 - 227, The function export_model currently uses the parameter name format which shadows the built-in; rename the keyword-only parameter to output_format (or fmt) in the export_model signature and update all usages inside the function (including the lowercasing line currently written as format = format.lower()) and its docstring/annotations to output_format, ensuring you also update any references in tests and callers; follow the existing convention used in synthetic.py by using output_format to avoid builtin shadowing and keep behavior identical (call .lower() on the new name where needed).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@forgelm/fit_check.py`:
- Around line 262-281: When galore_enabled is true, gate out LoRA-specific
accounting (do not add adapter_params/adapter_gb into total_gb and do not set
trainable_params = adapter_params); instead compute a projected-parameter count
for GaLore (e.g., projected_params = num_params * (galore_rank /
arch["hidden_size"])) and use that projected_params when calling
_optimizer_state_gb and when computing optim_gb; keep galore_gb as the
projection buffer cost but ensure total_gb = base_gb + optim_gb + act_gb +
galore_gb (and only add adapter_gb/adapter_params/trainable_params when not
getattr(t, "galore_enabled", False)). Reference symbols: adapter_params,
adapter_gb, trainable_params, galore_enabled, galore_rank, arch["hidden_size"],
_optimizer_state_gb, galore_gb, total_gb, num_params.
- Around line 250-257: The current heuristic reads the QLoRA-only field
bnb_4bit_compute_dtype when m.load_in_4bit is False, which can misreport dtype;
update the logic that sets quant (the quant variable) to first check model-level
indicators like m.load_in_8bit (or getattr(m, "load_in_8bit", False)) and
m.config.torch_dtype or getattr(m, "torch_dtype", None) and use those to
determine quant ("8bit" for load_in_8bit, map torch dtypes like
"float16"/"fp16"->"fp16", "bfloat16"/"bf16"->"bf16", else default to "bf16");
only fall back to bnb_4bit_compute_dtype when load_in_4bit is True or when no
model-level dtype is available, so quant reflects actual storage format rather
than a QLoRA-only compute dtype.
In `@forgelm/judge.py`:
- Around line 272-286: The current logic treats a run where valid_scores == []
the same as a low average, producing "Average judge score ... below minimum";
update the post-evaluation logic in the judge/evaluation block that computes
valid_scores, avg_score and passed to first detect the no-valid-scores case
(len(valid_scores) == 0) and set passed = False with a distinct failure_reason
such as "No valid judge scores (all parses/requests failed)" and log this
condition (using logger.error or logger.warning) including failure_count and
eval_prompts context; only compute avg_score and the existing "below minimum"
failure_reason when there is at least one valid score. Ensure the variables
referenced are valid_scores, avg_score, passed, failure_reason, failure_count,
eval_prompts and min_score so the change is localized and clear.
---
Nitpick comments:
In `@forgelm/export.py`:
- Around line 197-227: The function export_model currently uses the parameter
name format which shadows the built-in; rename the keyword-only parameter to
output_format (or fmt) in the export_model signature and update all usages
inside the function (including the lowercasing line currently written as format
= format.lower()) and its docstring/annotations to output_format, ensuring you
also update any references in tests and callers; follow the existing convention
used in synthetic.py by using output_format to avoid builtin shadowing and keep
behavior identical (call .lower() on the new name where needed).
In `@forgelm/judge.py`:
- Around line 33-37: The dataclass field JudgeResult.scores is annotated as
List[float] but run_judge_evaluation appends values from _clip_judge_score which
can be None; change the model to reflect this by updating JudgeResult.scores to
List[Optional[float]] (and add a short comment documenting None as a sentinel
for parse/transport failures), or alternatively modify run_judge_evaluation to
only append non-None scores and record failures into details/judge_failed
instead—choose one approach and apply it consistently across usages that iterate
result.scores (e.g., callers of run_judge_evaluation and any consumers that
assume float values).
In `@forgelm/merging.py`:
- Around line 266-268: The explicit dim=None on the 1-D tensor is redundant;
change the call that computes threshold to use torch.quantile(flat.float(),
trim_fraction) (leave the surrounding comment and the variables flat and
threshold unchanged) so the quantile is computed across all elements without the
unnecessary dim argument.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cd0569bd-5b84-45f7-bae0-65b42338017e
📒 Files selected for processing (23)
CHANGELOG.mdREADME.mdforgelm/chat.pyforgelm/cli.pyforgelm/config.pyforgelm/data.pyforgelm/export.pyforgelm/fit_check.pyforgelm/judge.pyforgelm/merging.pyforgelm/model_card.pyforgelm/safety.pyforgelm/synthetic.pyforgelm/trainer.pyforgelm/webhook.pyforgelm/wizard.pypyproject.tomltests/test_benchmark.pytests/test_cli_phase10.pytests/test_deploy.pytests/test_judge_functions.pytests/test_moe_functions.pytests/test_oom_recovery.py
✅ Files skipped from review due to trivial changes (3)
- tests/test_benchmark.py
- CHANGELOG.md
- tests/test_deploy.py
🚧 Files skipped from review as they are similar to previous changes (4)
- forgelm/webhook.py
- pyproject.toml
- forgelm/model_card.py
- tests/test_judge_functions.py
…izard CC) Workflow security: - nightly.yml / publish.yml: drop the workflow-level `contents: read` permission and declare it per-job. Each job now spells out exactly the scopes it needs (test jobs: read; notify-failure: read+issues; publish- testpypi/pypi: read+id-token). export.py: - Extract `_CONVERTER_SCRIPT` constant (was the literal "convert_hf_to_ gguf.py" repeated 3×). - Rename the `format=` keyword-only parameter to `format_=` so it no longer shadows the builtin. Updates `cli._run_export_cmd` and `tests/test_export.py::test_unsupported_format_returns_failure` to the new keyword. - Refactor `export_model` (CC 25 → well below 15) by extracting: * `_validate_export_request` — early rejection for bad format/quant * `_resolve_kquant_path` — K-quant → .f16.gguf rerouting * `_build_converter_command` — argv assembly * `_run_converter` — subprocess invocation + error reporting * `_cleanup_merged_dir` — temporary adapter-merge cleanup deploy.py: - `generate_deploy_config` had 18 parameters (limit 13). Group the HF-Endpoints knobs (task/instance_size/instance_type/region/framework/ vendor) into a new `HFEndpointsOptions` dataclass and the sampling defaults (temperature/top_k/top_p) into `SamplingOptions`. Net 18→11. - Update `cli._run_deploy_cmd` to pass `hf_endpoints=HFEndpointsOptions (vendor=...)` and the test `test_json_roundtrip_for_hf_endpoints` similarly. CHANGELOG gains a Breaking entry for the new signature. wizard.py: - `run_wizard` was still over the CC threshold after Wave 4. Extract: * `_select_model` — model picker (POPULAR_MODELS + custom) * `_derive_strategy_flags` — decode the 4-way strategy menu * `_parse_trainer_type` — return (trainer_type, dataset_format_hint) * `_collect_galore_config` — the GaLore-specific interactive branch tests/test_inference.py: - Hoist the inline `_Tensor`/`_Scalar` stub classes (and their helper ops `softmax`/`log`/`sum`) to module scope. `_make_torch_stub` now just builds the torch module (CC 36 → well under threshold). - Five test methods that did `torch_stub, _Tensor, _ = _make_torch_stub()` now do `torch_stub = _make_torch_stub()` — the local-variable PascalCase warnings disappear because `_Tensor` is no longer rebound in test scope. - Replace the unused `tok` in `test_adapter_is_merged` with `_`. tests/test_export.py: - `test_no_error_when_manifest_missing` no longer hardcodes `/tmp/model.gguf`; uses `tmp_path / "model.gguf"`. CHANGELOG: document the two breaking-API changes (export_model `format=` → `format_=`; generate_deploy_config grouped dataclasses). Verification: - ruff format && ruff check: clean - pytest tests/: 430 passed, 34 skipped (no behavioral regression) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… judge edges)
Real bugs:
- fit_check: when galore_enabled=True, the previous code added
adapter_gb on top of galore_gb and sized optimizer_state_gb against
the LoRA adapter — double-counting and wrong baseline. The two
strategies are mutually exclusive: under GaLore we now derive
projected_params = num_params * (galore_rank / hidden_size) and use
THAT as trainable_params for _optimizer_state_gb; adapter_gb stays 0.
Under LoRA we keep the original adapter sizing and skip galore_gb.
- fit_check: quant detection used to read bnb_4bit_compute_dtype even
when load_in_4bit was False, mis-tagging plain bf16 runs. Now picks
storage dtype: load_in_8bit → "8bit"; otherwise checks
m.torch_dtype / m.config.torch_dtype with a small fp16/bf16/fp32 map;
bnb_4bit_compute_dtype is consulted only when load_in_4bit=True.
- judge: an entire run with all-failures used to fall through to
"Average judge score (0.00) below minimum (X.XX)". Now we detect
``not valid_scores`` first and emit a distinct failure_reason
("No valid judge scores (all N/M parses/requests failed)").
Type / cleanup:
- judge: JudgeResult.scores typed as List[Optional[float]] with a
comment documenting the None sentinel — it was already storing None
but the annotation said List[float].
- merging: drop the redundant dim=None on torch.quantile; flat is
already 1-D after .flatten() so the default behavior is correct.
- export: rename the keyword-only parameter format_ → output_format
(matches the convention discussed in review); update cli call site
and test.
Verification:
- ruff format && ruff check: clean
- pytest tests/: 430 passed, 34 skipped
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three security notices flagged by Codacy after the round-7 refactor shifted line numbers: - export.py:23 (B404, "Consider possible security implications associated with the subprocess module"): annotate the import line with `# noqa: S404 # nosec B404` and a short comment pointing reviewers to the safe-usage rationale on _run_converter. - export.py:271-272 (B603/S603, S607): the multi-line subprocess.run call was already noqa'd on its opening line, but Codacy was re-flagging the `cmd,` continuation line on its own. Add the suppression on both lines (S603/B603 and S607/B607 covered) and expand the inline comment to spell out the no-shell, absolute-path invariant. Skipped (Codacy "Possibly found usage of AI tooling: PyTorch" notices on export.py:131, fit_check.py:240, inference.py 10x, merging.py:175): this is Codacy's AI/ML library-usage *policy* signal, not a security or quality issue. ForgeLM is an LLM fine-tuning toolkit; using PyTorch is unavoidable. No code change. Verification: - ruff format && ruff check: clean - pytest tests/test_export.py: 18 passed (no behavioral change) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
… progress, OCR docs) Three small Phase 11.5 backlog items pulled forward — none introduce breaking changes, none require new dependencies. ebooklib deprecation (#4 in backlog) - _extract_epub now passes options={"ignore_ncx": True, "ignore_missing_css": True} to epub.read_epub. Silences the deprecation / NCX / CSS warnings that newer ebooklib (>=0.18) emits when the keyword is omitted; behavior is unchanged for downstream. Audit progress logging (#7 in backlog) - _audit_split now emits a "computing simhashes for N rows" notice and a "X / N rows fingerprinted" line every 5000 rows when the split crosses that threshold. Quickstart-sized audits stay quiet. - audit_dataset main loop logs each split's open ("scanning split 'X' (N rows from path.jsonl)") and a "scanning for near-duplicates (O(n²); N rows)" notice when the within-split pairing step is the expensive piece. Closes the 3-5 minute silent stretch on 30K+ row audits without pulling tqdm in. OCR handoff documentation (#13 in backlog) - docs/guides/ingestion.md: new "Working with scanned PDFs" section with two worked recipes — Tesseract (local, free, via ocrmypdf or raw pdftoppm + tesseract) and AWS Textract (cloud, paid, better on multi-column / forms). Plus a brief PII-redaction-after-OCR flow. Removes the dead-end where "OCR is out of scope" left operators with nowhere to go. Backlog - docs/roadmap/phase-11-5-backlog.md updated: the three landed items removed from the table, header gains a "Recently landed" callout pointing at the relevant docs path. Twelve items remain in the follow-up queue. Verified: ruff clean, 718 tests / 11 skips unchanged, manual smoke on 6000-row audit confirms progress lines fire at the expected intervals.
Master-review §3 items 6-8, §6 Quick wins. Webhook hardening (Security §M-1/M-2/M-3, master-review §3 #8): - forgelm/webhook.py: new `_is_private_destination` resolves the URL host (IP literal or DNS) and rejects RFC1918 / loopback / link-local / reserved / multicast destinations. Cloud-metadata services (the 169.254.169.254 IMDS endpoint), in-cluster management UIs, and loopback debuggers no longer receive the run summary by accident when an operator typos / pastes a wrong URL into the YAML. - New `webhook.allow_private_destinations` opt-in for legitimate in-cluster monitoring endpoints (Slack proxies on-prem, Teams gateways behind a corporate ingress). - DNS-failure semantics: do NOT fail closed on `gaierror` (would mask typos as SSRF refusals); let `requests` produce the natural `ConnectionError` so the operator sees the real "could not resolve" message. - TLS: `requests.post(verify=verify)` now passed explicitly. New `webhook.tls_ca_bundle` path forwards as `verify=` for corporate MITM CA setups; default keeps the bundled certifi store (verify=True). - Timeout floor: `timeout < 1` clamped to 5s and warned. The previous default-of-5 silently honoured `timeout: 0` from a bad YAML, which `requests` interprets as "no timeout" — hung the trainer on a dead webhook receiver. - `notify_failure(reason)`: routes the reason text through `forgelm.data_audit.mask_secrets` before posting, so an exception string carrying a leaked API key / PEM block / Slack token is redacted to `[REDACTED-SECRET]` before the webhook payload leaves the process. Truncated to 2 KB after masking to keep Slack payloads reasonable. Safety classifier (M-6): - forgelm/safety.py: pass `trust_remote_code=False` explicitly to `pipeline("text-classification", ...)` so a future Transformers default flip can't silently start running classifier-side custom code on the production safety pass. Performance (master-review §3 #7, F-performance-001): - forgelm/data_audit.py::_audit_split: dropped the defensive `list(agg.minhashes)` / `list(agg.fingerprints)` copy on the return path. The downstream `_cross_split_pairs` already produces a fresh filtered list before walking, so the original list is never mutated by consumers. Saves ~1-2 KB per MinHash sketch × N rows × 2 = up to ~2.5 GB resident on a 1M-row split before LSH even started. Note: the master-review's F-performance-003 (lazy-torch-import in trainer.py) is partially stale — `cli.py` already lazy-imports trainer.py so `forgelm --help` and `forgelm audit` do NOT trigger the torch import (verified ~390 ms cold start). Skipping that change. Quick wins: - forgelm/cli.py:819: `str | None` → `Optional[str]` to match the uniform house style. - forgelm/cli.py: `--data-audit` deprecation help and the runtime log now declare a removal version (v0.7.0) so consumers can plan; the log was promoted from `info` to `warning`. - docs/standards/testing.md: `coverage.fail_under` example now reads `40` (matches `pyproject.toml` — the toml is the active gate). The drift was flagged by F-code-016. - forgelm/config.py::WebhookConfig: added `allow_private_destinations` (bool, default False) and `tls_ca_bundle` (Optional[str]) fields to back the new webhook hardening knobs. Verified: 822 tests pass; ruff lint + format clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iew artifact 11 verified findings actioned inline + 1 file removed at user request. REMOVED - docs/analysis/code_reviews/foundation-pr-review-20250504-opus.md — user flagged this artifact as accidentally committed; removed via 'git rm' (working tree + index). The follow-up review foundation-pr-review-202604301557-opus.md remains on disk locally but is gitignored per the existing pattern; only the 3 named planning artifacts (closure-plan, master-review, split-design) are publicly tracked. CLOSURE PLAN FIXES - §11 Faz 31 acceptance: "release-prep.yml tetiklenir" → publish.yml build → cross-os-tests → publish job chain. The two workflows were consolidated into a single publish.yml in the prior round; the acceptance text and §17 task #7 still referenced the old split filename. - §10 Faz 31 Cross-OS Install step: 'shell: bash' added so the POSIX-style 'if [ runner.os = ... ]' works on windows-latest; '--no-index --find-links dist' dropped from the Linux-only extras install (qlora pulls bitsandbytes which lives on PyPI, not in our local dist/). The extras install moved into a separate step gated by 'if: runner.os == 'Linux'' so Windows + macOS skip cleanly. SPLIT DESIGN FIXES - §1.d, §2.d, §3 fenced code blocks: added 'text' language identifier (markdownlint MD040 — three previously-unlabeled fences for the data_audit/cli package layout illustrations + the week-by-week sequence). CODE FIXES - forgelm/compliance.py:_verify_genesis_manifest — manifest present-but-unreadable / present-but-missing-required-fields now returns VerifyResult(valid=False, ...) instead of warn-and-success. Rationale: the manifest IS the truncate-and-resume detector; an attacker who corrupts it should fail verification, not silently pass on the assumption that 'manifest unreadable' is benign. - forgelm/compliance.py:_read_chain_head — 4 KiB tail re-read fallback was hashing truncated records when a single audit entry exceeded the 4 KiB window. Replaced with a progressive-widen pattern: window starts at 4 KiB, doubles each retry on empty whole-records segment, up to full-file. Trailing-newline guard already protected against in-flight writes; this fix protects against oversized records. - forgelm/safety.py:run_safety_evaluation — added library-API boundary check: 'isinstance(batch_size, int) and batch_size >= 1' raises ValueError with a clear message. Pydantic Field(ge=1) enforces this for config-driven calls but the function is also a public Python API (importable as forgelm.safety.run_safety_evaluation), so direct callers can bypass schema validation. Prior C-32 rejection covered the internal helpers (_generate_responses_batched etc.) which are Pydantic-fed; this is a different code path. - forgelm/trainer.py:_revert_model — webhook reason suffix "Adapters discarded." → "Artifacts discarded." so the message is correct whether save_final_model produced LoRA adapters, a merged model, or another output kind. PLAN §15.5 CARRY-OVER REGISTRY EXPANSION (C-35..C-43) 9 new entries documenting the round-3 fix decisions; all 9 closed inline this commit. C-38 (timestamp fix) marked REMOVED since the underlying file was deleted. ACCEPTANCE - pytest tests/ --ignore=tests/test_cost_estimation.py: 881 passed, 13 skipped, 0 failed. - ruff format --check: 81 files clean. - ruff check: All checks passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the three HIGH findings from the multi-agent review and the six MEDIUMs that have a clean doc-only fix: - F-23-01 (HIGH): rewrote §4.1 ¶2 to acknowledge the existing test_lazy_imports.py only covers forgelm.trainer / forgelm.model; bare `import forgelm` is verified manually today and Phase 19 task #7 closes the gap with a real subprocess regression test. - F-23-02 (HIGH): added a TYPE_CHECKING import block to the §4.2 __getattr__ sketch so mypy --strict sees real signatures (not Any), and replaced the if/elif chain with a _LAZY_SYMBOLS dict + caching via globals()[name] = value (gemini-code-assist comments addressed in the same edit: maintainability + per-access caching). - F-23-03 (HIGH): replaced the "raise AttributeError on forgelm._http" test with a stronger and import-side-effect-immune assertion ("_http" not in dir(forgelm) and not in __all__). Documented in §2.3 that internal = absent from public surface, not import-protected. - F-23-04 (MEDIUM): §12 sign-off note honestly acknowledges that no existing test imports the new stable names and Phase 19 task #7 ships the coverage; the table claim was aspirational. - F-23-05 (MEDIUM): §4.3 reframes the cli/__init__.py lazy migration as hygiene, not perf; explicitly notes _dispatch eagerly imports six submodules so a real CLI invocation defeats the laziness anyway. - F-23-06 (MEDIUM): §5.2 names the CI trigger (release-* branch push) and explicitly documents the script as a no-op for v0.5.5 because there is no prior __api_version__-bearing baseline. - F-23-07 (MEDIUM): added a tier-completeness audit table covering every top-level module + sub-package (Stable / Experimental / Internal) so reach-ins to forgelm.safety / forgelm.judge / forgelm.ingestion are not silently load-bearing. - F-23-08 (MEDIUM): replaced the §2.1 audit_dataset signature (`**kwargs` placeholder) and verify_audit_log signature (missing hmac_secret) with the real signatures audited from forgelm/data_audit/_orchestrator.py and forgelm/compliance.py. - BOT (sourcery): "spec'es" typo in CHANGELOG.md replaced with "specifies". Reviewer recommendation said "pre-Phase-19 must land before Phase 19 starts"; that's exactly what this commit does — design + Phase 19 implementation plan are now consistent with the audited live code. Closes (review IDs): F-23-01 .. F-23-08 + sourcery-typo Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gent review findings - Wave 5 PR #34 round 3 absorption: full sweep of CodeRabbit's 6 OOD + 20 Minor inline review comments PLUS 24 multi-agent review findings (code-correctness / privacy-security / standards / test-rigor docs under docs/analysis/code_reviews/wave5-*.md). HIGH/Major absorptions: - 4 broken `forgelm.cli.subcommands.*` symbol citations across purge/approvals reference docs (en+tr): _run_row_erasure → _run_purge_row_id, _handle_pending → _run_approvals_list_pending, _handle_show → _run_approvals_show (F-W5-PRIV-02). - configuration.md (en+tr): drop `kserve | triton` ghost choices and the entire `## deployment:` ghost YAML block (no `DeploymentConfig` in ForgeConfig; extra="forbid" rejects it at load) — replace with a v0.6.0+ Phase 14 callout (F-W5-S1, F-W5-S2). - ingestion.md (en+tr): drop conflicting "instructions format default" claim; document raw-chunk JSONL with summary-only --output-format {text,json} (CR-OOD ingestion). - closure-plan PR #33 status: line 131 said "PR #33 pending" but line 119 said merged with `01e40ba` (CR-OOD closure-plan). - leakage.md (en+tr): remove `audit.leakage_check.remove_cross_split_overlap` knob reference — knob doesn't exist in code; replace with the documented jq-based removal flow (CR-OOD leakage). - TR + EN experiment-tracking: drop `$ ` prompt prefix in `text`-tagged preview blocks for `forgelm compare-runs` (CR-OOD experiment-tracking). - roadmap mermaid (en+tr): add Phase 12.6 (Closure Cycle) → v0.5.5 node + edge so visual flow matches table (CR-OOD roadmap). Minor absorptions: - phase-12-6-closure-cycle.md L123: stale "(currently 0.5.1rc1)" parenthetical → already-bumped during Wave 5 (F-W5-S4 / CR Minor). - gdpr-erasure.md (en+tr): rename duplicate `### What gets recorded` / `### Neyi kaydeder` H3 to disambiguate Article 17 vs 15 sections (CR Minor #2 + #20). - verify-gguf.md (en+tr) L147: revise misleading "still match SHA-256" hash explanation; clarify magic-header layer doesn't catch truncation independent of sidecar (CR Minor #3). - verify_audit link mismatch in 4 reference docs (en+tr × verify_gguf + verify_annex_iv): label text now matches href (CR Minor #4 + #8). - air_gap_deployment-tr.md L19: add `text` language tag to ASCII diagram fence (CR Minor #6). - site/index.html L374: drop internal "Phase 19" milestone leak; tag reads `embed` (user-facing label) (CR Minor #7). - air_gap_deployment.md L193: fix dangling sentence fragment in HF_TOKEN gated-models tip (CR Minor #9). - performance.md (en+tr) L184: drop ghost `forgelm audit --output-format parquet`; replace with pandas/pyarrow conversion step (CR Minor #10). - phase-11-5-backlog.md L10-12: link text → package paths matching href (forgelm/data_audit.py → forgelm/data_audit/, etc.) (CR Minor #11). - testing.md L140: stale `23/23` → `39/39` bilingual parity count (CR Minor #12). - testing.md L198: `python3` prefix consistency in checklist (CR Minor #13). - supply-chain.md (en+tr): bandit cadence aligned to standards doc (Nightly 03:00 UTC + release tag, not "every PR + main push") (CR Minor #14 + #15). - translations.js L1927: "Three values" → "Two values" (only forgelm-lang + forgelm-theme listed) (CR Minor #16). - approvals_subcommand.md (en+tr) L26: staging path → match L84 JSON sample (`final_model.staging.<run_id>` canonical, run-id-less form honoured as fallback) (CR Minor #17 + #19). - translations.js L1445/L1462/L1504/L1521 (en+tr): `<code>reject</code>` → `<code>forgelm reject</code>` for fully-qualified copy/paste parity (CR Minor #18). - _meta.yaml header: add formal DE/FR/ES/ZH deferred-policy banner per closure-plan Task L (F-W5-S6). - release.md L20 + L359: stale "0.3.1rc1 current" + "no __version__" claims; corrected to 0.5.5 + reflect that __version__ + __api_version__ ARE exported per Library API standard (F-W5-S3). - purge_subcommand.md (en+tr): add WARNING-on-fallback language for --output-dir asymmetry vs reverse-pii (F-W5-PRIV-05). - verify-audit.md (en+tr): replace overstated "escrow" wording with KMS/HSM-based separation (F-W5-PRIV-06). Tool tightening + new tests: - check_cli_help_consistency.py: tighten _FORWARD_REF_MARKERS to whole-phrase forms (drop bare "planned"/"future"/"roadmap" which matched "planned attacks" / "future work" / "roadmap-driven" in unrelated prose); add Turkish equivalents for bilingual docs (F-W5-01). - test_check_cli_help_consistency.py: 3 new regression tests: bare-substring negative (F-W5-01), >±3-line forward-ref window negative (F-W5-TR-01a), and Wrong:-block scope-to-one-block resumption (F-W5-TR-01b). Suite 17 → 20 tests, all green. User-manual orphan fix: - _meta.yaml: register `reference/json-output` page (existed in filesystem since Wave 2 but never declared in nav config — was invisible on the live site). Now declared = on-disk = 65 each. Skipped (with reason): - CR Minor #5 (library_api.md `forgelm.__version__` example): INVALID — verified `python3 -c "import forgelm; print(forgelm.__version__)"` prints 0.5.5 cleanly; both __version__ and __api_version__ ARE exported per Library API standard. CodeRabbit's web-search must have hit pre-Wave-5 PyPI. Doc is correct; code matches. - F-W5-S5 (parity tool autodiscover usermanuals): registry shape refactor (~50 new pairs) is scope creep for round 3; defer to v0.6. - F-W5-S8 (compliance_summary-tr.md): explicitly deferred to v0.6 by the finding itself. - F-W5-S9, F-W5-TR-11/12 (NIT cite/style): non-load-bearing. - F-W5-02/03/04 (cognitive complexity, unterminated fence, resume-after-Wrong unit test): refactor scope; covered by F-W5-TR-01 block-scope tests for the resume case. - F-W5-TR-02 (cleanup commit baseline regression test): the live CI --strict gate IS the guard; doubling up the integration-test layer is defensible but not blocking. - F-W5-TR-03/05/06/07/08/09/10 (extras): non-blocking coverage gaps. Self-review gauntlet (all green): ruff format/check ✅ bilingual_parity --strict 39/39 ✅ anchor_resolution --strict 264/264 ✅ cli_help_consistency --strict 425/425 ✅ (reduced from 449 due to intentional invocation removals: ghost YAML block, parquet ghost, 4 broken symbol-cite paths, etc.) site_claims --strict (artefacts/templates/GPU/version) ✅ field_descriptions --strict ✅ pytest 1433 passed / 14 skipped (+5 new tests vs round 2) ✅ forgelm --config config_template.yaml --dry-run ✅ build_usermanuals.py: en/tr 65/65 OK, JS bundles regenerate ✅ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>



Closes the "trained, now what?" gap with five new modules:
CLI: forgelm chat / export / deploy subcommands + --fit-check flag; full backward compat with existing flat CLI. All subcommands work without --config.
Tests: 89 new tests across 5 modules; full suite 427 passed, 34 skipped.
Docs: CHANGELOG, README, roadmap, quickstart, usage reference updated.
Version: 0.3.1rc1 → 0.4.0
Summary
Changes
Type
Testing
pytest tests/passesruff check .passesruff format --check .passesforgelm --config config_template.yaml --dry-runworksChecklist
pip install forgelm[...])config_template.yaml) updated if new config fields addedSummary by CodeRabbit
New Features
Documentation
Bug Fixes / Security
Tests
Chores