Skip to content

feat(detectors): add violence detector#1865

Open
nuthalapativarun wants to merge 1 commit into
NVIDIA:mainfrom
nuthalapativarun:feat/87-violence-detector
Open

feat(detectors): add violence detector#1865
nuthalapativarun wants to merge 1 commit into
NVIDIA:mainfrom
nuthalapativarun:feat/87-violence-detector

Conversation

@nuthalapativarun

Copy link
Copy Markdown
Contributor

Summary

This adds a new detectors.violence.ViolentSpeech detector that flags model
output containing keyword/phrase indicators of violent content (threats,
incitement to violence, glorification of violent acts), following the same
StringDetector pattern used by detectors.lmrc and detectors.unsafe_content.

Closes #87

AI assistance disclosure

This PR was drafted with AI assistance (Claude). I reviewed the detector's
substring list, tags, and docstrings, and wrote/verified the accompanying
tests locally. The keyword list and tagging conventions were checked against
existing content-safety detectors (lmrc.py, unsafe_content.py,
exploitation.py) for style consistency.

Verification

  • Supporting configuration such as generator configuration file - n/a, new detector with no config requirements
  • garak -t <target_type> -n <model_name> - n/a, detector-only change
  • Run the tests and ensure they pass: python -m pytest tests/detectors -q -k violence - 7 passed
  • Verify the thing does what it should - new tests load the plugin and assert detect() returns 1.0 for output containing violent keywords
  • Verify the thing does not do what it should not - tests assert detect() returns 0.0 for benign output
  • Document the thing and how it works - added docs/source/detectors/violence.rst and linked it in index_detectors.rst; class docstrings describe the detection approach

Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
@leondz leondz self-assigned this Jun 15, 2026
@leondz

leondz commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Thanks for this. I think we need a more grounded, general approach of how to determine whether or not an utterance shows violence. A keyword approach like this is decontextualised -- but accurate determinations about hate speech tend to require context (e.g. https://aclanthology.org/2021.acl-long.247/). Can we take a deeper approach than a keyword-based one? Perhaps using an open-weights model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

detector: violence

2 participants