Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions build/configs/scanners.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -382,6 +382,14 @@ scanners:
limit: 2000
pdf_to_png: True
no_object_extraction: True
'ScanPdfObjHash':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use our existing PDF scanner instead?

- positive:
flavors:
- 'application/pdf'
- 'pdf_file'
priority: 5
options:
scanner_timeout: 10 # in seconds
'ScanOnenote':
- positive:
flavors:
Expand Down
3 changes: 2 additions & 1 deletion build/python/backend/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ olefile==0.46
oletools==0.60.1
opencv-python==4.8.1.78
opencv-contrib-python==4.8.1.78
pdf-object-hashing @ git+https://github.com/0xkyle/pdf_object_hashing.git
Pillow>=11.2.1
pi-heif>=0.16.0
idna==3.10
Expand All @@ -54,4 +55,4 @@ signify==0.3.0
ssdeep==3.4
tldextract==5.1.3
tnefparse==1.4.0
xmltodict==0.12.0
xmltodict==0.12.0
30 changes: 30 additions & 0 deletions src/python/strelka/scanners/scan_pdf_obj_hash.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
from strelka import strelka
from hashlib import md5
from pdf_object_hashing import pdf_object as po

class ScanPdfObjHash(strelka.Scanner):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this do something for every object? How long is it and is it expensive? We've observed in the past that naively getting every object is expensive and creates a bunch of noise. See #116 for more context, and we can also chat further internally if needed

def scan(self, data, file, options, expire_at):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you include type annotations here?

pdf_object = po(fdata=data)
if pdf_object:
obj_hash_str = ""
pdf_file_hash = pdf_object.sha256
try:
pdf_object.check_pdf_header()
pdf_object.trailer_process()
pdf_object.trailer_process()
pdf_object.start_object_parsing()
pdf_object.pull_objects_xref_aware()
except:
self.event["object_hash"] = "error"
file_ordered_objects = pdf_object.get_objects_by_file_order(in_use_only=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this always be run, or just when we didn't encounter an object hash error?

if file_ordered_objects:
for item in file_ordered_objects:
obj_hash_str += item["object_type"] + "|"
if obj_hash_str:
obj_hash = md5(obj_hash_str.encode()).hexdigest()
self.event["object_hash"] = obj_hash
self.event["hash_string"] = obj_hash_str
else:
self.event["object_hash"] = False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is just considered 'pythonic', but switching up the data type here gives me pause. Up above, we set it to a string "error" when there's a problem, but here we just set it to false. I'm assuming we're doing this so that we can quickly see that there isn't an object hash or a hash string, but then we'd need to check the string in case it exists to see if its an error or not anyway.

Could we track the error and/or flag presence in a field other than object_hash/hash_string, and only populate those keys in the dict if we have a meaningful value to provide?

self.event["hash_string"] = False