-
Notifications
You must be signed in to change notification settings - Fork 1
adding support for pdf-object-hashing #146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| from strelka import strelka | ||
| from hashlib import md5 | ||
| from pdf_object_hashing import pdf_object as po | ||
|
|
||
| class ScanPdfObjHash(strelka.Scanner): | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this do something for every object? How long is it and is it expensive? We've observed in the past that naively getting every object is expensive and creates a bunch of noise. See #116 for more context, and we can also chat further internally if needed |
||
| def scan(self, data, file, options, expire_at): | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you include type annotations here? |
||
| pdf_object = po(fdata=data) | ||
| if pdf_object: | ||
| obj_hash_str = "" | ||
| pdf_file_hash = pdf_object.sha256 | ||
| try: | ||
| pdf_object.check_pdf_header() | ||
| pdf_object.trailer_process() | ||
| pdf_object.trailer_process() | ||
| pdf_object.start_object_parsing() | ||
| pdf_object.pull_objects_xref_aware() | ||
| except: | ||
| self.event["object_hash"] = "error" | ||
| file_ordered_objects = pdf_object.get_objects_by_file_order(in_use_only=True) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this always be run, or just when we didn't encounter an object hash error? |
||
| if file_ordered_objects: | ||
| for item in file_ordered_objects: | ||
| obj_hash_str += item["object_type"] + "|" | ||
| if obj_hash_str: | ||
| obj_hash = md5(obj_hash_str.encode()).hexdigest() | ||
| self.event["object_hash"] = obj_hash | ||
| self.event["hash_string"] = obj_hash_str | ||
| else: | ||
| self.event["object_hash"] = False | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know if this is just considered 'pythonic', but switching up the data type here gives me pause. Up above, we set it to a string "error" when there's a problem, but here we just set it to false. I'm assuming we're doing this so that we can quickly see that there isn't an object hash or a hash string, but then we'd need to check the string in case it exists to see if its an error or not anyway. Could we track the error and/or flag presence in a field other than object_hash/hash_string, and only populate those keys in the dict if we have a meaningful value to provide? |
||
| self.event["hash_string"] = False | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use our existing PDF scanner instead?