Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sweep: Update all occurrences of file_cache to use sha1 instead of md5. There are two files. #3334

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions docs/pages/blogs/file-cache.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ class Obj:
obj = Obj("test")
print(recursive_hash(obj)) # -> this works fine
try:
hashlib.md5(obj).hexdigest()
hashlib.sha1(obj).hexdigest()
except Exception as e:
print(e) # -> this doesn't work
```
Expand All @@ -89,18 +89,18 @@ We use recursive_hash, which works for arbitrary python objects.
def recursive_hash(value, depth=0, ignore_params=[]):
"""Hash primitives recursively with maximum depth."""
if depth > MAX_DEPTH:
return hashlib.md5("max_depth_reached".encode()).hexdigest()
return hashlib.sha1("max_depth_reached".encode()).hexdigest()

if isinstance(value, (int, float, str, bool, bytes)):
return hashlib.md5(str(value).encode()).hexdigest()
return hashlib.sha1(str(value).encode()).hexdigest()
elif isinstance(value, (list, tuple)):
return hashlib.md5(
return hashlib.sha1(
"".join(
[recursive_hash(item, depth + 1, ignore_params) for item in value]
).encode()
).hexdigest()
elif isinstance(value, dict):
return hashlib.md5(
return hashlib.sha1(
"".join(
[
recursive_hash(key, depth + 1, ignore_params)
Expand All @@ -113,7 +113,7 @@ def recursive_hash(value, depth=0, ignore_params=[]):
elif hasattr(value, "__dict__") and value.__class__.__name__ not in ignore_params:
return recursive_hash(value.__dict__, depth + 1, ignore_params)
else:
return hashlib.md5("unknown".encode()).hexdigest()
return hashlib.sha1("unknown".encode()).hexdigest()
```

### file_cache
Expand Down Expand Up @@ -204,4 +204,8 @@ Finally we check cache key existence and write to the cache in the case of a cac
## Conclusion

We hope this code is useful to you. We've found it to be a massive time saver when debugging LLM calls.

### Note on the Shift from md5 to sha1

We decided to shift from md5 to sha1 hashing for our `file_cache` functionality to improve the security and reduce the chances of hash collisions. sha1 provides a slightly longer hash, which results in a lower probability of different inputs producing the same output. This change is particularly important as the scale of data we cache grows, ensuring that our caching mechanism remains reliable and efficient.
We'd love to hear your feedback and contributions at https://github.com/sweepai/sweep!