Treats your files as repositories and allows finding duplicates in the repositories and outside of them
config: ~/.config/dedup/
- Each repo consists of x index DBs
- They are separated by modulo x filesize
- The Repo manages the index files in a way that the outside is unaware of them
~/.config/dedup/repos/myRepo/dedup_repo.cfg
| Name | Absolute Path (normalized path) | Index files |
|---|---|---|
| immiches | /home/user/immich | 10 |
- All files of an index are written as logs in a way that all entries are Read only until pruned
- When reading the last path wins
- if it is marked deleted it is considered not existent
- The lookup always has to happen size -> hash because the hash alone is irrelevant. see Processes
~/.config/dedup/repos/myRepo/01.idx
| Relative path (normalized path) | size (bytes) | hash (string) | deleted (date or null) | last modified (date) |
|---|---|---|---|---|
| my/file | 12442 | SHA1-string | Nov 23 |
- check for bak files
- for each
- rename [xxx].bak to xxx
- if fail
- fail process with corruption error saying exactly what's the case and how to fix it
- read all .idx files
dedup repo create <name> <path> -i 10
- Creates the config dir
- Creates the dedup_repo.cfg
- Creates the idx files
dedup repo ls
- lists the repo names
dedup repo rm [<name> | --all]
- deletes the subdir under repos
dedup repo update [<name> | --all]
- Reads the current repo DB
- Iterates over the files and
- Adds files that
- Have been deleted since
- Have been added since
- Have changed their last modified date without a change
- Adds files that
- read the database
- build a set of paths to existing files
- iterate over the files on disk
- for each
- check if path is in lookup
- if not - addFile()
- else check if size is equal
- if not - addFile()
- else check if last modified is equal
- if not - addFile ()
- else check force hash config
- if yes - hash file and check if hash is equal
- if not add file
- else continue
- else assume equal and continue
- endif forced hash
- if yes - hash file and check if hash is equal
- endif last modified
- endif size equal
- endif path found
- check if path is in lookup
- mark all left in lookup as deleted
Pruning a repo means remove all deprecated information
dedup repo prune [<name> | --all]
- Rename the index files to [old_name].bak
- Create new index files
- For each bak file
- Read it completely
- remove all but last state
- for each remaining entry
- Add file to the index files
- Is added to the index file modulo [number]
- Only the last state is added
- Also the deleted are added to keep the info that this file existed
- Add file to the index files
- Delete the bak file
- CLI main with simple argument parsing (JCommander)
- Config Dir Management (use default)
- Simple Repository Management (query Config Dir for Repository)
- List, create and delete Repo commands
- Index Manager
- Repo interface