-
In my Data Science project, my team had to collect images through many kinds of Search Engines for creating dataset and we chose Google Sheets for assigning labeling tasks to each member because of its convenient.
-
There are lots of similar images when crawling from the Internet, this will result in biases in the dataset. Here is my solution to filter similar images for the Data Preparation step.
-
Get image urls from Search Engines. I have a repo for that here
-
Copy + paste these urls to Google Sheets. Here, we can see how similar images arranged next to each other
-
Connect to Google Sheets using Python
-
If just using 1 hash value, some images will be said to be the same even if they are different. Therefore, we decided to caculate 3 hash values for each 2 images:
-
If the distances of 2 in these 3 values tell 2 images are similar (≤ different points) then arrange these images next to each other
distances = [ahash0 - ahash1, phash0 - phash1, dhash0 - dhash1] diff_results = sum(dist < args['diff'] for dist in distances) if diff_results >= 2: print(f'|--Similar with url {idx1 + 1}: {url1}')
-
Decide what images to keep and begin labeling
-
Install libraries:
pip install -r requirements.txt
-
Sort similar images in Google Sheets:
- Example:
python sort_similar.py -s "example" -w "Sheet1" -r "B2:C" -a credentials.json
usage: sort_similar.py [-h] -s SPREADSHEET -w WORKSHEET -r RANGE -a AUTH [-d DIFF]
optional arguments:
-h, --help show this help message and exit
-s SPREADSHEET, --spreadsheet SPREADSHEET spreadsheet name
-w WORKSHEET, --worksheet WORKSHEET worksheet name
-r RANGE, --range RANGE updated range
-a AUTH, --auth AUTH credentials file
-d DIFF, --diff DIFF different points
- Download images from urls in Google Sheets:
- Example:
python download_images.py -s "example" -w "Sheet1" -r "B2:C" -a credentials.json -o images/
usage: download_images.py [-h] -s SPREADSHEET -w WORKSHEET -r RANGE -a AUTH -o OUT
optional arguments:
-h, --help show this help message and exit
-s SPREADSHEET, --spreadsheet SPREADSHEET spreadsheet name
-w WORKSHEET, --worksheet WORKSHEET worksheet name
-r RANGE, --range RANGE updated range
-a AUTH, --auth AUTH credentials file
-o OUT, --out OUT path to images directory