Imagededup - 😎 Finding duplicate images made easy

Last update: Jan 07, 2023

Overview

Image Deduplicator (imagededup)

imagededup is a python package that simplifies the task of finding exact and near duplicates in an image collection.

This package provides functionality to make use of hashing algorithms that are particularly good at finding exact duplicates as well as convolutional neural networks which are also adept at finding near duplicates. An evaluation framework is also provided to judge the quality of deduplication for a given dataset.

Following details the functionality provided by the package:

Finding duplicates in a directory using one of the following algorithms:
- Convolutional Neural Network (CNN)
- Perceptual hashing (PHash)
- Difference hashing (DHash)
- Wavelet hashing (WHash)
- Average hashing (AHash)
Generation of encodings for images using one of the above stated algorithms.
Framework to evaluate effectiveness of deduplication given a ground truth mapping.
Plotting duplicates found for a given image file.

Detailed documentation for the package can be found at: https://idealo.github.io/imagededup/

imagededup is compatible with Python 3.6+ and runs on Linux, MacOS X and Windows. It is distributed under the Apache 2.0 license.

⚙️ Installation

There are two ways to install imagededup:

Install imagededup from PyPI (recommended):

pip install imagededup

⚠️ Note: The TensorFlow >=2.1 and TensorFlow 1.15 release now include GPU support by default. Before that CPU and GPU packages are separate. If you have GPUs, you should rather install the TensorFlow version with GPU support especially when you use CNN to find duplicates. It's way faster. See the TensorFlow guide for more details on how to install it for older versions of TensorFlow.

Install imagededup from the GitHub source:

=0.29" python setup.py install">

git clone https://github.com/idealo/imagededup.git
cd imagededup
pip install "cython>=0.29"
python setup.py install

🚀 Quick Start

In order to find duplicates in an image directory using perceptual hashing, following workflow can be used:

Import perceptual hashing method

from imagededup.methods import PHash
phasher = PHash()

Generate encodings for all images in an image directory

encodings = phasher.encode_images(image_dir='path/to/image/directory')

Find duplicates using the generated encodings

duplicates = phasher.find_duplicates(encoding_map=encodings)

Plot duplicates obtained for a given file (eg: 'ukbench00120.jpg') using the duplicates dictionary

from imagededup.utils import plot_duplicates
plot_duplicates(image_dir='path/to/image/directory',
                duplicate_map=duplicates,
                filename='ukbench00120.jpg')

The output looks as below:

The complete code for the workflow is:

from imagededup.methods import PHash
phasher = PHash()

# Generate encodings for all images in an image directory
encodings = phasher.encode_images(image_dir='path/to/image/directory')

# Find duplicates using the generated encodings
duplicates = phasher.find_duplicates(encoding_map=encodings)

# plot duplicates obtained for a given file using the duplicates dictionary
from imagededup.utils import plot_duplicates
plot_duplicates(image_dir='path/to/image/directory',
                duplicate_map=duplicates,
                filename='ukbench00120.jpg')

For more examples, refer this part of the repository.

For more detailed usage of the package functionality, refer: https://idealo.github.io/imagededup/

⏳ Benchmarks

Detailed benchmarks on speed and classification metrics for different methods have been provided in the documentation. Generally speaking, following conclusions can be made:

CNN works best for near duplicates and datasets containing transformations.
All deduplication methods fare well on datasets containing exact duplicates, but Difference hashing is the fastest.

🤝 Contribute

We welcome all kinds of contributions. See the Contribution guide for more details.

📝 Citation

Please cite Imagededup in your publications if this is useful for your research. Here is an example BibTeX entry:

@misc{idealods2019imagededup,
  title={Imagededup},
  author={Tanuj Jain and Christopher Lennan and Zubin John and Dat Tran},
  year={2019},
  howpublished={\url{https://github.com/idealo/imagededup}},
}

🏗 Maintainers

Tanuj Jain, github: tanujjain
Christopher Lennan, github: clennan
Dat Tran, github: datitran

© Copyright

See LICENSE for details.

Comments

Optional parallelization of cosine similarity computation (Issue #95)

The PR adds parallel_cosine_similarity to find_duplicates which is set to True by default so that it won't break any existing code. I have a really huge dataset and not enough RAM to use multiprocessing so introducing the ability to disable parallel computation of cosine similarity was the only way to use the package.
enhancement

opened by EduardKononov 8

Cannot install imagededup==0.0.1, .. imagededup==0.1.0 because these package versions have conflicting dependencies.

Not sure if this is a recent issue, or related to my using an M1 Mac. Nevertheless, the tail of a very long traceback is below:

ERROR: Cannot install imagededup==0.0.1, imagededup==0.0.2, imagededup==0.0.3, imagededup==0.0.4 and imagededup==0.1.0 because these package versions have conflicting dependencies.

The conflict is caused by:
    imagededup 0.1.0 depends on tensorflow==2.0.0
    imagededup 0.0.4 depends on tensorflow==2.0.0
    imagededup 0.0.3 depends on tensorflow==1.13.1
    imagededup 0.0.2 depends on tensorflow==1.13.1
    imagededup 0.0.1 depends on tensorflow==1.13.1

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

opened by robmarkcole 7

Duplicates are not found when comparing two files

The code below always returns empty duplicates when two files are compared - regardless PNG files are same or not. (Python 3.8, OSX Catalina)

import sys
import os

from imagededup.methods import PHash

if __name__ == '__main__':
    hasher = PHash()
    image_map = {}

    for i in range(1,3):
        if not os.path.exists(sys.argv[i]):
            sys.exit(sys.argv[i] + " not found")
        image_map[sys.argv[i]] = hasher.encode_image(image_file=sys.argv[i])
    
    
    duplicates = hasher.find_duplicates(
        encoding_map=image_map,
        max_distance_threshold=0,
        scores=True)

    print(duplicates)

bug duplicate

opened by mmertama 7

Relax the dependency specifiers
I can see why you may think you need to lock these down to exact versions, but in general for a library like this it's better to keep them more liberal for several reasons:

Locking gives a false sense of security - it doesn't actually lock the whole tree, so a release of any of the dependencies of these packages could cause a break. You need to use a tool like Pipenv, poetry or pip-tools to lock the entire dependency tree.

This is a library, you need to ensure that it fits in nicely with larger applications that may well have similar dependencies. By locking them to exact versions you are going to cause dependency issues as their specifiers may conflict (i.e Pillow >6.0.0)

It's conceptually wrong - you're saying that this library only works with Pillow ==6.0.0. Not 6.0.1, which could be a bugfix release that fixes a bunch of issues.

Testing dependencies, like pytest and others, should also not be locked. They are not critical to the application as a whole, and it's fairly obvious if they break. Which they won't.
opened by orf 6

Installtion error

I forked the repository and I worked on the "dev" branch and while running the command:

python setup.py install

I got this error message:

Processing dependencies for imagededup==0.1.0
Searching for tensorflow~=2.0.0
Reading https://pypi.org/simple/tensorflow/
No local packages or working download links found for tensorflow~=2.0.0

opened by ShaharNaveh 5

Handle multi picture objects (MPO)

I have a LOT of images (roughly 1/3 of my entire personal library) that register in PIL as MPO. Your code barfs on all of them, but changing image_utils.py line 15 to IMG_FORMATS = ['JPEG', 'PNG', 'BMP', 'MPO'] fixes this.

I didn't want to submit a PR in case you already know this and it causes a headache somewhere else. If not, please change this.
enhancement

opened by drrelyea 5
Unable to execute this project

I have several errors during the execution of this project, the main error is that after launched "pip install imagededup" command, this error comes out:

I already looked at all other issues but no one can solve this problem

opened by uly94 4
Fix/installation

Running python setup.py install on the dev branch fails.

Tensorflow and numpy aren't playing well with each other. Leave tf>1.0 in setup.py with no mention of numpy (i.e., relying on tf to get numpy) leads to an error since tf 2.4.1 gets installed along with numpy 1.20.1 (due to pip resolver algo), but tf 2.4.1 needs numpy=~1.9.2. Explicitly mentioning numpy <1.20.0, makes the installation work. Additionally, newer versions of numpy (1.20.1), scipy (1.6 onwards) and matplotlib do not support python 3.6 anymore. The changes proposed in this PR will also work from Python 3.7 onwards.

opened by tanujjain 4
cannot identify image file 'filename.png' 2021-01-17 20:12:33,709: WARNING Invalid image file filename.png:

Getting this error, cannot identify image file 'filename.png' 2021-01-17 20:12:33,709: WARNING Invalid image file filename.png: in v0.2.4 for .png file

opened by awsaf49 4
Image format of encode_image method

Hi there, Thanks for your repo. To use encode_image(image_array=img_array) method, must the input numpy image array format be as BGR or RGB (because for example OpenCV default format is BGR, but Pillow is RGB)? Best

opened by ahkarami 4
Fix tests

Currently we have failing Linux tests because we rely on the order of how images are loaded which varies across OSs. We don't care in which order images are loaded so we shouldn't test for it, that's why I removed the lines from tests/test_hashing.py

We also have a failing test for macOS Python 3.6 on Azure pipelines which was not reproducible on my MacBook but was fixed by initialising a new CNN object for the failing test.

opened by clennan 4
Supporting image compression?

Hey!

I believe most duplicates are created through compression. For example, if I upload an image to a different service, I re-download it. It's usually compressed, and its metadata may have changed. I have many duplicates from various platforms like Google, Facebook, etc.

I haven't seen anything in the documentation about how this repository handles compression. Will it be able to recognize duplicate images with various levels of compression?

Thanks!

opened by PetrochukM 3
Introducing new optional multiprocessing parameters
WHAT/WHY

Introduces new optional multiprocessing parameters to several methods:

num_enc_workers - Change number of processes to generate encodings (Addresses #156)

num_sim_workers/num_dist_workers - Change number of processes to compute similarity/distances (Addresses #95, #113)

HOW

APIs impacted

For both CNN as well as Hashing functions, the following user-facing api calls get new parameters-

encode_images

find_duplicates

find_duplicates_to_remove

Choice of default values

num_enc_workers: For CNN methods, this is set to 0 by default (0=disabled). Furthermore, parallelization of CNN encoding generation is only supported on linux platform. This is because pytorch requires the call to Dataloader to be wrapped within an if ___name__ == '__main__' construct to enable multiprocessing on non-linux platforms (due to the difference between fork() vs spawn() call used for multiprocessing on different systems). Such a construct does not fit well with the current code structure. For Hashing methods, this parameter is set to the cpu count of the system. The default values preserve backward compatibility.

num_dist_workers/num_sim_workers: Set to cpu count of the system. The default values preserve backward compatibility.
opened by tanujjain 0
Formatted and Updated README.md - Added Streamlit based WebApp 👨‍💻✅

Hello @tanujjain / @clennan / @datitran, and https://github.com/idealo ,

Kudos to you for bringing up imagededup. I worked on developing a simple streamlit based webapp on the same and I think it will be fruitful to have it as a part of README here as the motivation behind developing this came from your work 😄! The entire webapp codebase has also been added as a separate folder while the main README.md of the project has been updated with the same.

You can find the entire webapp source-code in the stream_app directory of the repo.

Happy opensourcing!

Cheers, Prateek

opened by prateekralhan 7

On demand duplicate check during runtime with a 'growing' BKTree

What I would like to achieve I about the following:

EXISTING_HASHES: set = set()
def is_duplicate(img_bytes: bytes):
    if get_hash(img_bytes) in EXISTING_HASHES:
        return True
    return False

def main():
    image_bytes = get_new_image()
    if is_duplicate(image_bytes)
        return

    with open(file) as f:
       f.write(image_bytes)

opened by sla-te 0

Error in the case of too many images

Hello,

I got a problem when I tried to find duplicating image in a dataset of 40000 images.

I have already tried this solution but it didn't work https://github.com/idealo/imagededup/issues/95

Do you know how to fix it ? Thank you in advance.

Best regard

opened by hoangkhoiLE 0

Releases(v0.3.0)

v0.3.0(Oct 15, 2022)
Installation fix

Make package installable by removing tensorflow as a dependency and replacing it with pytorch #173

Drop support for python 3.6 and python 3.7 #173

✨ New features and improvements

Use MobileNetv3 for generating CNN encodings #173

Introduce a 'recursive' option to generate encodings for images organized in a nested directory structure #104

Breaking changes

Size of CNN encodings is 576 instead of 1024 #173

Since CNN encodings are generated using a different network, the robustness might be different; user might need to change similarity threshold settings #173

Hashes (all types) may be different from previous versions for a given image #173

Source code(tar.gz)
Source code(zip)
v0.2.4(Nov 23, 2020)
🔴 Bug fixes

Fix broken cython brute force in Python 3.8 #117

Close figure after plotting to avoid figure overwrite #111

Allow encode_image method of cnn to accept 2d arrays #110

Relax dependencies and update packages #116, #107, #102, #119

Source code(tar.gz)
Source code(zip)
v0.2.2(Dec 11, 2019)
✨ New features and improvements

Switched to creating list comprehensions to create lists on demand instead of slower explicit for loops that rely on calling the append function in every iteration. #76

Used sets for membership tests

Used broadcasting instead of explicit for loops

Source code(tar.gz)
Source code(zip)
v0.2.1(Nov 3, 2019)
🔴 Bug fixes

Add Manifest.in so that c files are included in the source distribution #72

Source code(tar.gz)
Source code(zip)
v0.2.0(Oct 30, 2019)
✨ New features and improvements

Implemented Cython implementation for brute force. This is now used as default search_method on Linux and MacOS X. For Windows, we still use bktree as default as we are not sure that popcnt is supported #56

Expand supported image formats. Now it also supports: 'MPO', 'PPM', 'TIFF', 'GIF', 'SVG', 'PGM', 'PBM' #35

🔴 Bug fixes

Relaxing the package dependencies #36

Removal of print statements #39

Fix type error when saving scores #55 & #61

👥 Contributors

Thanks to @jonatron, @orf, @DannyFeliz, @ImportTaste, @fridzema, @DannyFeliz, @iozevo, @MomIsBestFriend, @YadunandanH for the pull requests and contributions.
Source code(tar.gz)
Source code(zip)
v0.1.0(Oct 8, 2019)

This is the first release of imagededup.

We added: 🧮 Several hashing algorithms (PHash, DHash, WHash, AHash) and convolutional neural networks 🔎 An evaluation framework to judge the quality of deduplication 🖼 Easy plotting functionality of duplicates ⚙️ Simple API
Source code(tar.gz)
Source code(zip)

Owner

idealo

idealo's technology org page, Germany's largest price comparison service. Visit us at https://idealo.github.io/.

GitHub Repository https://idealo.github.io/imagededup/

SPTAG: A library for fast approximate nearest neighbor search

SPTAG: A library for fast approximate nearest neighbor search SPTAG SPTAG (Space Partition Tree And Graph) is a library for large scale vector approxi

4.3k Jan 01, 2023

Audio Visual Emotion Recognition using TDA

Audio Visual Emotion Recognition using TDA RAVDESS database with two datasets analyzed: Video and Audio dataset: Audio-Dataset: https://www.kaggle.com

3 May 11, 2022

Python wrapper of LSODA (solving ODEs) which can be called from within numba functions.

numbalsoda numbalsoda is a python wrapper to the LSODA method in ODEPACK, which is for solving ordinary differential equation initial value problems.

52 Jan 09, 2023

Cascaded Pyramid Network (CPN) based on Keras (Tensorflow backend)

ML2 Takehome Project Reimplementing the paper: Cascaded Pyramid Network for Multi-Person Pose Estimation Dataset The model uses the COCO dataset which

1 Nov 22, 2021

Official implementation of "Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets" (CVPR2021)

Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets This is the official implementation of "Towards Good Pract

52 Nov 22, 2022

Decompose to Adapt: Cross-domain Object Detection via Feature Disentanglement

Decompose to Adapt: Cross-domain Object Detection via Feature Disentanglement In this project, we proposed a Domain Disentanglement Faster-RCNN (DDF)

19 Nov 24, 2022

An example of semantic segmentation using tensorflow in eager execution.

Semantic segmentation using Tensorflow eager execution Requirement Python 2.7+ Tensorflow-gpu OpenCv H5py Scikit-learn Numpy Imgaug Train with eager e

25 Sep 29, 2022

Character Grounding and Re-Identification in Story of Videos and Text Descriptions

Character in Story Identification Network (CiSIN) This project hosts the code for our paper. Youngjae Yu, Jongseok Kim, Heeseung Yun, Jiwan Chung and

8 Dec 09, 2022

La source de mon module 'pyfade' disponible sur Pypi.

Version: 1.2 Introduction Pyfade est un module permettant de créer des dégradés colorés. Il vous permettra de changer chaque ligne de votre texte par

20 Sep 12, 2021

Camera ready code repo for the NeuRIPS 2021 paper: "Impression learning: Online representation learning with synaptic plasticity".

Impression-Learning-Camera-Ready Camera ready code repo for the NeuRIPS 2021 paper: "Impression learning: Online representation learning with synaptic

2 Feb 09, 2022

Face Mesh is a face geometry solution that estimates 468 3D face landmarks in real-time even on mobile devices

Face-Mesh Face Mesh is a face geometry solution that estimates 468 3D face landmarks in real-time even on mobile devices. It employs machine learning

9 Dec 21, 2022

PG2Net: Personalized and Group PreferenceGuided Network for Next Place Prediction

PG2Net PG2Net:Personalized and Group Preference Guided Network for Next Place Prediction Datasets Experiment results on two Foursquare check-in datase

5 Dec 20, 2022

This is the code repository for the paper A hierarchical semantic segmentation framework for computer-vision-based bridge column damage detection

Bridge-damage-segmentation This is the code repository for the paper A hierarchical semantic segmentation framework for computer-vision-based bridge c

5 Dec 07, 2022

Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

71 Dec 22, 2022

A Runtime method overload decorator which should behave like a compiled language

strongtyping-pyoverload A Runtime method overload decorator which should behave like a compiled language there is a override decorator from typing whi

20 Oct 31, 2022

交互式标注软件，暂定名 iann

iann 交互式标注软件，暂定名iann。安装按照官网介绍安装paddle。安装其他依赖 pip install -r requirements.txt 运行 git clone https://github.com/PaddleCV-SIG/iann/ cd iann python iann

294 Dec 30, 2022

A data-driven approach to quantify the value of classifiers in a machine learning ensemble.

Documentation | External Resources | Research Paper Shapley is a Python library for evaluating binary classifiers in a machine learning ensemble. The

188 Dec 29, 2022

Cooperative Driving Dataset: a dataset for multi-agent driving scenarios

Cooperative Driving Dataset (CODD) The Cooperative Driving dataset is a synthetic dataset generated using CARLA that contains lidar data from multiple

124 Dec 28, 2022

Implementation of Hire-MLP: Vision MLP via Hierarchical Rearrangement and An Image Patch is a Wave: Phase-Aware Vision MLP.

Hire-Wave-MLP.pytorch Implementation of Hire-MLP: Vision MLP via Hierarchical Rearrangement and An Image Patch is a Wave: Phase-Aware Vision MLP Resul

29 Oct 28, 2022

《K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters》(2020)

K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters This repository is the implementation of the paper "K-Adapter: Infusing Knowledge

118 Dec 13, 2022

Imagededup - 😎 Finding duplicate images made easy

Related tags

Overview

Image Deduplicator (imagededup)

📖 Contents

⚙️ Installation

🚀 Quick Start

⏳ Benchmarks

🤝 Contribute

📝 Citation

🏗 Maintainers

© Copyright

Comments

WHAT/WHY

HOW

APIs impacted

Choice of default values

Releases(v0.3.0)

v0.3.0(Oct 15, 2022)

Installation fix

✨ New features and improvements

Breaking changes

v0.2.4(Nov 23, 2020)

v0.2.2(Dec 11, 2019)

✨ New features and improvements

v0.2.1(Nov 3, 2019)

🔴 Bug fixes

v0.2.0(Oct 30, 2019)

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

v0.1.0(Oct 8, 2019)

Owner

idealo

SPTAG: A library for fast approximate nearest neighbor search

Audio Visual Emotion Recognition using TDA

Python wrapper of LSODA (solving ODEs) which can be called from within numba functions.

Cascaded Pyramid Network (CPN) based on Keras (Tensorflow backend)

Official implementation of "Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets" (CVPR2021)

Decompose to Adapt: Cross-domain Object Detection via Feature Disentanglement

An example of semantic segmentation using tensorflow in eager execution.

Character Grounding and Re-Identification in Story of Videos and Text Descriptions

La source de mon module 'pyfade' disponible sur Pypi.

Camera ready code repo for the NeuRIPS 2021 paper: "Impression learning: Online representation learning with synaptic plasticity".

Face Mesh is a face geometry solution that estimates 468 3D face landmarks in real-time even on mobile devices

PG2Net: Personalized and Group PreferenceGuided Network for Next Place Prediction

This is the code repository for the paper A hierarchical semantic segmentation framework for computer-vision-based bridge column damage detection

Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

A Runtime method overload decorator which should behave like a compiled language

交互式标注软件，暂定名 iann

A data-driven approach to quantify the value of classifiers in a machine learning ensemble.

Cooperative Driving Dataset: a dataset for multi-agent driving scenarios

Implementation of Hire-MLP: Vision MLP via Hierarchical Rearrangement and An Image Patch is a Wave: Phase-Aware Vision MLP.

《K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters》(2020)