[tif] semhash for semantic deduplication

January 14, 2025 1 minute read

I often find myself in a position where I have to deduplicate. And most of the time, it’s a lot of data that needs to be deduplicated. So I keep my eye out for libraries that might help with this task. Today I saw one from MinishLab called semhash.

SemHash is a lightweight and flexible tool for deduplicating datasets using semantic similarity. It combines fast embedding generation from Model2Vec with efficient ANN-based similarity search through Vicinity.

SemHash supports both single-dataset deduplication (e.g., cleaning up a train set) and multi-dataset deduplication (e.g., ensuring no overlap between a test set and a train set). It works with simple datasets, such as text lists, and more complex ones, like multi-column QA datasets. Additionally, it includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process.

The two links in the quote above point to other projects from MinishLab.

We’re a two-person (@pringled and @stephantul) open-source company, with a focus on Natural Language Processing. We believe that if you make models fast enough, you unlock new possibilities.

Looks like the team is

I love the library names, the retro gaming art in the github repo, and the focus on unlocking new use cases by just making things really fast. Some of the things they suggest you can do with their software:

Ingest the entire English Wikipedia in 5 minutes

Classify tens of thousands of documents per second on CPU

Approximately deduplicate extremely large datasets in minutes

Build the fastest RAG application in the world

Easily evaluate which ANN algorithm works best for your data

Looking forward to giving some of these a try!

source semhash via philipp schmid

Share on

X Facebook LinkedIn Bluesky

Gabriel Altay

[tif] semhash for semantic deduplication

Share on

You May Also Enjoy

[til] pytorch nightly with uv for nvidia blackwell

streamlit app with chat input and multiple columns

multimodal ai code editing with cursor and chatgpt advanced voice mode

[til] cloud gpus with the prime intellect cli