In this post, I'm not discussing drawing but instead a technique to efficiently identify candidate near duplicate news articles.
Shingling
In previous posts, I've discussed a technique called "w-shingling", which lets you determine that two news articles are near duplicates. Briefly, given two documents, you
1. calculate a set of w-shingles by identifying all sequences of three words that start with stop words (XQuery, Python)
2. determining the similarity of the w-shingles for the two documents by calculating their Jaccard Coefficient
This technique works well enough, but it requires you to compare every single pairwise combination of documents in your corpus. This can result in quite a few comparisons: even for a relatively small set of 100 documents, there are 4,950 permutations! For a large corpus of documents (such as a news archive with millions of documents) or in situations where you're trying to do real time comparisons of documents (such as when crawling websites) you need a way to efficiently narrow down the number of comparisons you need to perform.
Sketching
The technique of sketching builds upon the w-shingling technique to quickly identify sets of candidate duplicates, which can then be compared using more accurate methods.
To calculate a sketch for a document you:
1. calculate the w-shingles for the document
2. calculate the the hash of each w-shingle
2. group the hashes of w-shingles into n buckets by calculating the modulus (hash mod n)
3. take the smallest hash value as the sketch for this document
The resulting buckets are candidate near duplicates.
Efficient
The sketch technique described above is linear, i.e. you calculate one sketch per document, and requires simple functions built into every modern language (hashing and modulus arithmetic). It entails calculating the w-shingles per document, which we'll use again in the next step (calculating the Jaccard coefficient for candidate pairs). The sketches and the hashed w-shingles don't require nearly as much memory to store as the original text strings. And it results in *far* fewer pairwise comparisons.
If you sort our 100 example documents into 25 buckets, say, then you'd only need to do 150 pairwise comparisons (assuming that the 100 documents got sorted into 4 documents per bucket, which would mean 25 sets of 6 pairwise comparisons).
So, sketching and w-shingling are a very memory and time efficient way to find near duplicates.
More Precise Duplicate Detection
Can we expand this technique even further? If we regard sketches and w-shingling as being increasingly expensive and increasingly accurate ways to identify near duplicates, can we add a third step to provide even more precise results?
To some extent, it depends upon how you are defining duplicates. The w-shingling technique will generally identify very similar news articles that are "about" the same thing, since so many of the key phrases are identical. These might be news articles from different publications written about the same news events (which might be useful for clustering search results, for example).
However, I'm particularly interested in identifying articles that are variants of "the same" text. For example, two different versions of the same news article (perhaps published at different times on different websites, but from the same source, such as The Associated Press). That's why I would like to examine the technique of Longest Common Substring. But in another post...
Shingling
In previous posts, I've discussed a technique called "w-shingling", which lets you determine that two news articles are near duplicates. Briefly, given two documents, you
1. calculate a set of w-shingles by identifying all sequences of three words that start with stop words (XQuery, Python)
2. determining the similarity of the w-shingles for the two documents by calculating their Jaccard Coefficient
This technique works well enough, but it requires you to compare every single pairwise combination of documents in your corpus. This can result in quite a few comparisons: even for a relatively small set of 100 documents, there are 4,950 permutations! For a large corpus of documents (such as a news archive with millions of documents) or in situations where you're trying to do real time comparisons of documents (such as when crawling websites) you need a way to efficiently narrow down the number of comparisons you need to perform.
Sketching
The technique of sketching builds upon the w-shingling technique to quickly identify sets of candidate duplicates, which can then be compared using more accurate methods.
To calculate a sketch for a document you:
1. calculate the w-shingles for the document
2. calculate the the hash of each w-shingle
2. group the hashes of w-shingles into n buckets by calculating the modulus (hash mod n)
3. take the smallest hash value as the sketch for this document
The resulting buckets are candidate near duplicates.
Efficient
The sketch technique described above is linear, i.e. you calculate one sketch per document, and requires simple functions built into every modern language (hashing and modulus arithmetic). It entails calculating the w-shingles per document, which we'll use again in the next step (calculating the Jaccard coefficient for candidate pairs). The sketches and the hashed w-shingles don't require nearly as much memory to store as the original text strings. And it results in *far* fewer pairwise comparisons.
If you sort our 100 example documents into 25 buckets, say, then you'd only need to do 150 pairwise comparisons (assuming that the 100 documents got sorted into 4 documents per bucket, which would mean 25 sets of 6 pairwise comparisons).
So, sketching and w-shingling are a very memory and time efficient way to find near duplicates.
More Precise Duplicate Detection
Can we expand this technique even further? If we regard sketches and w-shingling as being increasingly expensive and increasingly accurate ways to identify near duplicates, can we add a third step to provide even more precise results?
To some extent, it depends upon how you are defining duplicates. The w-shingling technique will generally identify very similar news articles that are "about" the same thing, since so many of the key phrases are identical. These might be news articles from different publications written about the same news events (which might be useful for clustering search results, for example).
However, I'm particularly interested in identifying articles that are variants of "the same" text. For example, two different versions of the same news article (perhaps published at different times on different websites, but from the same source, such as The Associated Press). That's why I would like to examine the technique of Longest Common Substring. But in another post...