I. Scenario Introduction
Scenario - URL Deduplication Filtering at Billion-Scale Data Volume
Taking large-scale web crawlers as an example, in large-scale web crawlers targeting specific keywords, the number of URLs to be crawled may reach hundreds of millions. Duplicate URLs not only waste crawling resources but may also cause crawlers to fall into infinite loops or invalid crawling, severely affecting crawling efficiency and data quality.
Traditional Solutions
- Store in database with unique constraint
- I/O bottleneck
- Use cache, such as HashSet
- High memory overhead
Solution Optimization
Considering that URL crawler deduplication allows for false positives, a small portion of data misjudgment is completely acceptable at large data volumes.
- Use Bitmap with single-hash compression storage, utilizing a binary array to record whether a URL has appeared, occupying less memory space.
- Memory usage estimation for 10 billion URLs: with 1% false positive rate, data volume , bitmap size , occupying 125GB.
- Further optimize the bitmap, i.e., Bloom Filter, utilizing multiple ($k$) hash functions to map each element to multiple ($k$) bits in the binary array, representing one element with multiple bits.
II. Bloom Filter Introduction
Concept
Bloom Filter is a probabilistic data structure based on Bit Array. It can efficiently detect whether an element "might exist" or "definitely does not exist".
- Space Efficient: Uses bit array to store data, greatly saving space.
- Fast Query Speed: Insert and query operations both have constant time complexity , where is the number of hash functions.
- Possible False Positives: There is a possibility of misjudgment, i.e., possibly mistakenly believing that an element not inserted exists in the set (False Positive), but it will never misjudge that an element does not exist.
Diagram
Initialize Bloom Filter

image.png|500
image.png|500
Query: Definitely Does Not Exist vs Might Exist
Definitely Does Not Exist: Query and insert Z
When querying Z, only one bit is found to be 1 and the other bit is 0, proving it definitely does not exist.

image.png|500
image.png|500
image.png|500
Parameters
Parameter Meanings
- Number of elements to be inserted
- Bit array length
- Number of hash functions
- False positive rate
Derivation Formulas
- Probability that a certain bit remains 0 after inserting 1 element with hashes:
- Probability that a certain bit remains 0 after inserting elements with hashes:
- Probability that a certain bit is set to 1 after inserting elements with hashes:
- Probability that bits obtained by hashing 1 element times are all set to 1:
- Derivable false positive rate
Parameter Estimation Given the number of elements to be inserted and false positive rate :
- Bit array length
- Number of hash functions
Characteristics
Add, Query, No Delete It is obvious that Bloom Filter cannot support deletion operations, because multiple elements may hash to the same position in the Bloom Filter. Refer to the Bloom Filter diagram above.
Space Occupancy Memory usage estimation for 10 billion URLs, 1% false positive rate:
- For Bitmap: length , occupying 125GB
- For Bloom Filter: bit array length , occupying approximately 12GB
III. Bloom Filter in Practice - Using Guava Library
In actual projects, the Bloom Filter provided in Google's Guava library is typically used. Guava Maven Repository
Small Case Study
Introduce Guava dependency
Core methods of Guava library's Bloom Filter:
put(): Insert elementmightContain(): Determine if element might exist
Example case:
Insert data, judge data, false positives occurred 100075 times.
IV. Guava BloomFilter Source Code Deep Analysis
BloomFilter Class
Encapsulates the core logic of Bloom Filter, such as adding, querying elements, initialization parameter calculation (such as bitmap size, number of hash functions), etc.
- create(Funnel, int, double) Lightweight user interface
- create(Funnel, long, double) Advanced interface
- create(funnel, expectedInsertions, fpp, Strategy) Core implementation
Factory creation of BloomFilter:
Core methods:
Funnel Interface
The main responsibility of the Funnel interface is to "funnel" an object's state into a PrimitiveSink, thereby producing a continuous byte sequence, so that custom structures can use Bloom Filters.
- T from: The object to be converted.
- PrimitiveSink into: The container receiving the object data. PrimitiveSink provides various methods to write primitive data types and strings into a unified byte stream.
Guava comes with native Funnels, such as integerFunnel(), stringFunnel(Charset), byteArrayFunnel(), etc.
Case Study
Goods
PojoBloomFilter
Funnel Function
- Ensure Consistency: Funnel ensures that as long as it is the same structure, the pre-processing before hashing is consistent.
- Control Granularity: For custom classes, if you want to use Bloom Filter, you can choose which fields to inject. For example, Goods injects id and name.
Strategy Interface (Default Implementation: MURMUR128MITZ64)
The Strategy interface defines the hash strategy, element insertion, and query specific operations of the Bloom Filter. Default uses MURMUR128MITZ64:
- MURMUR128: MurmurHash3 128-bit variant
- MITZ: Derived from Google engineer Mitz Pettel, designed the following "double hashing" technique
- 64: Uses 64 bits each time
It can be found that it does not actually use hash functions, and the implementation uses one Hash and splits it into two fields. "Less Hashing, Same Performance: Building a Better Bloom Filter" proposed an optimization algorithm that only needs to calculate two independent hash functions to construct a Bloom Filter equivalent to using multiple hash functions: $$gi(x)=h1(x)+i·h_2(x)$$
The strategy passes index = (combinedHash & Long.MAX_VALUE) % bitSize) to BitArray. The purpose is to make combinedHash positive, otherwise modulo with bitSize cannot be performed (negative index exception).
BitArray Module
BitArray is the core of the Bloom Filter's underlying implementation, responsible for the storage and management of the bit array.
- Uses long(64 bit) array
- bitCount is the number of bits set to 1, using native method Long.bitCount() to统计负载程度
- get method: index logically right-shifted to get long element index, index uses AND to get specific index after modulo
- set method: uses compareAndSet for atomic update
- putAll method: uses OR operation on two BitArrays, used for Bloom Filter merging
V. Bloom Filter Derived Technologies
Counting Bloom Filter
- Allows deletion operations based on ordinary Bloom Filter
- Implemented through bit counting method
- Disadvantage is larger space occupation
Bloomfilter

image.png|500
image.png|500
Periodic Reconstruction of Bloom Filter
- Periodic reconstruction
- Delete old Bloom Filter
- Reconstruct new Bloom Filter
VI. Practical Scenarios
Large-scale Product IDs
- User requests a product
- Unlike crawler URL filtering, no mistakes allowed
- Use database?
- Use cache?
- Use Bloom Filter?

image.png|700