Bloom Filter Overview - Chanler's Blog

I. Scenario Introduction

Scenario - URL Deduplication Filtering at Billion-Scale Data Volume

Taking large-scale web crawlers as an example, in large-scale web crawlers targeting specific keywords, the number of URLs to be crawled may reach hundreds of millions. Duplicate URLs not only waste crawling resources but may also cause crawlers to fall into infinite loops or invalid crawling, severely affecting crawling efficiency and data quality.

Traditional Solutions

Store in database with unique constraint
- I/O bottleneck
Use cache, such as HashSet
- High memory overhead

Solution Optimization

Considering that URL crawler deduplication allows for false positives, a small portion of data misjudgment is completely acceptable at large data volumes.

Use Bitmap with single-hash compression storage, utilizing a binary array to record whether a URL has appeared, occupying less memory space.
- Memory usage estimation for 10 billion URLs: with 1% false positive rate, data volume $n=10^{10}$ , bitmap size $m=10^{12}$ , occupying 125GB.
Further optimize the bitmap, i.e., Bloom Filter, utilizing multiple ($k$) hash functions to map each element to multiple ($k$) bits in the binary array, representing one element with multiple bits.

II. Bloom Filter Introduction

Concept

Bloom Filter is a probabilistic data structure based on Bit Array. It can efficiently detect whether an element "might exist" or "definitely does not exist".

Space Efficient: Uses bit array to store data, greatly saving space.
Fast Query Speed: Insert and query operations both have constant time complexity $O(k)$ , where $k$ is the number of hash functions.
Possible False Positives: There is a possibility of misjudgment, i.e., possibly mistakenly believing that an element not inserted exists in the set (False Positive), but it will never misjudge that an element does not exist.

Diagram

Initialize Bloom Filter

Insert X Y

Query: Definitely Does Not Exist vs Might Exist Definitely Does Not Exist: Query and insert Z When querying Z, only one bit is found to be 1 and the other bit is 0, proving it definitely does not exist.

Might Exist: When querying D, both bits are found to be 1. It is impossible to determine whether it exists or not, i.e., possible false positive or true positive, actually false positive.

When querying X', both bits are found to be 1. It is impossible to determine whether it exists or not, i.e., possible false positive or true positive, actually true positive.

Parameters

Parameter Meanings

Number of elements to be inserted $n$
Bit array length $m$
Number of hash functions $k$
False positive rate $p$

Derivation Formulas

Probability that a certain bit remains 0 after inserting 1 element with $k$ hashes: $(1−\frac{1}{m})^k$
Probability that a certain bit remains 0 after inserting $n$ elements with $k$ hashes: $(1−\frac{1}{m})^{nk}$
Probability that a certain bit is set to 1 after inserting $n$ elements with $k$ hashes: $1−(1−\frac{1}{m})^{nk}$
Probability that bits obtained by hashing 1 element $k$ times are all set to 1: $(1−(1−\frac{1}{m})^{nk})^k$
Derivable false positive rate $p=(1−e^{−\frac{nk}{m}})^k$

Parameter Estimation Given the number of elements to be inserted $n$ and false positive rate $p$ :

Bit array length $m=−\frac{nlnp}{(ln2)^2}$
Number of hash functions $k=\frac{m}{n}ln2$

Characteristics

Add, Query, No Delete It is obvious that Bloom Filter cannot support deletion operations, because multiple elements may hash to the same position in the Bloom Filter. Refer to the Bloom Filter diagram above.

Space Occupancy Memory usage estimation for 10 billion URLs, 1% false positive rate:

For Bitmap: length $m=10^{12}$ , occupying 125GB
For Bloom Filter: bit array length $m=9.59*10^{10}$ , occupying approximately 12GB

III. Bloom Filter in Practice - Using Guava Library

In actual projects, the Bloom Filter provided in Google's Guava library is typically used. Guava Maven Repository

Small Case Study

Introduce Guava dependency

XML

<!-- https://mvnrepository.com/artifact/com.google.guava/guava -->  
<dependency>  
    <groupId>com.google.guava</groupId>  
    <artifactId>guava</artifactId>  
    <version>32.1.2-jre</version>  
</dependency>

Core methods of Guava library's Bloom Filter:

put(): Insert element
mightContain(): Determine if element might exist

Example case:

JAVA

import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

public class BloomFilterExample
{
    private static Integer expectedInsertions = 10000000;
    private static Double fpp = 0.01;

    private static BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), expectedInsertions, fpp);

    public static void main(String[] args)
    {
        for (int i = 0; i < expectedInsertions; i++)
        {
            bloomFilter.put(i);
        }

        int count = 0;
        for (int i = expectedInsertions; i < expectedInsertions * 2; i++)
        {
            if (bloomFilter.mightContain(i))
            {
                count++;
            }
        }

        System.out.println("一共误判了：" + count);
    }
}

Insert $1e7$ data, judge $1e7$ data, false positives occurred 100075 times.

IV. Guava BloomFilter Source Code Deep Analysis

BloomFilter Class

Encapsulates the core logic of Bloom Filter, such as adding, querying elements, initialization parameter calculation (such as bitmap size, number of hash functions), etc.

create(Funnel, int, double) Lightweight user interface
create(Funnel, long, double) Advanced interface
create(funnel, expectedInsertions, fpp, Strategy) Core implementation

Factory creation of BloomFilter:

JAVA

public static <T extends @Nullable Object> BloomFilter<T> create(  
    Funnel<? super T> funnel, int expectedInsertions, double fpp) {  
  return create(funnel, (long) expectedInsertions, fpp);  
}

public static <T extends @Nullable Object> BloomFilter<T> create(  
    Funnel<? super T> funnel, long expectedInsertions, double fpp) {  
  return create(funnel, expectedInsertions, fpp, BloomFilterStrategies.MURMUR128_MITZ_64);  
}

Core methods:

JAVA

public boolean put(@ParametricNullness T object) {  
    return strategy.put(object, funnel, numHashFunctions, bits);  
}

public boolean mightContain(@ParametricNullness T object) {  
    return strategy.mightContain(object, funnel, numHashFunctions, bits);  
}

static int optimalNumOfHashFunctions(long n, long m) {  
    // (m / n) * log(2), but avoid truncation due to division!  
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));  
}

static long optimalNumOfBits(long n, double p) {  
    if (p == 0) {  
        p = Double.MIN_VALUE;  
    }  
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));  
}

Funnel Interface

The main responsibility of the Funnel interface is to "funnel" an object's state into a PrimitiveSink, thereby producing a continuous byte sequence, so that custom structures can use Bloom Filters.

JAVA

public interface Funnel<T> {
    void funnel(T from, PrimitiveSink into);
}

T from: The object to be converted.
PrimitiveSink into: The container receiving the object data. PrimitiveSink provides various methods to write primitive data types and strings into a unified byte stream.

Guava comes with native Funnels, such as integerFunnel(), stringFunnel(Charset), byteArrayFunnel(), etc.

Case Study

Goods

JAVA

@Data  
@AllArgsConstructor  
public class Goods {  
    private final int id;  
    private final String name;  
}

PojoBloomFilter

JAVA

public static void main(String[] args)  
{  
    // Expected to insert 10000 Person objects, false positive rate set to 0.01  
    BloomFilter<Goods> personBloomFilter = BloomFilter.create(  
            PersonFunnel.INSTANCE,  
            10000,  
            0.01  
    );  
  
    // Create and insert a Goods object  
    Goods apple = new Goods(1, "apple");  
    personBloomFilter.put(apple);  
  
    // Query whether a certain object might exist  
    Goods queryGoods1 = new Goods(1, "apple");  
    if (personBloomFilter.mightContain(queryGoods1))  
    {  
        System.out.println(queryGoods1 + "可能存在于布隆过滤器中");  
    }  
    else  
    {  
        System.out.println(queryGoods1 + "肯定不存在于布隆过滤器中");  
    }  
  
    Goods queryGoods2 = new Goods(2, "banana");  
    if (personBloomFilter.mightContain(queryGoods2))  
    {  
        System.out.println(queryGoods2 + "可能存在于布隆过滤器中");  
    }  
    else  
    {  
        System.out.println(queryGoods2 + "肯定不存在于布隆过滤器中");  
    }  
}

Funnel Function

Ensure Consistency: Funnel ensures that as long as it is the same structure, the pre-processing before hashing is consistent.
Control Granularity: For custom classes, if you want to use Bloom Filter, you can choose which fields to inject. For example, Goods injects id and name.

Strategy Interface (Default Implementation: MURMUR128MITZ64)

The Strategy interface defines the hash strategy, element insertion, and query specific operations of the Bloom Filter. Default uses MURMUR128MITZ64:

MURMUR128: MurmurHash3 128-bit variant
MITZ: Derived from Google engineer Mitz Pettel, designed the following "double hashing" technique
64: Uses 64 bits each time

It can be found that it does not actually use $k$ hash functions, and the implementation uses one Hash and splits it into two fields. "Less Hashing, Same Performance: Building a Better Bloom Filter" proposed an optimization algorithm that only needs to calculate two independent hash functions to construct a Bloom Filter equivalent to using multiple hash functions: $$gi(x)=h1(x)+i·h_2(x)$$

JAVA

public <T extends @Nullable Object> boolean put(  
    @ParametricNullness T object,  
    Funnel<? super T> funnel,  
    int numHashFunctions,  
    LockFreeBitArray bits) {  
    long bitSize = bits.bitSize();  
    byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();  
    long hash1 = lowerEight(bytes);  
    long hash2 = upperEight(bytes);  
    boolean bitsChanged = false;  
    long combinedHash = hash1;  
    for (int i = 0; i < numHashFunctions; i++) {  
        // Make the combined hash positive and indexable  
        bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);  
        combinedHash += hash2;  
    }  
return bitsChanged;  
}

The strategy passes index = (combinedHash & Long.MAX_VALUE) % bitSize) to BitArray. The purpose is to make combinedHash positive, otherwise modulo with bitSize cannot be performed (negative index exception).

BitArray Module

BitArray is the core of the Bloom Filter's underlying implementation, responsible for the storage and management of the bit array.

JAVA

static final class LockFreeBitArray {
    private static final int LONG_ADDRESSABLE_BITS = 6;  
    final AtomicLongArray data;  
    private final LongAddable bitCount;
    
    LockFreeBitArray(long bits) {  
        checkArgument(bits > 0, "data length is zero!");  
        // Avoid delegating to this(long[]), since AtomicLongArray(long[]) will clone its input and  
        // thus double memory usage.  
        this.data =  new AtomicLongArray(Ints.checkedCast(LongMath.divide(bits, 64, RoundingMode.CEILING)));  
        this.bitCount = LongAddables.create();  
    }
    
    boolean set(long bitIndex) {  
      if (get(bitIndex)) {  
          return false;  
      }  
      int longIndex = (int) (bitIndex >>> LONG_ADDRESSABLE_BITS);  
      long mask = 1L << bitIndex; // only cares about low 6 bits of bitIndex  
      long oldValue;  
      long newValue;  
      do {  
          oldValue = data.get(longIndex);  
          newValue = oldValue | mask;  
          if (oldValue == newValue) {  
              return false;  
          }  
      } while (!data.compareAndSet(longIndex, oldValue, newValue));  
      // We turned the bit on, so increment bitCount.  
      bitCount.increment();  
      return true;  
    }
    
    boolean get(long bitIndex) {  
      return (data.get((int) (bitIndex >>> LONG_ADDRESSABLE_BITS)) & (1L << bitIndex)) != 0;  
    }
}

Uses long(64 bit) array
bitCount is the number of bits set to 1, using native method Long.bitCount() to统计负载程度
get method: index logically right-shifted to get long element index, index uses AND to get specific index after modulo
set method: uses compareAndSet for atomic update
putAll method: uses OR operation on two BitArrays, used for Bloom Filter merging

V. Bloom Filter Derived Technologies

Counting Bloom Filter

Allows deletion operations based on ordinary Bloom Filter
Implemented through bit counting method
Disadvantage is larger space occupation

Bloomfilter

Counting Bloomfilter

Periodic Reconstruction of Bloom Filter

Periodic reconstruction
Delete old Bloom Filter
Reconstruct new Bloom Filter

VI. Practical Scenarios

Large-scale Product IDs

User requests a product
Unlike crawler URL filtering, no mistakes allowed
Use database?
Use cache?
Use Bloom Filter?