Bloom Filter 和 Cuckoo Filter 的源码简析

Bloom Filter 和 Cuckoo Filter有类似的用途，很多人将它们称为概率数据结构或者近似成员数据结构(approximate membership data structure)。最近看了一个Bloom Filter和一个Cuckoo Filter的比较好的实现，代码都是用C++实现，代码量也不大，读起来没有什么难度。

本文贴出来的是两个项目的大体构成，不涉及到空间、错误率等的理论分析，希望能给需要的新手一些帮助。

两个项目的网址如下：

C++ Bloom filter library: https://code.google.com/p/bloom/
Cuckoo Filter: https://github.com/efficient/cuckoofilter

1. Bloom Filter

(https://code.google.com/p/bloom/)

源码目录：

$ tree
.
├── Makefile
├── bloom_filter.hpp
├── bloom_filter_example01.cpp
├── bloom_filter_example02.cpp
├── bloom_filter_example03.cpp
├── random-list.txt
├── word-list-extra-large.txt
├── word-list-large.txt
└── word-list.txt

文件包括三个用于测试example、三个用于测试insert的txt词典，bloom filter的实现在bloom_filter.hpp中。

bloom_filter.hpp中包括3个class：bloom_parameters、bloom_filter和compressible_bloom_filter。其中bloom_parameters是构建一个BF需要的参数，compressible_bloom_filter 类继承自bloom_filter类。

1.1 bloom_filter.hpp

bloom_parameters

包含的成员：

// BF所允许的最小、最大bit数
unsigned long long int minimum_size;
unsigned long long int maximum_size;

// BF所允许的最小、最大哈希函数
unsigned int minimum_number_of_hashes;
unsigned int maximum_number_of_hashes;

// 预计要插入的元素数，默认为10000
unsigned long long int projected_element_count;

// 预计的false positive概率，默认为projected_element_count的倒数
double false_positive_probability;

unsigned long long int random_seed;

// 以下关于“最优参数”的结构和方法计算限制projected_element_count和
// false_positive_probability条件下最小的bit数和最小的哈希函数数
struct optimal_parameters_t {
    ...
};
optimal_parameters_t optimal_parameters;
virtual bool compute_optimal_parameters()
{
    ...
}

bloom_filter

protected:

   typedef unsigned int bloom_type;
   typedef unsigned char cell_type;

public:

    //构造函数1，无参数，默参数默认都为0
    bloom_filter()
    :... 
    {}
    // 构造函数2，传入的参数为一个bloom_parameters对象，
    // 会根据对象中的optimal_parameters结构中的参数初始化一个BF
    bloom_filter(const bloom_parameters& p)
    :...
    {
        salt_count_ = p.optimal_parameters.number_of_hashes;
        table_size_ = p.optimal_parameters.table_size;
        generate_unique_salt();
        raw_table_size_ = table_size_ / bits_per_char;
        bit_table_ = new cell_type[static_cast<std::size_t>(raw_table_size_)];
        std::fill_n(bit_table_,raw_table_size_,0x00);
    }

    // 构造函数3，参数是另一个bloom filtrer
    bloom_filter(const bloom_filter& filter)
    :...
    {
        this->operator=(filter);
    }

    // 一些运算符重载
    inline bool operator == (const bloom_filter& f) const {...}
    inline bool operator != (const bloom_filter& f) const  {...}
    inline bloom_filter& operator = (const bloom_filter& f) {...}
    inline bool operator!() const {...}

    // 析构函数
    virtual ~bloom_filter()  {...}

    // 清空BF
    inline void clear()
    {
        std::fill_n(bit_table_,raw_table_size_,0x00);
        inserted_element_count_ = 0;
    }

    // 一些insert方法，参数不同
    inline void insert(const unsigned char* key_begin, const std::size_t& length)
    {
        std::size_t bit_index = 0;
        std::size_t bit = 0;
        for (std::size_t i = 0; i < salt_.size(); ++i)
        {
            compute_indices(hash_ap(key_begin,length,salt_[i]),bit_index,bit);
            bit_table_[bit_index / bits_per_char] |= bit_mask[bit];
        }
        ++inserted_element_count_;
    }
    template<typename T>
    inline void insert(const T& t) {...}
    inline void insert(const std::string& key) {...}
    inline void insert(const char* data, const std::size_t& length) {...}
    template<typename InputIterator>
    inline void insert(const InputIterator begin, const InputIterator end)  {...}

    // 一些contains(即BF lookup)方法
    inline virtual bool contains(const unsigned char* key_begin, const std::size_t length) const {...}
    template<typename T>
    inline bool contains(const T& t) const {...}
    inline bool contains(const std::string& key) const {...}
    inline bool contains(const char* data, const std::size_t& length) const {...}
    template<typename InputIterator>
    inline InputIterator contains_all(const InputIterator begin, const InputIterator end) const {...}
   inline InputIterator contains_none(const InputIterator begin, const InputIterator end) const {...}

    // size方法，返回BF所占空间
   inline virtual unsigned long long int size() const {
      return table_size_;
   }

    // element_count方法，返回BF总共所插入的元素
    inline std::size_t element_count() const {
        return inserted_element_count_;
    }
    // 返回当前计算的false positive 概率，而不是目标概率。
    inline double effective_fpp() const {...}

    // 一些运算符重载
    inline bloom_filter& operator &= (const bloom_filter& f) {...}
    inline bloom_filter& operator |= (const bloom_filter& f) {...}
    inline bloom_filter& operator ^= (const bloom_filter& f) {...}

    // 返回BF地址
    inline const cell_type* table() const {
      return bit_table_;
    }

    // 返回所用哈希函数数量
    inline std::size_t hash_count() {
      return salt_.size();
    }

    // 给定一个哈希值，计算素应该置位BF中哪些bits
    inline virtual void compute_indices(const bloom_type& hash, std::size_t& bit_index, std::size_t& bit) const
    {
      bit_index = hash % table_size_;
      bit = bit_index % bits_per_char;
    }

    // 生成哈希函数
    void generate_unique_salt() {...}

    // 计算一个元素的哈希值
   inline bloom_type hash_ap(const unsigned char* begin, std::size_t remaining_length, bloom_type hash) const {...}

compressible_bloom_filter

// 构造函数，通过一个bloom_parameters对象构造可以压缩的BF
compressible_bloom_filter(const bloom_parameters& p) : bloom_filter(p)
{
    size_list.push_back(table_size_);
}

// 返回BF的大小
inline unsigned long long int size() const {...}

// 压缩，参数是个小于1的数，即压缩后是占原来的大小的百分比
// 压缩的方法是将被压缩部分的bits置位到前面，这样会减小大小，但是提升false postive
inline bool compress(const double& percentage) {...}

// 计算大小
inline void compute_indices(const bloom_type& hash, std::size_t& bit_index, std::size_t& bit) const {...}

1.2 example01

设置目标元素为10000个，false postive为0.5%，然后计算最优的参数，并生成一个BF。(这时BF需要13794 Bytes，需要8个函数)
往里面插入3个string字符串和100个int数字。
测试contains方法，看里面是否有刚才插入的3个string和100个int。
测试contains，查找3个不存在的string的100个不存在的int。

1.3 example02

    // 从txt文件中读出要插入的word
    load_word_list(argc,argv,word_list))

    // 生成outliers
    generate_outliers(word_list,outliers);

    // 计算单词表一共占多大空间，存到word_list_storage_size变量中。
   for (unsigned int i = 0; i < word_list.size(); ++i)
   {
      word_list_storage_size += word_list[i].size();
   }

    // 指定一个false positive率
   const double desired_probability_of_false_positive = 1.0 / word_list.size();
    // 


    /*
      Terminology
      MinFPC : Minimum (smallest) False Positive Count
      MaxFPC : Maximum (largest) False Positive Count
      AverageFPC : Average False Positive Count
      AverageFPP : Average False Positive Probability

      FPQ  : False Positive Queries
      IPFP : Indicative (desired) False Positive Probability
      PFP  : Probability of a False Positive (based on the FPQ)
      DPFP : Difference as a percentage between IPFP and PFP
      TvD  : percentage of the filter size versus the raw data size
   */
   printf("Round\t   Queries\t   FPQ\t   IPFP\t           PFP\t            DPFP\t    TvD\n");
   unsigned int max_false_positive_count = 0;
   unsigned int min_false_positive_count = std::numeric_limits<unsigned int>::max();
   unsigned int total_false_positive = 0;
   unsigned int total_zero_fp = 0;
   unsigned long long int bloom_filter_size = 0;
   static const unsigned int rounds = 1000;

   // 以1~1000 作为随机种子。
   while (random_seed < rounds)
   {
      // 设置参数：目标元素数和false positive
      bloom_parameters parameters;
      parameters.projected_element_count    = word_list.size();
      parameters.false_positive_probability = desired_probability_of_false_positive;
      parameters.random_seed                = ++random_seed;

      if (!parameters)
      {
         std::cout << "Error - Invalid set of bloom filter parameters!" << std::endl;
         return 1;
      }
      // 计算最优BF大小等参数
      parameters.compute_optimal_parameters();
      // 初始化BF
      bloom_filter filter(parameters);
      // 插入word_list的所有词
      filter.insert(word_list.begin(),word_list.end());
      // lookup所有已经插入的词，并打印没有找到的元素，一般这情况不会发生
      std::vector<std::string>::iterator it = filter.contains_all(word_list.begin(),word_list.end());
      if (word_list.end() != it)
      {
         std::cout << "ERROR: key not found in bloom filter! =>" << (*it) << std::endl;
         return 1;
      }
      // lookup一些未插入的词，统计false positive个数
      unsigned int current_total_false_positive = 0;
      for (std::deque<std::string>::iterator it = outliers.begin(); it != outliers.end(); ++it)
      {
         if (filter.contains(*it))
         {
            ++current_total_false_positive;
         }
      }
      total_number_of_queries += (outliers.size() + word_list.size());
      // 计算false positive率
      double pfp = current_total_false_positive / (1.0 * outliers.size());

      printf("%6llu\t%10llu\t%6d\t%8.7f\t%8.7f\t%9.3f%%\t%8.6f\n",
              static_cast<unsigned long long>(random_seed),
              static_cast<unsigned long long>(total_number_of_queries),
              current_total_false_positive,
              desired_probability_of_false_positive,
              pfp,
              (100.0 * pfp) / desired_probability_of_false_positive,
              (100.0 * filter.size()) / (bits_per_char * word_list_storage_size));
      // 记录多个rounds中最小和最大的false positve个数
      if (current_total_false_positive < min_false_positive_count)
         min_false_positive_count = current_total_false_positive;
      else if (current_total_false_positive > max_false_positive_count)
         max_false_positive_count = current_total_false_positive;

     total_false_positive += current_total_false_positive;

     if (0 == current_total_false_positive)
        ++total_zero_fp;

     bloom_filter_size = filter.size();
   }
   // 平均每个round的false positive个数和百分比
   double average_fpc = (1.0 * total_false_positive) / rounds;
   double average_fpp = average_fpc / (outliers.size() + word_list.size());
   // 打印最小、最大的false positive数、平均FP百分率、无FP的round的百分比
   // 打印BF的大小、 所插入数据的总大小
   printf("Bloom Filter Statistics\n"
          "MinFPC: %d\tMaxFPC: %d\tAverageFPC: %8.5f\tAverageFPP: %9.8f Zero-FPC:%d\n"
          "Filter Size: %lluKB\tData Size: %dKB\n",
          min_false_positive_count,
          max_false_positive_count,
          average_fpc,
          average_fpp,
          total_zero_fp,
          bloom_filter_size / (8 * 1024),
          static_cast<unsigned int>(word_list_storage_size / 1024));
   return 0;

1.4 example03

是关于压缩bloom filter的测试。(略)

2. Cuckoo Filter

(https://github.com/efficient/cuckoofilter)

文件：

$ tree
.
├── LICENSE
├── Makefile
├── README.md
├── benchmarks
│   ├── Makefile
│   ├── bulk-insert-and-query.cc
│   ├── conext-figure5.cc
│   ├── conext-table3.cc
│   ├── random.h
│   └── timing.h
├── example
│   └── test.cc
└── src
    ├── bitsutil.h
    ├── cuckoofilter.h
    ├── debug.h
    ├── hashutil.cc
    ├── hashutil.h
    ├── packedtable.h
    ├── permencoding.h
    ├── printutil.cc
    ├── printutil.h
    ├── simd-block.h
    └── singletable.h

2.1 Example和Benchmarks

example中一个最简单的test.cc是一个最简单的功能测试。

benchmarks文件夹中有bulk-insert-and-query、conext-figure5和conext-table3三个benchmarks。具体的：

bulk-insert-and-query：用argv[1]个随机生成的元素批量Add到filter中测试insert操作，并用指定成功率(比如75%成功率每4个有一个应该是没有insert过的元素)的Contain()来测试lookup操作。
统计的信息包括：bulk insert和bulk query的速率(每秒多少次)。输出示例：

                  Million    Find    Find    Find    Find    Find                       optimal  wasted
                 adds/sec      0%     25%     50%     75%    100%       ε  bits/item  bits/item   space
Cuckoo12       23.78   37.24   35.04   37.17   37.35   36.35  0.131%      18.30       9.58   91.1%
SemiSort13       11.63   17.55   17.08   17.14   17.54   22.32  0.064%      18.30      10.62   72.4%
Cuckoo8       35.31   49.32   50.24   49.98   48.32   50.49  2.044%      12.20       5.61  117.4%
SemiSort9       13.99   22.23   22.78   22.13   23.16   24.06  1.207%      12.20       6.37   91.5%
Cuckoo16       27.06   36.94   37.12   35.31   36.81   35.10  0.009%      24.40      13.46   81.4%
SemiSort17       10.37   15.70   15.84   15.78   15.55   15.93  0.004%      24.40      14.72   65.8%
SimdBlock8       74.22   72.34   74.23   74.34   74.69   74.32  0.508%      12.20       7.62   60.1%
time: 14.34 seconds

测试的几种filter包括：cuckoo、semisort、semblock，每种每个元素又用不同的bits表示。

conext-figure5和conext-table3是作者CoNEXT会议上论文的对应figure5和table3两张图。
figure5：当filter满时的查找性能。

table3：相同大小和近似元素数的filter的false positive率和构建(插入？)速度。

2.2 具体实现

src文件夹是具体实现：

├── bitsutil.h     -----> 一些位操作的宏定义
├── cuckoofilter.h -----> cuckoofilter的实现
├── debug.h        -----> 一些用于DEBUG信息的宏定义
├── hashutil.cc    -----> 用于求指纹的几种哈希函数的实现
├── hashutil.h     --/
├── packedtable.h  -----> PackedTable类
├── permencoding.h --/    
├── printutil.cc   -----> 用于打印的封装
├── printutil.h    --/    
├── simd-block.h   -----> SIMD优化的block bloom filter的实现
└── singletable.h  -----> SingleTable类

cuckoofilter.h：
定义了模板类CuckooFilter，并定义了几个public方法：Add、Contain、Delete、Info、Size和SizeInBytes。

// A cuckoo filter class exposes a Bloomier filter interface,
// providing methods of Add, Delete, Contain. It takes three
// template parameters:
//   ItemType:  the type of item you want to insert
//   bits_per_item: how many bits each item is hashed into
//   TableType: the storage of table, SingleTable by default, and
// PackedTable to enable semi-sorting
template <typename ItemType, size_t bits_per_item,
          template <size_t> class TableType = SingleTable,
          typename HashFamily = TwoIndependentMultiplyShift>
class CuckooFilter {
    ...
}

如上图，创建一个cuckoofilter需要2~4个参数，其中元素的类型和每个元素指纹所占的bits数是必要的两个参数，后两个参数分别表示table 的类型和hash函数的类型，都有默认值。其中table类型中，SingleTable即普通的cuckoofilter，PackedTable即经过semi-sorting压缩过的cuckoofilter，可以节省一些空间。

下面是Add方法的实现，其中又调用了AddImpl方法。

template <typename ItemType, size_t bits_per_item,
          template <size_t> class TableType, typename HashFamily>
Status CuckooFilter<ItemType, bits_per_item, TableType, HashFamily>::Add(
    const ItemType &item) {
  size_t i;
  uint32_t tag;

  if (victim_.used) {
    return NotEnoughSpace;
  }

  GenerateIndexTagHash(item, &i, &tag);  // 生成item的index(位置/桶号)和tag(指纹)
  return AddImpl(i, tag); // 调用AddImpl进行insert
}

template <typename ItemType, size_t bits_per_item,
          template <size_t> class TableType, typename HashFamily>
Status CuckooFilter<ItemType, bits_per_item, TableType, HashFamily>::AddImpl(
    const size_t i, const uint32_t tag) {
  size_t curindex = i;
  uint32_t curtag = tag;
  uint32_t oldtag;

  for (uint32_t count = 0; count < kMaxCuckooCount; count++) { // kickout 过程
    bool kickout = count > 0;
    oldtag = 0;
    if (table_->InsertTagToBucket(curindex, curtag, kickout, oldtag)) {
      num_items_++;
      return Ok;
    }
    if (kickout) {
      curtag = oldtag;
    }
    curindex = AltIndex(curindex, curtag); // AltIndex找到当前桶号对应的另一个桶位置
  }

  victim_.index = curindex;
  victim_.tag = curtag;
  victim_.used = true;
  return Ok;
}

singletable.h和packedtable.h：

在cuckoofilter.h中的模板类中，有一个cuckoo哈希表的类型可以选择，即未压缩的SingleTable和用semi-sorting压缩过的PackedTable，对应的类分别定义在singletable.h和packedtable.h两个文件中。

以SingleTable为例，有NumBuckets、SizeInBytes、SizeInTags、Info、ReadTag、WriteTag、FindTagInBuckets、FindTagInBucket、DeleteTagFromBucket、InsertTagToBucket、NumTagsInBucket等方法，用途可顾名思义。

simd-block.h

这是一个block bloom filter的实现 (Putze, Felix, Peter Sanders, and Johannes Singler. “Cache-, hash-and space-efficient bloom filters.” International Workshop on Experimental and Efficient Algorithms. Springer, Berlin, Heidelberg, 2007.)：

(1)每个block 是一个小的BF，且

(2)每个Add()操作锁带来的bit修改是一定数量的，来保证可以用SIMD指令加速。

hashutil.h

包含了好几个把item哈希到32bit指纹的哈希函数实现：Bobhash、MurmurHash、SuperFastHash、NullHash、MD5Hash、SHA1Hash。后两种直接调用了OpenSSL的evp库。

JciX ~

Jc's Blog