Iter.fromArray / BloomFilter

Seb · March 8, 2021, 2:52pm

Hello everyone!

I was looking at this tutorial concerning the implementation of a BloomFilter :

I don’t understand why bitMap[digest] is set to True inside the loop for.

It was explained that one index in bitMap was put to 1 after applying multiple hash functions and not that multiple bitMap indexs were found using different hash functions…
Maybe I don’t fully understand what Iter.fromArray does; but I don’t understand when I look at the explanation for the Iter module in the Motoko library.

Really appreciate if someone could explain that to me.

cryptoschindler · March 8, 2021, 8:33pm

hey @Seb , welcome to the forum and grear first question!

Usually for a bloom filter you use multiple hash functions - which is also the case here. You take your item and hash it with each hash function f inside the hashFuncs array. The computed hash digest is used as the index for our bitMap to set the corresponding value to 1, indicating that we’ve seen this item.
This also gives a nice explanation of the concept:
https://llimllib.github.io/bloomfilter-tutorial/

I hope it helps, if not please don’t hesitate to ask further questions

Seb · March 8, 2021, 9:00pm

Hey! Thank you for your answer

I totally get the concept of the bloom filter but I still don’t understand the way the code is structured.
The instruction bitMap[digest] := true should be put after the loop FOR and not inside.

cryptoschindler · March 8, 2021, 9:50pm

Hey Seb, sorry for my misunderstanding! Why should it be outside? Are you referring to this?

github.com

DFINITY-Education/data-structures/blob/master/module-2.md

# Module 2: Object-Oriented Data Structure: Bloom Filters

In this Module, you will implement a bloom filter that allows users to determine if an item is present in a given set.

## Background

A **Bloom filter** is a probabilistic data structure designed to indicate, with high efficiency and low memory, if an element is contained in a set. It's **probabilistic** because although it can tell you with certainty that an element is not in the data structure, it can only tell you that an element *may be* in contained the structure. In other words, false negative results (indicating the element doesn't exist in the set when it actually does) won't occur, but false positive results (indicating the element exists when it doesn't) are possible. 

Such a data structure is especially useful in instances where we care more about ensuring that an element is definitely not in a set. For instance, when registering a new username, many services aim to quickly indicate whether a given name is already taken.  The cost of a false positive - indicating that a username is already taken when it is actually available - isn't high, so this tradeoff for increased efficiency is worthwhile.

Bloom filters use a **bitmap** as the base data structure. A bitmap is simply an array where each index contains either a 0 or a 1. The filter takes in the value that's being entered into the data structure, hashes it to multiple indices (ranging from 0 to the length - 1 of the bitmap) using several different hash functions, and stores a 1 at that particular index. The beauty of a bloom filter - and the aspect that makes it so space-efficient - is the fact that we don't need to actually store the given element in our set. We simply hash the element, go to the location in our bitmap that is hashes to, and insert a 1 into that spot (or multiple spots if using multiple hash functions).

**Example bitmap with values initialized to 0:**

| 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| 0    | 1    | 2    | 3    | 4    | 5    | 6    | 7    | 8    | 9    |

To test for membership in the set, the program hashes the value being searched using the same aforementioned hash functions. If the resulting values are not in the bitmap, then you know that the element is *not* in the set. If the values are in the bitmap, then all you can conclude is that the element *might be* in the set. You cannot determine if the item exists with certainly because there could be other combinations of different hashed values that overlap with the same bits. Naturally, as you enter more elements into data structure, the bitmap fills up and the probability of producing a false positive increases. [This interactive site](https://llimllib.github.io/bloomfilter-tutorial/) provides a great visual explanation of the mechanics behind bloom filters.

This file has been truncated. show original

cryptoschindler · March 8, 2021, 9:57pm

The formulation here might be a bit unclear but I believe it is referring to the behaviour of the code that your screenshot shows.

The filter takes in the value that’s being entered into the data structure, hashes it to multiple indices (ranging from 0 to the length - 1 of the bitmap) using several different hash functions, and stores a 1 at that particular index.

Seb · March 8, 2021, 10:58pm

Okay I get it this time !

My mistake was to think that the hash function number 2 was applied to the result of the hash function number 1…
Don’t know why I thought that way because it would present no interest…
We use multiples hash functions to decrease the risk of collusion, it makes sense now

Really appreciated your help

Topic		Replies	Views
Search for a value in a HashMap Developers	8	1352	July 4, 2022
Itertools Library for Motoko Language Support Motoko	6	977	June 24, 2022
Map v8.0.0, it's finally here Language Support Motoko	22	2168	January 6, 2024
Lits of unique values in vector/array Getting Started	2	416	October 23, 2023
Review Request: Representationally Independent Hash - Motoko Language Support Motoko	3	348	November 14, 2023

Iter.fromArray / BloomFilter

Related topics