Reservoir Sampling
Reservoir Sampling is a family of randomized algorithms for randomly choosing a sample of k
items from a list S
containing n
items, where n
is either a very large or unknown number. Typically, n
is large enough that the list doesn't fit into memory.
Essentials
Problem:
Choose
k
entries fromn
numbers. Make sure each number is selected with the probability ofk/n
Basic idea:
Choose
1, 2, 3, ..., k
first and put them into the reservoir.For
k+1
, pick it with a probability ofk/(k+1)
, and randomly replace a number in the reservoir.For
k+i
, pick it with a probability ofk/(k+i)
, and randomly replace a number in the reservoir.Repeat until
k+i
reachesn
Proof:
For
k+i
, the probability that it is selected and will replace a number in the reservoir isk/(k+i)
For a number in the reservoir before (let's say
X
), the probability that it keeps staying in the reservoir isP(X was in the reservoir last time)
xP(X is not replaced by k+i)
=
P(X was in the reservoir last time)
x(1 - P(k+i is selected and replaces X))
=
k/(k+i-1)
x(1 - k/(k+i) x 1/k)
=
k/(k+i)
When
k+i
reachesn
, the probability of each number staying in the reservoir isk/n
Example
Choose
3
numbers from[111, 222, 333, 444]
. Make sure each number is selected with a probability of3/4
First, choose
[111, 222, 333]
as the initial reservior.Then choose
444
with a probability of3/4
For 111 , it stays with a probability of
P(444 is not selected)
+P(444 is selected but it replaces 222 or 333)
=
1/4
+3/4
*2/3
=
3/4
The same case with
222
and333
Now all the numbers have the probability of
3/4
to be picked
经典题目
Last updated