Reservoir sampling
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)
(February 2010)No issues specified. Please specify issues, or remove this template. |
Reservoir sampling is a family of randomized algorithms for randomly choosing k samples from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn't fit into main memory. The most common example was labelled Algorithm R by Jeffrey Vitter in his paper on the subject.[1]
This simple O(n) algorithm as described in the Dictionary of Algorithms and Data Structures[2] consists of the following steps (assuming S is zero-based):
Fill up the 'reservoir' with the first k items from S For every item S[j] where j ≥ k: Choose a random integer r between 0 and j (inclusive) If r < k then replace element r in the reservoir with S[j]
Relation to Fisher-Yates shuffle
Suppose one wanted to draw k random cards from a deck of playing cards (i.e., n=52). A natural approach would be to shuffle the deck and then take the top k cards. In the general case, the shuffle also needs to work even if the number of cards in the deck is not known in advance, a condition which is satisfied by the inside-out version of the Fisher-Yates shuffle:
To initialize an array a of n elements to a randomly shuffled copy of S, both 0-based:
a[0] ← S[0]
for i from 1 to n - 1 do
r ← random (0 .. i)
a[i] ← a[r]
a[r] ← S[i]
Note that although the rest of the cards are shuffled, only the top k are important in the present context. Therefore, the array a need only track the cards in the top k positions while performing the shuffle, reducing the amount of memory needed. Truncating a to length k, the algorithm is modified accordingly:
To initialize an array a to k random elements of S (which is of length n), both 0-based:
a[0] ← S[0]
for i from 1 to k - 1 do
r ← random (0 .. i)
a[i] ← a[r]
a[r] ← S[i]for i from k to n - 1 do
r ← random (0 .. i)
if (r < k) then a[r] ← S[i]
Since the order of the first k cards is immaterial, the first loop can be removed and a can be initialized to be the first k items of S. This yields Algorithm R.
Example implementation
The following is a simple implementation of the algorithm in Python that samples the set of English Wikipedia page titles:
import random
SAMPLE_COUNT = 10
# Force the value of the seed so the results are repeatable
random.seed(12345)
sample_titles = []
for index, line in enumerate(open("enwiki-20091103-all-titles-in-ns0")):
# Generate the reservoir
if index < SAMPLE_COUNT:
sample_titles.append(line)
else:
# Randomly replace elements in the reservoir with a decreasing probability
# Choose an integer between 0 and (index - 1) (inclusive)
r = random.randrange(index)
if r < SAMPLE_COUNT:
sample_titles[r] = line
print sample_titles