Reservoir sampling

Reservoir sampling is a family of randomized algorithms for randomly choosing k samples from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn't fit into main memory. The most common example was labelled Algorithm R by Jeffrey Vitter in his paper on the subject.^[1]

This simple O(n) algorithm as described in the Dictionary of Algorithms and Data Structures^[2] consists of the following steps (assuming S is zero-based):

Fill up the 'reservoir' with the first k items from S
For every item S[j] where j ≥ k: 
    Choose a random integer r between 0 and j (inclusive)
    If r < k then replace element r in the reservoir with S[j]

Relation to Fisher-Yates shuffle

Suppose one wanted to draw k random cards from a deck of playing cards (i.e., n=52). A natural approach would be to shuffle the deck and then take the top k cards. In the general case, the shuffle also needs to work even if the number of cards in the deck is not known in advance, a condition which is satisfied by the inside-out version of the Fisher-Yates shuffle:

To initialize an array a of n elements to a randomly shuffled copy of S, both 0-based:
   a[0] ← S[0]
   for i from 1 to n - 1 do
       r ← random (0 .. i)
       a[i] ← a[r]
       a[r] ← S[i]

Note that although the rest of the cards are shuffled, only the top k are important in the present context. Therefore, the array a need only track the cards in the top k positions while performing the shuffle, reducing the amount of memory needed. Truncating a to length k, the algorithm is modified accordingly:

To initialize an array a to k random elements of S (which is of length n), both 0-based:
   a[0] ← S[0]
   for i from 1 to k - 1 do
       r ← random (0 .. i)
       a[i] ← a[r]
       a[r] ← S[i]

   for i from k to n - 1 do
       r ← random (0 .. i)

       if (r < k) then a[r] ← S[i]

Since the order of the first k cards is immaterial, the first loop can be removed and a can be initialized to be the first k items of S. This yields Algorithm R.

Example implementation

The following is a simple implementation of the algorithm in Python that samples the set of English Wikipedia page titles:

import random
SAMPLE_COUNT = 10

# Force the value of the seed so the results are repeatable
random.seed(12345)

sample_titles = []
for index, line in enumerate(open("enwiki-20091103-all-titles-in-ns0")):
        # Generate the reservoir
        if index < SAMPLE_COUNT:
                sample_titles.append(line)
        else:                   
                # Randomly replace elements in the reservoir with a decreasing probability                
                # Choose an integer between 0 and (index - 1) (inclusive)                
                r = random.randrange(index)                
                if r < SAMPLE_COUNT:                        
                        sample_titles[r] = line
print sample_titles

References

[1] Random Sampling with a Reservoir

[2] Dictionary of Algorithms and Data Structures

[1]

[2]