Algorithm: Weighted Random Sampling

 2018-05-02

Uniform Random Sampling

I got handed a design that required choosing n items from a set of N items without repeats. In the literature this is called sampling without replacement. This algorithm was recently added to the C++17 standard as std::sample.

In fact std::sample is two similar algorithms, selection sampling and reservoir sampling. These implementations operate on a population range of ForwardIterators or InputIterators respectively. Recall that BidirectionalIterator (like std::list) and RandomAccessIterator (like std::vector) also model ForwardIterator and thus are handled by the selection sampling implementation. These algorithms are readily found in Knuth’s TAOCP Volume 2 3.4.2 Random Sampling and Shuffling.

Algorithm S: Selection Sampling

Algorithm R: Reservoir Sampling

Weighted Random Sampling

Later the original design was expanded upon, as is wont to happen, to require the items in the set to be weighted such that some were more likely to be chosen than others. This is known as Weighted Random Sampling or WRS and our particular design calls again for sampling without replacement. There are in fact two reasonable interpretations of these weights. This problem is covered in a wonderful paper by Pavlos S. Efraimidis, Weighted Random Sampling over Data Streams (2015).

A-ES: WRS-N-W

WRS-N-W stands for Weighted Random Sampling without Replacement, with defined Weights.

In this scheme the probability of every unselected item to be selected in that round is proportional to the relative item weight with respect to the weights of all unselected items.

Weighted Random Sampling over Data Streams, Efraimidis, 2015

Accelerating weighted random sampling without replacement, Muller, 2016

/// Sample n weighted random elements from population [first,last) without replacement.
/// Implements A-ES WRS-N-W algorithm.
/// see Weighted Random Sampling over Data Streams, Pavlos S. Efraimidis (2010)
///
/// \param first> beginning of the range of the sampling population
/// \param last> end of the range of the sampling population
/// \param name=weight> beginning of the range of sample weights parallel to the population range, weights must be > 0
/// \param name=out> beginning of output range where samples will be stored
/// \param name=n> the number of samples to make
/// \returns the end of the output range
template< class PopulationIterator, class WeightIterator, class SampleIterator, class Distance,
          class UniformRandomBitGenerator >
SampleIterator sample_weighted_w( PopulationIterator first, PopulationIterator last, WeightIterator weight,
                                  SampleIterator out, Distance n, UniformRandomBitGenerator&& g )
{
	using KeySample = std::pair<double, PopulationIterator>;
	auto cmp = [](const KeySample& x, const KeySample& y) { return x.first > y.first; };

	std::priority_queue<KeySample, std::vector<KeySample>, decltype(cmp)> queue(cmp);

	// we modify the original algorithm to operate on the logarithmic scale for increased numeric stability
	// see Accelerating weighted random sampling without replacement - Kirill Muller
	auto genkey = [&g](auto w) { return w / std::exponential_distribution<double>()(g); };

	// fill the reservoir
	for ( Distance i = 0; first != last && i < n; ++first, (void)++weight, (void)++i )
	{
		assert(*weight > 0);
		double k = genkey( *weight );
		queue.push( KeySample{k, first} );
	}	

	// replace in reservoir with weighted probability
	for (; first != last; ++first, (void)++weight )
	{
		assert(*weight > 0);
		
		double k = genkey( *weight )			;
		if (k > queue.top().first)
		{
			queue.pop();
			queue.push( KeySample{k, first} );
		}
	}

	while (!queue.empty())
	{
		*out++ = *queue.top().second;
		queue.pop();
	}
	return out;
}

A-Chao: WRS-N-P

WRS-N-P stands for Weighted Random Sampling without Replacement, with defined Probabilities.

In this scheme the probability of each item to be included in the random sample is proportional to its relative weight.

Assume any two items vi and vj of the population with weights wi and wj respectively. The probability pi that vi is in the random sample is equal to cpj, where pj is the probability that vj is in the random sample.

For heavy items with relative weight larger than 1/m we say that the respective items are “infeasible” or “overweight”. These items would have an inclusion probability greater than 1 which is impossible.

A General Purpose Unequal Probability Sampling Plan, Chao, 1982