Zero-Numerator Problem: Calculating the Expected Number of Mistakes in Data Entry Jobs

I have a project for which I need to digitize a series of tables from scanned pdf pages. Due to the scan quality, some pages are handled manually by research assistances. After receiving a new badge, I typically like to estimate how many data-entry errors I have to expect. Therefore I draw a random evaluation sample and carefully check if the entered numbers matches the originals in the pdf. Depending on the sample size and the quality of the badge, I often arrive at a count of zero errors for the evaluation sample.

What does a zero count for the evaluation sample say about the distribution of errors in the entire badge? If the sample is randomly drawn, the number of errors, F, has a Bernoulli density with parameter S and p. The parameter S denotes the number of checked cells (the sample size) and p measures the true proportion of incorrect values across all cells. While S is known, p is not and constitutes the estimand.

A typical estimator for p is the sample mean, but in the event of F=0, the estimate for p is 0 for any sample size. That is, the sample mean estimator always suggests that the expected number of errors is zero — independent of the size of the evaluation sample. The same holds for an interval estimator of the 95% confidence interval. In the event of zero errors, the estimate for the sample variance is also ‘0’ and consequently the 95% confidence interval is zero regardless of the sample size.

A version of this problem is well-known in the bio-statistical literature. The task here is to compare two procedures (or treatments): One for which the risk of people being harmed (killed) is well-known and another for which only sample information with not a single harmed individual is available. The standard approach seems to be to calculate the upper bound of the 95% confidence interval using 3/S as an estimator (also known as The Rule of Three).

As pointed out in several contributions — for a summary see Winkler et al. (2002)
— a more principled way to form an estimate requires Bayesian Statistics and starts with the assignment of an informed prior density over p. This prior is then updated using the newly gathered data point ‘0 errors in an evaluation sample of size S’. To the extend that the prior is expressed as a Beta distribution, the calculations are algebraically easy.

However, the tricky part is typically to come up with a reasonable prior density for p. In practice, I make use of some very old numbers reported in a paper by Baddeley and Longman (1978). Back in the days when post codes had to be manually typed, they conducted an experiment with groups of postmen to evaluate what type of training procedures reduce the number of wrongly typed “alpha-numeric code material”.

In the table below I reproduce the Baddeley-Longman numbers and also the variance I calculated from the range estimates. Averaging across the four groups suggests that the rate of errors is about 1.4% with a variance 0.7. A more conservative set of estimates might be stipulated from using the values reported from group 4 alone (2.1 for the mean value and 1.1 for the variance).

Group 1 Group 2 Group 3 Group 4
% Incorrect 1.09 1.14 1.41 2.06
Range 0.22 – 2.18 0.06 – 2.42 0.40 – 3.45 0.38 – 4.65
N 18 18 18 18
Variance 0.49 0.59 0.76 1.07

Using these values, it is easy to calculate the implied Beta prior and the posterior distribution and its characteristics when combined with the information from the evaluation sample. In Bayesian Statistics the calculations are known as Beta-Binomial update.

In order to simplify these calculations, I have written a short R function that makes the calculations when given information about the evaluation sample. The user can choose between the two pre-programmed priors mentioned above or supply its own. The function is part of my datatools package and can be installed from my github, via devtools::install_github('sumtxt/datatools').

As an example, suppose we have an evaluation sample of S=20 with F=0 (no errors). Using the function, we find that the expected proportion of errors in the population is about 1.3% and the upper bound of the 95% credible interval equals 2,5%.

datatools::est_typing_err(20, F=0, prior='postmen', quantity='mean')
[1] 0.01306888

datatools::est_typing_err(20, F=0, prior='postmen', quantity='95q')
[1] 0.02538156

Notice: These estimates differ from the estimates implied by the sample-mean estimator (0%) and the rule-of-three estimator (3/20=15%).