Sampling Parameters

Sampling is based on three parameters. They influence the number of documents that must be added to a sample.

Estimation Interval
The estimation interval defines an admissible error rate. This means the extent to which the percentage of confirmed documents within reviewed documents in the universe or for a specific review workflow is allowed to diverge from the percentage of confirmed documents within the sample documents. The smaller the percentage, the more documents will be randomly selected for the sample, and the greater the probability that sample results will be accurate.
Confidence Threshold
There is a certain probability that a sample contains very many or very few relevant documents, and that it is not representative for the matter. The confidence threshold expresses the likelihood that the measurement over the sample is within the estimation interval. The higher the confidence threshold, the larger is the sample.
Hypothesis Test
This is the way the tolerated number of misclassified documents is calculated. It is calculated from the misclassified number of documents found in the sample, for example during a second-level review of the sample.
There is a one-sided and a two-sided hypothesis test. Which one you prefer depends on the error rates you can tolerate, that is, if you expect that error rates in your review results are below a certain percentage or within a certain range.
  • One-sided hypothesis test
    With one-sided hypothesis test, any review result is acceptable that has a rate of misclassified documents that is less than the rate of misclassified documents in the sample, minus the estimation interval.
    If, for example, the second level review of the sample reveals a 10% rate of misclassified documents, and the estimation interval is 1%, the tolerated number of misclassified documents is any percentage lower than: 10% - 1% = 9%.
  • Two-sided hypothesis test
    This hypothesis test narrows down tolerated rate of misclassified documents from two sides. The tolerated rate of misclassified documents should equal the error rate in the sample plus/minus the estimation interval — like the plus/minus for opinion polls.
    If, for example, the second level review of the sample reveals a 10% rate of misclassified documents, and the estimation interval is 1%, the tolerated number of misclassified documents is any percentage higher than (10% - 1%) = 9% and lower than (10% + 1%) = 11%.

Copyright © 2019 Open Text. All Rights Reserved. Trademarks owned by Open Text.