Estimation bias due to duplicated observations
The LCSR staff prepared a summary of Francesco Sarracino`s lecture «Estimation bias due to duplicated observations: A Monte Carlo simulation», which was presented on the Sixth LCSR International Workshop.
The LCSR staff prepared a summary of Francesco Sarracino`s (STATEC, HSE-LCSR) lecture «Estimation bias due to duplicated observations: A Monte Carlo simulation», which was presented on the Sixth LCSR International Workshop.
Summary:
Availability of reliable data is a prerequisite for social science research. Applied researchers’ work often relies on survey data, and the quality of the results depends on accurate recording of respondents’ answers. However, this condition is not always met. Many widely used surveys contain a considerable number of duplicate records. Duplicate records are records in which the set of all (or nearly all) answers from a given respondent is identical to that of another respondent.
Surveys in social sciences usually include a large number of questions. Therefore, it is highly unlikely that two respondents provided identical answers to all (or nearly all) substantive survey questions. In other words, it is unlikely that two identical records come from answers of two real respondents. It is more probable that either one record corresponds to a real respondent and the second one is its duplicate, or that both records are fake. Duplicate records can result from an error or forgery by interviewers, data coders, or data processing staff and should, therefore, be treated as suspicious observations. Unfortunately, little is known about the bias induced by duplicated observations in survey data.
Dr. Sarracino and Dr. Mikucka are the first to analyze the effect of duplicate records on estimates obtained in OLS regression. They assess the severity of the bias induced by duplicate records analyzing two scenarios. First, they focus on the bias induced by the number of duplications. Second, they examine the bias due to the number of duplicated records. Furthermore, they investigate how the risk of obtaining biased estimates changes when the duplicates are situated in specific parts of the distribution (the center, the lower and the upper tie, or across the whole distribution). Finally, they compared the ‘naive’ estimation, which ignores the presence of duplicate records, with four alternative solutions to decrease the bias from the presence of duplicate records: excluding duplicates from the analysis; flagging duplicates and controlling for them in the estimation; using robust regression weighting down the impact of influential cases on the estimates; weighting the observations by the inverse of the duplicates’ multiplicity.
To this aim, the authors generated a dataset (N = 1, 500) with four variables with a known covariance matrix. They used a Monte Carlo simulation with 1,000 replications to investigate the effect of 40 patterns of duplicate records on the bias of regression estimates. Moreover, they used Dfbetas to assess the severity of the bias related to various patterns of duplicate records and to various solutions.
The authors show that the risk of obtaining biased estimates of regression coefficients increase with the number of duplicate records. If the data contained a single sextuplet (e.g. less than 1% of the sample) the probability of obtaining unbiased estimates is 41.6%. If the data contained 79 doublets of identical records (e.g. duplicates summed up to about 10% of the sample) the probability of obtaining unbiased estimates is about 11.4%. Hence, even a small number of duplicate records create a considerable risk of obtaining biased estimates. This suggests that researchers failing to account for the presence of duplicate records may reach misleading conclusions.
Additionally, the authors provide evidence that the risk of bias is not lower if the duplicate records are located close to the center of the distribution of the dependent variable. The differences between ‘typical’, ‘unconstrained’ and ‘deviant’ variants are small. Even if duplicate observations are drawn from the center of a distribution, the risk of obtaining biased estimates remains high: 60.4% in case of a ‘naive’ regression run on data with a sextuplet of duplicated records, and 87.9% if the data contain 79 doublets of duplicated records. The authors also explore the effectiveness of these five possible solutions to minimize the estimation bias induced by the presence of duplicated records. They demonstrate that weighting the duplicates by the inverse of their multiplicity is the best solution, among the considered ones, to minimize the estimation bias due to duplicated observations. They argue that this solution outperforms ‘naive’ estimates in presence of one doublet, and it performs equally to dropping or flagging the duplicates when one triplet, quadruplet, quintuplet or sixtuplet were present in the data. Weighting by the inverse of the multiplicity is the best solution to minimize the 13 bias also when the number of duplicated records increases. The performance of this solution decreases when the number of duplicates increases, but the chances of unbiased estimates is higher than in the alternative solutions. Finally, robust regression, which weighs down the impact of influential cases on the estimated regression coefficients, performs poorly in all cases. The risk of obtaining biased estimates is higher with robust regression than when duplicate cases are ignored in the analysis (‘naive’ estimation).
These results are discouraging, but not pessimistic. Although duplicate data plague many surveys, it is possible to adopt solutions to minimize the risk of biased estimates. The authors suggest three possible solutions to alleviate biased estimates: (1) excluding duplicates from the analysis, (2) flagging and controlling for the duplicated records, or (3) weighting by the inverse of multiplicity. This conclusion emphasizes the importance of collecting high quality data, since correcting the data with statistical tools remains a challenging task. The authors call for further research about how to address the presence of multiple doublets in surveys.
Author of the news: Ivan Aimaliev,
Laboratory for Comparative Social Research
- You may also watch video record on youtube