An R package to identify sample Cross-contamination in whole genome sequencing studies

Identification of sample cross-contamination is crucial in next generation sequencing (NGS) studies because undetected contamination may lead to bias in association studies. In PCR-free germline multiplexed whole genome sequencing (WGS) studies, sample cross-contamination may be investigated by studying the excess of non-matching reads at homozygous sites compared to the expected sequencing error fraction. In this presentation, we propose a probabilistic method to infer contaminated samples and their contaminant for low levels of contamination. The distance on the well plate between the contaminant and the contaminated sample may be penalized. The method is implemented in a free of charge R package. We compare it with the three alternative methods ART-DeCo, VerifyBamID2 and the built-in function in Illumina’s DRAGEN platform and demonstrate its accuracy on simulated data. We illustrate the method using real data from the pilot phase of a large-scale NGS experiment with 9000 whole genome sequences. In the real data, our method was able to successfuly identify cross-contamination. Sample cross-contamination in NGS studies can be identified using a simple-to-use R package.