Duplicate Detection in Biological Data using Association


Koh J. L., Lee M. L., Khan M. A., Tan P. T., Brusic V.

2nd European Workshop on Data Mining and Text Mining for Bioinformatics, Piacenza, İtalya, 24 - 26 Eylül 2004, ss.35-41, (Tam Metin Bildiri)

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Basıldığı Şehir: Piacenza
  • Basıldığı Ülke: İtalya
  • Sayfa Sayıları: ss.35-41
  • Bezmiâlem Vakıf Üniversitesi Adresli: Hayır

Özet

Recent advancement in biotechnology has produced a massive amount of raw biological data which are accumulating at an exponential rate. Errors, redundancy and discrepancies are prevalent in the raw data, and there is a serious need for systematic approaches towards biological data cleaning. This work examines the extent of redundancy in biological data and proposes a method for detecting duplicates in biological data. Duplicate relations in a real-world biological dataset are modeled into forms of association rules so that these duplicate relations or rules can be induced from data with known duplicates using association rule mining. Our approach of using association rule induction to find duplicate relations is new. Evaluation of our method on a real-world dataset shows that our duplicate association rules can accurately identify up to 96.8% of the duplicates in the dataset at the accuracy of 0.3% false positives and 0.0038% false negatives.