Correction of Jaccard Similarity Index for Chance Agreement in Cluster Analysis

Grant Winner

Ahmed Albatineh, Ph.D. – Farquhar College of Arts and Sciences

Dean

Donald Rosenblum – Farquhar College of Arts and Sciences

Abstract

Cluster analysis is the art uncovering structure in data sets using clustering algorithms. Most of the time we are interested in measuring similarity between two groupings of the same data set using similarity indices. Such indices are widely used in many disciplines including gene expression and micro array analysis, marketing behavioral research, ecology, botany just to name a few. The problem with such indices is that they do not account for agreement due to chance between the two groupings of the same data set. In this study, I will derive new mathematical procedure to correct the Jaccard similarity index for chance agreement, which will improve substantially the performance of this index in terms of cluster structure recovery and validation studies. Jaccard index was introduced in 1908 to measure the degree of relatedness between two biological communities with respect to their species composition and is widely used in ecology and botany as well. I think that the results of this study will be of great importance for all colleagues working in the areas of ecology, botany, biology, gene expression and micro array data analysis and any other field where cluster analysis and measuring similarity is of interest.