Authorship Clustering using Homogeneous Feature Space and Two-stepped Automatic Fuzzy Cmeans Clustering

Document Type : Original Article


Computer Engineering Department, Bu Ali Sina University, Hamedan, Iran


Identifying the authorship either of an anonymous or a doubtful document constitutes a cornerstone for automatic forensic applications.  Moreover, it is a challenging task for both humans and computers considering complex content of document with variety of backgrounds. Due to nature of task it is always considered as an unsupervised task. Clustering documents according to the linguistic style of the authors who wrote them has been a task little studied by the research community. In order to address this problem, PAN Evaluation Framework has become the first effort to promote the development of the author clustering. There are different approaches to address the task and this article proposes a method based on a set of homogeneous features and two-stepped automatic FCM clustering. We use word Ngram, part-of-speech tagging and some other context free features, then using document similarity graph (DSG) estimating number of clusters; finally we use FCM to cluster corpus. We have done the task in very short amount of time and our performance results is comparable with leaderboard competitors in PAN CLEF 2017 challenge.


Main Subjects