Scalable k-NN based text clustering

Alessandro Lulli, Thibault Debatty, Matteo Dell'Amico, Pietro Michiardi, Laura Ricci

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic scalability becomes a major concern. In this work, we present our approach for text clustering that builds an approximate k-NN graph, which is then used to compute connected components representing clusters. Our focus is to understand the scalability / accuracy tradeoff that underlies our method: we do so through an extensive experimental campaign, where we use real-life datasets, and show that even rough approximations of k-NN graphs are sufficient to identify valid clusters. Our method is scalable and can be easily tuned to meet requirements stemming from different application domains.

Original languageEnglish
Title of host publicationProceedings - 2015 IEEE International Conference on Big Data, Big Data 2015
EditorsHoward Ho, Beng Chin Ooi, Mohammed J. Zaki, Xiaohua Hu, Laura Haas, Vipin Kumar, Sudarsan Rachuri, Shipeng Yu, Morris Hui-I Hsiao, Jian Li, Feng Luo, Saumyadipta Pyne, Kemafor Ogan
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages958-963
Number of pages6
ISBN (Electronic)9781479999255
DOIs
Publication statusPublished - 22 Dec 2015
Event3rd IEEE International Conference on Big Data, Big Data 2015 - Santa Clara, United States
Duration: 29 Oct 20151 Nov 2015

Publication series

NameProceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015

Conference

Conference3rd IEEE International Conference on Big Data, Big Data 2015
Country/TerritoryUnited States
CitySanta Clara
Period29/10/151/11/15

Fingerprint

Dive into the research topics of 'Scalable k-NN based text clustering'. Together they form a unique fingerprint.

Cite this