Scalable k-NN based text clustering

Alessandro Lulli, Thibault Debatty, Matteo Dell'Amico, Pietro Michiardi, Laura Ricci

Résultats de recherche: Chapitre dans un livre, un rapport, des actes de conférencesContribution à une conférenceRevue par des pairs

Résumé

Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic scalability becomes a major concern. In this work, we present our approach for text clustering that builds an approximate k-NN graph, which is then used to compute connected components representing clusters. Our focus is to understand the scalability / accuracy tradeoff that underlies our method: we do so through an extensive experimental campaign, where we use real-life datasets, and show that even rough approximations of k-NN graphs are sufficient to identify valid clusters. Our method is scalable and can be easily tuned to meet requirements stemming from different application domains.

langue originaleAnglais
titreProceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015
rédacteurs en chefFeng Luo, Kemafor Ogan, Mohammed J. Zaki, Laura Haas, Beng Chin Ooi, Vipin Kumar, Sudarsan Rachuri, Saumyadipta Pyne, Howard Ho, Xiaohua Hu, Shipeng Yu, Morris Hui-I Hsiao, Jian Li
EditeurInstitute of Electrical and Electronics Engineers Inc.
Pages958-963
Nombre de pages6
ISBN (Electronique)9781479999255
Les DOIs
étatPublié - 22 déc. 2015
Evénement3rd IEEE International Conference on Big Data, IEEE Big Data 2015 - Santa Clara, États-Unis
Durée: 29 oct. 20151 nov. 2015

Série de publications

NomProceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015

Une conférence

Une conférence3rd IEEE International Conference on Big Data, IEEE Big Data 2015
Pays/TerritoireÉtats-Unis
La villeSanta Clara
période29/10/151/11/15

Empreinte digitale

Examiner les sujets de recherche de « Scalable k-NN based text clustering ». Ensemble, ils forment une empreinte digitale unique.

Contient cette citation