TY - GEN
T1 - Building k-NN graphs from large text data
AU - Debatty, Thibault
AU - Michiardi, Pietro
AU - Thonnard, Olivier
AU - Mees, Wim
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014
Y1 - 2014
N2 - In this paper we present our new design of NNCTPH, a scalable algorithm to build an approximate k-NN graph from large text datasets. The algorithm uses a modified version of Context Triggered Piecewise Hashing to bin the input data into buckets, and uses NN-Descent, a versatile graph-building algorithm, inside each bucket. We use datasets consisting of the subject of spam emails to experimentally test the influence of the different parameters of the algorithm on the number of computed similarities, on processing time, and on the quality of the final graph. We also compare the algorithm with a sequential and a MapReduce implementation of NN-Descent. For our datasets, the algorithm proved to be up to ten times faster than NN-Descent, for the same quality of produced graph. Moreover, the speedup increased with the size of the dataset, making NNCTPH a sensible choice for very large text datasets.
AB - In this paper we present our new design of NNCTPH, a scalable algorithm to build an approximate k-NN graph from large text datasets. The algorithm uses a modified version of Context Triggered Piecewise Hashing to bin the input data into buckets, and uses NN-Descent, a versatile graph-building algorithm, inside each bucket. We use datasets consisting of the subject of spam emails to experimentally test the influence of the different parameters of the algorithm on the number of computed similarities, on processing time, and on the quality of the final graph. We also compare the algorithm with a sequential and a MapReduce implementation of NN-Descent. For our datasets, the algorithm proved to be up to ten times faster than NN-Descent, for the same quality of produced graph. Moreover, the speedup increased with the size of the dataset, making NNCTPH a sensible choice for very large text datasets.
UR - http://www.scopus.com/inward/record.url?scp=84921725248&partnerID=8YFLogxK
U2 - 10.1109/BigData.2014.7004276
DO - 10.1109/BigData.2014.7004276
M3 - Conference contribution
AN - SCOPUS:84921725248
T3 - Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014
SP - 573
EP - 578
BT - Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014
A2 - Lin, Jimmy
A2 - Pei, Jian
A2 - Hu, Xiaohua Tony
A2 - Chang, Wo
A2 - Nambiar, Raghunath
A2 - Aggarwal, Charu
A2 - Cercone, Nick
A2 - Honavar, Vasant
A2 - Huan, Jun
A2 - Mobasher, Bamshad
A2 - Pyne, Saumyadipta
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2nd IEEE International Conference on Big Data, IEEE Big Data 2014
Y2 - 27 October 2014 through 30 October 2014
ER -