Determining the k in k-means with mapreduce

Thibault Debatty, Wim Mees, Pietro Michiardi, Olivier Thonnard

Publikation: Beitrag in FachzeitschriftKonferenzartikelBegutachtung

Abstract

In this paper we propose a MapReduce implementation of G-means, a variant of k-means that is able to automatically determine k, the number of clusters. We show that our implementation scales to very large datasets and very large values of k, as the computation cost is proportional to nk. Other techniques that run a clustering algorithm with different values of k and choose the value of k that provides the best results have a computation cost that is proportional to nk2. We run experiments that confirm that the processing time is proportional to k. These experiments also show that, because G-means adds new centers progressively, if and where they are needed, it reduces the probability to fall into a local minimum, and finally finds better centers than classical k-means processing.

OriginalspracheEnglisch
Seiten (von - bis)19-28
Seitenumfang10
FachzeitschriftCEUR Workshop Proceedings
Jahrgang1133
PublikationsstatusVeröffentlicht - 2014
Veranstaltung2014 Joint Workshops on International Conference on Extending Database Technology, EDBT 2014 and International Conference on Database Theory, ICDT 2014 - Athens, Griechenland
Dauer: 28 März 2014 → …

Fingerprint

Untersuchen Sie die Forschungsthemen von „Determining the k in k-means with mapreduce“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren