DBSCAN ALGORITHM FOR DOCUMENT CLUSTERING

Authors

  • Radu George Cretulescu "Lucian Blaga" University of Sibiu
  • Daniel Morariu "Lucian Blaga" University of Sibiu
  • Macarie Breazu "Lucian Blaga" University of Sibiu
  • Danie Volovici "Lucian Blaga" University of Sibiu

Abstract

Document clustering is a problem of automatically grouping similar document into categories based on some similarity metrics. Almost all available data, usually on the web, are unclassified so we need powerful clustering algorithms that work with these types of data. All common search engines return a list of pages relevant to the user query. This list needs to be generated fast and as correct as possible. For this type of problems, because the web pages are unclassified, we need powerful clustering algorithms. In this paper we present a clustering algorithm called DBSCAN – Density-Based Spatial Clustering of Applications with Noise – and its limitations on documents (or web pages) clustering.  Documents are represented using the “bag-of-words” representation (word occurrence frequency). For this type o representation usually a lot of algorithms fail. In this paper we use Information Gain as feature selection method and evaluate the DBSCAN algorithm by its capacity to integrate in the clusters all the samples from the dataset.

References

S. Chakrabarti, Mining the Web- Discovering Knowledge from Hypertext Data, Morgan Kaufmann Press, 2003;

Cretulescu, R., Morariu,D. Text Mining. Tehnici de clasificare si clustering al documentelor, Published at Editura Albastra, Cluj Napoca, 2012, ISBN 978-973-650-289-7

Radu Cretulescu, Daniel Morariu, Macarie Breazu - Using WEKA framework in document classification, Int. Journal of Advanced Statistics and IT&C for Economics and Life Sciences, Vol 6, No 2, ISSN 2067-354X, 2016

Han, J., Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001;

Mitchell T. Machine Learning, McGraw Hill Publishers, 1997.

Mitkov R., The Oxford Handbook of Computational Linguistics, Oxford University Press, 2005;

Daniel Morariu, Radu Cretulescu, Macarie Breazu - The WEKA Multilayer Perceptron Classifier, Int. Journal of Advanced Statistics and IT&C for Economics and Life Sciences, Vol 7, No 1, ISSN 2067-354X, 2017

Reuters Corpus: http://about.reuters.com/researchandstandards/corpus/. Released in November 2000

WEKA package - http://www.cs.waikato.ac.nz/ml/weka/index.html (accessed March 2015)

Downloads

Published

2019-12-05

Issue

Section

Articles