Topic Oriented Probability Based and Semi Supervised Document Clustering
DOI:
https://doi.org/10.51983/ajeat-2012.1.1.2506Keywords:
Document Clustering, Text Documents, Word Frequency, Probability, Tokenization, Structural FilteringAbstract
Clustering of related or similar objects has long been regarded as a potentially useful contribution for helping users to navigate an information space such as a document collection. But, the major challenge in document clustering is high dimensionality. Data mining and statistical techniques have been applied with some success to large set of documents to automatically produce meaningful subsets. Many clustering algorithms and techniques have been developed and implemented since the earliest days of computational information retrieval but as the sizes of document collections have grown, these techniques have not been scaled to large collections because of their computational overhead. Traditional document clustering is usually considered as an unsupervised learning. It cannot effectively group documents under user’s need. To solve this problem, the proposed system concentrates on an interactive text clustering methodology, topic oriented probability based and semi supervised document clustering. It suggests interactive approach for document clustering, to facilitate human refinement of clustering outputs. The proposed system evaluates system efficiency by implementing and testing the clustering results with Dbscan and K-means clustering algorithms. Experiment shows that the proposed document clustering algorithm performs with an average efficiency of 94.4% for various document categories.
References
Benjamin C.M. Fung, Ke Wang and Martin Ester. “Hiearchical Document Clustering Using Frequent Item sets”. Proceedings of the SIAM International Conference on Data Mining, 2003.
Chihli Huang, Stefan Wermter and Peter Smith, “Hybrid Neural Document Clustering Using Guided Self-Oranization and WordNet”. Intelligent Systems, IEEE,3/2004.
Hotho A, Maedche A,Staab S. “Ontology-Based document clustering”. In Proc. Of the workshop Text Learning: Beyond Supervision, at IJCAI 2001. Seattle, WA,USA, Aug 6.
Yuan-Chao Liu, Xiao-Long Wang, Bing-Quan Liu, “A Feature Selection Algorithm for Document Clustering Based on Word Co- occurrence frequency” at Third International Conference on Machine Learning and Gybernetics, Shanghai,26-29 Aug. 2004.
O. Zamir, Clustering Web Documents: “A Phrase-Based Method for Group Search Engine Results”, Ph.D. dissertation, Dept. Computer Science & Engineering, Univ. of Washington, 1999.
Bakus, Hussin, and Kamel,” A SOM-Based Document clustering Using Phrases”, In proceeding of the 9th International Conference on Neural Information Processing, Vol. 5,2002,pp 2212-2216.
D. Mladenic and M. Grobelink, “ Word Sequence as Features in Text- learning”, In proceedings of the 17th Electrotechnical and Computer Science Conference, Ljublijana, Slovenia, 1998.
K. Wagstaff, C. Cardie, S. Rogers and S. Schroedl, “Constrained k-means clustering with background knowledge. In proc. Of 18th International Conference on Machine Learning, pp. 577-584, 2001.
Sugato Basu, Mikhail Bilenko, and Raymond J.Monney, ”A probabilistic framework for semi-supervised clustering”. In Proc. Of the 10th International Conference on Knowledge Discovery and Data Mining 2004.
Mikhail Bilenko, Sugato Basu and Raymond J. Monney, “ Integrating constraints and metric learning in semi-supervised clustering”. In Proc. Of 21st International Conference on Maching Learning, 2004.
David Gondek and Thomas Hofmann, “Non-redundant data clustering”. In proc. Of the fourth IEEE International Conference on Data Mining, 2004.
Hsin-Chang Yang, Chung-Hong Lee, “A text mining approach for automatic construction of hypertexts”, Expert Systems with Applications, Vol. 29, 2005, pp. 723-734.
Todsanai Chumwatana, Kok Wai Wong and Hong Xie, “ A SOM- Based Document Clustering Using Frequent Max Substrings for Non- Segmented Texts”, J. Intelligent Learning Systems & Applications, Vol. 2, pp. 117-125, 2010.
Linghui Gong, Jianping Zeng, Shiyong Zhang, “ Text stream clustering algorithm based on adaptive feature selection”, Expert Systems with Applications, Vol.38, No. 3, 2011, pp.1393-1399.
Wei Song, Cheng Hua Li, Soon Cheol Park, “ Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures”, Expert Systems with Applications, Vol.36, No. 5, 2009, pp. 9095-9104.
Ramiz M. Aliguliyev, “ Clustering of document collection – A weighting approach”, Expert Systems with Applications, Vol.36, No.4, 2009, pp. 7904-7916.
Ridvan Saracoglu, Kemal Tutuncu, Novruz Allahverdi, “ A new approach on search for similar documents with multiple categories using fuzzy clustering”, Expert Systems with Applications, Vol.34, No. 4, 2008, pp. 2545-2554.
Dino Isa, V.P. Kallimani, Lam Hong Lee, “ Using the sellf organizing map for clustering text documents”, Expert Systems with Applications, Vol.36, No. 5, 2009, pp. 9584-9591.
Ramiz M. Aliguliyev, “ A new sentence similarity measure and sentence based extrative technique for automatic text summarization”, Expert Systems with Applications, Vol.36, No. 4, 2009, pp.7764-7772.
Pei-Yi Hao, Jung-Hsien Chiang, Yi-Kun Tu, “ Hierarchically SVM classification based on support vector clustering method and its application to document categorization”, Expert Systems with Applications, Vol.33, No. 3, 2007, pp. 627-635.
Yuen-Hsien Tseng, “Generic title labeling for clustered documents”,
Expert Systems with Applications, Vol.37, No.3, 2010, pp. 2247-2254.
Shih-Cheng Hong, Feng-Yi Yang, Shieh-Shing Lin, “Hierarchical fuzzy clustering decision tree for classifying recipes of ion implanter”, Expert Systems with Applications, Vol.38, Issue 1 (2011) 933-940.
Lam Hong Lee, Dino Isa, “Automatically computed document dependent wighting factor facility for Naïve Bayes classification”, Expert Systems with Applications, Vol.37, No. 12, 2010, pp. 8471- 8478.
Chia-Hui Chang, Zhi-Kai Ding, “Categorical data visualization and clustering using subjective factors”, Data & Knowledge Engineering,
Vol. 53, 2005, pp. 243-262.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2012 The Research Publication
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.