High performance latent dirichlet allocation for text mining

Liu, Zelong

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/7726

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Li, M	-
dc.contributor.author	Liu, Zelong	-
dc.date.accessioned	2013-11-28T10:12:01Z	-
dc.date.available	2013-11-28T10:12:01Z	-
dc.date.issued	2013	-
dc.identifier.uri	http://bura.brunel.ac.uk/handle/2438/7726	-
dc.description	This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.	en_US
dc.description.abstract	Latent Dirichlet Allocation (LDA), a total probability generative model, is a three-tier Bayesian model. LDA computes the latent topic structure of the data and obtains the significant information of documents. However, traditional LDA has several limitations in practical applications. LDA cannot be directly used in classification because it is a non-supervised learning model. It needs to be embedded into appropriate classification algorithms. LDA is a generative model as it normally generates the latent topics in the categories where the target documents do not belong to, producing the deviation in computation and reducing the classification accuracy. The number of topics in LDA influences the learning process of model parameters greatly. Noise samples in the training data also affect the final text classification result. And, the quality of LDA based classifiers depends on the quality of the training samples to a great extent. Although parallel LDA algorithms are proposed to deal with huge amounts of data, balancing computing loads in a computer cluster poses another challenge. This thesis presents a text classification method which combines the LDA model and Support Vector Machine (SVM) classification algorithm for an improved accuracy in classification when reducing the dimension of datasets. Based on Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the algorithm automatically optimizes the number of topics to be selected which reduces the number of iterations in computation. Furthermore, this thesis presents a noise data reduction scheme to process noise data. When the noise ratio is large in the training data set, the noise reduction scheme can always produce a high level of accuracy in classification. Finally, the thesis parallelizes LDA using the MapReduce model which is the de facto computing standard in supporting data intensive applications. A genetic algorithm based load balancing algorithm is designed to balance the workloads among computers in a heterogeneous MapReduce cluster where the computers have a variety of computing resources in terms of CPU speed, memory space and hard disk space.	en_US
dc.language.iso	en	en_US
dc.publisher	Brunel University School of Engineering and Design PhD Theses	-
dc.relation.uri	http://bura.brunel.ac.uk/bitstream/2438/7726/1/Fulltext.pdf	-
dc.subject	Probabilistic topic models	en_US
dc.subject	Text classification	en_US
dc.subject	Noisy data reduction	en_US
dc.subject	Parallel computing	en_US
dc.subject	static load balancing	en_US
dc.title	High performance latent dirichlet allocation for text mining	en_US
dc.type	Thesis	en_US
Appears in Collections:	Electronic and Computer Engineering Dept of Electronic and Electrical Engineering Theses

Files in This Item:

File	Description	Size	Format
Fulltext.pdf		1.69 MB	Adobe PDF	View/Open

Show simple item record