Text Classification with Sparse Composite Document Vectors (SCDV)
Various machine learning algorithms are used to perform text classification. As machine learning algorithms don't understand textual data, they require text data to be represented as fixed dimension vector.
We modify the feature formation technique i.e. graded-weighted Bag of Word Vectors (gwBoWV) for faster and better composite document feature vector formation.
We propose a very simple feature construction algorithm which has potential to overcome many weaknesses in current distributional vector representations and other composite document representation techniques.
We demonstrate effectiveness of our method through experiments on multi-class classification on 20newsgroup dataset and multi-label text classification on Reuters-21578 dataset.
TheoryThere are 2 steps during building SCDV.
Precomputation of word-topics vectors.
Build sparse document vectors using word-topics vectors.
Precomputation of word-topics vector
Word vectors for all words in vocabulary are clustered into K semantic clusters using soft clustering algorithms (e.g. GMM), thus each word belongs to every cluster with some probability.
For each word wi, K word-cluster vectors (wcvik) are created by multiplying word vector with the probability of word belonging to each cluster.
Concatenate all word-cluster vectors and weight it with idf of wi to form a word-topics vector (wtvi)
This precomputation leads to significant reduction in feature formation time.
Building sparse document vectors using word-topics vectors
Basic preprocessing like stop words removal is done on documents.
For each word wi appearing in the preprocessed document Dj, sum word-topics vector (wtvi) to obtain document vector dvDj .
Make document vector dvDj sparse by zeroing attribute feature values which are less than 5% of average attribute feature values range resulting in SCDVDj .
Results and Analysis
Multi-class classification on 20Newsgroup dataset
- We evaluate performance using standard metrics like accuracy, macro-averaging precision, recall and macro-averaging F-measure for comparison.
- We compare our performance with NTSG, Paragraph vector models (distributed memory and distributed bag of words), gwBoWV, TWE (TWE-1, TWE-2, TWE-3) models, weighted Bag of Concepts, Average word-vector model (AvgVec) where we build document vector by averaging all word embedding vectors in a document.
- We also compare our results with reported results of other topic modeling based document embedding methods like WTM, w2v - LDA, TV + MeanWV, LTSG, Gaussian-LDA, Topic2Vec, MvTM.
- We perform significantly better than current state-of-art NTSG on 20newsgroup dataset.
- We demonstrate time comparision with gwBoWV, TWE-1 and space comparision with gwBoWV.
|Cluster formation time||90||660||90|
|Document vector formation time||1170||180||60|
|Total training time||1320||858||210|
|Total prediction time||780||120||42|
|Document vectors||1.1 GB||236 MB|
Multi-label classification on Reuters-21578 dataset.
- We evaluate performance using standard metrics like Precision@K, nDCG@K, Coverage error, Label ranking average precision score(LRAPS), weighted F-measure for comparision.
- We compare our performance with Paragraph vector models (distributed memory and distributed bag of words), gwBoWV, TWE (TWE-1, TWE-2, TWE-3) models, Average word-vector model (AvgVec) and tfidf weighted average word-vector model (tfidf Avg Vec).
|Models||Prec@1||Prec@5||nDCG@5||Coverage error||LRAPS||Weighted F_score|
We modified gwBoWV and reduced overall feature vector computation time and space occupied by document vectors.
Our method (SCDV) outperforms TWE, gwBoWV both in performance and time complexity by significant margin.