Text Classification with Sparse Composite Document Vectors (SCDV)

The Crux

Theory

There are 2 steps during building SCDV.

Precomputation of word-topics vector

Building sparse document vectors using word-topics vectors

Results and Analysis

Multi-class classification on 20Newsgroup dataset

Performance Comparision

Models Accuracy(%) Precision(%) Recall(%) F_score(%)
SCDV 84.6 84.6 84.5 84.6
NTSG-1 82.6 82.5 81.9 81.2
NTSG-2 82.5 83.7 82.8 82.4
NTSG-3 81.9 83.0 81.7 81.1
gwBoWV 81.6 81.1 81.1 80.9
LTSG 82.8 82.4 81.8 81.8
TWE-1 81.5 81.2 80.6 80.6
WTM 80.9 80.3 80.3 80.0
BOW 79.7 79.5 79.0 79.0
w2v-LDA 77.7 77.4 77.2 76.9
TV+MeanWV 72.2 71.8 71.5 71.6
MvTM 72.2 71.8 71.5 71.6
lda2vec 81.3 81.4 80.4 80.5
TWE-2 79.0 78.6 77.9 77.9
TWE-3 77.4 77.2 76.2 76.1
weight-AvgVec 81.9 81.7 81.9 81.7
weight-BOC 71.8 71.3 71.8 71.4
PV-DBOW 75.4 74.9 74.3 74.3
PV-DM 72.4 72.1 71.5 71.5
AvgVec 71.8 71.2 70.5 70.0

Time Comparision

Time (sec) gwBoWV TWE-1 SCDV
Cluster formation time 90 660 90
Document vector formation time 1170 180 60
Total training time 1320 858 210
Total prediction time 780 120 42

Space Comparision

Space Occupied gwBoWV SCDV
Document vectors 1.1 GB 236 MB

Multi-label classification on Reuters-21578 dataset.

Performance Comparision

Models Prec@1 Prec@5 nDCG@5 Coverage error LRAPS Weighted F_score
SCDV 94.20 36.98 49.55 6.48 93.30 81.75
gwBoWV 92.9 36.14 48.55 8.16 91.46 79.16
TWE-1 90.91 35.49 47.54 9.03 89.25 74.76
PV-DM 87.54 33.24 44.21 13.15 86.21 70.24
PV-DBOW 88.78 34.51 46.42 11.28 87.43 73.68
AvgVec 89.09 34.73 46.48 9.67 87.28 71.91
tfidf AvgVec 89.33 35.04 46.83 9.42 87.90 71.97

Conclusion