Topic Modeling Best Practices

Project Overview

Literature Review

Compare Number of Topics

Compare Alpha Values

Compare Noun and Regular Corpus Models

Discussion

Conclusions

Appendix 1: Documentation of Corpus Preparation

Appendix 2: Documentation of Model Training Code

References

View the Project on GitHub msaxton/topic-model-best-practices

References

Blei, D. M., Ng, A. Y., Jordan, M. I., & Lafferty, J. (2003). Latent Dirichlet Allocation. Journal Of Machine Learning Research, 3(4/5), 993-1022.

Blei, D. M. and Lafferty, J. (2009). Topic Models. In A. Srivastava, M. Sahami, & V. Kumar (Eds.), Text Mining (pp. 71-93). New York: Chapman and Hall/CRC.

Carron-Arthur, B., Reynolds, J., Bennett, K., Bennett, A., & Griffiths, K. M. (2016). What’s all the talk about? Topic modelling in a mental health Internet support group. BMC Psychiatry, 161-12.

Carter, D. J., Brown, J., & Rahmani, A. (2016). Reading the high court at a distance: Topic modelling the legal subject matter and judicial activity of the high court of Australia, 1903-2015. University Of New South Wales Law Journal, 39(4), 1300-1354.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal Of The American Society For Information Science, 41(6), 391-407.

Chang, J., Boyd-Garber, J., Wang, C., Gerrish, S., & Blei, D. (2009). Reading tea-leaves: How humans interpret topic models. Advances in Neural Information Processing Systems 22.

Hall, D., Jurafsky, D., & Manning C. D. (2008). Studying the History of Ideas Using Topic Models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 363-371). Stroudsburg, PA: Association for Computational Linguistics.

Jockers, M. (2011, September, 29). The LDA Buffet is Now Open; or, LAtent Dirichlet Allocation for English Majors [Blog post].

Martin, F. & Johnson, M. (2015). More efficient topic modeling through a noun only approach. In Proceedings of Australian Language Technology Association Workshop (pp. 111-115).

Mimno, D. (2012). Computational historiography: Data mining in a century of classics journals. Journal on Computing and Cultural Heritage 5(1), 3:1-3:19. doi: 10.1145/2160165.2160168

Mohr, J. W. & Bogdanov, P. (2013). Introduction—Topic models: What they are and why they matter. Poetics 41(6), 545-569. doi: 10.1016/j.poetic.2013.10.001

Savoy, Jacques. (2013). Authorship attribution based on a probabilistic topic model. Information Processing and Management 49(1), 341-354. doi: 10.1016/j.ipm.2012.06.003

Suominen, A., & Toivanen, H. (2016). Map of science with topic modeling: Comparison of unsupervised learning and human-assigned subject classification. Journal Of The Association For Information Science & Technology, 67(10), 2464-2476. doi:10.1002/asi.23596

Székely, N., & vom Brocke, J. (2017). What can we learn from corporate sustainability reporting? Deriving propositions for research and practice from over 9,500 corporate sustainability reports published between 1999 and 2015 using topic modelling technique. Plos ONE, 12(4), 1-27. doi:10.1371/journal.pone.0174807

Wallach, H., Mimno, D., & McCallum, A. (2009). Rethinking LDA: Why Priors Matter. In Y. Bengio and D. Schuurmans and J. D. Lafferty and C. K. I. Williams and A. Culotta (Eds.), Advances in Neural Information Processing Systems 22 (pp. 1973-1981). Curran Associates, Inc.

Wang, SH., Ding, Y., Zhao, W., Huang, YH, Perkins, R., Zou, W., & Chen, J. (2016). Text mining for identifying topics in the literatures about adolescent substance use and depression. BMC Public Health 16

Yau, CK., Porter, A., Newman, N. & Suominen, A. (2014). Clustering scientific documents with topic modeling. Scientometrics 100:767.

Zheng, B., McLean, D., & Lu, X. (2006). Identifying biological concepts from a protein-related corpus with a probabilistic model. BMC Bioinformatics 7(58) doi: 10.1186/1471-2105-7-58