Compare Noun and Regular Corpus Models
Appendix 1: Documentation of Corpus Preparation
Appendix 2: Documentation of Model Training Code
View the Project on GitHub msaxton/topic-model-best-practices
Topic modeling is a text mining technique which discovers patterns of word co-occurrence across a corpus of documents; these patterns of word co-occurrence are then conceived of as hidden “topics” which are present in the corpus. Topic modeling has been used successfully to facilitate information retrieval, classify documents, and for exploratory analysis of large corpora of texts.
Although the application of topic modeling algorithms to large corpora of text is accomplished with computers, a successful topic model requires a researcher to prepare the corpus for processing and to make decisions about the algorithm’s parameters such as how many topics should be inferred and how those topics should be distributed across the documents. This literature review addresses these questions by first introducing the concept of topic modeling, then briefly surveying select studies which have utilized topic models, and finally proceeds by examining how those studies, and other relevant literature, discuss the decisions which must be made by the user of a topic model.
Topic modeling began in the 1980’s as a means for more accurate information retrieval programs. An early topic modeling algorithm, called Latent Semantic Analysis (LSA) was introduced by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990). LSA was able to group documents together on the latent semantic structures assuming that words with similar semantic ranges tend to occur together. This algorithm improved upon other modes document retrieval because users no longer had to use the same words in their searches that appear in related documents because similarity could be discovered via the wider semantic range of the term.
Blei, Ng, Jordan, and Lafferty (2003) built upon LSA and developed a another algorithm for topic modeling called Latent Dirichlet Allocation (LDA). The principles of LDA (and some of its extensions) are described in a mathematical manner in Blei and Lafferty (2009). In LDA, a document is conceived as a collection of observable words, but each of those words are assumed to be drawn from an underlying and unobserved (hence “latent”) topic. LDA measures the probabilistic distribution of a particular word over a set of topics, and the distribution of a given topic over a set of documents. LDA “discovers” the topics by calculating patterns of word co-occurrence across a collection of documents, or a corpus. Each topic in the model will be distributed across each document as a probability. So, the total probabilities of each topic in a document will sum to 1.0. For example, if there are 10 topics in a model and for “document A” topic 4 has a probability score of 0.3, the total of probabilities of the remaining 9 topics will be 0.7.
While many discussions of topic modeling and LDA focus on the mathematics, there are other introductions to LDA that attempt to describe the process by metaphor. Perhaps most informative is Jockers (2011). He uses the imagery of a buffet. Image a buffet of themes where Herman Melville and Jane Austin are filling their plates. Melville tends to grab words from the “whaling” and “seafaring” section, whereas Austin takes her words from the “gossip” and “courtship” sections. Each author may grab larger or smaller handfuls from each section for their respective plates, but at the end of the day, those are the words they have to work with. In other words, the intuition behind LDA is that each document is a distribution of topics and each topic is characterized by a specific set of words. Assuming that is the case, LDA finds words which often occur together and thinks of them as topics. The distribution of these topics across documents can be measured by the observable words that actually appear in the documents.
Topic modeling has had success not just for information retrieval, but also for document classification and exploratory analysis of large corpora. It is not surprising then that examples of topic modeling can be found across a variety of academic domains: information science, the social sciences, legal studies, and the humiliates. In the following paragraphs I highlight a few studies in these domains. These studies form the basis of my later discussion setting the parameters of successful topic modeling.
The field of information science contains a number of examples of using topic models to classify scientific articles. Yau, Porter, Newman, and Suominen (2014) demonstrate that a topic model could be used to successfully group a broad collection scientific articles by discipline. Important though that may be, topic models can also be used to classify much more narrow and specific corpora. Zheng, McLean, and Lu (2006) use a topic model to identify recurring themes in biomedical studies which study proteins specifically. More recently, Suominen and Toianen (2016) apply a topic model to the “cartography of science” and argue that because topic models do not require predefined labels, they are useful for classification because “new to the world knowledge” does not have to be fitted to historically established categories. Hall, Juragsky, and Manning (2008) use a topic model to study changes in computational linguistics from 1978 to 2006. These researchers are able to identify a number of trends in the field including the rise and fall of certain ideas and the convergence of areas of focus for major conferences.
Researchers in the social sciences have also put topic modeling to good use. Carron-Arthur, Reynolds, Bennet, Bennet, and Griffiths (2016) use a topic model to analyze postings in an online support group. The goal of the study is to identify coherent topics and then to identify and analyze what type of user contributed to which kind of topic. Székely and vom Brocke (2017) identify and analyze recurring themes in corporate sustainability reports in order to see how important environmental issues were being addressed by different corporations. Finally, Wang, Ding, Zhao, Huang, Perkins, Zou, and Chen (2016) use a topic model to find recurring themes in the scholarly literature on adolescent substance abuse and depression.
The field of legal studies also contains an example of topic modeling. Carter, Brown, and Rahmani (2016) use a topic model to analyze judgments from the High Court of Australia. This study also argues that topic modeling may open up new modes of legal scholarship which ought to be pursued by legal experts.
The humanities have also seen a large number of examples of topic modeling. Mimno (2012) uses topic models to explore a very large corpus of academic journals in the field of Classical Studies. This model allows Mimno to make conclusions about how the field has changed over the last century. Authorship attribution is often an important research question in the humanities and Savoy (2013) demonstrates that a topic model can outperform other traditional metrics for this task such as the Delta rule or Chi Squared distance. In 2013 the journal Poetics devoted a whole issue to the use of topic modeling in humanities related studies. In the introductory article to this issue, Mohr and Bogdanov (2013) apply a topic model to the other articles in the issue as a means of showing what a topic model can do in the service of exploratory analysis.
Each of the studies touched on here show that topic modeling has proven to be a useful tool for a variety of different researchers. More importantly, however, each of these studies provide concrete examples of how a topic model can be implemented. In what follows I review how these studies shed light on the decisions that need to be made when implementing a model.
Before a topic model can be built on a corpus of documents, those documents must be processed in a way that not only allows the computer to understand them, but also so that only the most informative features of those documents are used for the topic model. Székely and vom Brocke (2017) enumerate the steps used to prepare a corpus of corporate sustainability reports: (1) collect the data, (2) use optical character recognition (ocr) to convert pdf files to plain text files, (3) filter out non-English words, (4) tokenize the documents, (5) clean the text by making every letter lowercase, and removing special characters and numbers, (6) lemmatize the words, (7) remove stop words.
Most researchers who build topic models follow a similar pipeline for processing texts, but there is some variation. The decision to filter out non-English words may make a model easier to build (because you do not have account for two words meaning the same thing), but it may also eliminate informative features. Mimno (2012) makes the deliberate decision to keep foreign languages in his corpus because it allows him to note differences between classical studies conducted in English speaking countries and those in Germany.
The lemmatization step mentioned above prevents the topic modeling algorithm from counting “write,” “writes,” and “writing”, for example, as three distinct word types. Lemmatization accomplishes this goal by reducing each word to its lexical form. However, Lemmatization is not the only strategy for addressing this issue. Zheng et al. (2006) used the process of stemming, not lemmatization, for processing biomedical literature on proteins. Stemming does not reduce words to their lexical form, instead it breaks off the morphological ending of a word and returns only the stem; “write,” “writes,” and “writing” all are reduced to “writ.” The difficulty with stemming, however, is that in some instances it can be confusing to move from the stem to the actual word used (Mimno, 2012). Lemmatization and stemming each normalize the text in helpful ways, but more work may need to be done in this regard. Savoy (2013) also normalizes the newspaper articles he processes to standardize spelling and contractions, changing “don’t” to “do not” for example.
Removing stop words is a common practice in preparing documents for topic modeling because stop words are so common that they are not informative of a document’s content, but stop words are not the only words which need to be removed to build an informative topic model. Depending on the nature of the corpus, other words may need to be removed as well. The corpus of scientific articles processed by Suominen and Toivanen (2016) tended to have acknowledgment sections which were not informative for their project, so words in those sections were removed.
After stop words and words from superfluous sections are removed, a document will still contain words which are uninformative. A word that is used too rarely it will not be informative for the topic model because it will fail to show how documents are similar to one another and a word that is used too frequently will fail to show how documents are different from one another. There is no standard guideline for deciding what counts as too rare or too frequent; Székely and vom Brocke (2017) removed words that appear in fewer than two documents, Carter et al. (2016) remove words that appear in fewer than 50 documents or that appear in more than 50% of the corpus, Suominen and Toivanen (2016) remove words that occur only once in their corpus, and Mimno (2012) recommends what could be called a 5-10 rule: remove words that appear less than 5-10 times and remove words that appear in more than 5-10% of the corpus. These guidelines are helpful but cannot be applied without consideration for the corpus being used and the objective of the topic model. Székely and vom Brocke (2017) make the point of manually checking the remaining vocabulary in their corpus before building their topic model.
Martin and Johnson (2015) provide another alternative to finding the most informative features of a corpus: eliminating all words except nouns. These researchers conduct an experiment where they build three topic models each based on three versions of the same news corpus. The first version is a raw corpus, the second is a lemmatized corpus, and the third version is a lemmatized corpus that contains nouns only. These researchers note that content rich information is typically represented in a document’s nouns rather than in verbs or adjectives (though the latter may be useful if one is attempting a sentiment analysis). In this experiment the lemmatized corpus and the lemmatized noun-only corpus produce more topics that are semantically coherent than did the raw corpus. The advantage of the noun only corpus over the lemmatized corpus is that eliminates a number of words (verbs and adjectives, for example) and therefore improves efficiency in building the topic model.
One of the parameters required by LDA before building a topic model is the number of topics which the model should infer from the document corpus (this value is labeled K in the literature). Yau et al. (2014) see this as a limitation of the LDA algorithm because researchers cannot know in advance how many topics to select for the value of K. Carter et al. (2016) argue that there is no definitive answer to how many topics a corpus contains and Suominen and Toivanen (2016) conclude that the number of topics chosen can only be evaluated in light of the “real world performance” of the model. Therefore, researchers tend toward a trial and error approach in choosing the number of topics the model should extract from the corpus. Carron-Arthur et al. (2016) try several values for K ranging from 10 to 100 in increments of 10. The researchers evaluate the results qualitatively by looking at the top words in each topic. They observe that models containing more than 30 topics tended toward duplication, but models below 20 topics tended to merge what the researchers believe ought to be distinct topics. Therefore, 25 topics are selected for a corpus of 131,004 documents. Székely and vom Brocke (2017) test 3, 5, 10, 20, 50, 70, and 100 as values for K for their corpus of 9,514 documents and settle on K = 70 because it is the best balance between topics that are too general and topics that are too specific. Carter et al. (2016) test models where K = 10, 15, 20, 50, and 100. They report results on models where K = 10 and 100 for their corpus of 7476 documents. Suominen and Toivanen (2016) choose K = 60 for their corpus of 144,081 documents. Mohr and Bogadanov (2013) set K=25 for their small corpus of 8 documents. Zheng et al. (2006) choose a higher value, K = 300, for their corpus of 25,005 documents. Finally, Yau et al. (2014) try a number of different values for K ranging from 50 to 200 but settle on 50 because interpreting and using the results is more manageable with fewer topics. Finally, Hall et al. (2008) choose K = 100 for their corpus of 12,500 documents. Through a processed of adjusting their topic model these researchers ended up with 43 useful topics for their project.
These studies do not reveal a definitive answer to how many topics should be chosen for a topic model, but there is a consensus that trial and error is a necessary part of the process. There also seems to be consensus that the right number of topics for a model should be a middle ground between topics which are too general (choosing too few topics) and topics which are too specific (choosing too many topics). What counts as too general or as too specific depends entirely upon the objective of the topic model.
Topic models not only need to be told how many topics to infer from the corpus of documents, but they also need to be told the ratios of per document topic distribution (how widely a topic is distributed over a document) and per topic word distribution (how many words in a topic). These hyper-parameters, also called priors, are the alpha and beta values respectively. As the value for alpha increases, the number of topics found in a document will increase. As the value for beta increases the number of words in a topic will increase. Both alpha and beta can be set to a value between 0 and 1.
According to Wallach et al. (2009) hyper-parameter settings are closely related to the issue of selecting the number of topics because hyper-parameter settings dictate how the topics are related to the documents and the words in the corpus. These researchers demonstrate that the optimal setting for alpha is ‘asymmetric’, that is, allowing a different alpha value for each topic.
Despite the importance of hyper-parameter settings, there was little discussion on hyper-parameters in the example studies discussed above. Suominen and Toivanen (2016), for example, do not comment on hyper-parameters at all and Carron-Arthur et al. (2016) and Székely and vom Brocke (2017) simply mention that they relied on the topic modeling software they were using (MALLET) to automatically estimate the values for alpha and beta. Carter et al. (2016) discuss the meaning of alpha and beta, but then go on to say that they set the value of alpha to auto in the software they used (Gensim). Only Wang et al. (2016) go into any detail about the values they picked for these hyper-parameters and the rationale for doing so. These researchers want to produce a topic model containing sparse topic and word distributions. In other words, they do not want topics distributed too widely among documents nor do they want words distributed too widely among topics. According to Wang et al. (2016) sparse topic and word distributions lead to topics which are more interpretable because the models are essentially simpler: no matter the value set for K, if alpha is set to a small value, an individual topic will not be distributed across too many documents. Toward that end, these researchers set the values of alpha to 0.1 and beta to 0.01.
All of the above discussion begs the question: what makes a good topic model? On the one hand, Mohr and Bogdanov (2013) express clearly one aspect of a topic model’s success in a way that is worth quoting, “The objective of topic modeling is to find the parameters of the LDA process that has likely generated the corpus…it is the task of reverse-engineering the intents of the author(s) in producing the corpus” (p. 547). On the other hand, Carter, et al. (2016) stress that the success of a topic model may be best measured if it is able to accomplish the task(s) for which it was built.
At the very least, a successful model must contain topics which are semantically coherent. There are both qualitative and quantitative measures of a topic’s coherence, but most of the literature reviewed here used qualitative evaluations of the coherence of topics in a topic model. For qualitative evaluation, researchers can manually examine the topics and see if the top words in a topic actually make sense together. Martin and Johnson (2015) comment that the top words in a topic must not be ambiguous nor be without an interpretable theme. Chang, Boyd-Garber, Gerrish, Wang, and Blei (2009) suggest a “word intrusion test” to measure a topic’s coherence. In this test, a participant is presented with the top n words from a topic except that one of the words has been replaced by an “intruder” word. If the participant is able to identify the intruder word, the topic can be considered coherent. Carron-Arthur et al. (2016) also stress the importance of “domain knowledge” in this area. Words that appear as intruders to the lay person may not be perceived as such by those with subject expertise in the area.
A successful topic model is then a topic model which is useful. If the topic model is intended as a tool for analyzing a large corpus of documents, it is successful to the degree that it does so in informative ways for the researcher. If a topic model is intended as a tool for grouping documents together into meaningful groups, it is successful to the degree that it accomplishes that task. If a topic model is intended as a tool for information retrieval, it is successful to the degree that it returns the information the user is looking for. Whatever the use case may be, topic models must be carefully built to accomplish those tasks successfully.
Blei, D. M. and Lafferty, J. (2009). Topic Models. In A. Srivastava, M. Sahami, & V. Kumar (Eds.), Text Mining (pp. 71-93). New York: Chapman and Hall/CRC.