Compare Noun and Regular Corpus Models
Appendix 1: Documentation of Corpus Preparation
Appendix 2: Documentation of Model Training Code
View the Project on GitHub msaxton/topic-model-best-practices
<!DOCTYPE html>
A number of researchers have suggested that one of the limitations of LDA is that it cannot identify how many topics are in a corpus, leaving this decision to the human user (Yau et al., 2014 and Suominen and Toianen, 2016). Indeed, there is no way to identify the "correct" number of topics in advance of building the topic model (Carter et al., 2016). If the user specifies too few topics for the model, then the topics will be too general and useless for exploratory analysis or information retrieval. By contrast, if the user specifies too many topics for the model, the topics will be too specific, or redundant, to be of use; also, too many topics makes the interpretation of the model unwieldy. Therefore, most users experiment with the number of topics and make qualitative evaluations about which number of topics is most useful (Chang et al., 2016). Ultimately the right choice about the number of topics is dependent upon the way in which the model is going to be used (Carter, et al., 2016). As such, the ratio of documents (n) in a corpus to topics (k) to be extracted from the corpus ranges widely. Just to provide a few examples:
Here, I analyze the properties of topic models each of which have a different number of topics:
model_25_topics
: This model has 25 topics, is based on a noun-only corpus, and has the alpha value set to symmetric.model_75_topics
: This model has 25 topics, is based on a noun-only corpus, and has the alpha value set to symmetric.model_150_topics
: This model has 150 topics, is based on a noun-only corpus, and has the alpha value set to symmetric.from gensim import corpora, models, similarities
import pyLDAvis.gensim
import json
import spacy
path = '../noun_corpus/'
# load metadata for later use
with open('../data/doc2metadata.json', encoding='utf8', mode='r') as f:
doc2metadata = json.load(f)
# load dictionary and corpus for the noun models
dictionary = corpora.Dictionary.load(path + 'noun_corpus.dict')
corpus = corpora.MmCorpus(path + 'noun_corpus.mm')
# load model_25_topics
model_25_topics = models.ldamodel.LdaModel.load(path + 'noun_25.model')
# load model_75_topics
model_75_topics = models.ldamodel.LdaModel.load(path + 'noun_75.model')
# load model_150_topics
model_150_topics = models.ldamodel.LdaModel.load(path + 'noun_150.model')
model_25_topics
¶model_25_viz = pyLDAvis.gensim.prepare(model_25_topics, corpus, dictionary)
pyLDAvis.display(model_25_viz)
model_25_topics
produced 4 topics which lack semantic or contextual coherence, 2 topics of mixed coherence, and 19 topics which are coherent. Therefore its topics are:
To illustrate what is meant by each category, consider the following examples:
Examples of junk topics:
Example of mixed topic:
Examples of coherent topics:
model_75_topics
¶noun_75_viz = pyLDAvis.gensim.prepare(noun_75, corpus, dictionary)
pyLDAvis.display(noun_75_viz)
model_75_topics
produced 13 topics which lack semantic or contextual coherence, 5 topics of mixed coherence, and 57 topics which are coherent. Therefore its topics are:
A number of the topics from the model_25_topics
reappear in the model_75_topics
. However, some topics, such as topic 9 from model_25_topics
appear to be given more nuance in model_75_topics
, for example:
model_75_topics
also introduces many new coherent topics not found in model_25_topics
, for example:
model_150_topics
¶model_150_viz = pyLDAvis.gensim.prepare(model_150_topics, corpus, dictionary)
pyLDAvis.display(model_150_viz)
model_150_topics
produced 33 topics which lack semantic or contextual coherence, 8 topics of mixed coherence, and 109 topics which are coherent. Therefore its topics are:
The coherent topics found in the previous models are present in the model_150_topics
, but a large number of other coherent topics are added, for example:
model_25_topics
contained 76% coherent topics, model_75_topics
contained 77% coherent topics and model_150_topics
contained 73% coherent topics. So, relative to the number of topics in each model, the performance was similar. However, given the raw numbers, model_150_topics
contains far more coherent topics than either of the other two models. This suggests that model_150_topics
provides a more nuanced model of the corpus. Topics which did not register in the other models, such as topic 78 (holiness code) and topic 112 (patristics), are revealed in model_150_topics
. The utility of having nuanced topics needs to be wighed against the difficulty of keeping track of so many topics while doing an exploritory analysis of a corpus; nuance comes at the cost of efficenty.
def cluster_test(corpus, model):
docs_with_1_topic = 0
docs_with_multiple_topics = 0
docs_with_no_topics = 0
total_docs = 0
for doc in corpus:
topics = model.get_document_topics(doc, minimum_probability=0.20)
total_docs += 1
if len(topics) == 1:
docs_with_1_topic += 1
elif len(topics) > 1:
docs_with_multiple_topics += 1
else:
docs_with_no_topics += 1
print('Corpus assigned to a single topic:', (docs_with_1_topic / total_docs) * 100, '%')
print('Corpus assigned to multiple topics:', (docs_with_multiple_topics / total_docs) * 100, '%')
print('corpus assigned to no topics:', (docs_with_no_topics / total_docs) * 100, '%')
model_25_topics
¶cluster_test(corpus, model_25_topics)
model_25_topics
¶cluster_test(corpus, model_75_topics)
model_150_topics
¶cluster_test(corpus, model_150_topics)
model_25_topics
outperforms the other two models in that it only left 1.47% of documents unassigned to a topic. By contrast, model_75_topics
left 14.59% of documents assigned and model_150_topics
left 28.14% of the documents unassigned. Additionaly, although model_25_topics
assigned fewer documents to a single topic than the other two models, it assigned more far more documents to multiple topics, thus providing a more robust clustering system where a document may belong to more than one topic.
# build indicies for similarity quiries
index_25 = similarities.MatrixSimilarity(model_25_topics[corpus])
index_75 = similarities.MatrixSimilarity(model_75_topics[corpus])
index_150 = similarities.MatrixSimilarity(model_150_topics[corpus])
# define retrieval text
def retrieval_test(new_doc, lda, index):
new_bow = dictionary.doc2bow(new_doc) # change new document to bag of words representation
new_vec = lda[new_bow] # change new bag of words to a vector
index.num_best = 10 # set index to generate 10 best results
matches = (index[new_vec])
scores = []
for match in matches:
score = (match[1])
scores.append(score)
score = str(score)
key = 'doc_' + str(match[0])
article_dict = doc2metadata[key]
author = article_dict['author']
title = article_dict['title']
year = article_dict['pub_year']
print(key + ': ' + author.title() + ' (' + year + '). ' + title.title() + '\n\tsimilarity score -> ' + score + '\n')
# set up nlp for new docs
nlp = spacy.load('en')
stop_words = spacy.en.STOPWORDS
def get_noun_lemmas(text):
doc = nlp(text)
tokens = [token for token in doc]
noun_tokens = [token for token in tokens if token.tag_ == 'NN' or token.tag_ == 'NNP' or token.tag_ == 'NNS']
noun_lemmas = [noun_token.lemma_ for noun_token in noun_tokens if noun_token.is_alpha]
noun_lemmas = [noun_lemma for noun_lemma in noun_lemmas if noun_lemma not in stop_words]
return noun_lemmas
# load and process Greene, N. E. (2017)
with open('../abstracts/greene.txt', encoding='utf8', mode='r') as f:
text = f.read()
greene = get_noun_lemmas(text)
#load and process Hollenback, G. M. (2017)
with open('../abstracts/hollenback.txt', encoding='utf8', mode='r') as f:
text = f.read()
hollenback = get_noun_lemmas(text)
# load and process Dinkler, M. B. (2017)
with open('../abstracts/dinkler.txt', encoding='utf8', mode='r') as f:
text = f.read()
dinkler = get_noun_lemmas(text)
model_25_topics
¶retrieval_test(greene, model_25_topics, index_25)
model_75_topics
¶retrieval_test(greene, model_75_topics, index_75)
model_150_topics
¶retrieval_test(greene, model_150_topics, index_150)
Two documents from the corpus were matches with the Greene article in all three models:
doc_2855 shows up as the 3rd highest match in model_25_topics
(similarity score of 92.9%) and model_150_topics
(similarity score of 78.2%), but as the 7th highest match in the model_75_topics
(similarity score of 87.1%). doc_8205 shows up as the 7th highest match in model_25_topics
(similarity score of 90.4%) and model_75_topics
(similarity score of 87.1%), but as the 10th highest match in model_150_topics
(similarity score of 72.2%).
model_25_topics
¶retrieval_test(hollenback, model_25_topics, index_25)
model_75_topics
¶retrieval_test(hollenback, model_75_topics, index_75)
model_150_topics
¶retrieval_test(hollenback, model_150_topics, index_150)
model_25_topics
returned results having to do with biblical law and rabbinic interpretation. model_75_topics
returned results that focus primarily on issues of gender and sexuality. Finally, model_150_topics
returned results focusing on translation issues. Clearly, each model understands this article differently. All three themes--law, gender/sexuality, and translation issues-- are present in the article, so in a sense each model is useful. However, and interestingly, none of these models returned the article to which the present one is a response: Walsh, J.T. (2001). Leviticus 18:22 and 20:13: Who is Doing What to Whom? Journal of Biblical Literature, 120, 201-9.
model_25_topics
¶retrieval_test(dinkler, model_25_topics, index_25)
model_75_topics
¶retrieval_test(dinkler, model_75_topics, index_75)
model_150_topics
¶retrieval_test(dinkler, model_150_topics, index_150)
Each topic model retrieved docuuments dealing with the gospels which on a general level is fitting for this article. There is one document from the corpus which was retireved by all three models:
The model_25_topics
ranked this as the 8th highest match (similarity score of 97.4%) whereas both model_75_topics
and model_150_topics
ranked this as the 1st highest match(similarity scores of 86.7% and 80.0% respectivley). It may seem strange that these two models ranked this document as the highest match insofar as it is about the Gospel of John but the query article was about the Gospel of Luke, but the nuance provided by these models are picking up the themes of literary charactiorizartion.