Compare Noun and Regular Corpus Models
Appendix 1: Documentation of Corpus Preparation
Appendix 2: Documentation of Model Training Code
View the Project on GitHub msaxton/topic-model-best-practices
<!DOCTYPE html>
In a topic model, the value of the hyper-parameter alpha dictates how the topics are distributed across the documents. A higher value for alpha means that a topic will be distributed more widely across the documents, whereas a lower value for alpha means that a topic will be distributed more narrowly across the documents. Wallach et al. (2009) argue that attention to the settings of alpha is important in constructing a robust topic model. Yet, many studies that utalize topic models for various purposes simply set the alpha hyper-parameter to the default value (Carron-Arthur et al. (2016) and Székely and vom Brocke (2017)). In the case of gensim
the default value for alpha is 'symmetric.' This means that the value for alpha is uniform for each topic. The formula which gensim
uses to calculate the symmetric value for alpha is to divide 1.0 by the number of topics in the model. So if the model has 75 topics, alpha will be set to 0.013.
Here I analyze the properties of topic models with three different alpha values:
model_alpha_symmetric
: This model takes gensim
's default setting for alpha, which here results in a value of 0.013. The number of topics is 75 and the model is based on a noun-only version of the corpus.model_alpha_auto
: This model has the alpha value set to 'auto' which means that gensim
estimates a value for each topic which results in asymmetric values for alpha. See bellow for values of alpha for this mode. The number of topics is 75 and the model is based on a noun-only version of the corpus.model_alpha_05
: This model has an alpha value of 0.5 for each topic (symmetric). Setting the alpha value to 0.5 of each topic stretches the model beyond what is perhaps reasonable, but the intent is to show the effects such a setting has on the model. The number of topics for this model is 75 and the model is based on a noun-only version of the corpus.from gensim import corpora, models, similarities
import pyLDAvis.gensim
import spacy
import json
path = '../noun_corpus/'
# load metadata for later use
with open('../data/doc2metadata.json', encoding='utf8', mode='r') as f:
doc2metadata = json.load(f)
# load dictionary and corpus for the noun models
dictionary = corpora.Dictionary.load(path + 'noun_corpus.dict')
corpus = corpora.MmCorpus(path + 'noun_corpus.mm')
# load alpha = symmetric model
model_alpha_symmetric = models.ldamodel.LdaModel.load(path + 'noun_75.model')
# load alpha = auto model
model_alpha_auto = models.ldamodel.LdaModel.load(path + 'alphas/noun_auto.model')
# load alpha = 0.5 model
model_alpha_05 = models.ldamodel.LdaModel.load(path + 'alphas/noun_05.model')
print('Alpha values for symmetric model:\n ', model_alpha_symmetric.alpha)
print('Alpha values for auto model:\n ', model_alpha_auto.alpha)
print('Alpha values for 0.5 model:\n ', model_alpha_05.alpha)
model_alpha_symmetric
¶model_alpha_symmetric_viz = pyLDAvis.gensim.prepare(model_alpha_symmetric, corpus, dictionary)
pyLDAvis.display(model_alpha_symmetric_viz)
model_alpha_symmetric
produced 13 topics which lack semantic or contextual coherence, 5 topics of mixed coherence, and 57 topics which are coherent. Therefore its topics are:
A few examples of junk topics:
An example of mixed topics:
A few examples of coherent topics:
model_alpha_auto
¶model_alpha_auto_viz = pyLDAvis.gensim.prepare(model_alpha_auto, corpus, dictionary)
pyLDAvis.display(model_alpha_auto_viz)
model_alpha_auto
produced 13 topics which lack semantic or contextual coherence, 5 topics of mixed coherence, and 57 topics which are coherent. Therefore its topics are:
A few examples of junk topics:
An example of mixed topics:
A few examples of coherent topics:
model_alpha_05
¶model_alpha_05_viz = pyLDAvis.gensim.prepare(model_alpha_05, corpus, dictionary)
pyLDAvis.display(model_alpha_05_viz)
model_alpha_05
produced 12 topics which lack semantic or contextual coherence, 5 topics of mixed coherence, and 58 topics which are coherent. Therefore its topics are:
A few examples of junk topics:
An example of mixed topics:
A few examples of coherent topics:
Each of these models produced a similar number of coherent topics: model_alpha_symmetric
has 57 coherent topics, noun_alpha_auto
also has 57 coherent topics, and model_alpha_05
has 58. A close examination of the topics in these models reveals that they are very similar to one another (especially between model_alpha_symmetric
and model_alpha_auto
) in terms of the words in topics, although they differ in the order of prominence of words in the topic and the prominence of topics in the corpus (hence they are numbered differently in the visualizations above. Interestingly, model_alpha_05
identified an important topic in New Testament scholarship: topic 24 (justification).
def cluster_test(corpus, model):
docs_with_1_topic = 0
docs_with_multiple_topics = 0
docs_with_no_topics = 0
total_docs = 0
for doc in corpus:
topics = model.get_document_topics(doc, minimum_probability=0.20)
total_docs += 1
if len(topics) == 1:
docs_with_1_topic += 1
elif len(topics) > 1:
docs_with_multiple_topics += 1
else:
docs_with_no_topics += 1
print('Corpus assigned to a single topic:', (docs_with_1_topic / total_docs) * 100, '%')
print('Corpus assigned to multiple topics:', (docs_with_multiple_topics / total_docs) * 100, '%')
print('corpus assigned to no topics:', (docs_with_no_topics / total_docs) * 100, '%')
model_alpha_symmetric
¶cluster_test(corpus, model_alpha_symmetric)
model_alpha_auto
¶cluster_test(corpus, model_alpha_auto)
model_alpha_05
¶cluster_test(corpus, model_alpha_05)
The results of the cluster test for model_alpha_symmetric
and model_alpha_auto
are fairly close to one another. model_alpha_symmetric
left 14.5% of the documents in the corpus unassigned to a topic and model_alpha_auto
left 16.3% of the documents in the corpus unassigned to a topic. model_alpha_05
did not perform nearly as well and left 48.8% of the documents in the corpus unassigned to a topic.
# build indicies for similarity queries
index_symmetric = similarities.MatrixSimilarity(model_alpha_symmetric[corpus])
index_auto = similarities.MatrixSimilarity(model_alpha_auto[corpus])
index_05 = similarities.MatrixSimilarity(model_alpha_05[corpus])
# define retrieval test
def retrieval_test(new_doc, lda, index):
new_bow = dictionary.doc2bow(new_doc) # change new document to bag of words representation
new_vec = lda[new_bow] # change new bag of words to a vector
index.num_best = 10 # set index to generate 10 best results
matches = (index[new_vec])
scores = []
for match in matches:
score = (match[1])
scores.append(score)
score = str(score)
key = 'doc_' + str(match[0])
article_dict = doc2metadata[key]
author = article_dict['author']
title = article_dict['title']
year = article_dict['pub_year']
print(key + ': ' + author.title() + ' (' + year + '). ' + title.title() + '\n\tsimilarity score -> ' + score + '\n')
# set up nlp for new docs
nlp = spacy.load('en')
stop_words = spacy.en.STOPWORDS
def get_noun_lemmas(text):
doc = nlp(text)
tokens = [token for token in doc]
noun_tokens = [token for token in tokens if token.tag_ == 'NN' or token.tag_ == 'NNP' or token.tag_ == 'NNS']
noun_lemmas = [noun_token.lemma_ for noun_token in noun_tokens if noun_token.is_alpha]
noun_lemmas = [noun_lemma for noun_lemma in noun_lemmas if noun_lemma not in stop_words]
return noun_lemmas
# load and process Greene, N. E. (2017)
with open('../abstracts/greene.txt', encoding='utf8', mode='r') as f:
text = f.read()
greene = get_noun_lemmas(text)
#load and process Hollenback, G. M. (2017)
with open('../abstracts/hollenback.txt', encoding='utf8', mode='r') as f:
text = f.read()
hollenback = get_noun_lemmas(text)
# load and process Dinkler, M. B. (2017)
with open('../abstracts/dinkler.txt', encoding='utf8', mode='r') as f:
text = f.read()
dinkler = get_noun_lemmas(text)
model_alpha_symmetric
¶retrieval_test(greene, model_alpha_symmetric, index_symmetric)
model_alpha_auto
¶retrieval_test(greene, model_alpha_auto, index_auto)
model_alpha_05
¶retrieval_test(greene, model_alpha_05, index_05)
These models achieved similar average similarity scores in the first information retrieval task and each model returned documents about psalms in its results. Six documents from the corpus were matches with the Greene article in all three models (although there was not consistency in how high of a match each was ranked):
model_alpha_symmetric
¶retrieval_test(hollenback, model_alpha_symmetric, index_symmetric)
model_alpha_auto
¶retrieval_test(hollenback, model_alpha_auto, index_auto)
model_alpha_05
¶retrieval_test(hollenback, model_alpha_05, index_05)
Each model returned documents dealing with gender and sexuality which is appropriate given the nature of the query article. Four documents from the corpus were matches with the Hollenback article in all three models:
Interestingly, all three models ranked doc_463 as the second most likely match. It is also worth noting that doc_8757: Walsh, Jerome T. (2001). Leviticus 18:22 And 20:13: Who Is Doing What To Whom? was returned as a match for model_alpha_symmetric
and for model_alpha_05
. This is the article to which the query article is a response.
model_alpha_symmetric
¶retrieval_test(dinkler, model_alpha_symmetric, index_symmetric)
model_alpha_auto
¶retrieval_test(dinkler, model_alpha_auto, index_auto)
model_alpha_05
¶retrieval_test(dinkler, model_alpha_05, index_05)
Each topic model retrieved documents dealing with the gospels which on a general level are appropriate for the query article. The similarity score for each model are close to one another for this retrieval task. Three documents from the corpus were returned by all three models:
doc_8158 was ranked as the top match by model_alpha_symmetric
and model_alpha_auto
.