Compare Noun and Regular Corpus Models
Appendix 1: Documentation of Corpus Preparation
Appendix 2: Documentation of Model Training Code
View the Project on GitHub msaxton/topic-model-best-practices
<!DOCTYPE html>
There is a truism in computer science that illustrates the importance of what goes into a topic model: "garbage in, garbage out." The output of a topic model can only be as good as its input. Therefore, when building a topic model it is important to use the most informative features of the corpus. This is why it is necessary not only to tokenize documents into word tokens, but also to remove stop words and lemmatize word tokens (Székely and vom Brocke (2017)). Some researchers have also tested whether or not building a topic model on a noun-only corpus will improve the model's performance (Martin and Johnson (2015)). The idea behind this approach is that nouns are often more informative of a document's content than are other parts of speech such as adjectives, adverbs, or verbs. While this may be true, it is important to consider how the model is going to be used; after all, if the model is being used for authorship attribution, adjectives, adverbs, or verbs may be informative features (Savoy (2013)).
Here, I analyze the properties of topic models built on different versions of the JBL corpus:
r_model
: This model is based on the regular version of the corpus containing all parts of speech. The number of topics is 75 and the value of alpha is set to symmetric.n_model
: This model is based on a noun-only version of the corpus. The number of topics is 75 and the value of alpha is set to symmetric.The details of how these versions were built can be seen in the processing section of this project; the Python library SpaCy was used to process the corpus. Unfortunately, an examination of the noun-only model will reveal that not all non-nouns were filtered out by SpaCy, but a comparison of the unique word tokens for each version of the corpus shows that many non-nouns were filtered out: The regular corpus contains 7,834 unique tokens and the noun-only corpus contains 4,790 unique tokens.
from gensim import corpora, models, similarities
import pyLDAvis.gensim
import json
import spacy
# load metadata for later use
with open('../data/doc2metadata.json', encoding='utf8', mode='r') as f:
doc2metadata = json.load(f)
# load regular corpus, dictionary, and model
r_dictionary = corpora.Dictionary.load('../general_corpus/general_corpus.dict')
r_corpus = corpora.MmCorpus('../general_corpus/general_corpus.mm')
r_model = models.ldamodel.LdaModel.load('../general_corpus/general_75.model')
# load noun-only corpus, dictionary, and model
n_dictionary = corpora.Dictionary.load('../noun_corpus/noun_corpus.dict')
n_corpus = corpora.MmCorpus('../noun_corpus/noun_corpus.mm')
n_model = models.ldamodel.LdaModel.load('../noun_corpus/noun_75.model')
r_model
¶r_model_viz = pyLDAvis.gensim.prepare(r_model, r_corpus, r_dictionary)
pyLDAvis.display(r_model_viz)
r_model
produced 11 topics which lack semantic or contextual coherence, 7 topics of mixed coherence, and 57 topics which are coherent. Therefore its topics are:
A few examples of junk topics:
A few examples of mixed topics:
A few examples of coherent topics:
n_model
¶n_model_viz = pyLDAvis.gensim.prepare(n_model, n_corpus, n_dictionary)
pyLDAvis.display(n_model_viz)
n_model
produced 13 topics which lack semantic or contextual coherence, 5 topics of mixed coherence, and 57 topics which are coherent. Therefore its topics are:
A few examples of junk topics:
An example of mixed topics:
A few examples of coherent topics:
r_model
and n_model
each produced 57 coherent topics (76% of total topics). Further a close examination of the topics produced demonstrates a striking similarity in the topics produced by these models:
r_model
topic 8: essay, john, bible, studies, paul, theology, jesus, david, commentary, fortressn_model
topic 3: essay, bible, john, commentary, old, theology, james, fortress, david, paulr_model
topic 7 (narrative criticism): narrative, story, reader, literary, character, audience, speech, reading, response, narratorn_model
topic 11 (narrative criticism): story, narrative, character, reader, narrator, account, event, motif, element, patternr_model
topic 57 (dead sea scrolls): qumran, scroll, shall, society, member, sea, dead, scrolls, community, counciln_model
topic 34 (dead sea scrolls): qumran, scroll, dead, sea, community, scrolls, document, cave, sect, fragmentr_model
topic 51: son, father, child, family, mother, brother, dead, burial, wife, bearn_model
topic 38 (family): son, father, child, family, mother, bother, marriage, wife, daughter, birthThere are some slight differences however. For example, in the Dead Sea Scrolls topic, r_model
includes the imperative verb "shall" which says something about the nature of Dead Sea Scrolls literature. Another difference occurs between r_mdoel
topic 51 and n_model
topic 38; the former is a mixed topic but the latter is a coherent topic because the words "dead" and "burial" are not present. As a noun, "burial" may be expected in n_model
but perhaps with few features in n_model
the distribution of words across topics was altered.
def cluster_test(corpus, model):
docs_with_1_topic = 0
docs_with_multiple_topics = 0
docs_with_no_topics = 0
total_docs = 0
for doc in corpus:
topics = model.get_document_topics(doc, minimum_probability=0.20)
total_docs += 1
if len(topics) == 1:
docs_with_1_topic += 1
elif len(topics) > 1:
docs_with_multiple_topics += 1
else:
docs_with_no_topics += 1
print('Corpus assigned to a single topic:', (docs_with_1_topic / total_docs) * 100, '%')
print('Corpus assigned to multiple topics:', (docs_with_multiple_topics / total_docs) * 100, '%')
print('corpus assigned to no topics:', (docs_with_no_topics / total_docs) * 100, '%')
r_model
¶cluster_test(r_corpus, r_model)
n_model
¶cluster_test(n_corpus, n_model)
r_model
assigned 58.2% of the documents in the corpus to a single topic and 26.6% of the documents in the corpus to multiple topics, leaving 15.1% unassigned. n_model
assigned 55.2% of the documents in the corpus to a single topic and 30.1% of the documents in the corpus to multiple topics, leaving 14.5% unassigned. There is not a significant difference between these models in terms of the amount of the corpus which was left unassigned.
# build indicies for similarity quiries
r_index = similarities.MatrixSimilarity(r_model[r_corpus])
n_index = similarities.MatrixSimilarity(n_model[n_corpus])
# define retrieval test
def retrieval_test(dictionary, new_doc, lda, index):
new_bow = dictionary.doc2bow(new_doc) # change new document to bag of words representation
new_vec = lda[new_bow] # change new bag of words to a vector
index.num_best = 10 # set index to generate 10 best results
matches = (index[new_vec])
scores = []
for match in matches:
score = (match[1])
scores.append(score)
score = str(score)
key = 'doc_' + str(match[0])
article_dict = doc2metadata[key]
author = article_dict['author']
title = article_dict['title']
year = article_dict['pub_year']
print(key + ': ' + author.title() + ' (' + year + '). ' + title.title() + '\n\tsimilarity score -> ' + score + '\n')
# set up nlp for new docs
nlp = spacy.load('en')
stop_words = spacy.en.STOPWORDS
# define regular lemmatizer
def get_lemmas(text):
doc = nlp(text)
tokens = [token for token in doc]
lemmas = [token.lemma_ for token in tokens if token.is_alpha]
lemmas = [lemma for lemma in lemmas if lemma not in stop_words]
return lemmas
# define noun-only lemmatizer
def get_noun_lemmas(text):
doc = nlp(text)
tokens = [token for token in doc]
noun_tokens = [token for token in tokens if token.tag_ == 'NN' or token.tag_ == 'NNP' or token.tag_ == 'NNS']
noun_lemmas = [noun_token.lemma_ for noun_token in noun_tokens if noun_token.is_alpha]
noun_lemmas = [noun_lemma for noun_lemma in noun_lemmas if noun_lemma not in stop_words]
return noun_lemmas
# load and process Greene, N. E. (2017)
with open('../abstracts/greene.txt', encoding='utf8', mode='r') as f:
text = f.read()
r_greene = get_lemmas(text)
n_greene = get_noun_lemmas(text)
#load and process Hollenback, G. M. (2017)
with open('../abstracts/hollenback.txt', encoding='utf8', mode='r') as f:
text = f.read()
r_hollenback = get_lemmas(text)
n_hollenback = get_noun_lemmas(text)
# load and process Dinkler, M. B. (2017)
with open('../abstracts/dinkler.txt', encoding='utf8', mode='r') as f:
text = f.read()
r_dinkler = get_lemmas(text)
n_dinkler = get_noun_lemmas(text)
r_model
¶retrieval_test(r_dictionary, r_greene, r_model, r_index)
n_model
¶retrieval_test(n_dictionary, n_greene, n_model, n_index)
Each model returned documents about the Psalms which is appropriate given the query article, and each model had similar similarity scored. However, each model returned a different set of results without any documents in common.
r_model
¶retrieval_test(r_dictionary, r_hollenback, r_model, r_index)
n_model
¶retrieval_test(n_dictionary, n_hollenback, n_model, n_index)
Each model returned documents which focus on gender, especially on women's roles in various biblical contexts, and each model had similar similarity scores. There were two documents from the corpus which were returned by each model:
doc_8757 is in fact a document to which the query article is a response, but neither model places this document very high in its ranking (number 6 in r_model
with a similarity score of 72.9% and number 10 in n_model
with a similarity score of 68.2%). The other document, doc_1974, is a review of a book about women in biblical narratives. It was ranked 7th by r_model
(with a similarity score of 71.2%) and ranked 8th by n_model
(with a similarity score of 70.2%).
r_model
¶retrieval_test(r_dictionary, r_dinkler, r_model, r_index)
n_model
¶retrieval_test(n_dictionary, n_dinkler, n_model, n_index)
Each model returned documents related to gospel studies which is appropirate given the query article. r_model
had higher similarity scores than did n_model
. Each model returned a unique list of results with no documents in common.