Topic Modeling Best Practices

Project Overview

Literature Review

Compare Number of Topics

Compare Alpha Values

Compare Noun and Regular Corpus Models

Discussion

Conclusions

Appendix 1: Documentation of Corpus Preparation

Appendix 2: Documentation of Model Training Code

References

View the Project on GitHub msaxton/topic-model-best-practices

<!DOCTYPE html>

compare_noun_and_regular

Analysis: Comparing a Regular Corpus with a Noun-only Corpus

There is a truism in computer science that illustrates the importance of what goes into a topic model: "garbage in, garbage out." The output of a topic model can only be as good as its input. Therefore, when building a topic model it is important to use the most informative features of the corpus. This is why it is necessary not only to tokenize documents into word tokens, but also to remove stop words and lemmatize word tokens (Székely and vom Brocke (2017)). Some researchers have also tested whether or not building a topic model on a noun-only corpus will improve the model's performance (Martin and Johnson (2015)). The idea behind this approach is that nouns are often more informative of a document's content than are other parts of speech such as adjectives, adverbs, or verbs. While this may be true, it is important to consider how the model is going to be used; after all, if the model is being used for authorship attribution, adjectives, adverbs, or verbs may be informative features (Savoy (2013)).

Here, I analyze the properties of topic models built on different versions of the JBL corpus:

  • r_model: This model is based on the regular version of the corpus containing all parts of speech. The number of topics is 75 and the value of alpha is set to symmetric.
  • n_model: This model is based on a noun-only version of the corpus. The number of topics is 75 and the value of alpha is set to symmetric.

The details of how these versions were built can be seen in the processing section of this project; the Python library SpaCy was used to process the corpus. Unfortunately, an examination of the noun-only model will reveal that not all non-nouns were filtered out by SpaCy, but a comparison of the unique word tokens for each version of the corpus shows that many non-nouns were filtered out: The regular corpus contains 7,834 unique tokens and the noun-only corpus contains 4,790 unique tokens.

Set Up: Import Packages and Load Topic Models

In [1]:
from gensim import corpora, models, similarities
import pyLDAvis.gensim
import json
import spacy

# load metadata for later use
with open('../data/doc2metadata.json', encoding='utf8', mode='r') as f:
    doc2metadata = json.load(f)
    
# load regular corpus, dictionary, and model
r_dictionary = corpora.Dictionary.load('../general_corpus/general_corpus.dict')
r_corpus = corpora.MmCorpus('../general_corpus/general_corpus.mm')
r_model = models.ldamodel.LdaModel.load('../general_corpus/general_75.model')

# load noun-only corpus, dictionary, and model
n_dictionary = corpora.Dictionary.load('../noun_corpus/noun_corpus.dict')
n_corpus = corpora.MmCorpus('../noun_corpus/noun_corpus.mm')
n_model = models.ldamodel.LdaModel.load('../noun_corpus/noun_75.model')

Topic coherence

Topic Coherence: r_model

In [3]:
r_model_viz = pyLDAvis.gensim.prepare(r_model, r_corpus, r_dictionary)
pyLDAvis.display(r_model_viz)
Out[3]:

r_model produced 11 topics which lack semantic or contextual coherence, 7 topics of mixed coherence, and 57 topics which are coherent. Therefore its topics are:

  • 14.6% junk topics
  • 9.3% mixed topics
  • 76% coherent topics

A few examples of junk topics:

  • topic 8: essay, john, bible, studies, paul, theology, jesus, david, commentary, fortress
  • topic 54: stone, head, symbol, hand, scene, garment, art, figure, serpent, right
  • topic 73: e, clement, clause, clem, antecedent, arnold, toy, ropes, lyon, miller

A few examples of mixed topics:

  • topic 31: city, jerusalem, wall, gate, land, north, south, valley, east, jordan (This topic could either be thought of as a mix of "city" and "geography" or perhaps as a coherent topic labeled "locations")
  • topic 51: son, father, child, family, mother, brother, dead, burial, wife, bear (This topic deals mostly with "family" but the words "dead" and "burial" suggest a mixed topic.)

A few examples of coherent topics:

  • topic 7 (narrative criticism): narrative, story, reader, literary, character, audience, speech, reading, response, narrator
  • topic 25 (archeology): israel, period, archaeological, israeilite, palestine, archaeology, site, age, excavation, iron
  • topic 57 (dead sea scrolls): qumran, scroll, shall, society, member, sea, dead, scrolls, community, council

Topic Coherence: n_model

In [4]:
n_model_viz = pyLDAvis.gensim.prepare(n_model, n_corpus, n_dictionary)
pyLDAvis.display(n_model_viz)
Out[4]:

n_model produced 13 topics which lack semantic or contextual coherence, 5 topics of mixed coherence, and 57 topics which are coherent. Therefore its topics are:

  • 17.3% junk topics
  • 6.6% mixed topics
  • 76% coherent topics

A few examples of junk topics:

  • topic 3: essay, bible, john, commentary, old, theology, james, fortress, david, paul
  • topic 9: faith, hebrews, life, sin, thought, love, sense, people, heart, judgment
  • topic 74: plate, script, hatch, harrison, index, haran, equivalent, print, earthquake, tribution

An example of mixed topics:

  • topic 44: esther, hand, foot, eye, king, garment, head, house, moore, gold (This topic could be thought of as a mix of "body" and the story of Esther).

A few examples of coherent topics:

  • topic 11 (narrative criticism): story, narrative, character, reader, narrator, account, event, motif, element, pattern
  • topic 34 (dead sea scrolls): qumran, scroll, dead, sea, community, scrolls, document, cave, sect, fragment
  • topic 38 (family): son, father, child, family, mother, bother, marriage, wife, daughter, birth

Topic Coherence: Brief Discussion

r_model and n_model each produced 57 coherent topics (76% of total topics). Further a close examination of the topics produced demonstrates a striking similarity in the topics produced by these models:

  • r_model topic 8: essay, john, bible, studies, paul, theology, jesus, david, commentary, fortress
  • n_model topic 3: essay, bible, john, commentary, old, theology, james, fortress, david, paul

  • r_model topic 7 (narrative criticism): narrative, story, reader, literary, character, audience, speech, reading, response, narrator
  • n_model topic 11 (narrative criticism): story, narrative, character, reader, narrator, account, event, motif, element, pattern

  • r_model topic 57 (dead sea scrolls): qumran, scroll, shall, society, member, sea, dead, scrolls, community, council
  • n_model topic 34 (dead sea scrolls): qumran, scroll, dead, sea, community, scrolls, document, cave, sect, fragment

  • r_model topic 51: son, father, child, family, mother, brother, dead, burial, wife, bear
  • n_model topic 38 (family): son, father, child, family, mother, bother, marriage, wife, daughter, birth

There are some slight differences however. For example, in the Dead Sea Scrolls topic, r_model includes the imperative verb "shall" which says something about the nature of Dead Sea Scrolls literature. Another difference occurs between r_mdoel topic 51 and n_model topic 38; the former is a mixed topic but the latter is a coherent topic because the words "dead" and "burial" are not present. As a noun, "burial" may be expected in n_model but perhaps with few features in n_model the distribution of words across topics was altered.

Clustering Test

In [9]:
def cluster_test(corpus, model):
    docs_with_1_topic = 0
    docs_with_multiple_topics = 0
    docs_with_no_topics = 0
    total_docs = 0
    for doc in corpus:
        topics = model.get_document_topics(doc, minimum_probability=0.20)
        total_docs += 1
        if len(topics) == 1:
            docs_with_1_topic += 1
        elif len(topics) > 1:
            docs_with_multiple_topics += 1
        else:
            docs_with_no_topics += 1
    print('Corpus assigned to a single topic:', (docs_with_1_topic / total_docs) * 100, '%')
    print('Corpus assigned to multiple topics:', (docs_with_multiple_topics / total_docs) * 100, '%')
    print('corpus assigned to no topics:', (docs_with_no_topics / total_docs) * 100, '%')

Clustering: r_model

In [10]:
cluster_test(r_corpus, r_model)
Corpus assigned to a single topic: 58.2798459563543 %
Corpus assigned to multiple topics: 26.60462130937099 %
corpus assigned to no topics: 15.115532734274712 %

Clustering: n_model

In [5]:
cluster_test(n_corpus, n_model)
Corpus assigned to a single topic: 55.26315789473685 %
Corpus assigned to multiple topics: 30.14548566538297 %
corpus assigned to no topics: 14.591356439880187 %

Clustering: Brief Discussion

r_model assigned 58.2% of the documents in the corpus to a single topic and 26.6% of the documents in the corpus to multiple topics, leaving 15.1% unassigned. n_model assigned 55.2% of the documents in the corpus to a single topic and 30.1% of the documents in the corpus to multiple topics, leaving 14.5% unassigned. There is not a significant difference between these models in terms of the amount of the corpus which was left unassigned.

Information Retrieval Test

In [2]:
# build indicies for similarity quiries
r_index = similarities.MatrixSimilarity(r_model[r_corpus])
n_index = similarities.MatrixSimilarity(n_model[n_corpus])

# define retrieval test
def retrieval_test(dictionary, new_doc, lda, index):
    new_bow = dictionary.doc2bow(new_doc)  # change new document to bag of words representation
    new_vec = lda[new_bow]  # change new bag of words to a vector
    index.num_best = 10  # set index to generate 10 best results
    matches = (index[new_vec])
    scores = []
    for match in matches:
        score = (match[1])
        scores.append(score)
        score = str(score)
        key = 'doc_' + str(match[0])
        article_dict = doc2metadata[key]
        author = article_dict['author']
        title = article_dict['title']
        year = article_dict['pub_year']
        print(key + ': ' + author.title() + ' (' + year + '). ' + title.title() + '\n\tsimilarity score -> ' + score + '\n')

# set up nlp for new docs
nlp = spacy.load('en')
stop_words = spacy.en.STOPWORDS

# define regular lemmatizer
def get_lemmas(text):
    doc = nlp(text)
    tokens = [token for token in doc]
    lemmas = [token.lemma_ for token in tokens if token.is_alpha]
    lemmas = [lemma for lemma in lemmas if lemma not in stop_words]
    return lemmas

# define noun-only lemmatizer
def get_noun_lemmas(text):
    doc = nlp(text)
    tokens = [token for token in doc]
    noun_tokens = [token for token in tokens if token.tag_ == 'NN' or token.tag_ == 'NNP' or token.tag_ == 'NNS']
    noun_lemmas = [noun_token.lemma_ for noun_token in noun_tokens if noun_token.is_alpha]
    noun_lemmas = [noun_lemma for noun_lemma in noun_lemmas if noun_lemma not in stop_words]
    return noun_lemmas

# load and process Greene, N. E. (2017)
with open('../abstracts/greene.txt', encoding='utf8', mode='r') as f:
    text = f.read()
    r_greene = get_lemmas(text)
    n_greene = get_noun_lemmas(text)
    
#load and process Hollenback, G. M. (2017)
with open('../abstracts/hollenback.txt', encoding='utf8', mode='r') as f:
    text = f.read()
    r_hollenback = get_lemmas(text)
    n_hollenback = get_noun_lemmas(text)

# load and process Dinkler, M. B. (2017)
with open('../abstracts/dinkler.txt', encoding='utf8', mode='r') as f:
    text = f.read()
    r_dinkler = get_lemmas(text)
    n_dinkler = get_noun_lemmas(text)

Finding Articles Similar to Greene, N. E. (2017). Creation, destruction, and a Psalmist's plea: rethinking the poetic structure of Psalm 74.

Information Retrieval: r_model

In [3]:
retrieval_test(r_dictionary, r_greene, r_model, r_index)
doc_8337: Durlesser, James A. (1991). Review Of Trishagion Und Gottesherrschaft: Psalm 99 Als Neuinterpretation Von Tora Und Propheten
	similarity score -> 0.9167608022689819

doc_7176: Malchow, Bruce V. (1997). Review Of Psalm 102 Im Kontext Des Vierten Psalmenbuches
	similarity score -> 0.8992549180984497

doc_8705: Jerome F. D. Creach (1999). Review Of The Songs Of Ascents (Psalms 120-134): Their Place In Israelite History And Religion
	similarity score -> 0.8797650337219238

doc_7288: Limburg, James (1997). Review Of  Jahwe Wird Kommen, Zu Herrschen Über Die Erde: Ps 90-110 Als Komposition 
	similarity score -> 0.8787395358085632

doc_7286: Miller, Patrick D. (1997). Review Of Die Komposition Des Psalters: Ein Formgeschichtlicher Ansatz
	similarity score -> 0.8757284879684448

doc_2110: Mays, James L. (1985). Review Of The Psalms Of The Sons Of Korah
	similarity score -> 0.8631658554077148

doc_5243: John Wm. Wevers (1960). Review Of  Psalm 89: Eine Liturgie Aus Dent Ritual Des Leidenden Königs 
	similarity score -> 0.8608895540237427

doc_7287: Watts, James W. (1997). Review Of  Merveilles À Nos Yeux: Etude Structurelle De Vingt Psaumes Dont Celui De 1 Ch 16,8-36 
	similarity score -> 0.8604172468185425

doc_7250: Allen, Leslie C. (1998). Review Of The Structure Of Psalms 93-100
	similarity score -> 0.8420727252960205

doc_7539: Ceresko, Anthony R. (1995). Review Of  Voyez De Vos Yeux: Étude Structurelle De Vingt Psaumes, Dont Le Psaume 119 
	similarity score -> 0.8418440818786621

Information Retrieval: n_model

In [4]:
retrieval_test(n_dictionary, n_greene, n_model, n_index)
doc_9217: Briggs, Charles A. (1899). An Inductive Study Of Selah
	similarity score -> 0.8950629830360413

doc_1411: Berry, George R. (1914). The Titles Of The Psalms
	similarity score -> 0.8928478956222534

doc_804: Peters, John P. (1921). Another Folk Song
	similarity score -> 0.8910002708435059

doc_757: Peters, John P. (1916). Ritual In The Psalms
	similarity score -> 0.8899545669555664

doc_2855: Jefferson, Helen Genevieve (1952). Psalm 93
	similarity score -> 0.8780089616775513

doc_123: Armstrong, Ryan M. (2012). Psalms Dwelling Together In Unity: The Placement Of Psalms 133 And 134 In Two Different Psalms Collections
	similarity score -> 0.8754132986068726

doc_8205: Waltke, Bruce K. (1991). Superscripts, Postcripts, Or Both
	similarity score -> 0.871891438961029

doc_9314: Peters, John P. (1910). Notes On Some Ritual Uses Of The Psalms
	similarity score -> 0.8673380017280579

doc_2970: Liebreich, Leon J. (1955). The Songs Of Ascents And The Priestly Blessing
	similarity score -> 0.8623627424240112

doc_5418: Buss, Martin J. (1963). The Psalms Of Asaph And Korah
	similarity score -> 0.8536018133163452

Brief Discussion: Finding articles similar to Greene, N. E. (2017). Creation, destruction, and a Psalmist's plea: rethinking the poetic structure of Psalm 74.

Each model returned documents about the Psalms which is appropriate given the query article, and each model had similar similarity scored. However, each model returned a different set of results without any documents in common.

Finding Articles Similar to Hollenback, G. M. (2017). Who is doing what to whom revisited: Another look at Leviticus 18:22 and 20:13.

Information Retrieval: r_model

In [5]:
retrieval_test(r_dictionary, r_hollenback, r_model, r_index)
doc_7540: Collins, John J. (1995). Review Of Marriage As A Covenant: A Study Of Biblical Law And Ethics Governing Marriage, Developed From The Perspective Of Malachi
	similarity score -> 0.7883983254432678

doc_6899: Blenkinsopp, Joseph (1995). Review Of The View Of Women Found In The Deuteronomic Family Laws
	similarity score -> 0.7663593888282776

doc_7805: Bird, Phyllis A. (1993). Review Of Frauen Im Alten Israel: Eine Begriffsgeschichtliche Und Sozialrechtliche Studie Zur Stellung Der Frau Im Alten Testament
	similarity score -> 0.750497043132782

doc_1639: Running, Leona Glidden (1984). Review Of Il Matrimonio Israelitico. Una Theoria Generale
	similarity score -> 0.7494863271713257

doc_8166: Collins, Adela Yarbro (1988). Review Of  Schweigen, Schmuck Und Schleier: Drei Neutestamentliche Vorschriften Zur Verdrängung Der Frauen Auf Dem Hintergrund Einer Frauenfeindlichen Exegese Des Alten Testaments Im Antiken Judentum 
	similarity score -> 0.7305008172988892

doc_8757: Walsh, Jerome T. (2001). Leviticus 18:22 And 20:13: Who Is Doing What To Whom?
	similarity score -> 0.7292389869689941

doc_1974: Trible, Phyllis (1987). Review Of The Israelite Woman: Social Role And Literary Type In Biblical Narrative
	similarity score -> 0.7121146321296692

doc_2355: De George, Susan G. (1986). Review Of Women In The Ministry Of Jesus: A Study Of Jesus' Attitudes To Women And Their Roles As Reflected In His Earthly Life
	similarity score -> 0.7006977796554565

doc_8946: Mclay, Tim (2000). Review Of The Oracle Of Tyre: The Septuagint Of Isaiah Xxiii As Version And Vision
	similarity score -> 0.6980003714561462

doc_5894: Macdonald, John (1980). Review Of  Untersuchungen Zum Begriff <Rle>וצר<Pdf> Im Alten Testament 
	similarity score -> 0.6873050928115845

Infomration Retrieval: n_model

In [6]:
retrieval_test(n_dictionary, n_hollenback, n_model, n_index)
doc_8995: Martin, Troy W. (2004). Paul'S Argument From Nature For The Veil In 1 Corinthians 11:13-15: A Testicle Instead Of A Head Covering
	similarity score -> 0.7967202663421631

doc_463: Cosgrove, Charles H. (2005). A Woman'S Unbound Hair In The Greco-Roman World, With Special Reference To The Story Of The "Sinful Woman" In Luke 7:36-50
	similarity score -> 0.7828606367111206

doc_8719: Burrus, Virginia (1999). Review Of Early Christian Women And Pagan Opinion: The Power Of The Hysterical Woman
	similarity score -> 0.7453253865242004

doc_284: Lemos, T. M. (2006). Shame And Mutilation Of Enemies In The Hebrew Bible
	similarity score -> 0.7283123731613159

doc_1851: Kraemer, Ross S. (1985). Review Of In Memory Of Her: A Feminist Theological Reconstruction Of Christian Origins
	similarity score -> 0.7240121364593506

doc_316: Nasrallah, Laura (2006). Review Of A Woman'S Place: House Churches In Earliest Christianity
	similarity score -> 0.7107797265052795

doc_143: Townsley, Jeramy (2011). Paul, The Goddess Religions, And Queer Sects: Romans 1:23—28
	similarity score -> 0.7066832780838013

doc_1974: Trible, Phyllis (1987). Review Of The Israelite Woman: Social Role And Literary Type In Biblical Narrative
	similarity score -> 0.7020665407180786

doc_6994: Corley, Kathleen E. (1996). Review Of The Double Message: Patterns Of Gender In Luke-Acts
	similarity score -> 0.6915348768234253

doc_8757: Walsh, Jerome T. (2001). Leviticus 18:22 And 20:13: Who Is Doing What To Whom?
	similarity score -> 0.6825031638145447

Brief Discussion: Finding articles similar to Hollenback, G. M. (2017). Who is doing what to whom revisited: Another Look at Leviticus 18:22 and 20:13.

Each model returned documents which focus on gender, especially on women's roles in various biblical contexts, and each model had similar similarity scores. There were two documents from the corpus which were returned by each model:

  • doc_8757: Walsh, Jerome T. (2001). Leviticus 18:22 And 20:13: Who Is Doing What To Whom?
  • doc_1974: Trible, Phyllis (1987). Review Of The Israelite Woman: Social Role And Literary Type In Biblical Narrative

doc_8757 is in fact a document to which the query article is a response, but neither model places this document very high in its ranking (number 6 in r_model with a similarity score of 72.9% and number 10 in n_model with a similarity score of 68.2%). The other document, doc_1974, is a review of a book about women in biblical narratives. It was ranked 7th by r_model (with a similarity score of 71.2%) and ranked 8th by n_model (with a similarity score of 70.2%).

Finding Articles Similar to Dinkler, M. B. (2017). Building character on the road to Emmaus: Lukan characterization in contemporary literary perspective.

Information Retrieval: r_model

In [7]:
retrieval_test(r_dictionary, r_dinkler, r_model, r_index)
doc_8835: Graber, Philip L. (2002). Review Of Transitivity-Based Foregrounding In The Acts Of The Apostles: A Functional-Grammatical Approach To The Lukan Perspective
	similarity score -> 0.9306855797767639

doc_8985: Nolland, John (2000). Review Of  Den Anfang Hören: Leserorientierte Evangelienexegese Am Beispiel Von Matthäus 1-2 
	similarity score -> 0.9304404258728027

doc_7745: Darr, John A. (1991). Review Of Literary Criticism And The Gospels: The Theoretical Challenge
	similarity score -> 0.9158278703689575

doc_8421: Darr, John A. (1993). Review Of Host, Guest, Enemy And Friend: Portraits Of The Pharisees In Luke And Acts
	similarity score -> 0.9145811796188354

doc_8887: Bauer, David R. (2000). Review Of Matthew'S Parables: Audience-Oriented Perspectives
	similarity score -> 0.9141808748245239

doc_8156: Bassler, Jouette M. (1988). Review Of Narrative Space And Mythic Meaning In Mark
	similarity score -> 0.904059886932373

doc_6919: Stegner, William Richard (1995). Review Of The Transfiguration: A Source- And Redaction-Critical Study Of Luke 9:28-36
	similarity score -> 0.9022995233535767

doc_8235: Watson, Duane F. (1991). Review Of The Rhetorical Strategy Of 1 Peter, With Special Regard To Ambiguous Expressions
	similarity score -> 0.8984384536743164

doc_5910: Beardslee, William A. (1980). Review Of Structural Exegesis: From Theory To Practice: Exegesis Of Mark 15 And 16; Hermeneutical Implications
	similarity score -> 0.8975492119789124

doc_7430: Dean, Margaret E. (1996). Review Of God-With-Us: The Dominant Perspective In Matthew'S Story And Other Essays
	similarity score -> 0.8948432207107544

Information Retrieval: n_model

In [8]:
retrieval_test(n_dictionary, n_dinkler, n_model, n_index)
doc_8158: Tyson, Joseph B. (1988). Review Of The Lukan Voice: Confusion And Irony In The Gospel Of Luke
	similarity score -> 0.8752881288528442

doc_7866: Lincoln, Andrew T. (1989). The Promise And The Failure: Mark 16:7, 8
	similarity score -> 0.8574842214584351

doc_1952: Praeder, Susan Marie (1984). Review Of Mark As Story: An Introduction To The Narrative Of A Gospel
	similarity score -> 0.8537847995758057

doc_7796: Malbon, Elizabeth Struthers (1993). Echoes And Foreshadowings In Mark 4-8 Reading And Rereading
	similarity score -> 0.8523545265197754

doc_8712: Brodie, Thomas L. (1999). Review Of The Discipleship Paradigm: Readers And Anonymous Characters In The Fourth Gospel
	similarity score -> 0.852075457572937

doc_264: Ahearne-Kroll, Stephen P. (2010). Audience Inclusion And Exclusion As Rhetorical Technique In The Gospel Of Mark
	similarity score -> 0.839181661605835

doc_7865: Malbon, Elizabeth Struthers (1989). The Jewish Leaders In The Gospel Of Mark: A Literary Study Of Marcan Characterization
	similarity score -> 0.8106918931007385

doc_6706: Boomershine, Thomas E. (1981). Mark 16:8 And The Apostolic Commission
	similarity score -> 0.7990190982818604

doc_312: Sylva, Dennis (2006). Review Of Dialogue And Drama: Elements Of Greek Tragedy In The Fourth Gospel
	similarity score -> 0.7910127639770508

doc_3857: Robbins, Vernon K. (1973). The Healing Of Blind Bartimaeus (10:46-52) In The Marcan Theology
	similarity score -> 0.7885808348655701

Brief Discussion: Finding Articles Similar to Dinkler, M. B. (2017). Building character on the road to Emmaus: Lukan characterization in contemporary literary perspective.

Each model returned documents related to gospel studies which is appropirate given the query article. r_model had higher similarity scores than did n_model. Each model returned a unique list of results with no documents in common.