Compare Noun and Regular Corpus Models
Appendix 1: Documentation of Corpus Preparation
Appendix 2: Documentation of Model Training Code
View the Project on GitHub msaxton/topic-model-best-practices
<!DOCTYPE html>
Here, I prepare a collection of documents, namely articles from the Journal of Biblical Literature (JBL), into a format that will be most useful for building a topic model. When the process is complete, there will be a dictionary which maps the articles (documents) to the appropriate metadata and two corpora: the first a general corpus in which each journal article is represented by a list of informative words (minus pre-defined stop words), and second a corpus in which each journal article is represented by a list of the most informative nouns only.
There are several python libraries which I use to process the JBL into a corpus ready to be modeled. The most important of which are:
spaCy
, a natural language processing library which allows me to prepare the text in meaningful was such as removing stop words, part of speech tagging, and lemmatizinggensim
, a topic modeling library which does the computational work of building the topic model.import os
import re
import collections
import xml.etree.ElementTree as ET
import spacy
from gensim import corpora
import json
The output of a topic model will only be as good as its input. It is important therefore to select the most informative words, or features, from the corpus. This allows the topic model to focus in on the "signal" by eliminating as much "noise" as possible. Toward that end, I take a few steps:
Preprocessing also includes the process of tokenizing each document, that is to say, transforming each document to a list of discrete items. In this case, each item is a word. In the functions I define below for this purpose, for each document I create a list of general tokens (minus stop words) and a list of noun only tokens. This allows me to build to different versions of the topic model which I compare to see which of the models is most informative.
books_abbrs = [('gen', 'genesis'),('exod', 'exodus'),('ex', 'exodus'),('lev', 'leviticus'),('num', 'numbers'),
('deut', 'deuteronomy'),('josh', 'joshua'),('judg', 'judges'), ('jud', 'judges'),('sam', 'samuel'),('kgs', 'kings'),
('chr', 'chronicles'),('neh', 'nehemiah'),('esth', 'esther'),('ps', 'psalms'),('pss', 'psalms'),
('prov', 'proverbs'),('eccl', 'ecclesiastes'),('qoh', 'qoheleth'), ('isa', 'isaiah'),
('jer', 'jeremiah'),('lam', 'lamentations'),('ezek', 'ezekiel'),('hos', 'hosea'),('obad', 'obediah'),
('mic', 'micah'),('nah', 'nahum'),('hab', 'habakkuk'),('zeph', 'zephaniah'),('hag', 'haggai'),
('zech', 'zechariah'),('mal', 'malachi'),('matt', 'matthew'),('mk', 'mark'),('lk', 'luke'),
('jn', 'john'),('rom', 'romans'),('cor', 'corinthians'),('gal', 'galatians'),('eph', 'ephesians'),
('phil', 'philippians'),('col', 'colossians'),('thess', 'thessalonians'),('tim','timothy'),
('phlm', 'philemon'),('heb', 'hebrews'),('jas', 'james'),('pet', 'peter'),('rev', 'revelation'),
('tob', 'tobit'),('jdt', 'judith'), ('wis', 'wisdom of solomon'),('sir', 'sirach'), ('bar', 'baruch'),
('macc', 'maccabees'), ('esd', 'esdras'), ('tg', 'targum')]
custom_stop_words = ['ab', 'al', 'alten', 'america', 'atlanta', 'au', 'av', 'avrov', 'b', 'ba', 'bauer', 'berlin',
'boston', 'brill', 'brown', 'c', 'cad', 'cambridge', 'cf', 'ch', 'chap', 'chapter', 'charles',
'chicago', 'chs', 'cit', 'cite', 'claremont', 'college', 'craig', 'cum', 'd', 'dans', 'dennis',
'diese', 'dissertation', 'dm', 'dtr', 'ed', 'eds', 'eerdmans', 'ek', 'elisabeth', 'en', 'et',
'ev', 'ez', 'f', 'far', 'ff', 'fiir', 'g', 'gar', 'george', 'geschichte', 'gott', 'gottes',
'grand', 'h', 'ha', 'hall', 'hartford', 'hat', 'haven', 'henry', 'ia', 'ibid', 'io',
'isbn', 'iv', 'ivye', 'ix', 'jeremias', 'jesu', 'k', 'ka', 'kai', 'kal', 'kat', 'kee', 'ki', 'kim',
'kirche', 'klein', 'knox', 'l', 'la', 'le', 'leiden', 'leipzig', 'les', 'loc', 'louisville', 'm',
'ma', 'madison', 'marie', 'marshall', 'mohr', 'n', 'na', 'neuen', 'ni', 'nu', 'nur', 'o', 'ol',
'om', 'op', 'ov', 'ovadd', 'ovk', 'oxford', 'paper', 'paulus', 'ph', 'philadelphia', 'post',
'pres', 'president', 'press', 'pro', 'prof', 'professor', 'r', 'ra', 'rab', 'rapids', 'refer',
'reviews', 'ro', 'robert', 'robinson', 'rov', 's', 'sa', 'schmidt', 'schriften', 'scott', 'sec',
'section', 'seiner', 'sheffield', 'siebeck', 'stanely', 'studien', 't', 'thee', 'theologie',
'they', 'thing', 'thou', 'thy', 'tiibingen', 'tov', 'tr', 'tv', 'u', 'um', 'univ', 'unto', 'v',
'van', 'verse', 'vol', 'volume', 'vs', 'vss', 'vv', 'w', 'william', 'wunt',
'y', 'yap', 'ye', 'york', 'zeit']
with open('../romannumeral.txt') as f:
rom_nums = f.read()
rom_nums = re.sub('romannumeral', '', rom_nums)
rom_nums = re.sub('lxx', '', rom_nums) # lxx is an abbr. for 'septuagint'
rom_nums = re.split(r'\t\n', rom_nums)
with open('../data/german_stop_words') as f:
german_stop_words = f.readlines()
german_stop_words = [word.strip() for word in german_stop_words]
nlp = spacy.load('en')
stop_words = spacy.en.STOPWORDS
stop_words.update(custom_stop_words)
stop_words.update(rom_nums)
stop_words.update(german_stop_words)
def substitute(list_tuples, string):
for tuple_ in list_tuples:
string = re.sub(r'\b' + tuple_[0] + r'\b', tuple_[1], string)
return string
def get_lemmas(doc):
tokens = [token for token in doc]
lemmas = [token.lemma_ for token in tokens if token.is_alpha]
lemmas = [lemma for lemma in lemmas if lemma not in stop_words]
for index, item in enumerate(lemmas):
item = substitute(books_abbrs, item)
lemmas[index] = item
return lemmas
def get_noun_lemmas(doc):
tokens = [token for token in doc]
noun_tokens = [token for token in tokens if token.tag_ == 'NN' or token.tag_ == 'NNP' or token.tag_ == 'NNS']
noun_lemmas = [noun_token.lemma_ for noun_token in noun_tokens if noun_token.is_alpha]
noun_lemmas = [noun_lemma for noun_lemma in noun_lemmas if noun_lemma not in stop_words]
for index, item in enumerate(noun_lemmas):
item = substitute(books_abbrs, item)
noun_lemmas[index] = item
return noun_lemmas
def process_text(text):
doc = nlp(text)
lemmas = get_lemmas(doc)
noun_lemmas = get_noun_lemmas(doc)
return lemmas, noun_lemmas
JSTOR's Data for Research provided both metadata and full text articles for JBL in the form of xml files and txt files respectively. However, many of these articles are not useful for the topic model and are not processed:
After leaving those articles aside, I extract the metadata from the xml files, map the metadata to the relevant article, and then store this mapping as a dictionary for later reference. Then, I extract the full text for each article and run it through the preprocessing functions I defined above.
%%time
xml_files = sorted(os.listdir('../data/metadata/'))
txt_files = sorted(os.listdir('../data/ocr/'))
mapping_dict = collections.OrderedDict()
general_docs = []
noun_docs = []
i = 0
for xml, txt in zip(sorted(xml_files), sorted(txt_files)):
article_dict = {}
# read xml file
tree = ET.parse('../data/metadata/' + xml)
root = tree.getroot()
# only process english articles
lang = root.find('./front/article-meta/custom-meta-group/custom-meta/meta-value')
if (lang.text == 'eng') or (lang.text == 'en'):
# add title to article dict
title = root.find('./front/article-meta/title-group/article-title')
try:
title = title.text
title = title.lower()
except AttributeError:
book_reviewed = root.find('./front/article-meta/product/source')
title = 'review of ' + book_reviewed.text.lower() # jbl does not title book reviews
unwanted_titles = ['front matter', 'back matter', 'annual index', 'volume information'] # ignore these titles
if not title in unwanted_titles:
article_dict['title'] = title
# add article_id to article_dict
article_id = root.find('./front/article-meta/article-id')
article_id = article_id.text.lower()
article_dict['article_id'] = article_id
# add author to article_dict
f_name = root.find('./front/article-meta/contrib-group/contrib/string-name/given-names')
l_name = root.find('./front/article-meta/contrib-group/contrib/string-name/surname')
author = root.find('./front/article-meta/contrib-group/contrib/string-name')
if f_name != None:
author = l_name.text + ', ' + f_name.text
elif author != None:
author = root.find('./front/article-meta/contrib-group/contrib/string-name')
author = author.text
else:
author = 'author not listed'
article_dict['author'] = author
# add publish date to article_dict
pub_year = root.find('./front/article-meta/pub-date/year')
article_dict['pub_year'] = pub_year.text
key = 'doc_' + str(i)
mapping_dict[key] = article_dict
# read txt file
with open('../data/ocr/' + txt, mode='r', encoding='utf8') as f:
text = f.read()
lemmas, nouns = process_text(text)
if len(nouns) > 0: # only want docs which are not empty
general_docs.append(lemmas)
noun_docs.append(nouns)
key = 'doc_' + str(i)
mapping_dict[key] = article_dict
i += 1
else:
continue
if i % 500 == 0: # displaying progress
print('finhsied doc ', i)
else:
continue
else:
continue
with open('../data/doc2metadata.json', encoding='utf8', mode='w') as outfile:
json.dump(mapping_dict, outfile)
Initializing a Gensim corpus (which serves as the basis of a topic model) entails two steps:
[(1, 1.0), (2, 1.0), (3, 5.0), (4, 1.0), (5, 9.0)]
The creation of the dictionary requires the researcher to specify two important parameters which further identify the most informative features.
no_below
This parameter filters out words which are too rare to be informative. The value of this parameter indicates the number of documents in which a token appears. Here I have set the value to 100 indicating that if a token occurs in less than 100 documents, it will not be included in the topic model.no_above
This parameter filters out words which are too frequent to be informative The value of this parameter indicates the percentage of documents in which a token appears. Here, I have set the value to 0.5 indicating that if a token occurs in more than 50% of the documents, it will not be included in the topic model.This filtering selects a "goldilocks zone" of informative features. Tokens which are too rare fail to register similarity among documents, because so few documents use them. By contrast, words which are too frequent fail to register difference among documents because so many documents share them in common.
There is no formula for deciding what counts as too rare or too frequent. It just depends on the size of the corpus and its lexical diversity. I tried a number of variations for these parameters for the regular corpus. First, taking no_above
as a constant:
no_below=20
and no_above=0.5
which left 22,283 unique tokensno_below=50
and no_above=0.5
which left 12,642 unique tokensno_below=100
and no_above=0.5
which left 7,834 unique tokensThen taking no_below
as a constant:
no_below=100
and no_above=0.3
which left 7,617 unique tokensno_below=100
and no_above=0.4
which left 7,753 unique tokensno_below=100
and no_above=0.5
which left 7,834 unique tokensno_below=100
and no_above=0.9
which left 7,901 unique tokensAs you can see, adjusting no_below
had a greater effect on the number of unique tokens than did adjusting no_above
. The model I tested with no_below=20
and no_above=0.5
contained more "junk topics" than did other models so I decided 22,283 unique tokens was too many features for this corpus. no_below=50
and no_above=0.5
rendered fewer junk topics, but no_below=100
and no_above=0.5
did even better.
I did a similar process for the noun-only corpus. First, taking the no_above
as a constant:
no_below=20
and no_above=0.5
which left 15,761 unique tokensno_below=50
and no_above=0.5
which left 8,264 unique tokensno_below=100
and no_above=0.5
which left 4,780 unique tokensThen taking the no_below
as a constant:
no_below=100
and no_above=0.3
which left 4,685 unique tokensno_below=100
and no_above=0.4
which left 4,758 unique tokensno_below=100
and no_above=0.5
which left 4,790 unique tokensno_below=100
and no_above=0.6
which left 4,813 unique tokensFor the noun-only corpus I also settled on no_below=100
and no_above=0.5
This corpus contains lemmatized forms of the words used in the text (minus the stopwords outlined above) regardless of their part of speech.
# create dictionary
general_dictionary = corpora.Dictionary(general_docs)
general_dictionary.filter_extremes(no_below=100, no_above=0.5)
general_dictionary.save('../general_corpus/general_corpus.dict')
# create corpus
corpus = [general_dictionary.doc2bow(doc) for doc in general_docs]
corpora.MmCorpus.serialize('../general_corpus/general_corpus.mm', corpus)
This corpus contains lemmatized forms of the nouns used in the text (minus the stop words outlined above).
# create dictionary
noun_dictionary = corpora.Dictionary(noun_docs)
noun_dictionary.filter_extremes(no_below=100, no_above=0.5)
noun_dictionary.save('../noun_corpus/noun_corpus.dict')
# create corpus
noun_corpus = [noun_dictionary.doc2bow(doc) for doc in noun_docs]
corpora.MmCorpus.serialize('../noun_corpus/noun_corpus.mm', noun_corpus)