Topic Modeling Best Practices

Project Overview

Literature Review

Compare Number of Topics

Compare Alpha Values

Compare Noun and Regular Corpus Models

Discussion

Conclusions

Appendix 1: Documentation of Corpus Preparation

Appendix 2: Documentation of Model Training Code

References

View the Project on GitHub msaxton/topic-model-best-practices

<!DOCTYPE html>

build_models

Building Topic Models

The core of this project is the building of multiple topic models based on the same corpus of documents, the JBL. There are three primary variations that are being tested:

  • The number of topics assigned to each model
  • The value assigned to the alpha hyper-parameter
  • The types of words included in the corpus, whether they should be from all parts of speech or just nouns

For a base-line of comparison I build the following model:

  • Model based on a noun-only corpus, 75 topics, alpha = 'symmetric'

To make analyze the number of topics I build the following models:

  • Model based on a noun-only corpus, 25 topics, alpha = 'symmetric'
  • Model based on a noun-only corpus, 150 topics, alpha = 'symmetric'

To analyze the value of alpha I build the following models

  • Model based on a noun-only corpus, 75 topics, alpha = 'auto'
  • Model based on a noun-only corpus, 75 topics, alpha = 0.5

To analyze the types of words included in the corpus I build the following model:

  • Model based on a regular corpus, 75 topics, alpha = symmetric

Set up

In [ ]:
from gensim import corpora, models
In [ ]:
# load corpus and dictionary for noun-only models
n_path = '../noun_corpus/'
n_dictionary = corpora.Dictionary.load(path + 'noun_corpus.dict')
n_corpus = corpora.MmCorpus(path + 'noun_corpus.mm')

Model 1: Model based on a noun-only corpus, 75 topics, alpha = 'symmetric'

In [ ]:
lda_75 = models.LdaModel(n_corpus, id2word=n_dictionary, num_topics=75, passes=100, random_state=42)
lda_75.save(path + 'noun_75.model')

Model 2: Model based on a noun-only corpus, 25 topics, alpha = 'symmetric'

In [ ]:
lda_25 = models.LdaModel(n_corpus, id2word=n_dictionary, num_topics=25, passes=100, random_state=42)
lda_25.save(path + 'noun_25.model')

Model 3: Model based on a noun-only corpus, 150 topics, alpha = 'symmetric'

In [ ]:
lda_150 = models.LdaModel(n_corpus, id2word=n_dictionary, num_topics=150, passes=100, random_state=42)
lda_150.save(path + 'noun_150.model')

Model 4: Model based on a noun-only corpus, 75 topics, alpha = 'auto'

In [ ]:
lda_auto = models.LdaModel(n_corpus, id2word=n_dictionary, num_topics=75, passes=100, alpha='auto', random_state=42)
lda_auto.save(path + 'alphas/noun_auto.model')

Model 5: Model based on a noun-only corpus, 75 topics, alpha = 0.5

In [ ]:
np_05 = np.full(75, 0.5) # create array of 0.5 values to feed to the model
lda_05 = models.LdaModel(corpus, id2word=dictionary, num_topics=75, passes=100, alpha=np_05, random_state=42)
lda_05.save(path + 'alphas/noun_05.model')

Model 6: Model based on a regular corpus, 75 topics, alpha = symmetric

In [ ]:
# load dictionary and corpus for regular corpus models
r_path = '../general_corpus/'
r_dictionary = corpora.Dictionary.load(path + 'general_corpus.dict')
r_corpus = corpora.MmCorpus(path + 'general_corpus.mm')

lda_75 = models.LdaModel(corpus, id2word=dictionary, num_topics=75, passes=100, random_state=42)
lda_75.save(path + 'general_75.model')