Label topics function — label

This function generates topic labels using four metrics: highest probability, FREX, lift, and score. For each topic, it returns the top n vocabulary terms according to each metric.

Usage

label_topics(beta, vocab, wordcounts, n = 8, frex_weight = 0.5)

Arguments

beta: A numeric matrix of dimension (topics x words) representing the probability distribution of words within each topic. Each row should sum to 1. Beta must be on the probability scale (not log scale).
vocab: a character vector of vocabulary terms corresponding to the columns of beta.
wordcounts: a numeric vector giving the total count of each word across the entire dataset.
n: the number of top words to return for each topic, the default value is 8.
frex_weight: the weight between 0 and 1 controlling the balance between frequency and exclusivity in the FREX metric. Weight closer to 1 is favoring exclusivity and closer to 0 is favoring frequency, we set the default as 0.5.

Value

a list of top n vocabulary terms for each topic, ranked according to four metrics: highest probability, FREX, lift, and score.

Details

Highest Probability: For each topic, words are ranked by their probability within that topic. The top n words with the largest probabilities are selected. FREX: FREX is calculated by combining frequency and exclusivity for each word in each topic. Frequency is the word probabilities ranked and scaled to values between 0 and 1. Each word’s probability is divided by its total probability to calculate how exclusive the word is to each topic. Then the exclusivity values are ranked within each topic and scaled to values between 0 and 1. The FREX score is the weighted harmonic mean of frequency rank and exclusivity rank, according to this formula frex<- 1 / (w / freq_rank + (1 - w) / ex_rank). Lift: We first calculate the overall frequency of each word by dividing its total count by the total count of all words in the dataset. Then each word’s probability is divided by its overall frequency. Score: The score is computed by first taking the logarithm of the topic-word probabilities. Then calculate the average log probability across all topics for each word to represent its overall baseline level. For each topic and word, compute the difference between its log probability in that topic and its average log probability, and multiply by beta to get the final score.