1. Introduction
This article explores the fundamental aspects of Latent Dirichlet Allocation (LDA), a highly-utilized unsupervised probabilistic technique for topic modeling. It elaborates the core principles of LDA, providing an accessible interpretation of the underlying mathematical concepts that dictate the operation of the model, as well as the training process using Gibbs Sampling with operable illustration. The real-world applications and potential extensions of the LDA model are also explored.
2. Model Assumption
Documents are treated as "bags of words", disregarding the order in which the words appear.
Every document is considered as a mixture of these topics, with the contribution of each topic defined by a probability.
Every document can be described as a mixture of these topics.
-
For example:
-
Document 1 Topics: 0.5 * Art + 0.4 * History + 0.1 * Physics
-
Document 2 Topics: 0.7 * Technology + 0.2 * Business + 0.1 * Politics
-
Topic Art: Painting * p1 + Sculpture * p2 + Photography * p3...
-
Topic History: War * p1 + Revolution * p2 + Empire * p3...
-
Topic Physics: Quantum * p1 + Relativity * p2 + Particle * p3...
-
Topic Technology: AI * p1 + Robotics * p2 + Blockchain * p3...
-
Topic Business: Startup * p1 + Corporation * p2 + Investment * p3...
-
Topic Politics: Democracy * p1 + Policy * p2 + Election * p3...
-
3. Model Intuition
As stated in section 2, The fundamental idea behind LDA is that every document in
the text corpus is a mixture of topics, and each word in a document is attributable to one of these topics.
By saying a document is "generated" by its given topic distribution, it means that
each document is assumed to be produced in the following manner:
A document's topic mixture is determined. For example, we might say that Document 1 is 70% about "sports", 20% about "politics", and 10% about "economics".
Each word in the document is selected. This is done by first choosing a topic based on the document's topic distribution and then choosing a word based on the topic's probability distribution over words. For example, if we choose the "sports" topic for a word in Document 1, we might then choose the word "football" if "football" has a high probability in the "sports" topic. And we do this for each "word slot" of the document.
In this way, the document is "generated" by its given topic distribution, and each word in turn
is generated by the topic distribution of its respective document and the word distribution of its assigned topic.
The task of LDA is to reverse this process. Given a corpus of documents, LDA tries to
figure out the topic distributions for each document and the word distributions for each topic
that would have most likely resulted in the observed corpus.
4. Theorem Explaination
4.1 Formula
\(P(\boldsymbol{W},\boldsymbol{Z},\theta,\varphi;\alpha,\beta) = \prod_{j=1}^{M}P(\theta_j;\alpha) \prod_{i=1}^{K}P(\varphi_i;\beta) \prod_{t=1}^{N}P(Z_{j,t} \mid \theta_j) P(W_{j,t} \mid \varphi_{Z_{j,t}})\)
4.2 Explanation
The formula represents the joint probability distribution for a LDA model, denoted as \(P(\boldsymbol{W},\boldsymbol{Z},\theta,\varphi;\alpha,\beta)\). Here's what each part means:
\(P(\boldsymbol{W},\boldsymbol{Z},\theta,\varphi;\alpha,\beta)\) is the joint probability of the observed words \(\boldsymbol{W}\), the latent (or hidden) topic assignments \(\boldsymbol{Z}\), the per-document topic proportions \(\theta\), and the per-topic word probabilities \(\varphi\), given the Dirichlet prior parameters \(\alpha\) and \(\beta\).
\(\prod_{j=1}^{M}P(\theta_j;\alpha)\) is the probability of the topic distribution for each document \(j\) under a Dirichlet prior \(\alpha\).
\(\prod_{i=1}^{K}P(\varphi_i;\beta)\) is the probability of the word distribution for each topic \(i\) under a Dirichlet prior \(\beta\).
\(\prod_{t=1}^{N}P(Z_{j,t} \mid \theta_j)\) is the probability of the topic assignments \(Z_{j,t}\) for each word \(t\) in each document \(j\), given the topic distribution \(\theta_j\) of that document.
\(P(W_{j,t} \mid \varphi_{Z_{j,t}})\) is the probability of each word \(W_{j,t}\) in each document \(j\), given the word distribution \(\varphi_{Z_{j,t}}\) of the assigned topic \(Z_{j,t}\) for that word.
The aim of the LDA model is to find values for \(\boldsymbol{Z}\), \(\theta\), and \(\varphi\) that maximize this joint probability, given the observed words \(\boldsymbol{W}\) and the priors \(\alpha\) and \(\beta\). Due to the complexity of this problem, approximation methods like Gibbs sampling or variational inference are often used to estimate these values.
5. Training - Gibbs Sampling
5.1 Objective
The principal aim of Gibbs Sampling in LDA is to maximize the monochromaticity of topic distribution within documents, and word distribution within topics. Implicitly, this implies a desirable model where each document is characterized by one or a limited number of topics, and similarly, each word belongs to a limited number of topics.
5.2 Procedure
The following steps outline the typical steps for training:
- Initialize the model by randomly assigning topics to each word in each document.
- Iterate sequentially through each word, assume all other topic assignments are correct, and concentrate on the topic distribution of the current word across documents and the topic distribution of its parent document.
- Apply a smoothing operation to prevent zero probabilities. The smoothing values added here correspond to the Dirichlet priors.
- Based on the computed probabilities, decide the topic for the current word via a random draw.
- As the iterations progress, the topic distribution converges to the true underlying distribution.
5.3 Hyperparameters
The model depends on several critical hyperparameters:
- N - The number of iterations.
- K - The number of topics. A higher value allows for more fine-grained topic identification but may lead to overfitting. A lower value yields broader, less specific topics.
- Dirichlet priors - These parameters control the "smoothness" of the topic and word distributions. A higher value results in sparser distributions, i.e., fewer topics heavily contribute to each document and fewer words significantly contribute to each topic. Conversely, a lower value results in denser distributions, i.e., more topics contribute equally to each document, and more words contribute equally to each topic. After initialization, the model may adjust Dirichlet priors as training proceeds (this adjustment may not be present in all code implementations).
5.4 Initialization of Dirichlet Priors
There are several strategies for initializing the Dirichlet priors:
- Random initialization.
- Small value initialization - This strategy generally leads to sparser results.
- Initialization based on prior knowledge.
- Empirical Bayes method - This involves maximizing the marginal likelihood of the data, which tends to perform better than relying on prior knowledge.
- Cross-validation to select the best initialization method.
5.5 Animation
6. Basic Applications of LDA
6.1 Topic Modeling
The basic implementation of LDA, as stated above, takes a set of documents as input, and is able to give the topic distribution for each document. However, LDA doesn't output nominal topics directly, which means that the topics within the result don't have any nominal meanings themselves. By studying the words that compose each topic, it is possible to assign names and labels to the topics, but this requires further steps and techniques.
6.2 Document Clustering
LDA can be used to cluster large collections of text documents into topics, which can help with tasks such as information retrieval, document classification, and recommendation systems.
6.3 Content-based recommendation
LDA can be used to identify the topics that a user is interested in based on their past behavior or preferences, and then recommend similar content that matches those topics.
6.4 Sentiment Analysis
LDA can be used to analyze the sentiment of a piece of text by identifying the topics that are most commonly associated with positive or negative sentiment.
7. Extended Applications of LDA
As described so far, LDA is based on the distribution of words in a corpus of text, often it is directly applied to text objects, but it can also handle other problems with proper feature transformations.
7.1 Fraud Detection
-
There are features for transactions, such as date, time, type, amount, andlocation.
-
Transfer them to BOG models, such as “time: 9:00 am”, “amount: $1000”, “location: Champaign, IL”Each of the above features is treated as single word in the BOG model.
-
Applying LDA gives the distribution of topics mixture.
-
Cluster the transactions based on the topic distribution.
-
Human effort to recognize the topic / cluster that’s suspicious.
- An in-depth example could be found here.
7.2 Medical diagnosis
-
Data comes from the patients’ medical history and case symptoms, such as:"cough," asthma”, "breathlessness”, "wheezing”.
-
Also the medicine used, and lab results can be part of data, there are countless features with proper feature processing.
-
Apply LDA and retrieve topic distributions for each patient.
-
From the result human effort is required for the recognition of high probabilities for specific medical conditions according to the topic distributions.
- An in-depth example could be found here.