Bayes Rule
In words, it becomes:
is the Posterior is the Likelihood is the Prior is the Normalization
Evidence is what we know -- the given feature.
Cause is the class we are trying to classify the item into.
This is useful because
In a "We see a teal bike. What is the chance that it's a veoride?" example, Veoride likes teal and regular bike owners don't. So probability of teal given Veoride is stable, while probability of Veoride given Teal depends on what types of bikes we are looking at.
MAP Estimate
Maximum a Posterior (MAP) Estimate chooses the type with the highest posterior probability:
- Suppose type
comes from a set of types - MAP picks the type
such that is highest: - where
returns the index (value of ) that gives the larges value
Ignoring normalization factor
Bayesian estimation often works with the equation
The normalization is often ignored because we are only trying to determine the relative probabilities.
is often the same in one comparison. That is to say, for all entries in the dataset, the denominator does not change, it remain static. so we only care about the posterior (proportional to)
MLE Estimate
The Maximum a Posterior (MAP) Estimate reflects the relative frequencies of the two underlying causes/types by including the prior
Maximum Likelihood Estimate
- Maximizes the likelihood
- Pros: can be a sensible choice if we have poor information about the prior probabilities
- Cons: inaccurate if the prior probabilities of different causes are very different.
Naive Bayes
Naive Bayes deal with multiple types of evidences (
- MAP Estimate:
- Assume that
and are conditionally independent given : - So the Naive Bayes Equation is:
In larger dimensions, where
E.g. Classify UK v UK English given an input document
We need the following pre-knowledge from training data:
The likelihood
and for each word type The priors
and So
probabilities to estimate, where is the number of word types, whereas the full joint distribution is .
In the following sections, we look at how to find the likelihood of each word appearing in document of a certain types
Simple Estimating Probabilities from Data
Given Document
First try:
The naive estimate is
Text Classification
Text Data Model
CS440 Artificial Intelligence/Classification
Underflow and Log Transformation
Problem: Underflow
Using the simple estimation, uncommon words can cause the estimate process to produce numbers too small for standard floating point storage.
Log Transformation
Log Transformation gives better precisions on small values, loses precision on large values.
- becomes
That is, our naive Bayes algorithm will be maximizing
E.g. "The" and "my" are very common. We don't care about what their probability differences are. But we do care about the appearance probability of less commonly used words.
Overfitting and Smoothing
Problem: Overfitting
Words that didn't appear in the training data get estimated zero probability. Words that were uncommon in the training data get inaccurate estimates. Zeroes destroy the Naive Bayes algorithm.
Laplace Smoothing
Smoothing assigns non-zero probabilities to unseen words. Note that it's tricky because all probabilities of words must add up to one.
= all unseen words (seen as a single word type) = number of word types seen in training data = number of words in our Class training data
If we only use
and , we will have a larger than 1 probability. So forces it back to normal.
Performance of Laplace smoothing
- overestimates probability of unseen words
- underestimates probability of common words
Deleted Estimation
Used to directly measure estimation errors. The steps are as follows:
- Divide our training data into two halves 1 and 2.
- Pick a specific count
in the first half. - Suppose that
are words that occur times in the first half. We can estimate the corrected count for this group of words as the average of their counts in the second half. - Specifically, suppose
is the count of in the second half of the dataset - The corrected count value
- Specifically, suppose
- Further, make the estimate symmetrical:
- Assume both half contains roughly the same number of words, compute:
as above reversing the role of the two halfs estimate of the true count for a word with observed count