Bayes Rule

P(C|A)=P(A|C)P(C)P(A)

In words, it becomes:
P(cause|evidence)=P(evidence|cause)P(cause)÷P(evidence)

Evidence is what we know -- the given feature.
Cause is the class we are trying to classify the item into.

This is useful because P(evidence|cause) tends to be stable, but P(cause|evidence) is less stable, because it depends on the set of possible causes and how common they are right now.

In a "We see a teal bike. What is the chance that it's a veoride?" example, Veoride likes teal and regular bike owners don't. So probability of teal given Veoride is stable, while probability of Veoride given Teal depends on what types of bikes we are looking at.

MAP Estimate

Maximum a Posterior (MAP) Estimate chooses the type with the highest posterior probability:

Ignoring normalization factor
Bayesian estimation often works with the equation
P(cause|evidence)P(evidence|cause)P(cause)

The normalization is often ignored because we are only trying to determine the relative probabilities. P(evidence) is often the same in one comparison. That is to say, for all entries in the dataset, the denominator does not change, it remain static. so we only care about the posterior (proportional to)

MLE Estimate

The Maximum a Posterior (MAP) Estimate reflects the relative frequencies of the two underlying causes/types by including the prior P(cause). If we know that all causes/types are equally likely, then we can set P(cause) to the same value. We have:

Maximum Likelihood Estimate
P(cause|evidence)P(evidence|cause)

Naive Bayes

Naive Bayes deal with multiple types of evidences (A,B) about the same cause/class C. It is based on applying Intro Probability and #Bayes Rule with strong independence assumption between features.

In larger dimensions, where C causes E1,...En effects:
P(C|E1...En)P(E1|C)P(E2|C)...P(En|C)P(C)=P(C)Πk=1nP(Ek|C)

E.g. Classify UK v UK English given an input document W1,W2...Wn

We need the following pre-knowledge from training data:

So O(n) probabilities to estimate, where n is the number of word types, whereas the full joint distribution is O(2n).

In the following sections, we look at how to find the likelihood of each word appearing in document of a certain types P(W|C), as well as its underlying problems and solutions.

Simple Estimating Probabilities from Data

Given Document W1...Wn, two classes C1 and C2:

First try:
count(W)= number of times W occurs in the documents of class C
n = number of total words in the documents of class C
The naive estimate is P(W|C)=count(W)n

Text Classification

Text Data Model
CS440 Artificial Intelligence/Classification

Underflow and Log Transformation

Problem: Underflow

Using the simple estimation, uncommon words can cause the estimate process to produce numbers too small for standard floating point storage.

Log Transformation

Log Transformation gives better precisions on small values, loses precision on large values.

That is, our naive Bayes algorithm will be maximizing

log(P(C|W1...Wk))log(P(C))+k=1nlog(P(Wk|C))

E.g. "The" and "my" are very common. We don't care about what their probability differences are. But we do care about the appearance probability of less commonly used words.

Overfitting and Smoothing

Problem: Overfitting

Words that didn't appear in the training data get estimated zero probability. Words that were uncommon in the training data get inaccurate estimates. Zeroes destroy the Naive Bayes algorithm.

Laplace Smoothing

Smoothing assigns non-zero probabilities to unseen words. Note that it's tricky because all probabilities of words must add up to one.

If we only use UNK and α, we will have a larger than 1 probability. So V forces it back to normal.

Performance of Laplace smoothing

Deleted Estimation

Used to directly measure estimation errors. The steps are as follows:

  1. Divide our training data into two halves 1 and 2.
  2. Pick a specific count r in the first half.
  3. Suppose that W1...Wn are words that occur r times in the first half. We can estimate the corrected count for this group of words as the average of their counts in the second half.
    • Specifically, suppose C(Wk) is the count of Wk in the second half of the dataset
    • The corrected count value Corr(r)=k=1nC(Wk)n
  4. Further, make the estimate symmetrical:
    • Assume both half contains roughly the same number of words, compute:
    • Corr(r) as above
    • Corr(r) reversing the role of the two halfs
    • Corr(r)+Corr(r)2 estimate of the true count for a word with observed count r