Published on 2014-11-17

Bayes Classifier

Bayes Classifier tries to minimises the probability of misclassification.

Background Knowlege

Bayes' theorem

Bayes' theorem is stated mathematically as the following simple form:

Bayes' Theorem

For proposition A and evidence or background B,

  • P(A), the prior probability, is the initial degree of belief in A.
  • P(A|B), the conditional probability, is the degree of belief in A, having taken B into account.
  • the quotient P(B|A)/P(B) represent the support B provide for A.

Another form of Bayes' Theorem that is generally encountered when looking at two competing statements or hypotheses is:

Bayes' Theorem

Because P(B) = P(B|A)P(A) + P(BA)PA), in fact.

Chain Rule

In probability theory, the Chain Rule permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities.

Consider an indexed set of sets A1 to An, we can apply the definition of conditional probability bo obtain the value of this member of the joint distribution.

Chain Rule

For example:

Chain Rule example


We want to estimate P(C|F), in which C is a variable stands for a class, and F is a vector variable represents values of a set of attributes or features. With Bayes' Theorem, we know

P(C|F) = P(F|C)P(C)/P(F)

  • P(C = c), the probability of C for a specific value c, is easy to estimate by frequency of c
  • P(F = f), the probability of F for a specific value f, is a constant in fact, and could be ommited in condition that we consider P(C = ci|F = f) only.

Thus the problem is now to estimate P(F|C).

Naive Bayes Classifier

In P(F|C), the feature variable F consists of a set of features Fi with 1 <= i <= n, thus P(F|C) could be rewritten as P(F1, ..., Fn|C). Using the Chain Rule for repeated applications of the definition of conditional probability, we get:

P(F|C) = P(F1, ..., Fn|C) = P(F1|C)P(F2|C, F1)...P(Fn|C, F1, ..., Fn-1)

The Naive Assumption: assume that each feature Fi is conditionally independent of every other feature Fj for ji, given the category C. Therefore:

  • P(F2|C, F1) = P(F2|C)
  • P(F3|C, F1, F2) = P(F3|C)
  • ...
  • P(Fn|C, F1, ..., Fn-1) = P(Fn|C)


P(F|C) = P(F1|C)P(F2|C)...P(Fn|C)

Now we get the estimation probability of category c. And with the same procedure we could get the estimation of every category, in which the category with maximum estimation will be picked as the classified category.

Bayesian Network

Bayesian Network, also known as Bayes(ian) Belief Network or Belief Network, is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed asyclic graph (DAG). Bayesian Networks work well even if the features are not independent.

Formally, Bayesian networks are DAGs whose

  • Nodes represent random variables in the Bayesian sense, e.i., they may be observable quantities, latent variables, unknown parameters or hypotheses.
  • Edges reoresent conditional dependencies, thus nodes that are not connected represent variables that are conditionally independent of each other.

In a Bayesian network, each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives, as output, the probability of the variable represented by the node. This probability function associated with the node could be represented by a probability table.

  • If node X has no parent node, then the probability table consists of prior probability P(X) only.
  • If node X has one parent node Y, then its probability table also includes conditional probability P(X|Y).
  • Else, if node X has more parent nodes {Yi for 1 <= i <= k}, then its probability table also includes conditional probability P(X|Y1, ..., Yk)

Build a Bayesian Network

  1. Build the network structure: divide the variables into cause variables and outcome variables, and some variables might be both cause variables for ones and outcome variables for others. An expert might be helpful. In applications the task of defining the network is too complex for humans, see Structure Learning.
  2. Once the structure is built, the probability associated with a node is easy to compute.


  • Bayesian Network could handle dependent features.
  • Building Bayesian Network is difficult and costing, but once built, it is easy to change.
  • Bayesian Network is good at processing incomplete data: the omissive feature of an instance could be substituted by sum of all probable values of the feature.
  • Bayesian Network is robust to overfitting.

Artificial Neural Networks

Artificial Neural Networks (ANNs) are computational models inspired by an animal's central nervous systems, and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown.


The perceptron is an algorithm for learning a binary classifier: a function that maps its input x (a real-valued vector) to an output value f(x) (a single binary value):


where w is a vector of real-valued weights, w.x is the dot product, and b is the 'bias', a constant term that does not depend on any input values.


  1. Initialise the weights w with 0 or random values as w(0).
  2. For each example j = {xj, yj} in training set D, perform the following steps
    • Calculate the actual output: y^j(k) = f[w(k), xj]
    • Update the weights: for each weight wi, wi(k+1) = wi(k) + λ(yj - y^j(k)xji)
  3. For offline learning, the step 2 may be repeated until the iteration error is less than a user-specified error threshold γ, or a predetermined number of iterations have been completed.

The parameter λ is called learning rate.


Artificial neural network types vary from those with only one or two layers of single direction logic, to complicated multi-input many directional feedback loops and layers.

For more information about ANN history and approches, see Wikipedia.