*Instructions: you must type in your solutions and submit them to the blackboard. Only PDF format is accepted. The assignment is to be completed by a group of two. Name the solutions file as Assign1_FirstName1_FirstName2. You only need to submit both the problem set & programming exercises.*

# Problem Set:

**Curve fitting problem**

**PS-I-1**: Assume the curve fitting problem is solved using the polynomial function π¦(π₯, π°) and the Sum-of-Squares Error Function πΈ(π°). Drive the optimal solution π°β where

# II- Probability Theory

**PS-II-1: **Describe the generic Bayesian approach to solve the curve fitting problem described above.

**PS-II-2**: Gaussian Distribution is one of the most important probability distributions for continuous variable. It is widely used to modes the data for regression problem

- Drive the mean and variance for (univariant) Gaussian Distribution
- Drive the mean and variance for Multivariate Gaussian Distribution (Bonus at most 2 points)
- Specify the Bayesian approach by using Gaussian Distribution and drive the optimal parametersβ values.
- Show that maximizing likelihood is equivalent to minimizing the sum-of-squares error function.

**PS -II-23: **Assume the probability of a certain disease is 0.01. The probability of testing positive given that a person is infected with the disease is 0.95 and the probability of testing positive given the person is not infected with the disease is 0.05.

- Calculate the probability of testing positive.
- Use Bayesβ Rule to calculate the probability of being infected with the disease given that the test is positive.

# III- Decision Theory

**PS -III-1: **Assume you have generic classification problem with K classes. To avoid making decisions on the difficult cases, the * reject option *is defined by introducing a threshold π and rejecting those inputs π₯ for which the largest of the posterior probabilities π(πΆ

_{“}|π₯) is less than or equal to π. Prove that:

- when π = 1 , all examples are rejected
- when π < 1/πΎ , no examples are rejected

**PS -III-2: **There are many reasons to sperate the inference step from decision step. List these reasons and describe in depth how computing posterior probabilities would help.

# IV- Information Theory

**PS -IV-1: **Entropy takes advantage of the nonuniform distribution of events to guild the use of variable lengths codes representing these events, in hope to have shorter average code length.

- Describe a coding schema to achieve that
- Show that proposed coding schema ensure the decoding is unique
- Support your work with examples

**PS -IV-2: **The KL-divergence from a distribution π(π₯) to a distribution π(π₯) can be thought of as a distance measure from π to π:

πΎπΏ(π||π) = ββπ(π₯) log_{#} π(π₯)/π(π₯)

If π(π₯) = π(π₯), then πΎπΏ(π||π) = 0, otherwise πΎπΏ(π||π) > 0

We can define mutual information as the KL-divergence from the observed joint distribution of

π and π to the product of their marginals:

πΌ(π, π ) β‘ πΎπΏ(π(π₯, π¦)||π(π₯)π(π¦))

- Using KL-divergence based definition of mutual information, show that πΌ(π, π) =

π»(π) β π»(π|π) and πΌ(π, π) = π»(π) β π»(π|π)

- From this definition, prove that mutual information is symmetric, i.e. πΌ(π, π ) = πΌ(π, π)
- According to this definition, under what conditions do we have that πΌ(π, π ) = 0

.