## Machine learning assignment

Instructions: you must type in your solutions and submit them to the blackboard. Only PDF format is accepted. The assignment is to be completed by a group of two. Name the solutions file as Assign1_FirstName1_FirstName2. You only need to submit both the problem set & programming exercises.

# Problem Set:

1. Curve fitting problem

PS-I-1: Assume the curve fitting problem is solved using the polynomial function π¦(π₯, π°) and the Sum-of-Squares Error Function πΈ(π°). Drive the optimal solution π°β where

# II-               Probability Theory

PS-II-1: Describe the generic Bayesian approach to solve the curve fitting problem described above.

PS-II-2: Gaussian Distribution is one of the most important probability distributions for continuous variable. It is widely used to modes the data for regression problem

1. Drive the mean and variance for (univariant) Gaussian Distribution
2. Drive the mean and variance for Multivariate Gaussian Distribution (Bonus at most 2 points)
3. Specify the Bayesian approach by using Gaussian Distribution and drive the optimal parametersβ values.
4. Show that maximizing likelihood is equivalent to minimizing the sum-of-squares error function.

PS -II-23: Assume the probability of a certain disease is 0.01. The probability of testing positive given that a person is infected with the disease is 0.95 and the probability of testing positive given the person is not infected with the disease is 0.05.

1. Calculate the probability of testing positive.
1. Use Bayesβ Rule to calculate the probability of being infected with the disease given that the test is positive.

# III-            Decision Theory

PS -III-1: Assume you have generic classification problem with K classes. To avoid making decisions on the difficult cases, the reject option is defined by introducing a threshold π and rejecting those inputs π₯ for which the largest of the posterior probabilities π(πΆ|π₯) is less than or equal to π. Prove that:

• when π = 1 , all examples are rejected
• when π < 1/πΎ , no examples are rejected

PS -III-2: There are many reasons to sperate the inference step from decision step. List these reasons and describe in depth how computing posterior probabilities would help.

# IV-             Information Theory

PS -IV-1: Entropy takes advantage of the nonuniform distribution of events to guild the use of variable lengths codes representing these events, in hope to have shorter average code length.

• Describe a coding schema to achieve that
• Show that proposed coding schema ensure the decoding is unique
• Support your work with examples

PS -IV-2: The KL-divergence from a distribution π(π₯) to a distribution π(π₯) can be thought of as a distance measure from π to π:

πΎπΏ(π||π) = ββπ(π₯) log# π(π₯)/π(π₯)

If π(π₯) = π(π₯), then πΎπΏ(π||π) = 0, otherwise πΎπΏ(π||π) > 0

We can define mutual information as the KL-divergence from the observed joint distribution of

π and π to the product of their marginals:

πΌ(π, π ) β‘ πΎπΏ(π(π₯, π¦)||π(π₯)π(π¦))

1. Using KL-divergence based definition of mutual information, show that πΌ(π, π) =

π»(π) β π»(π|π) and πΌ(π, π) = π»(π) β π»(π|π)

• From this definition, prove that mutual information is symmetric, i.e. πΌ(π, π ) = πΌ(π, π)
• According to this definition, under what conditions do we have that πΌ(π, π ) = 0

.