Instructions: you must type in your solutions and submit them to the blackboard. Only PDF format is accepted. The assignment is to be completed by a group of two. Name the solutions file as Assign1_FirstName1_FirstName2. You only need to submit both the problem set & programming exercises.
- Curve fitting problem
PS-I-1: Assume the curve fitting problem is solved using the polynomial function 𝑦(𝑥, 𝐰) and the Sum-of-Squares Error Function 𝐸(𝐰). Drive the optimal solution 𝐰∗ where
II- Probability Theory
PS-II-1: Describe the generic Bayesian approach to solve the curve fitting problem described above.
PS-II-2: Gaussian Distribution is one of the most important probability distributions for continuous variable. It is widely used to modes the data for regression problem
- Drive the mean and variance for (univariant) Gaussian Distribution
- Drive the mean and variance for Multivariate Gaussian Distribution (Bonus at most 2 points)
- Specify the Bayesian approach by using Gaussian Distribution and drive the optimal parameters’ values.
- Show that maximizing likelihood is equivalent to minimizing the sum-of-squares error function.
PS -II-23: Assume the probability of a certain disease is 0.01. The probability of testing positive given that a person is infected with the disease is 0.95 and the probability of testing positive given the person is not infected with the disease is 0.05.
- Calculate the probability of testing positive.
- Use Bayes’ Rule to calculate the probability of being infected with the disease given that the test is positive.
III- Decision Theory
PS -III-1: Assume you have generic classification problem with K classes. To avoid making decisions on the difficult cases, the reject option is defined by introducing a threshold 𝜃 and rejecting those inputs 𝑥 for which the largest of the posterior probabilities 𝑝(𝐶“|𝑥) is less than or equal to 𝜃. Prove that:
- when 𝜃 = 1 , all examples are rejected
- when 𝜃 < 1/𝐾 , no examples are rejected
PS -III-2: There are many reasons to sperate the inference step from decision step. List these reasons and describe in depth how computing posterior probabilities would help.
IV- Information Theory
PS -IV-1: Entropy takes advantage of the nonuniform distribution of events to guild the use of variable lengths codes representing these events, in hope to have shorter average code length.
- Describe a coding schema to achieve that
- Show that proposed coding schema ensure the decoding is unique
- Support your work with examples
PS -IV-2: The KL-divergence from a distribution 𝑝(𝑥) to a distribution 𝑞(𝑥) can be thought of as a distance measure from 𝑝 to 𝑞:
𝐾𝐿(𝑝||𝑞) = −∑𝑝(𝑥) log# 𝑞(𝑥)/𝑝(𝑥)
If 𝑝(𝑥) = 𝑞(𝑥), then 𝐾𝐿(𝑝||𝑞) = 0, otherwise 𝐾𝐿(𝑝||𝑞) > 0
We can define mutual information as the KL-divergence from the observed joint distribution of
𝑋 and 𝑌 to the product of their marginals:
𝐼(𝑋, 𝑌 ) ≡ 𝐾𝐿(𝑝(𝑥, 𝑦)||𝑝(𝑥)𝑝(𝑦))
- Using KL-divergence based definition of mutual information, show that 𝐼(𝑋, 𝑌) =
𝐻(𝑋) − 𝐻(𝑋|𝑌) and 𝐼(𝑋, 𝑌) = 𝐻(𝑌) − 𝐻(𝑌|𝑋)
- From this definition, prove that mutual information is symmetric, i.e. 𝐼(𝑋, 𝑌 ) = 𝐼(𝑌, 𝑋)
- According to this definition, under what conditions do we have that 𝐼(𝑋, 𝑌 ) = 0