联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-12-12 10:52

COMPSCI 671D Fall 2019

Homework 4

1 EM Algorithm for Topic Modeling (35 points)

In this question we will try to design an algorithm for discovering the abstract topics that occur in a

set of documents. In the problem, we have a set of M abstract documents in the universe, D, which

are mixtures of latent topics. We also have a dictionary of words, W, with size N. Intuitively, W is

the list of all unique words that occur at least once within D. Using these two sets we can denote

the number of times word wj occurs in document di as n(wj , di). It follows that we can represent

a document di as a vector of size N, each entry of which corresponds to how many times word wj

appears in it. The topics z are latent variables in a topic set Z sized K. Intuitively, if a document

is about a certain topic, it will include some particular words with higher probability. For example,

“music”, “show” and “film” appear frequently in documents about Arts while “school”, “student”

and “teachers” usually occur if the document is about Education. Meanwhile, a document can be

a mixture of several topics, e.g., a document can be about arts education for high school students.

Inspired by the intuition, a reasonable approach to model the generative process is as follows.

Pr(di

, zk, wj ) = Pr(di) Pr(zk|di) Pr(wj |zk)

which means the topics only depend on the documents and the words only depend on topics. For

computational convenience, let Pr(di) be a uniform distribution and Pr(zk|di) and Pr(wj |zk) be

Multinomial distributions. i.e.

We are interested in learning αik and βkj because they are substantively interesting: αik represents

the proportion of topic k that makes up document i, and βkj the probability of seeing word j under

topic k. Therefore, a single word wj in a document di

is generated in this way: (1) Choose a topic

zk from the topic distribution P r(zk|di), the probability of choosing topic k is αik; (2) Choose a

word from the vocabulary distribution p(wj |zk) in this topic, the probability is βkj . We want to

learn these parameters by maximizing the likelihood of the data using the EM algorithm.

1

a) For our set of N documents and W word of choices, write down the log-likelihood of the model

above, i.e. log(p(D = di

, W = wj |α, β)).

Remember that in the EM (Expectation-Maximization) algorithm, we can figure out the parameters

and the hidden variables by iteratively computing (1) the distribution of latent variables using

the old parameters and (2) find new parameters that maximize the likelihood using the old latent

variable distribution. So, you can do the following:

b) E-step: Derive the distribution of latent variable p(zk|wj , di

, αold, βold) by fixing the old parameters.

c) M-step: Find the α

new and β

new that optimize the log likelihood using p(zk|wj , di

, αold, βold)

as the distribution of z.

You would iterate these until convergence to get the final parameters in practice.

Just so you know, the model you have just been working with in this problem is a very famous

and very popular model. :)

2 Neural Networks (65 points)

In this problem you will get the chance to construct a neural network with architecture 784 ? H ?

H ? 1, where H is the number of hidden nodes you choose. This neural network can be used for

binary classification, such 0, 1 digits classification for MNIST. The model can be represented as

f(x; θ) : R

784 → [0, 1]

where θ = {W1 ∈ R

784×H, b1 ∈ R

H,W2 ∈ R

H×H, b2 ∈ R

H, w3 ∈ R

H, b3 ∈ R, }. Explicitly,

The loss function for this model (or the negative log-likelihood) is the usual one:


a) Backpropogation

Derive the gradients ?w3L(θ), ?b3L(θ), ?W2L(θ), ?b2L(θ), ?W1L(θ), ?b1L(θ) by backpropagation.

Hints: The problem would be much easier if you define zl = WT

l hl?1 + bl

, and derive the

relationship between ?zlL(θ) and ?zl?1L(θ). Also, you can give the final result in a recursive way

as long as it is correct and consistent.

2

b) Weight Initialization

Neural Networks are normally trained by gradient descent based optimization algorithms, for

which we have to give initial values for our parameters (weight and bias). The bias parameters

are usually initialized as 0’s, while the weight parameters are usually initialized by independently

sampling from N(0, σ2).

(i) What are the potential problems if we choose σ to be either too small or too large?

(ii) One nice property we would like our weights to have is that by proper initialization, we

want the values before activation to be similar in distribution for each layer. It turns out we can

achieve this by choosing σ properly. Now suppose we use ReLU activation in our neural network

and the inputs are normalized. We want to choose the proper σl

for the l-th hidden layer such that

V ar(zl) = V ar(zl?1).

We assume the neurons in each layer are mutually independent and share the same distribution

and we use zl to represent the random variable that generates zl

, where zl = WT

l hl?1 + bl and

hl?1 = max(0, zl?1) and Wl ∈ R

nl?1×nl.

Find σt such that V ar(zl) = V ar(zl?1) and prove your result.

c) Implement your Neural Network

Use the preprocessed MNIST data provided in the attachment. MNIST is a dataset of images

of handwritten digits that also contain labels for the number they correspond to. We want to train

a neural network to learn to recognize these handwritten digits. Use gradient descent with the

gradients you have derived in part a) to train a neural network on the data. Report the accuracy

for the network on the train and test data. Since the sample size is large, we can approximate the

true gradient with stochastic gradient for computational convenience:

which means at each step, instead of calculating the gradients with all samples in the dataset,

we randomly pick a mini-batch of B samples and calculate the gradient using these samples. In

implementing the neural network, it would be beneficial to build and train the model in a general

way. That is, build a NN with d hidden layers such that we can easily modify d.

d) Vanishing Gradient

(i) In our current model, we have 2 hidden layers. What is the magnitude of ?W1L(θ) (you

can use Frobenius norm or induced matrix norm)? Now try to modify the the number of hidden

layers d, from 1 to 10. What happens to the magnitude of ?W1L(θ) when we increase d? Plot

the magnitude against d on log scale. In this question, you don’t need to train the whole neural

network, just initialize it with random weights, e.g. N (0, 1), and calculate the gradient.

(ii) Suppose that all weights are finite and the weight matrices W2,W3, ...,Wd are diagonalizable

matrices in which the absolute value of the eigenvalues are upper-bounded by 4 ? . Prove

that

lim

d→∞

?W1L(θ) = 0

Hints: Try to use the the relationship between ?zlL(θ) and ?zl?1L(θ) and try to apply a common

inequality.

3

(iii) The result above is known as the vanishing gradient problem in the feedforward neural

network. Will there still be such a problem if we use a tanh activation? What if we use a ReLU

activation?

Give an intuitive explanation of why vanishing gradient happens in light of the shapes of activation

functions and provide 2 ways of dealing with this problem.

e) Sigmoid Activation

One potential problem of using the sigmoid activation function for hidden layers is that it maps

from R to [0, 1], which means the output of the sigmoid function is always positive. What happens

to the signs of the gradients of hidden layer weights in this situation if we only have one example?

What if we are adding gradients up across a batch of data?

4


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp