• #ACL2021NLP #ACL2021 Please check our group’s recent publication at the main conference of @aclmeeting. We uncovered a compositional generalization problem existing in NMT models and contributed a new dataset. Contributed by Yafu Li, Yongjing Yin, Yulong Chen, Yue Zhang.

  • Prof Yue Zhang leads the #NLP lab at Westlake University @Westlake_Uni. Our group focuses on machine learning-based natural language processing, as well as application-oriented tasks, such as web information extraction and financial market prediction. Welcome to join us!

  • #NLProc #ACL2021 G-Transformer for Document-level Machine Translation Paper:arxiv.org/abs/2105.14761 Code:github.com/baoguangsheng/ Our @aclmeeting paper at the main conference introduces locality bias to fix the failure of Transformer training on document-level MT data.

word2vec Parameter Learning Explained

NLP Deep Talk 2年前 (2020-01-04) 196次浏览 已收录 0个评论 扫描二维码

The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector representations of words learned by word2vec models have been shown to carry semantic meanings and are useful in various NLP tasks. As an increasing number of researchers would like to experiment with word2vec or similar techniques, I notice that there lacks a material that comprehensively explains the parameter learning process of word embedding models in details, thus preventing researchers that are non-experts in neural networks from understanding the working mechanism of such models.

This note provides detailed derivations and explanations of the parameter update equations of the word2vec models, including the original continuous bag-of-word (CBOW) and skip-gram (SG) models, as well as advanced optimization techniques, including hierarchical softmax and negative sampling. Intuitive interpretations of the gradient equations are also provided alongside mathematical derivations.

In the appendix, a review on the basics of neuron networks and backpropagation is provided. I also created an interactive demo, wevi, to facilitate the intuitive understanding of the model.111An online interactive demo is available at: http://bit.ly/wevi-online.

Continuous Bag-of-Word Model

1.1 One-word context

We start from the simplest version of the continuous bag-of-word model (CBOW) introduced in Mikolov et al. (2013a). We assume that there is only one word considered per context, which means the model will predict one target word given one context word, which is like a bigram model. For readers who are new to neural networks, it is recommended that one go through Appendix A for a quick review of the important concepts and terminologies before proceeding further.

word2vec Parameter Learning Explained
Figure 1: A simple CBOW model with only one word in the context

Figure 1 shows the network model under the simplified context definition222In Figures 123, and the rest of this note,  is not the transpose of , but a different matrix instead.. In our setting, the vocabulary size is , and the hidden layer size is . The units on adjacent layers are fully connected. The input is a one-hot encoded vector, which means for a given input context word, only one out of  units, , will be 1, and all other units are 0.

The weights between the input layer and the output layer can be represented by a  matrix . Each row of  is the -dimension vector representation  of the associated word of the input layer. Formally, row  of W is . Given a context (a word), assuming  and  for , we have


which is essentially copying the -th row of  to  is the vector representation of the input word . This implies that the link (activation) function of the hidden layer units is simply linear (i.e., directly passing its weighted sum of inputs to the next layer).

From the hidden layer to the output layer, there is a different weight matrix , which is an  matrix. Using these weights, we can compute a score  for each word in the vocabulary,


where  is the -th column of the matrix . Then we can use softmax, a log-linear classification model, to obtain the posterior distribution of words, which is a multinomial distribution.


where  is the output of the -the unit in the output layer. Substituting (1) and (2) into (3), we obtain


Note that  and  are two representations of the word  comes from rows of , which is the inputhidden weight matrix, and  comes from columns of , which is the hiddenoutput matrix. In subsequent analysis, we call  as the “input vector”, and  as the “output vector” of the word .

Update equation for hiddenoutput weights

Let us now derive the weight update equation for this model. Although the actual computation is impractical (explained below), we are doing the derivation to gain insights on this original model with no tricks applied. For a review of basics of backpropagation, see Appendix A.

The training objective (for one training sample) is to maximize (4), the conditional probability of observing the actual output word  (denote its index in the output layer as ) given the input context word  with regard to the weights.


where  is our loss function (we want to minimize ), and  is the index of the actual output word in the output layer. Note that this loss function can be understood as a special case of the cross-entropy measurement between two probabilistic distributions.

Let us now derive the update equation of the weights between hidden and output layers. Take the derivative of  with regard to -th unit’s net input , we obtain


where , i.e.,  will only be 1 when the -th unit is the actual output word, otherwise . Note that this derivative is simply the prediction error  of the output layer.

Next we take the derivative on  to obtain the gradient on the hiddenoutput weights.


Therefore, using stochastic gradient descent, we obtain the weight updating equation for hiddenoutput weights:




where  is the learning rate, and  is the -th unit in the hidden layer is the output vector of . Note that this update equation implies that we have to go through every possible word in the vocabulary, check its output probability , and compare  with its expected output  (either 0 or 1). If  (“overestimating”), then we subtract a proportion of the hidden vector  (i.e., ) from , thus making  farther away from ; if  (“underestimating”, which is true only if , i.e., ), we add some  to , thus making  closer333Here when I say “closer” or “farther”, I meant using the inner product instead of Euclidean as the distance measurement. to . If  is very close to , then according to the update equation, very little change will be made to the weights. Note, again, that  (input vector) and  (output vector) are two different vector representations of the word .

Update equation for inputhidden weights

Having obtained the update equations for , we can now move on to . We take the derivative of  on the output of the hidden layer, obtaining


where  is the output of the -th unit of the hidden layer is defined in (2), the net input of the -th unit in the output layer; and  is the prediction error of the -th word in the output layerEH, an -dim vector, is the sum of the output vectors of all words in the vocabulary, weighted by their prediction error.

Next we should take the derivative of  on . First, recall that the hidden layer performs a linear computation on the values from the input layer. Expanding the vector notation in (1) we get


Now we can take the derivative of  with regard to each element of , obtaining


This is equivalent to the tensor product of  and EH, i.e.,


from which we obtain a  matrix. Since only one component of  is non-zero, only one row of  is non-zero, and the value of that row is , an -dim vector. We obtain the update equation of  as


where  is a row of , the “input vector” of the only context word, and is the only row of  whose derivative is non-zero. All the other rows of  will remain unchanged after this iteration, because their derivatives are zero.

Intuitively, since vector EH is the sum of output vectors of all words in vocabulary weighted by their prediction error , we can understand (16) as adding a portion of every output vector in vocabulary to the input vector of the context word. If, in the output layer, the probability of a word  being the output word is overestimated (), then the input vector of the context word  will tend to move farther away from the output vector of ; conversely if the probability of  being the output word is underestimated (), then the input vector  will tend to move closer to the output vector of ; if the probability of  is fairly accurately predicted, then it will have little effect on the movement of the input vector of . The movement of the input vector of  is determined by the prediction error of all vectors in the vocabulary; the larger the prediction error, the more significant effects a word will exert on the movement on the input vector of the context word.

As we iteratively update the model parameters by going through context-target word pairs generated from a training corpus, the effects on the vectors will accumulate. We can imagine that the output vector of a word  is “dragged” back-and-forth by the input vectors of ’s co-occurring neighbors, as if there are physical strings between the vector of  and the vectors of its neighbors. Similarly, an input vector can also be considered as being dragged by many output vectors. This interpretation can remind us of gravity, or force-directed graph layout. The equilibrium length of each imaginary string is related to the strength of cooccurrence between the associated pair of words, as well as the learning rate. After many iterations, the relative positions of the input and output vectors will eventually stabilize.

1.2 Multi-word context

Figure 2 shows the CBOW model with a multi-word context setting. When computing the hidden layer output, instead of directly copying the input vector of the input context word, the CBOW model takes the average of the vectors of the input context words, and use the product of the inputhidden weight matrix and the average vector as the output.


where  is the number of words in the context,  are the words the in the context, and  is the input vector of a word . The loss function is


which is the same as (7), the objective of the one-word-context model, except that  is different, as defined in (18) instead of (1).

word2vec Parameter Learning Explained
Figure 2: Continuous bag-of-word model

The update equation for the hiddenoutput weights stay the same as that for the one-word-context model (11). We copy it here:


Note that we need to apply this to every element of the hiddenoutput weight matrix for each training instance.

The update equation for inputhidden weights is similar to (16), except that now we need to apply the following equation for every word  in the context:


where  is the input vector of the -th word in the input context;  is a positive learning rate; and  is given by (12). The intuitive understanding of this update equation is the same as that for (16).

Skip-Gram Model

The skip-gram model is introduced in Mikolov et al. (2013ab). Figure 3 shows the skip-gram model. It is the opposite of the CBOW model. The target word is now at the input layer, and the context words are on the output layer.

word2vec Parameter Learning Explained
Figure 3: The skip-gram model.

We still use  to denote the input vector of the only word on the input layer, and thus we have the same definition of the hidden-layer outputs  as in (1), which means  is simply copying (and transposing) a row of the inputhidden weight matrix, , associated with the input word . We copy the definition of  below:


On the output layer, instead of outputing one multinomial distribution, we are outputing  multinomial distributions. Each output is computed using the same hiddenoutput matrix:


where  is the -th word on the -th panel of the output layer is the actual -th word in the output context words;  is the only input word;  is the output of the -th unit on the -th panel of the output layer is the net input of the -th unit on the -th panel of the output layer. Because the output layer panels share the same weights, thus


where  is the output vector of the -th word in the vocabulary, , and also  is taken from a column of the hiddenoutput weight matrix, .

The derivation of parameter update equations is not so different from the one-word-context model. The loss function is changed to


where  is the index of the actual -th output context word in the vocabulary.

We take the derivative of  with regard to the net input of every unit on every panel of the output layer and obtain


which is the prediction error on the unit, the same as in (8). For notation simplicity, we define a -dimensional vector  as the sum of prediction errors over all context words:


Next, we take the derivative of  with regard to the hiddenoutput matrix , and obtain


Thus we obtain the update equation for the hiddenoutput matrix ,




The intuitive understanding of this update equation is the same as that for (11), except that the prediction error is summed across all context words in the output layer. Note that we need to apply this update equation for every element of the hiddenoutput matrix for each training instance.

The derivation of the update equation for the inputhidden matrix is identical to (12) to (16), except taking into account that the prediction error  is replaced with . We directly give the update equation:


where EH is an -dim vector, each component of which is defined as


The intuitive understanding of (35) is the same as that for (16).

Optimizing Computational Efficiency

So far the models we have discussed (“bigrammodel, CBOW and skip-gram) are both in their original forms, without any efficiency optimization tricks being applied.

For all these models, there exist two vector representations for each word in the vocabulary: the input vector , and the output vector . Learning the input vectors is cheap; but learning the output vectors is very expensive. From the update equations (22) and (33), we can find that, in order to update , for each training instance, we have to iterate through every word  in the vocabulary, compute their net input , probability prediction  (or  for skip-gram), their prediction error  (or  for skip-gram), and finally use their prediction error to update their output vector .

Doing such computations for all words for every training instance is very expensive, making it impractical to scale up to large vocabularies or large training corpora. To solve this problem, an intuition is to limit the number of output vectors that must be updated per training instance. One elegant approach to achieving this is hierarchical softmax; another approach is through sampling, which will be discussed in the next section.

Both tricks optimize only the computation of the updates for output vectors. In our derivations, we care about three values: (1) , the new objective function; (2) , the new update equation for the output vectors; and (3) , the weighted sum of predictions errors to be backpropagated for updating input vectors.

3.1 Hierarchical Softmax

Hierarchical softmax is an efficient way of computing softmax (Morin and Bengio, 2005; Mnih and Hinton, 2009). The model uses a binary tree to represent all words in the vocabulary. The  words must be leaf units of the tree. It can be proved that there are  inner units. For each leaf unit, there exists a unique path from the root to the unit; and this path is used to estimate the probability of the word represented by the leaf unit. See Figure 4 for an example tree.

word2vec Parameter Learning Explained
Figure 4: An example binary tree for the hierarchical softmax model. The white units are words in the vocabulary, and the dark units are inner units. An example path from root to  is highlighted. In the example shown, the length of the path  means the -th unit on the path from root to the word .

In the hierarchical softmax model, there is no output vector representation for words. Instead, each of the  inner units has an output vector . And the probability of a word being the output word is defined as


where  is the left child of unit  is the vector representation (“output vector”) of the inner unit  is the output value of the hidden layer (in the skip-gram model ; and in CBOW, );  is a special function defined as


Let us intuitively understand the equation by going through an example. Looking at Figure 4, suppose we want to compute the probability that  being the output word. We define this probability as the probability of a random walk starting from the root ending at the leaf unit in question. At each inner unit (including the root unit), we need to assign the probabilities of going left and going right.444While an inner unit of a binary tree may not always have both children, a binary Huffman tree’s inner units always do. Although theoretically one can use many different types of trees for hierarchical softmax, word2vec uses a binary Huffman tree for fast training. We define the probability of going left at an inner unit  to be


which is determined by both the vector representation of the inner unit, and the hidden layer’s output value (which is then determined by the vector representation of the input word(s)). Apparently the probability of going right at unit  is


Following the path from the root to  in Figure 4, we can compute the probability of  being the output word as


which is exactly what is given by (37). It should not be hard to verify that


making the hierarchical softmax a well defined multinomial distribution among all words.

Now let us derive the parameter update equation for the vector representations of the inner units. For simplicity, we look at the one-word context model first. Extending the update equations to CBOW and skip-gram models is easy.

For the simplicity of notation, we define the following shortenizations without introducing ambiguity:


For a training instance, the error function is defined as


We take the derivative of  with regard to , obtaining


where  if  and  otherwise.

Next we take the derivative of  with regard to the vector representation of the inner unit  and obtain


which results in the following update equation:


which should be applied to . We can understand  as the prediction error for the inner unit . The “task” for each inner unit is to predict whether it should follow the left child or the right child in the random walk.  means the ground truth is to follow the left child;  means it should follow the right child.  is the prediction result. For a training instance, if the prediction of the inner unit is very close to the ground truth, then its vector representation  will move very little; otherwise  will move in an appropriate direction by moving (either closer or farther away555Again, the distance measurement is inner product. from ) so as to reduce the prediction error for this instance. This update equation can be used for both CBOW and the skip-gram model. When used for the skip-gram model, we need to repeat this update procedure for each of the  words in the output context.

In order to backpropagate the error to learn inputhidden weights, we take the derivative of  with regard to the output of the hidden layer and obtain

EH (54)

which can be directly substituted into (23) to obtain the update equation for the input vectors of CBOW. For the skip-gram model, we need to calculate a EH value for each word in the skip-gram context, and plug the sum of the EH values into (35) to obtain the update equation for the input vector.

From the update equations, we can see that the computational complexity per training instance per context word is reduced from  to , which is a big improvement in speed. We still have roughly the same number of parameters ( vectors for inner-units compared to originally  output vectors for words).

3.2 Negative Sampling

The idea of negative sampling is more straightforward than hierarchical softmax: in order to deal with the difficulty of having too many output vectors that need to be updated per iteration, we only update a sample of them.

Apparently the output word (i.e., the ground truth, or positive sample) should be kept in our sample and gets updated, and we need to sample a few words as negative samples (hence “negative sampling”). A probabilistic distribution is needed for the sampling process, and it can be arbitrarily chosen. We call this distribution the noise distribution, and denote it as . One can determine a good distribution empirically.666As described in (Mikolov et al., 2013b), word2vec uses a unigram distribution raised to the th power for the best quality of results.

In word2vec, instead of using a form of negative sampling that produces a well-defined posterior multinomial distribution, the authors argue that the following simplified training objective is capable of producing high-quality word embeddings:777Goldberg and Levy (2014) provide a theoretical analysis on the reason of using this objective function.


where  is the output word (i.e., the positive sample), and  is its output vector;  is the output value of the hidden layer in the CBOW model and  in the skip-gram model is the set of words that are sampled based on , i.e., negative samples.

To obtain the update equations of the word vectors under negative sampling, we first take the derivative of  with regard to the net input of the output unit :


where  is the “label” of word  when  is a positive sample;  otherwise. Next we take the derivative of  with regard to the output vector of the word ,


which results in the following update equation for its output vector:


which only needs to be applied to  instead of every word in the vocabulary. This shows why we may save a significant amount of computational effort per iteration.

The intuitive understanding of the above update equation should be the same as that of (11). This equation can be used for both CBOW and the skip-gram model. For the skip-gram model, we apply this equation for one context word at a time.

To backpropagate the error to the hidden layer and thus update the input vectors of words, we need to take the derivative of  with regard to the hidden layer’s output, obtaining


By plugging EH into (23) we obtain the update equation for the input vectors of the CBOW model. For the skip-gram model, we need to calculate a EH value for each word in the skip-gram context, and plug the sum of the EH values into (35) to obtain the update equation for the input vector.


The author would like to thank Eytan Adar, Qiaozhu Mei, Jian Tang, Dragomir Radev, Daniel Pressel, Thomas Dean, Sudeep Gandhe, Peter Lau, Luheng He, Tomas Mikolov, Hao Jiang, and Oded Shmueli for discussions on the topic and/or improving the writing of the note.


  • Goldberg and Levy (2014) Goldberg, Y. and Levy, O. (2014). word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722 [cs, stat]arXiv: 1402.3722.
  • Mikolov et al. (2013a) Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • Mikolov et al. (2013b) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119.
  • Mnih and Hinton (2009) Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language modelIn Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages 1081–1088. Curran Associates, Inc.
  • Morin and Bengio (2005) Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neural network language modelIn AISTATS, volume 5, pages 246–252. Citeseer.

Appendix A Back Propagation Basics

a.1 Learning Algorithms for a Single Unit

Figure 5 shows an artificial neuron (unit).  are input values;  are weights;  is a scalar output; and  is the link function (also called activation/decision/transfer function).

word2vec Parameter Learning Explained
Figure 5: An artificial neuron

The unit works in the following way:


where  is a scalar number, which is the net input (or “new input”) of the neuron is defined as


Using vector notation, we can write


Note that here we ignore the bias term in . To include a bias term, one can simply add an input dimension (e.g., ) that is constant 1.

Apparently, different link functions result in distinct behaviors of the neuron. We discuss two example choices of link functions here.

The first example choice of  is the unit step function (aka Heaviside step function):


A neuron with this link function is called a perceptron. The learning algorithm for a perceptron is the perceptron algorithm. Its update equation is defined as:


where  is the label (gold standard) and  is the learning rate . Note that a perceptron is a linear classifier, which means its description capacity can be very limited. If we want to fit more complex functions, we need to use a non-linear model.

The second example choice of  is the logistic function (a most common kind of sigmoid function), defined as


The logistic function has two primary good properties: (1) the output  is always between 0 and 1, and (2) unlike a unit step function,  is smooth and differentiable, making the derivation of update equation very easy.

Note that  also has the following two properties that can be very convenient and will be used in our subsequent derivations:


We use stochastic gradient descent as the learning algorithm of this model. In order to derive the update equation, we need to define the error function, i.e., the training objective. The following objective function seems to be convenient:


We take the derivative of  with regard to ,


where  because , and (68) and (69). Once we have the derivative, we can apply stochastic gradient descent:


a.2 Back-propagation with Multi-Layer Network

Figure 6 shows a multi-layer neural network with an input layer , a hidden layer , and an output layer . For clarity we use  as the subscript for input, hidden, and output layer units respectively. We use  and  to denote the net input of hidden layer units and output layer units respectively.

word2vec Parameter Learning Explained
Figure 6: A multi-layer neural network with one hidden layer

We want to derive the update equation for learning the weights  between the input and hidden layers, and  between the hidden and output layers. We assume that all the computation units (i.e., units in the hidden layer and the output layer) use the logistic function  as the link function. Therefore, for a unit  in the hidden layer, its output is defined as


Similarly, for a unit  in the output layer, its output is defined as


We use the squared sum error function given by


where , a  weight matrix (input-hidden), and , a  weight matrix (hidden-output). , a -dimension vector, which is the gold-standard labels of output.

To obtain the update equations for  and , we simply need to take the derivative of the error function  with regard to the weights respectively. To make the derivation straightforward, we do start computing the derivative for the right-most layer (i.e., the output layer), and then move left. For each layer, we split the computation into three steps, computing the derivative of the error with regard to the output, net input, and weight respectively. This process is shown below.

We start with the output layer. The first step is to compute the derivative of the error w.r.t. the output:


The second step is to compute the derivative of the error with regard to the net input of the output layer. Note that when taking derivatives with regard to something, we need to keep everything else fixed. Also note that this value is very important because it will be reused multiple times in subsequent computations. We denote it as  for simplicity.


The third step is to compute the derivative of the error with regard to the weight between the hidden layer and the output layer.


So far, we have obtained the update equation for weights between the hidden layer and the output layer.


where  is the learning rate.

We can repeat the same three steps to obtain the update equation for weights of the previous layer, which is essentially the idea of back propagation.

We repeat the first step and compute the derivative of the error with regard to the output of the hidden layer. Note that the output of the hidden layer is related to all units in the output layer.


Then we repeat the second step above to compute the derivative of the error with regard to the net input of the hidden layer. This value is again very important, and we denote it as .


Next we repeat the third step above to compute the derivative of the error with regard to the weights between the input layer and the hidden layer.


Finally, we can obtain the update equation for weights between the input layer and the hidden layer.


From the above example, we can see that the intermediate results () when computing the derivatives for one layer can be reused for the previous layer. Imagine there were another layer prior to the input layer, then  can also be reused to continue computing the chain of derivatives efficiently. Compare Equations (78) and (83), we may find that in (83), the factor  is just like the “error” of the hidden layer unit . We may interpret this term as the error “back-propagated” from the next layer, and this propagation may go back further if the network has more hidden layers.

Appendix B wevi: Word Embedding Visual Inspector

An interactive visual interface, wevi (word embedding visual inspector), is available online to demonstrate the working mechanism of the models described in this paper. See Figure 7 for a screenshot of wevi.

The demo allows the user to visually examine the movement of input vectors and output vectors as each training instance is consumed. The training process can be also run in batch mode (e.g., consuming 500 training instances in a row), which can reveal the emergence of patterns in the weight matrices and the corresponding word vectors. Principal component analysis (PCA) is employed to visualize the “high”-dimensional vectors in a 2D scatter plot. The demo supports both CBOW and skip-gram models.

After training the model, the user can manually activate one or multiple input-layer units, and inspect which hidden-layer units and output-layer units become active. The user can also customize training data, hidden layer size, and learning rate. Several preset training datasets are provided, which can generate different results that seem interesting, such as using a toy vocabulary to reproduce the famous word analogy: king – queen = man – woman.

It is hoped that by interacting with this demo one can quickly gain insights of the working mechanism of the model. The system is available at http://bit.ly/wevi-online. The source code is available at http://github.com/ronxin/wevi.

word2vec Parameter Learning Explained
Figure 7: wevi screenshot (http://bit.ly/wevi-online)

CSIT FUN , 版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:word2vec Parameter Learning Explained
喜欢 (0)
分享 (0)
表情 贴图 加粗 删除线 居中 斜体 签到


  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址