National Taiwan University (NTU) Mr. Li Hongyi's 《Machine Learning》
Learning notes of , Therefore, in the full text, it refers to the content of video multiple times . fledgling , Poor learning , If you have any shortcomings, please do not hesitate to comment .

   Welcome to leave more comments and interact in the comment area ~~~~

1. Why use word embedding (Word Embedding)

   It is often used before word embedding 1-of-N Encoding Method of , As shown in the figure below

There are two main ways to do this
shortcoming , First of all, this representation is orthogonal , But it is because of orthogonality that the relationship between words with similar attributes becomes weak ; Secondly, this coding method makes the codeword very long , For example, there are 10 Ten thousand words , Then we need a length of 10 Ten thousand strings for coding .

   In order to overcome such shortcomings , Word embedding (Word Embedding) Method of . This method maps words to higher dimensions ( But the dimension is still 1-of-N Encoding
Much lower ), Similar words come together , And different words separate ; Each axis can be seen as an attribute that distinguishes these words , For example, in the picture above , Abscissa can be regarded as the difference between living things and others , The ordinate can be regarded as the difference between moving and not moving .

2. Why word embedding (Word Embedding) It's unsupervised learning

   Because in the process of learning , We only know that the input is a code of words , The output is another code of words , But I don't know what kind of coding it should be , So it's unsupervised learning .
   Some people may want to realize word embedding by self coding , But if you have to type yes 1-of-N Encoding The words , Basically, we can't handle it in this way
, Because the input vectors are irrelevant , It's hard to extract any useful information through the self coding process .

3. Word embedding (Word Embedding) Two ways

   Word embedding (Word Embedding) Mainly based on statistics (Count based ) And Forecast Based (Perdition based) Two methods of .

3.1 Based on statistics (Count based )

   The main idea of this method is shown in the figure below

If two word vectors appear together more frequently , So these two word vectors should be similar . So the dot product of two word vectors should be directly proportional to the number of times they appear in common , The matrix decomposition in the last lesson is very similar . For details, please refer to
Glove Vector:

3.2 Based on Forecast (Perdition based)

   The original idea is the method shown in the figure below

Here the input of the neural network is the previous word wi−1wi−1 Word vector of ( 1-of-N Encoding ) form , After God's network, his output should be the next possible word wiw
i It's the chance of a word , the reason being that 1-of-N Encoding form , So every one-dimensional representation of the output is a probability of a certain degree . Then take the weight input of the first layer zz As word vector .

   In actual use , It's not just the relationship between one word and the next , It's a bunch of words from the front, and a word from the back , In the process of training, there are behaviors similar to weight sharing , As shown in the figure below

We can see that the input neurons in the same position have the same weight ( On the way, it's represented by lines of the same color ), There are two main reasons for this , First, make sure that you have the same code for the same word entered in the same batch ( That is, the same weight ); Secondly, weight sharing can reduce the number of parameters in the model .

   So how to ensure that they have the same weight in the training process ? As shown in the figure below

In the process of gradient updating , The prime minister sets the same initial value for the shared parameters , Secondly, in the process of updating, we should not only subtract the corresponding gradient , You should also subtract the gradient of another neuron at the same location , Ensure that the update process between the two parameters is the same .

   Except that we can deduce the following words according to the previous ones , You can also deduce the middle word according to the words on both sides , Or from the middle to the two sides

Although neural network is used here , But it doesn't work deep learning, It's just one layer linear hidden
layer, Mainly because the past has worked deep Method of , But it's hard to train , And in fact, the effect that can be achieved with one layer is why it must be used deep How about .

   Through experiments, we can see that , There is a certain correspondence between word vectors , For example, the country has a good correspondence with the capital , The three states of verbs have a relatively stable trigonometric relationship .

PS: Thank you @mabowen110 Point out that I forgot to add the source of the blog , Thank you very much. ~