National Taiwan University (NTU) Mr. Li Hongyi's《Machine Learning》
<https://www.coursera.org/learn/ntumlone-mathematicalfoundations>
Learning notes of, Therefore, in the full text, it refers to the content of video multiple times. Fledgling, Art is not refined. If you have any shortcomings, please do not hesitate to comment.

   Welcome to leave more comments and interact in the comment area~~~~

1. Why use word embedding(Word Embedding)

   It is often used before word embedding 1-of-N Encoding Method, As shown in the figure below




There are two main ways to do this
shortcoming, First of all, this representation is orthogonal, But it is because of orthogonality that the relationship between words with similar attributes becomes weak; Secondly, this coding method makes the codeword very long, For example, there are10 Ten thousand words, Then we need a length of10 Ten thousand strings for coding.

   In order to overcome such shortcomings, Word embedding(Word Embedding) Method. This method maps words to higher dimensions( But the dimension is still 1-of-N Encoding
Much lower), Similar words come together, And different words separate; Each axis can be seen as an attribute that distinguishes these words, For example, in the picture above, Abscissa can be regarded as the difference between living things and others, The ordinate can be regarded as the difference between moving and not moving.

2. Why word embedding(Word Embedding) It's unsupervised learning

   Because in the process of learning, We only know that the input is a code of words, The output is another code of words, But I don't know what kind of coding it should be, So it's unsupervised learning.
   Some people may want to realize word embedding by self coding, But if you have to type yes 1-of-N Encoding Words, Basically, we can't handle it in this way
, Because the input vectors are irrelevant, It's hard to extract any useful information through the self coding process.

3. Word embedding(Word Embedding) Two ways

   Word embedding(Word Embedding) Mainly based on statistics(Count based ) And Forecast Based(Perdition based) Two methods of.

3.1 Based on statistics(Count based )

   The main idea of this method is shown in the figure below





If two word vectors appear together more frequently, So these two word vectors should be similar. So the dot product of two word vectors should be directly proportional to the number of times they appear in common, The matrix decomposition in the last lesson is very similar. For details, please refer to
Glove Vector:http://nlp.stanford.edu/projects/glove/
<http://nlp.stanford.edu/projects/glove/>

3.2 Prediction based(Perdition based)

   The original idea is the method shown in the figure below




Here the input of the neural network is the previous word wi−1wi−1 Word vector( 1-of-N Encoding ) form, After God's network, his output should be the next possible word wiw
i Is the probability of a word, the reason being that 1-of-N Encoding form, So every one-dimensional representation of the output is a probability of a certain degree. Then take the weight input of the first layer zz As word vector.

   In actual use, It's not just the relationship between one word and the next, It's a bunch of words from the front, and a word from the back, In the process of training, there are behaviors similar to weight sharing, As shown in the figure below





We can see that the input neurons in the same position have the same weight( On the way, it's represented by lines of the same color), There are two main reasons for this, First, make sure that you have the same code for the same word entered in the same batch( That is, the same weight); Secondly, weight sharing can reduce the number of parameters in the model.

   So how to ensure that they have the same weight in the training process? As shown in the figure below





In the process of gradient updating, The prime minister sets the same initial value for the shared parameters, Secondly, in the process of updating, we should not only subtract the corresponding gradient, You should also subtract the gradient of another neuron at the same location, Ensure that the update process between the two parameters is the same.

   Except that we can deduce the following words according to the previous ones, You can also deduce the middle word according to the words on both sides, Or from the middle to the two sides




Although neural network is used here, But it doesn't workdeep learning, It's just one layer linear hidden
layer, Mainly because the past has workeddeep Method, But it's hard to train, And in fact, the effect that can be achieved with one layer is why it must be useddeep What's the way?.

   Through experiments, we can see that, There is a certain correspondence between word vectors, For example, the country has a good correspondence with the capital, The three states of verbs have a relatively stable trigonometric relationship.




PS: Thank netizens @mabowen110 Point out that I forgot to add the source of the blog, Thank you very much.~