Unsupervised learning ： Word embedding or Word vector （Word Embedding）
National Taiwan University (NTU) Mr. Li Hongyi's 《Machine Learning》
Learning notes of , Therefore, in the full text, it refers to the content of video multiple times . fledgling , Poor learning , If you have any shortcomings, please do not hesitate to comment .
Welcome to leave more comments and interact in the comment area ~~~~
1. Why use word embedding （Word Embedding）
It is often used before word embedding 1-of-N Encoding Method of , As shown in the figure below
There are two main ways to do this
shortcoming , First of all, this representation is orthogonal , But it is because of orthogonality that the relationship between words with similar attributes becomes weak ; Secondly, this coding method makes the codeword very long , For example, there are 10 Ten thousand words , Then we need a length of 10 Ten thousand strings for coding .
In order to overcome such shortcomings , Word embedding （Word Embedding） Method of . This method maps words to higher dimensions （ But the dimension is still 1-of-N Encoding
Much lower ）, Similar words come together , And different words separate ; Each axis can be seen as an attribute that distinguishes these words , For example, in the picture above , Abscissa can be regarded as the difference between living things and others , The ordinate can be regarded as the difference between moving and not moving .
2. Why word embedding （Word Embedding） It's unsupervised learning
Because in the process of learning , We only know that the input is a code of words , The output is another code of words , But I don't know what kind of coding it should be , So it's unsupervised learning .
Some people may want to realize word embedding by self coding , But if you have to type yes 1-of-N Encoding The words , Basically, we can't handle it in this way
, Because the input vectors are irrelevant , It's hard to extract any useful information through the self coding process .
3. Word embedding （Word Embedding） Two ways
Word embedding （Word Embedding） Mainly based on statistics （Count based ） And Forecast Based （Perdition based） Two methods of .
3.1 Based on statistics （Count based ）
The main idea of this method is shown in the figure below
If two word vectors appear together more frequently , So these two word vectors should be similar . So the dot product of two word vectors should be directly proportional to the number of times they appear in common , The matrix decomposition in the last lesson is very similar . For details, please refer to
3.2 Based on Forecast （Perdition based）
The original idea is the method shown in the figure below
Here the input of the neural network is the previous word wi−1wi−1 Word vector of （ 1-of-N Encoding ） form , After God's network, his output should be the next possible word wiw
i It's the chance of a word , the reason being that 1-of-N Encoding form , So every one-dimensional representation of the output is a probability of a certain degree . Then take the weight input of the first layer zz As word vector .
In actual use , It's not just the relationship between one word and the next , It's a bunch of words from the front, and a word from the back , In the process of training, there are behaviors similar to weight sharing , As shown in the figure below
We can see that the input neurons in the same position have the same weight （ On the way, it's represented by lines of the same color ）, There are two main reasons for this , First, make sure that you have the same code for the same word entered in the same batch （ That is, the same weight ）; Secondly, weight sharing can reduce the number of parameters in the model .
So how to ensure that they have the same weight in the training process ? As shown in the figure below
In the process of gradient updating , The prime minister sets the same initial value for the shared parameters , Secondly, in the process of updating, we should not only subtract the corresponding gradient , You should also subtract the gradient of another neuron at the same location , Ensure that the update process between the two parameters is the same .
Except that we can deduce the following words according to the previous ones , You can also deduce the middle word according to the words on both sides , Or from the middle to the two sides
Although neural network is used here , But it doesn't work deep learning, It's just one layer linear hidden
layer, Mainly because the past has worked deep Method of , But it's hard to train , And in fact, the effect that can be achieved with one layer is why it must be used deep How about .
Through experiments, we can see that , There is a certain correspondence between word vectors , For example, the country has a good correspondence with the capital , The three states of verbs have a relatively stable trigonometric relationship .
PS： Thank you @mabowen110 Point out that I forgot to add the source of the blog , Thank you very much. ~