<https://web.stanford.edu/~jurafsky/slp3/6.pdf>学习笔记。

<>文档和向量

- As You Like It Twelfth Night Julius Caesar Henry V
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3

As You Like It ------> [ 1,114,36,20] Twelfth Night ------> [ 0, 80,58,15]
Julius Caesar ------> [ 7, 62, 1, 2] Henry V ------> [13, 89, 4, 3]

<>单词和向量

- aardvark … computer data pinch result sugar
apricot 0 … 0 0 1 0 1
pineapple 0 … 0 0 1 0 1
digital 0 … 2 1 0 1 0
information 0 … 1 6 0 4 0

digital ------> [ 0,..., 2, 1, 0, 1, 0]

<>Cosine计算相似度

dot-produtc(v→,w→)=∑i=1Nviwi=v1w1+v2w2+⋯+vNwN
\text{dot-produtc}(\overrightarrow{v},\overrightarrow{w}) =
\sum_{i=1}^Nv_iw_i=v_1w_1+v_2w_2+\dots+v_Nw_Ndot-produtc(v,w)=i=1∑N​vi​wi​=v1​w1
​+v2​w2​+⋯+vN​wN​

∣v→∣=∑i=1Nvi2\vert\overrightarrow{v}\vert = \sqrt{\sum_{i=1}^Nv_i^2}∣v∣=i=1∑N​v
i2​​

a→⋅b→=∣a→∣∣b→∣cos⁡θ\overrightarrow{a}\cdot\overrightarrow{b} =
\vert{\overrightarrow{a}}\vert \vert{\overrightarrow{b}}\vert \cos\thetaa⋅b=∣a∣∣
b∣cosθ

cos⁡θ=a→⋅b→∣a→∣∣b→∣\cos\theta =
\frac{\overrightarrow{a}\cdot\overrightarrow{b}}{\vert{\overrightarrow{a}}\vert
\vert{\overrightarrow{b}}\vert}cosθ=∣a∣∣b∣a⋅b​

cos⁡(v→,w→)=v→⋅w→∣v→∣∣w→∣=∑i=1Nviwi∑i=1Nvi2∑i=1Nwi2
\cos(\overrightarrow{v},\overrightarrow{w}) =
\frac{\overrightarrow{v}\cdot\overrightarrow{w}}{\vert{\overrightarrow{v}}\vert
\vert{\overrightarrow{w}}\vert} =
\frac{\sum_{i=1}^Nv_iw_i}{\sqrt{\sum_{i=1}^Nv_i^2}\sqrt{\sum_{i=1}^Nw_i^2}}cos(v
,w)=∣v∣∣w∣v⋅w​=∑i=1N​vi2​​∑i=1N​wi2​​∑i=1N​vi​wi​​

<>TF-IDF

TF-IDF = Term Frequency - Inverse Document Frequency

tft,d=1+log⁡10count(t,d)if count(t,d)>0else 0\text{tf}_{t,d} =

dft\text{df}_tdft​表示出现过这个单词的文档(document)的个数！

idft=Ndft\text{idf}_t = \frac{N}{\text{df}_t}idft​=dft​N​

idft=log⁡10(Ndft)\text{idf}_t = \log_{10}(\frac{N}{\text{df}_t})idft​=log10​(df
t​N​)

wt,d=tft,d×idftw_{t,d} = \text{tf}_{t,d}\times\text{idf}_twt,d​=tft,d​×idft​

- As You Like It Twelfth Night Julius Caesar Henry V
battle 0.074 0 0.22 0.28
good 0 0 0 0
fool 0.019 0.021 0.0036 0.0083
wit 0.049 0.044 0.018 0.022

<>Word2Vec

<https://juejin.im/post/5b986f296fb9a05d11176a15>，但是其实还是很粗糙。Tensorflow也有一个教程
Vector Representations of Words
<https://www.tensorflow.org/tutorials/representation/word2vec>
，但是如果你没有一点基础的话，也还是有些概念难以理解。所以相对完整地理解word2vec，你需要结合多方面的资料。这个笔记在介绍斯坦福教材的同时，也会引入其他文章，做一些比较和思考，希望这个笔记能够给你带来相对全面的理解。

<>word embedding

embedding有什么不一样的呢？

Embedding同样也是用一个向量来表示一个词，但是它是使用一个较低的维度，稠密地表示。

0, 0, 1, 2, 0,\dots, 0]}_{N个数}hello⟶N个数[0,0,0,1,2,0,…,0]​​

hello⟶[0.012,0.025,0.001,0.078,0.056,0.077,…,0.022]⎵n个数，一般是100到500左右
\text{hello} \longrightarrow
hello⟶n个数，一般是100到500左右[0.012,0.025,0.001,0.078,0.056,0.077,…,0.022]​​

* 不会造成维度爆炸，因为维度是我们自己设置的，通常比较小
* 向量是稠密的，不需要稀疏向量所采用的各种优化算法来提升计算效率

<>数据模型

word2vec有两种常用的数据准备方式：

* CBOW，用前后词(context words)预测目标词(target word)
* skip-gram，用目标词(target word)预测前后词(context word)

the quick brown fox jumped over the lazy dog

(fox, quick) (fox, brown) (fox, jumped) (fox, over)

([quick brown jumped over], fox)

<https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html>

skip-gram模型

CBOW模型

!也就是说，word2vec不关系根据这个词预测出的下一个词语是什么，而是只关心这两个词语之间是不是有上下文关系。

！关于逻辑回归，可以看我的另一篇笔记Logistic Regression
<https://luozhouyang.github.io/logistic_regression/>。

<>神经语言模型

model)**的描述。

P(wt∣h)=softmax(score(wt,h))=exp⁡score(wt,h)∑w’ in Vexp⁡score(w′,h)P(w_t\vert
h) = \text{softmax}(\text{score}(w_t,h)) =
\frac{\exp{\text{score}(w_t,h)}}{\sum_{\text{w' in
V}}\exp{\text{score}(w',h)}}P(wt​∣h)=softmax(score(wt​,h))=∑w’ in V​exp
score(w′,h)expscore(wt​,h)​

JML=log⁡P(wt∣h)=score(wt,h)−log⁡(∑w’ in Vexp⁡score(w′,h))J_{\text{ML}} =
\log{P(w_t\vert h)} = \text{score}(w_t,h) - \log(\sum_{\text{w' in
V}}\exp{\text{score}(w',h)})JML​=logP(wt​∣h)=score(wt​,h)−log(w’ in V∑​exp
score(w′,h))

，因为选出来的一批词都是不是正确的target word。这个模型如下图所示：

JNEG=log⁡Qθ(D=1∣wt,h)+kEw~∼Pnoise[log⁡Qθ(D=0∣w~,h)]J_\text{NEG} = \log
Q_\theta(D=1 |w_t, h) + k \mathop{\mathbb{E}}_{\tilde w \sim P_\text{noise}}
\left[ \log Q_\theta(D = 0 |\tilde w, h) \right]JNEG​=logQθ​(D=1∣wt​,h)+kEw~∼P
noise​​[logQθ​(D=0∣w~,h)]

wt​的概率。

<>The classifier

* 把目标词和上下文词组成的样本当做训练的正样本(positive sample)
* 随机选取一些词和目标词组成的样本当做训练的负样本(negtive sample)
* 使用logistic regression训练一个二分类器来区分两种情况
* regression的权重就是我们的embedding
word2vec需要的是训练一个binary logistic regression，给定一个目标ttt和候选上下文ccc的元组(t,c)(t,c)(t,c)
，返回ccc正好是ttt的上下文词的概率：

P(+∣t,c)P(+\vert t,c)P(+∣t,c)

P(−∣t,c)=1−P(+∣t,c)P(-\vert t,c) = 1 - P(+\vert t,c)P(−∣t,c)=1−P(+∣t,c)

Similarity(t,c)≈t⋅cSimilarity(t,c) \approx t\cdot cSimilarity(t,c)≈t⋅c

P(+∣t,c)=11+e−t⋅cP(+\vert t,c) = \frac{1}{1+e^{-t\cdot c}}P(+∣t,c)=1+e−t⋅c1​

P(−∣t,c)=1−P(+∣t,c)=e−t⋅c1+e−t⋅cP(-\vert t,c) = 1 - P(+\vert t,c) =
\frac{e^{-t\cdot c}}{1+e^{-t\cdot c}}P(−∣t,c)=1−P(+∣t,c)=1+e−t⋅ce−t⋅c​

P(+∣t,c1:k)=∏i=1k11+e−t⋅ciP(+\vert t,c_{1:k}) =
\prod_{i=1}^k\frac{1}{1+e^{-t\cdot c_i}}P(+∣t,c1:k​)=i=1∏k​1+e−t⋅ci​1​

log⁡P(+∣t,c1:k)=∑i=1klog⁡11+e−t⋅ci\log{P(+\vert t,c_{1:k})} =
\sum_{i=1}^k\log{\frac{1}{1+e^{-t\cdot c_i}}}logP(+∣t,c1:k​)=i=1∑k​log1+e−t⋅ci​1

<>skip-gram模型的训练

pα(w)=count(w)α∑w′count(w′)αp_\alpha(w) =
\frac{count(w)^\alpha}{\sum_{w'}count(w')^\alpha}pα​(w)=∑w′​count(w′)α
count(w)α​

，加上这个比率之后：

Pα(a)=0.97P_\alpha(a) = 0.97Pα​(a)=0.97

Pα(b)=0.03P_\alpha(b) = 0.03Pα​(b)=0.03

* 最大化正样本的概率，也就是正样本的相似度最大化
* 最小化负样本的概率，也就是负样本的相似度最小化

L(θ)=∑(t,c)∈+log⁡P(+∣t,c)+∑(t,c)∈−log⁡P(−∣t,c)L(\theta) = \sum_{(t,c)\in
+}\log P(+\vert t,c) + \sum_{(t,c)\in -}\log P(-\vert t,c)L(θ)=(t,c)∈+∑​logP(+∣t
,c)+(t,c)∈−∑​logP(−∣t,c)

)，就有：

L(θ)=log⁡P(+∣t,c)+∑i=1klog⁡P(−∣t,ni)L(\theta) = \log P(+\vert t,c) +
\sum_{i=1}^k\log P(-\vert t,n_i)L(θ)=logP(+∣t,c)+i=1∑k​logP(−∣t,ni​)

L(θ)=log⁡σ(c⋅t)+∑i=1klog⁡σ(−ni⋅t)L(\theta) = \log\sigma(c\cdot t) +
\sum_{i=1}^k\log\sigma(-n_i\cdot t)L(θ)=logσ(c⋅t)+i=1∑k​logσ(−ni​⋅t)

L(θ)=log⁡11+e−c⋅t+∑i=1klog⁡11+ec⋅tL(\theta) = \log\frac{1}{1+e^{-c\cdot t}} +
\sum_{i=1}^k\log\frac{1}{1+e^{c\cdot t}}L(θ)=log1+e−c⋅t1​+i=1∑k​log1+ec⋅t1​

* 进行采样，计算出概率
* 使用交叉熵计算损失

<>两个权重矩阵W和C

**如果我们要得到每一个单词的向量表示，只要从WWW中取出对应的行即可！**因为，训练的每一个单词，都是用one-hot编码的，直接和WWW

<https://juejin.im/post/5b986f296fb9a05d11176a15>。

<>推荐文章

1.Vector Representations of Words
<https://www.tensorflow.org/tutorials/representation/word2vec>
2.自己动手实现word2vec(skip-gram模型)
<https://juejin.im/post/5b986f296fb9a05d11176a15>
3.Logistic Regression <https://luozhouyang.github.io/logistic_regression/>
4.Learning Word Embedding
<https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html>

<>联系我

*
Email: stupidme.me.lzy@gmail.com <mailto:stupidme.me.lzy@gmail.com>

*
WeChat: luozhouyang0528

*