1. Algorithm overview

0x1: Logistic regression

Logistic regression (logistic regression) It is a classical classification method in statistical learning , It belongs to a logarithmic linear model ( It can be transformed into linear model after being transformed into logarithmic form )

From the perspective of probability , Logistic regression is essentially a kind of conditional probability under given characteristic conditions

0x2: Maximum entropy criterion

The principle of maximum entropy model :
Acknowledge what is known ( knowledge ); Make no assumptions about the unknown , Without any prejudice .
When predicting the probability distribution of a random event , Our prediction should satisfy all known conditions , And don't make any subjective assumptions about the unknown . under these circumstances , The most uniform probability distribution , Minimum risk forecast .

Because the information entropy of probability distribution is the largest , So people call this model “ Maximum entropy model ”(Maximum Entropy)

0x3: Logistic regression and deep neural network DNN Relationship of

In logistic regression model sigmoid Function output is a probability confidence , We make the conditional probability output of logistic regression directly argmax Judgment is the traditional logistic regression model , If stacking Layer upon layer , It becomes an artificial neural network

Relevant Link:

1. Logistic regression model

0x1: logistic distribution (logistic distribution)

set up X It's a random variable ,X Following logistic distribution means X It has the following distribution function and density function

Where , Is the position parameter , Is a shape parameter

Logistic regression density function ; And the distribution function graph is shown in the figure below

Distribution function belongs to logistic function , Its graph is a S Shape curve (sigmoiid curve), The curve is symmetrical with the point as the center

The curve grows and changes rapidly near the center , Slow growth and change at both ends . Smaller shape parameters ,sigmoid The faster the curve grows near the center

0x2: Binomial logistic regression model - Two categories

Binomial logistic regression model (binominal logistic regression model) It's a classification model , By conditional probability distribution
express , In the form of parameterization ( Probability discretization ) Logistic distribution of

For a given input instance x, According to the above formula, we can get and . Comparison of two conditional probability values by binomial logistic regression ( A comparison of the maximum posterior probabilities similar to naive Bayes ), Set instance x The one with higher probability

1. To test a conclusion with the help of binomial logic STI's return : Logistic regression is a logarithmic linear model

Let's examine the characteristics of logistic regression model . The probability of an event (odds) It refers to the ratio of the probability of occurrence of the event to the probability of non occurrence of the event , For binomial logistic regression , The probability formula is :

That is to say , In logistic regression model , output Y = 1 The logarithmic probability of is input x Linear function of , Or say , output Y = 1 The logarithmic probability of is determined by the input x
The model of linear function representation of , Logistic regression model

It should be noted that , This is true in many cases , It's just more complicated

0x3: Polynomial logistic regression model - Multiclassification

Polynomial logistic regression model (multi-norminal logistic regression model) For multi classification . Set discrete random variable Y
The value set of is {1,2,....K}, So multiple logistic regression model is :


2. Maximum entropy model

0x1: Maximum entropy principle

Maximum entropy principle is a criterion in probability model . According to the principle of maximum entropy , When learning probability model , In all possible probability models ( distribution ) in , The model with the largest entropy is the best one . Generally, constraints are used to determine the set of probability models , Therefore, the principle of maximum entropy can also be expressed as selecting the model with maximum entropy from the model set satisfying the constraint conditions

Assumed discrete random variable X The probability distribution of is P(X), Then its entropy is :

Entropy satisfies the following inequality :, Where is X Number of values of , if and only if X The right equal sign holds when the distribution of is uniform , That is, when X Under uniform distribution , Maximum entropy

Intuitively speaking :
According to the principle of maximum entropy, the probability model to be selected must first satisfy the existing facts , Constraints , Without more information , Those uncertain parts are " And so on "
. The principle of maximum entropy expresses equal possibility by maximizing entropy , because " Etc " Not easy to engineer , Entropy is a numerical index that can be optimized
1. An example to illustrate the principle of maximum entropy

Hypothetical random variable X Yes 5 Values {A,B,C,D,E}, To estimate the probability of each value

solution : These probabilities satisfy the following constraints :

But the problem is , There are infinite probability distributions satisfying this constraint , If there is no other information , To estimate the probability distribution , One way is to think that each value in this distribution is equal probability :

Equal probability shows ignorance of facts , Because there's no more information , This judgment is reasonable .

But sometimes , We can get some constraints on probability from some prior knowledge , for example :

There are still infinite probability distributions satisfying these two constraints , In the absence of other information , We can continue to use the maximum entropy principle , Namely :

In the maximum likelihood estimation of naive Bayes method , In the absence of prior knowledge , Set equal probability distribution for prior probability ( I.e. maximum entropy ) That's what we do " Maximum entropy principle "
One way . But if we have some knowledge of prior distribution before parameter estimation , It can become MAP estimate

0x2: Definition of maximum entropy model

The principle of maximum entropy is the general principle of statistical learning , Applying it to classification to get the maximum entropy model . Hypothesis classification model is a conditional probability distribution , Represents the given input X, Output with conditional probability Y

Given a training data set :

The goal of learning is to choose the best classification model based on the principle of maximum entropy .

Let's define the constraints first , Namely :

Expectation value of characteristic function about empirical distribution ( Estimate from sample ), And , Expected value of characteristic function about model and empirical distribution . If the model can get the information in the training data , Then we can assume that the two expectations are equal :

We regard the above formula as the constraint condition of model learning , Joined by n Characteristic functions

, Then there is n Constraints

1. Mathematical definition of maximum entropy model

  Suppose that the set of models satisfying the left and right constraints is :

The conditional entropy defined on the conditional probability distribution is :

Model set C The model with the largest conditional entropy in is called the maximum entropy model


3. Logistic regression model and maximum entropy model strategy

0x1: Constrained Optimization Problems - Solving the search problem of maximum entropy model space

The learning process of the maximum entropy model is the process of solving the maximum entropy model , The learning of the maximum entropy model can be formalized as a constrained optimization problem

For a given training data set and characteristic function , Learning of maximum entropy model is equivalent to constrained optimization problem :

According to the idea of dual equivalence , Rewriting the problem of finding the maximum value into the equivalent problem of finding the minimum value :

The solution of the above constrained optimization problem , Is the learning solution of the maximum entropy model . Here's a derivation

Transforming the original problem of constrained optimization into the dual problem of unconstrained optimization , Solving the original problem by solving the dual problem . first , Introducing Lagrange multipliers w0,w1,...wn, Defining Lagrangian functions :L(P,w)

The original problem of optimization is :

The dual problem is :

first , Solving the internal minimization problem of dual problem :, It's w Function of , Write it down as :, Called dual function . meanwhile , Record its solution as :

Specifically , Find the partial derivative of the right :

Make the partial derivative equal to 0, In the case of , Get the solution :

because , have to

, among :

Called normalization factor ; Is the characteristic function ; Is the weight of the feature . The above two expressions represent the maximum entropy model , here w Is the parameter vector in the maximum entropy model

after , Solving the external maximization problem of dual problem :, Write it down as , Namely

1. An example to illustrate the principle of maximum entropy

In this paper, we solve a model parameter estimation problem under finite constraints by directly applying the maximum entropy principle and intuitive intuition , Here we continue to use the idea of Lagrange multiplier method to solve the constrained optimization problem again

With y1,y2,y3,y4,y5 express A,B,C,D,E. So the optimization problem of learning the maximum entropy model is :

Introducing Lagrange multipliers w0,w1, Defining Lagrangian functions :

According to Lagrangian duality , The solution of primal optimization problem can be obtained by solving dual optimization problem :

First solve the problem about P Minimization of , to this end , fixed w1,w1, Find partial derivative

Make the partial derivatives equal to 0, Get the solution :

therefore :

Re solving about w The maximization of :

Respectively w0 and w1 Find the partial derivative and make it 0, obtain :

You can see , The results are consistent with what we intuitively understand

0x2: Maximum likelihood estimation - Empirical risk minimization fit training sample

In the last section, we saw , The maximum entropy model is determined by

and Conditional probability distribution represented by .

Let's prove that the maximization of the dual function is equivalent to the maximum likelihood estimation of the maximum entropy model ( The core idea of maximum entropy principle and maximum likelihood principle is mutual )

Empirical probability distribution of known training data , The logarithmic likelihood function of conditional probability distribution is expressed as :

When the conditional probability distribution is the maximum entropy model , The log likelihood function is :

On the dual function of conditional probability maximum entropy model

Compare the two above :

That is, the dual function is equivalent to the log likelihood function , So it is proved that the maximization of dual function in the learning of maximum entropy model is equivalent to the maximum likelihood estimation of maximum entropy model

such , The learning problem of the maximum entropy model is transformed into the problem of solving the maximization of log likelihood function or dual function

The maximum entropy model can be written in a more general form :, here , Is any real valued characteristic function

0x3: The equivalence between logistic regression and maximum entropy model

The maximum entropy model is similar to logistic regression model , They are also called logarithmic linear models (log linear model).

They also have common forms and ideas in algorithm strategy , Model learning is to estimate the maximum likelihood of the model with given training data ( Follow the principle of maximum entropy ), Or regularized maximum likelihood estimation

In the maximum entropy model , order :

Then we can get logistic regression model .

Relevant Link:

4. Logistic regression , Maximum entropy algorithm - parameter estimation

Now that we have the maximum entropy principle and the maximum likelihood estimation , We can't get the objective function directly ( Or conditional probability ) Is the global optimal solution of ? But it's the same as in decision tree learning , The principle of maximum entropy gain is just the guiding principle , When it comes to the engineering realization, there needs to be an efficient and fast way to get " Best parameters "

Logistic regression model , The learning of maximum entropy model is reduced to the optimization problem with the likelihood function as the objective function , Because there are many multiplication terms in the objective function, it is very difficult to find the extremum directly , Therefore, it can be solved by iterative algorithm ( This is the same in deep neural networks ). From the point of view of optimization , At this time, the objective function has good properties , It's a smooth convex function , So many optimization methods are applicable , Ensure to find the global optimal solution , The commonly used method is the improved iterative scale method , Gradient descent method , Newton method , Quasi Newton method

0x1: Improved iterative scaling method

Improved iterative scaling method (improved interative scaling IIS) It is an optimization algorithm for learning maximum entropy model .

The maximum entropy model is known to be :

The log likelihood function is :

The goal is to estimate the parameters of learning model by maximum likelihood , That is, to find the maximum value of the logarithmic likelihood function . Because the likelihood function =
probability density ( And the probability density function here is the maximum entropy model ), So to find the maximum value of the log likelihood function is to find the optimal solution of the maximum entropy model

IIS The idea is :

Suppose that the current parameter vector of the maximum entropy model is , We hope to find a new one 参数向量,使得模型的对数似然函数值增大.如果能有这样一种参数向量的更新方法


















5. 逻辑斯蒂回归模型和感知机模型的联系与区别讨论

我们知道,感知机模型对输入 x 进行分类的线性函数,其值域(输入)为整个实属域,以一元(1维特征变量)感知机为例,感知机的判别函数实际上是一个阶跃函数


* f(x;β)∈[0,1].
* f应该至少是个连续函数. 这是因为我们希望模型f的输出能够随 x平滑地变化.
* f应该尽可能简单.



6. 用Python实现最大熵模型(MNIST数据集) 

# -*- coding: utf-8 -*- import pandas as pd import numpy as np import os
import time import math import randomfrom collections import defaultdict from
sklearn.cross_validation import train_test_splitfrom sklearn.metrics import
accuracy_scoreclass MaxEnt(object): def init_params(self, X, Y): self.X_ = X
self.Y_= set() # 计算(x, y)和(x)的总数 self.cal_Pxy_Px(X, Y) self.N = len(X) # 训练集大小
self.n= len(self.Pxy) # 书中(x,y)总数n self.M = 10000.0 # 书91页那个M,但实际操作中并没有用那个值 #
可认为是学习速率 self.build_dict() # 计算特征函数f(x,y)关于经验分布P~(X,Y)的期望值 self.cal_EPxy() def
build_dict(self): self.id2xy= {} self.xy2id = {} for i, (x, y) in
enumerate(self.Pxy): self.id2xy[i]= (x, y) self.xy2id[(x, y)] = i def
cal_Pxy_Px(self, X, Y): self.Pxy= defaultdict(int) self.Px = defaultdict(int)
for i in xrange(len(X)): x_, y = X[i], Y[i] self.Y_.add(y) for x in x_:
self.Pxy[(x, y)]+= 1 self.Px[x] += 1 def cal_EPxy(self): ''' 计算书中82页最下面那个期望 '''
self.EPxy = defaultdict(float) for id in xrange(self.n): (x, y) =
self.id2xy[id] # 特征函数f(x,y)是0/1指示函数,所以频数/总数 = 概率P(x,y) self.EPxy[id] = float
(self.Pxy[(x, y)]) /float(self.N) def cal_pyx(self, X, y): result = 0.0 for x in
X:if self.fxy(x, y): id = self.xy2id[(x, y)] result += self.w[id] return
(math.exp(result), y) def cal_probality(self, X):''' 计算书85页公式6.22 ''' # 计算P(Y|
X)条件概率 Pyxs= [(self.cal_pyx(X, y)) for y in self.Y_] Z = sum([prob for prob, y
in Pyxs]) return [(prob / Z, y) for prob, y in Pyxs] def cal_EPx(self): '''
计算书83页最上面那个期望''' self.EPx = [0.0 for i in xrange(self.n)] for i, X in
enumerate(self.X_): # 得到P(y|x)条件概率 Pyxs = self.cal_probality(X) for x in X: for
Pyx, yin Pyxs: if self.fxy(x, y): id = self.xy2id[(x, y)] # 计算特征函数f(x,y)关于模型P(Y
|X)与经验分布P~(X)的期望值 self.EPx[id] += Pyx * (1.0 / self.N) def fxy(self, x, y):
return (x, y) in self.xy2id def train(self, X, Y): # 初始化参数,计算一些概率,期望等变量
self.init_params(X, Y) # 初始化每个样本特征函数的权重w= 0 self.w = [0.0 for i in
range(self.n)] max_iteration= 50 for times in xrange(max_iteration): print '
iterater times %d' % times sigmas = [] self.cal_EPx() for i in xrange(self.n):
# 书上91页,公式6.34 sigma = 1 / self.M * math.log(self.EPxy[i] / self.EPx[i])
sigmas.append(sigma) #if len(filter(lambda x: abs(x) >= 0.01, sigmas)) == 0: #
break # 一次训练一个样本,一次更新所有特征函数的权重w值 self.w = [self.w[i] + sigmas[i] for i in
xrange(self.n)] def predict(self, testset): results= [] for test in testset: #
基于训练得到的特征函数,预测新的待检测样本在最大熵模型下的P(Y|x)的argmax(Y)值 result =
self.cal_probality(test) results.append(max(result, key=lambda x: x[0])[1])
return results def rebuild_features(features): '''
将原feature的(a0,a1,a2,a3,a4,...) 变成 (0_a0,1_a1,2_a2,3_a3,4_a4,...)形式'''
new_features = [] for feature in features: new_feature = [] for i, f in
enumerate(feature): new_feature.append(str(i)+ '_' + str(f))
new_features.append(new_feature)return new_features if __name__ == "__main__":
print'Start read data' time_1 = time.time() path = './data/train_binary.csv' pwd
= os.getcwd() os.chdir(os.path.dirname(path)) data =
pd.read_csv(os.path.basename(path), encoding='gbk') os.chdir(pwd) raw_data =
pd.read_csv(path, header=0) data = raw_data.values imgs = data[0::, 1::] labels
= data[::,0] # 选取 2/3 数据作为训练集, 1/3 数据作为测试集 train_features, test_features,
train_labels, test_labels= train_test_split( imgs, labels, test_size=0.33,
random_state=23323) # 将每个像素点作为一个特征维度,构造特征空间 train_features =
rebuild_features(train_features) test_features= rebuild_features(test_features)
time_2= time.time() print 'read data cost ', time_2 - time_1, ' second', '\n'
print'Start training' met = MaxEnt() met.train(train_features, train_labels)
time_3= time.time() print 'training cost ', time_3 - time_2, ' second', '\n'
print'Start predicting' test_predict = met.predict(test_features) time_4 =
time.time() print'predicting cost ', time_4 - time_3, ' second', '\n' score =
accuracy_score(test_labels, test_predict) print"The accruacy socre is ", score

7. 基于scikit-learn实验logistic regression

0x1:MNIST classfification using multinomial logistic + L1
# -*- coding:utf-8 -*- import time import matplotlib.pyplot as plt import numpy
as np from sklearn.datasets import fetch_mldata from sklearn.linear_model
import LogisticRegressionfrom sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler from sklearn.utils import
check_random_state # Turn downfor faster convergence t0 = time.time()
train_samples= 5000 mnist = fetch_mldata('MNIST original') X =
mnist.data.astype('float64') y = mnist.target random_state = check_random_state(
0) permutation = random_state.permutation(X.shape[0]) X = X[permutation] y =
y[permutation] X= X.reshape((X.shape[0], -1)) X_train, X_test, y_train, y_test =
train_test_split( X, y, train_size=train_samples, test_size=10000) # 特征标准化
scaler= StandardScaler() X_train = scaler.fit_transform(X_train) X_test =
scaler.transform(X_test) # Turn up tolerancefor faster convergence clf =
LogisticRegression(C=50. / train_samples, multi_class='multinomial', penalty='l1
', solver='saga', tol=0.1) clf.fit(X_train, y_train) sparsity =
np.mean(clf.coef_ ==0) * 100 score = clf.score(X_test, y_test) # print('Best C
% .4f' % clf.C_) print("Sparsity with L1 penalty: %.2f%%" % sparsity) print("
Test score with L1 penalty: %.4f" % score) coef = clf.coef_.copy()
plt.figure(figsize=(10, 5)) scale = np.abs(coef).max() for i in range(10):
l1_plot= plt.subplot(2, 5, i + 1) l1_plot.imshow(coef[i].reshape(28, 28),
interpolation='nearest', cmap=plt.cm.RdBu, vmin=-scale, vmax=scale)
l1_plot.set_xticks(()) l1_plot.set_yticks(()) l1_plot.set_xlabel('Class %i' %
i) plt.suptitle('Classification vector for...') run_time = time.time() - t0
print('Example run in %.3f s' % run_time) plt.show()
# -*- coding:utf-8 -*- import numpy as np import matplotlib.pyplot as plt from
sklearn import linear_model #this is our test set, it's just a straight line
with some # Gaussian noise xmin, xmax = -5, 5 n_samples = 100 np.random.seed(0)
X= np.random.normal(size=n_samples) y = (X > 0).astype(np.float) X[X > 0] *= 4 X
+= .3 * np.random.normal(size=n_samples) X = X[:, np.newaxis] # run the
classifier clf= linear_model.LogisticRegression(C=1e5) clf.fit(X, y) # and plot
the result plt.figure(1, figsize=(4, 3)) plt.clf() plt.scatter(X.ravel(), y,
color='black', zorder=20) X_test = np.linspace(-5, 10, 300) def model(x): return
1 / (1 + np.exp(-x)) loss = model(X_test * clf.coef_ + clf.intercept_).ravel()
plt.plot(X_test, loss, color='red', linewidth=3) ols =
linear_model.LinearRegression() ols.fit(X, y) plt.plot(X_test, ols.coef_*
X_test + ols.intercept_, linewidth=1) plt.axhline(.5, color='.5') plt.ylabel('y'
) plt.xlabel('X') plt.xticks(range(-5, 10)) plt.yticks([0, 0.5, 1]) plt.ylim(-.
25, 1.25) plt.xlim(-4, 10) plt.legend(('Logistic Regression Model', 'Linear
Regression Model'), loc="lower right", fontsize='small') plt.show()

Relevant Link:
Copyright (c) 2017 LittleHann All rights reserved