1. Algorithm overview

0x1: Logistic regression

Logistic regression (logistic regression) It is a classical classification method in statistical learning, It belongs to a logarithmic linear model( It can be transformed into linear model after being transformed into logarithmic form)

From the perspective of probability, Logistic regression is essentially a kind of conditional probability under given characteristic conditions

0x2: Maximum entropy criterion

The principle of maximum entropy model:
Acknowledge what you know( knowledge); Make no assumptions about the unknown, Without any prejudice.
When predicting the probability distribution of a random event, Our prediction should satisfy all known conditions, And don't make any subjective assumptions about the unknown. under these circumstances, The most uniform probability distribution, Minimum risk forecast.

Because the information entropy of probability distribution is the largest, So people call this model“ Maximum entropy model”(Maximum Entropy)

0x3: Logistic regression and deep neural networkDNN Relationship


In logistic regression modelsigmoid Function output is a probability confidence, We make the conditional probability output of logistic regression directlyargmax Judgment is the traditional logistic regression model, Ifstacking Layer upon layer, It becomes an artificial neural network

Relevant Link:
https://www.cnblogs.com/little-YTMM/p/5582271.html
 

1. Logistic regression model

0x1: logistic distribution (logistic distribution)

set up X It's a random variable,X Following logistic distribution means X It has the following distribution function and density function



In style, Is the position parameter, Is a shape parameter

Logistic regression density function; And the distribution function graph is shown in the figure below



Distribution function belongs to logistic function, Its graph is aS Shape curve(sigmoiid curve), The curve is symmetrical with the point as the center

The curve grows and changes rapidly near the center, Slow growth and change at both ends. Smaller shape parameters,sigmoid The faster the curve grows near the center

0x2: Binomial logistic regression model - Two classification

Binomial logistic regression model(binominal logistic regression model) It's a classification model, By conditional probability distribution
Express, In the form of parameterization( Probability discretization) Logistic distribution of

For a given input instance x, According to the above formula, we can get and. Comparison of two conditional probability values by binomial logistic regression( A comparison of the maximum posterior probabilities similar to naive Bayes), Example will be x The one with higher probability

1. To test a conclusion with the help of binomial logic STI's return: Logistic regression is a logarithmic linear model

Let's examine the characteristics of logistic regression model. The probability of an event(odds) It refers to the ratio of the probability of occurrence of the event to the probability of non occurrence of the event, For binomial logistic regression, The probability formula is:

That is to say, In logistic regression model, output Y = 1 The logarithmic probability of is input x Linear function of, Or say, output Y = 1 The logarithmic probability of is determined by the input x
The model of linear function representation of, Logistic regression model

It should be noted that, This is true in many cases, It's just more complicated

0x3: Polynomial logistic regression model - Multi class

Polynomial logistic regression model(multi-norminal logistic regression model) For multi classification. Set discrete random variable Y
The value set of is{1,2,....K}, So multiple logistic regression model is:

 

2. Maximum entropy model

0x1: Maximum entropy principle


Maximum entropy principle is a criterion in probability model. According to the principle of maximum entropy, When learning probability model, In all possible probability models( distribution) in, The model with the largest entropy is the best one. Generally, constraints are used to determine the set of probability models, So the principle of maximum entropy can also be expressed as selecting the model with maximum entropy from the model set satisfying the constraint conditions

Assumed discrete random variable X The probability distribution of is P(X), Its entropy is:

Entropy satisfies the following inequality:, In formula X Number of values of, if and only if X The right equal sign holds when the distribution of is uniform, That is to say X Under uniform distribution, Entropy maximum

Intuitively speaking:
According to the principle of maximum entropy, the probability model to be selected must first satisfy the existing facts, Constraints, Without more information, Those uncertain parts are" And so on."
. The principle of maximum entropy expresses equal possibility by maximizing entropy, because" Equal possibility" Not easy to engineer, Entropy is a numerical index that can be optimized
1. An example to illustrate the principle of maximum entropy

Hypothetical random variable X Yes5 One value{A,B,C,D,E}, To estimate the probability of each value

solution: These probabilities satisfy the following constraints:



But the problem is, There are infinite probability distributions satisfying this constraint, If there is no other information, To estimate the probability distribution, One way is to think that each value in this distribution is equal probability:

Equal probability shows ignorance of facts, Because there's no more information, This judgment is reasonable.

But sometimes, We can get some constraints on probability from some prior knowledge, for example:



There are still infinite probability distributions satisfying these two constraints, In the absence of other information, We can continue to use the maximum entropy principle, Namely:

In the maximum likelihood estimation of naive Bayes method, In the absence of prior knowledge, Set equal probability distribution for prior probability( I.e. maximum entropy) That's what we do" Maximum entropy principle"
One way. But if we have some knowledge of prior distribution before parameter estimation, It can becomeMAP estimate

0x2: Definition of maximum entropy model

The principle of maximum entropy is the general principle of statistical learning, Applying it to classification to get the maximum entropy model. Hypothesis classification model is a conditional probability distribution, Represents the given inputX, Output with conditional probabilityY

Given a training data set:

The goal of learning is to choose the best classification model based on the principle of maximum entropy.

Let's define the constraints first, Namely:

Expectation value of characteristic function about empirical distribution( Estimate from sample), And, Expected value of characteristic function about model and empirical distribution. If the model can get the information in the training data, Then we can assume that the two expectations are equal:

We regard the above formula as the constraint condition of model learning, Join in n Characteristic functions

, Then there is n Constraints

1. Mathematical definition of maximum entropy model

  Suppose that the set of models satisfying the left and right constraints is:

The conditional entropy defined on the conditional probability distribution is:

Model setC The model with the largest conditional entropy in is called the maximum entropy model

 

3. Logistic regression model and maximum entropy model strategy

0x1: Constrained Optimization Problems - Solving the search problem of maximum entropy model space

The learning process of the maximum entropy model is the process of solving the maximum entropy model, The learning of the maximum entropy model can be formalized as a constrained optimization problem

For a given training data set and characteristic function, Learning of maximum entropy model is equivalent to constrained optimization problem:



According to the idea of dual equivalence, Rewriting the problem of finding the maximum value into the equivalent problem of finding the minimum value:



The solution of the above constrained optimization problem, Is the learning solution of the maximum entropy model. Here's a derivation


Transforming the original problem of constrained optimization into the dual problem of unconstrained optimization, Solving the original problem by solving the dual problem. First, Introducing Lagrange multipliersw0,w1,...wn, Defining Lagrangian functions:L(P,w)



The original problem of optimization is:

The dual problem is:

First, Solving the internal minimization problem of dual problem:, It is w Function, Record it as:, Called dual function. meanwhile, Record its solution as:

Concretely, Find the partial derivative of the right:

Make the partial derivative equal to0, In the case of, Solution:

Because, have to

, among:

Called normalization factor; Is the characteristic function; Is the weight of the feature. The above two expressions represent the maximum entropy model, Here w Is the parameter vector in the maximum entropy model

after, Solving the external maximization problem of dual problem:, Write it down as, Namely

1. An example to illustrate the principle of maximum entropy

In this paper, we solve a model parameter estimation problem under finite constraints by directly applying the maximum entropy principle and intuitive intuition, Here we continue to use the idea of Lagrange multiplier method to solve the constrained optimization problem again

Separatelyy1,y2,y3,y4,y5 ExpressA,B,C,D,E. So the optimization problem of learning the maximum entropy model is:



Introducing Lagrange multipliersw0,w1, Defining Lagrangian functions:

According to Lagrangian duality, The solution of primal optimization problem can be obtained by solving dual optimization problem:

First solve the problem aboutP Minimization of, to this end, fixedw1,w1, Partial derivative



Make the partial derivatives equal to0, Solution:

Therefore:

Re solving about w The maximization of:

Separatelyw0 andw1 Find the partial derivative and make it0, obtain:



Can see, The results are consistent with what we intuitively understand

0x2: Maximum likelihood estimation - Empirical risk minimization fit training sample

In the last section, we saw, The maximum entropy model is determined by

and Conditional probability distribution represented by.

Let's prove that the maximization of the dual function is equivalent to the maximum likelihood estimation of the maximum entropy model( The core idea of maximum entropy principle and maximum likelihood principle is mutual)

Empirical probability distribution of known training data, The logarithmic likelihood function of conditional probability distribution is expressed as:

When the conditional probability distribution is the maximum entropy model, The log likelihood function is:

On the dual function of conditional probability maximum entropy model



Compare the two above:

That is, the dual function is equivalent to the log likelihood function, So it is proved that the maximization of dual function in the learning of maximum entropy model is equivalent to the maximum likelihood estimation of maximum entropy model

such, The learning problem of the maximum entropy model is transformed into the problem of solving the maximization of log likelihood function or dual function

The maximum entropy model can be written in a more general form:, Here, Is any real valued characteristic function

0x3: The equivalence between logistic regression and maximum entropy model

The maximum entropy model is similar to logistic regression model, They are also called logarithmic linear models(log linear model).

They also have common forms and ideas in algorithm strategy, Model learning is to estimate the maximum likelihood of the model with given training data( Follow the principle of maximum entropy), Or regularized maximum likelihood estimation

In the maximum entropy model, order:



Then we can get logistic regression model.

Relevant Link:
http://blog.csdn.net/foryoundsc/article/details/71374893
 

4. Logistic regression , Maximum entropy algorithm - parameter estimation


Now that we have the maximum entropy principle and the maximum likelihood estimation, We can't get the objective function directly( Or conditional probability) Is the global optimal solution of? But it's the same as in decision tree learning, The principle of maximum entropy gain is just the guiding principle, When it comes to the engineering realization, there needs to be an efficient and fast way to get" Optimum parameters"


Logistic regression model, The learning of maximum entropy model is reduced to the optimization problem with the likelihood function as the objective function, Because there are many multiplication terms in the objective function, it is very difficult to find the extremum directly, Therefore, it can be solved by iterative algorithm( This is the same in deep neural networks). From the point of view of optimization, At this time, the objective function has good properties, It's a smooth convex function, So many optimization methods are applicable, Ensure to find the global optimal solution, The commonly used method is the improved iterative scale method, Gradient descent method, Newton method, Quasi Newton method

0x1: Improved iterative scaling method

Improved iterative scaling method(improved interative scaling IIS) It is an optimization algorithm for learning maximum entropy model.

The maximum entropy model is known to be:

The log likelihood function is:

The goal is to estimate the parameters of learning model by maximum likelihood, That is, to find the maximum value of the logarithmic likelihood function. Because the likelihood function =
probability density( And the probability density function here is the maximum entropy model), So to find the maximum value of the log likelihood function is to find the optimal solution of the maximum entropy model

IIS The idea is:

Suppose that the current parameter vector of the maximum entropy model is, We hope to find a new one参数向量,使得模型的对数似然函数值增大.如果能有这样一种参数向量的更新方法
,那么就可以重复使用这一方法,(逐步迭代),直至找到对数似然函数的最大值

对于给定的(本轮迭代)经验分布,模型参数从到,对数似然函数的改变量是:



利用不等式:

建立对数似然函数改变量的下界:



将右端记为:

于是有:

即是对数似然函数本轮迭代改变量的一个下界.如果能找到适当的使下界提高,那么对数似然函数也会提高.

然而,函数中的是一个向量,含有多个变量,不易同时优化.IIS试图一次只优化其中一个变量,而固定其他变量

为达到这一目的,IIS进一步降低下界.具体地,IIS引进一个量



因为是二值函数,故表示所有特征在(x,y)出现的次数,这样,可以改写为:

根据Jensen不等式上式可改写为:



记不等式右端为:

于是得到:

这里,是对数似然改变量的一个新的(相对不紧)下界.于是求对的偏导数:

上式中,除了外不包含任何其他变量,令偏导数为0得到:



依次对求解方程可以求出

算法的过程可以参阅下面的python代码实现

 

5. 逻辑斯蒂回归模型和感知机模型的联系与区别讨论

我们知道,感知机模型对输入 x 进行分类的线性函数,其值域(输入)为整个实属域,以一元(1维特征变量)感知机为例,感知机的判别函数实际上是一个阶跃函数



逻辑斯蒂回归模型相当于改进版本的感知机,它在感知机的输出后面加上了一个sigmoid函数,使其概率分布函数增加了一些新的特性

* f(x;β)∈[0,1].
* f应该至少是个连续函数. 这是因为我们希望模型f的输出能够随 x平滑地变化.
* f应该尽可能简单.
sigmoid不仅完成了值域概率化压缩,同时还使概率分布函数连续可导,通过逻辑斯蒂回归模型可以将线性函数转换为概率:


这时,线性函数的值(输入)越接近正无穷,概率值就越接近于1;线性函数的值越接近于负无穷,概率值就越接近于0,即实现了判别结果概率化压缩,这样的模型就是逻辑斯蒂回归模型



 

6. 用Python实现最大熵模型(MNIST数据集) 

0x1:数据地址
https://
github.com/LittleHann/lihang_book_algorithm/blob/master/data/train_binary.csv
0x2:code
# -*- coding: utf-8 -*- import pandas as pd import numpy as np import os
import time import math import randomfrom collections import defaultdict from
sklearn.cross_validation import train_test_splitfrom sklearn.metrics import
accuracy_scoreclass MaxEnt(object): def init_params(self, X, Y): self.X_ = X
self.Y_= set() # 计算(x, y)和(x)的总数 self.cal_Pxy_Px(X, Y) self.N = len(X) # 训练集大小
self.n= len(self.Pxy) # 书中(x,y)总数n self.M = 10000.0 # 书91页那个M,但实际操作中并没有用那个值 #
可认为是学习速率 self.build_dict() # 计算特征函数f(x,y)关于经验分布P~(X,Y)的期望值 self.cal_EPxy() def
build_dict(self): self.id2xy= {} self.xy2id = {} for i, (x, y) in
enumerate(self.Pxy): self.id2xy[i]= (x, y) self.xy2id[(x, y)] = i def
cal_Pxy_Px(self, X, Y): self.Pxy= defaultdict(int) self.Px = defaultdict(int)
for i in xrange(len(X)): x_, y = X[i], Y[i] self.Y_.add(y) for x in x_:
self.Pxy[(x, y)]+= 1 self.Px[x] += 1 def cal_EPxy(self): ''' 计算书中82页最下面那个期望 '''
self.EPxy = defaultdict(float) for id in xrange(self.n): (x, y) =
self.id2xy[id] # 特征函数f(x,y)是0/1指示函数,所以频数/总数 = 概率P(x,y) self.EPxy[id] = float
(self.Pxy[(x, y)]) /float(self.N) def cal_pyx(self, X, y): result = 0.0 for x in
X:if self.fxy(x, y): id = self.xy2id[(x, y)] result += self.w[id] return
(math.exp(result), y) def cal_probality(self, X):''' 计算书85页公式6.22 ''' # 计算P(Y|
X)条件概率 Pyxs= [(self.cal_pyx(X, y)) for y in self.Y_] Z = sum([prob for prob, y
in Pyxs]) return [(prob / Z, y) for prob, y in Pyxs] def cal_EPx(self): '''
计算书83页最上面那个期望''' self.EPx = [0.0 for i in xrange(self.n)] for i, X in
enumerate(self.X_): # 得到P(y|x)条件概率 Pyxs = self.cal_probality(X) for x in X: for
Pyx, yin Pyxs: if self.fxy(x, y): id = self.xy2id[(x, y)] # 计算特征函数f(x,y)关于模型P(Y
|X)与经验分布P~(X)的期望值 self.EPx[id] += Pyx * (1.0 / self.N) def fxy(self, x, y):
return (x, y) in self.xy2id def train(self, X, Y): # 初始化参数,计算一些概率,期望等变量
self.init_params(X, Y) # 初始化每个样本特征函数的权重w= 0 self.w = [0.0 for i in
range(self.n)] max_iteration= 50 for times in xrange(max_iteration): print '
iterater times %d' % times sigmas = [] self.cal_EPx() for i in xrange(self.n):
# 书上91页,公式6.34 sigma = 1 / self.M * math.log(self.EPxy[i] / self.EPx[i])
sigmas.append(sigma) #if len(filter(lambda x: abs(x) >= 0.01, sigmas)) == 0: #
break # 一次训练一个样本,一次更新所有特征函数的权重w值 self.w = [self.w[i] + sigmas[i] for i in
xrange(self.n)] def predict(self, testset): results= [] for test in testset: #
基于训练得到的特征函数,预测新的待检测样本在最大熵模型下的P(Y|x)的argmax(Y)值 result =
self.cal_probality(test) results.append(max(result, key=lambda x: x[0])[1])
return results def rebuild_features(features): '''
将原feature的(a0,a1,a2,a3,a4,...) 变成 (0_a0,1_a1,2_a2,3_a3,4_a4,...)形式'''
new_features = [] for feature in features: new_feature = [] for i, f in
enumerate(feature): new_feature.append(str(i)+ '_' + str(f))
new_features.append(new_feature)return new_features if __name__ == "__main__":
print'Start read data' time_1 = time.time() path = './data/train_binary.csv' pwd
= os.getcwd() os.chdir(os.path.dirname(path)) data =
pd.read_csv(os.path.basename(path), encoding='gbk') os.chdir(pwd) raw_data =
pd.read_csv(path, header=0) data = raw_data.values imgs = data[0::, 1::] labels
= data[::,0] # 选取 2/3 数据作为训练集, 1/3 数据作为测试集 train_features, test_features,
train_labels, test_labels= train_test_split( imgs, labels, test_size=0.33,
random_state=23323) # 将每个像素点作为一个特征维度,构造特征空间 train_features =
rebuild_features(train_features) test_features= rebuild_features(test_features)
time_2= time.time() print 'read data cost ', time_2 - time_1, ' second', '\n'
print'Start training' met = MaxEnt() met.train(train_features, train_labels)
time_3= time.time() print 'training cost ', time_3 - time_2, ' second', '\n'
print'Start predicting' test_predict = met.predict(test_features) time_4 =
time.time() print'predicting cost ', time_4 - time_3, ' second', '\n' score =
accuracy_score(test_labels, test_predict) print"The accruacy socre is ", score
 

7. 基于scikit-learn实验logistic regression

0x1:MNIST classfification using multinomial logistic + L1
# -*- coding:utf-8 -*- import time import matplotlib.pyplot as plt import numpy
as np from sklearn.datasets import fetch_mldata from sklearn.linear_model
import LogisticRegressionfrom sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler from sklearn.utils import
check_random_state # Turn downfor faster convergence t0 = time.time()
train_samples= 5000 mnist = fetch_mldata('MNIST original') X =
mnist.data.astype('float64') y = mnist.target random_state = check_random_state(
0) permutation = random_state.permutation(X.shape[0]) X = X[permutation] y =
y[permutation] X= X.reshape((X.shape[0], -1)) X_train, X_test, y_train, y_test =
train_test_split( X, y, train_size=train_samples, test_size=10000) # 特征标准化
scaler= StandardScaler() X_train = scaler.fit_transform(X_train) X_test =
scaler.transform(X_test) # Turn up tolerancefor faster convergence clf =
LogisticRegression(C=50. / train_samples, multi_class='multinomial', penalty='l1
', solver='saga', tol=0.1) clf.fit(X_train, y_train) sparsity =
np.mean(clf.coef_ ==0) * 100 score = clf.score(X_test, y_test) # print('Best C
% .4f' % clf.C_) print("Sparsity with L1 penalty: %.2f%%" % sparsity) print("
Test score with L1 penalty: %.4f" % score) coef = clf.coef_.copy()
plt.figure(figsize=(10, 5)) scale = np.abs(coef).max() for i in range(10):
l1_plot= plt.subplot(2, 5, i + 1) l1_plot.imshow(coef[i].reshape(28, 28),
interpolation='nearest', cmap=plt.cm.RdBu, vmin=-scale, vmax=scale)
l1_plot.set_xticks(()) l1_plot.set_yticks(()) l1_plot.set_xlabel('Class %i' %
i) plt.suptitle('Classification vector for...') run_time = time.time() - t0
print('Example run in %.3f s' % run_time) plt.show()
0x2:展示了逻辑斯蒂回归和简单线性回归对非非线性数据集的分类能力
# -*- coding:utf-8 -*- import numpy as np import matplotlib.pyplot as plt from
sklearn import linear_model #this is our test set, it's just a straight line
with some # Gaussian noise xmin, xmax = -5, 5 n_samples = 100 np.random.seed(0)
X= np.random.normal(size=n_samples) y = (X > 0).astype(np.float) X[X > 0] *= 4 X
+= .3 * np.random.normal(size=n_samples) X = X[:, np.newaxis] # run the
classifier clf= linear_model.LogisticRegression(C=1e5) clf.fit(X, y) # and plot
the result plt.figure(1, figsize=(4, 3)) plt.clf() plt.scatter(X.ravel(), y,
color='black', zorder=20) X_test = np.linspace(-5, 10, 300) def model(x): return
1 / (1 + np.exp(-x)) loss = model(X_test * clf.coef_ + clf.intercept_).ravel()
plt.plot(X_test, loss, color='red', linewidth=3) ols =
linear_model.LinearRegression() ols.fit(X, y) plt.plot(X_test, ols.coef_*
X_test + ols.intercept_, linewidth=1) plt.axhline(.5, color='.5') plt.ylabel('y'
) plt.xlabel('X') plt.xticks(range(-5, 10)) plt.yticks([0, 0.5, 1]) plt.ylim(-.
25, 1.25) plt.xlim(-4, 10) plt.legend(('Logistic Regression Model', 'Linear
Regression Model'), loc="lower right", fontsize='small') plt.show()


Relevant Link:
http://
scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
http://
scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_logistic_regression_mnist.html#sphx-glr-auto-examples-linear-model-plot-sparse-logistic-regression-mnist-py
http://
scikit-learn.org/stable/auto_examples/linear_model/plot_logistic.html#sphx-glr-auto-examples-linear-model-plot-logistic-py
Copyright (c) 2017 LittleHann All rights reserved