# 【 Data anomaly check 】 Chi square test （chi-squared test） Handling abnormal data

brief introduction

Chi square test , Also written as χ 2 test , It's any statistical hypothesis test

<https://blog.csdn.net/ChenVast/article/details/82797265>, Sampling distribution

<https://en.wikipedia.org/wiki/Sampling_distribution> The test statistic is a chi square distribution

<https://blog.csdn.net/ChenVast/article/details/82797265>, When zero hypothesis

<https://blog.csdn.net/ChenVast/article/details/82797265> It's true .

“ Chi square test ” It is often used as a short-term Pearson chi square test . Used to determine if there is a significant difference between the expected frequency and the frequency observed in one or more categories .

In the standard application of testing , Observations are classified as mutually exclusive categories , And there are some theories , Or zero hypothesis , It gives the probability that any observation falls into the corresponding category . The purpose of the test is to assess the possibility of the observations made , Let's assume zero. Suppose it's true .

Chi square test usually consists of the sum of square errors or sample variance . The hypothesis that test statistics following chi square distribution come from independent normal distribution data , This is valid in many cases due to the central limit theorem . Chi square test

Can be used to attempt to reject the null hypothesis of data independence .

Chi square test is an asymptotically correct test , This means the sampling distribution （ If zero is true ） The chi square distribution can be approximated by increasing the sample size .

history

stay 19 century , Statistical analysis methods are mainly used in biological data analysis , It's customary for researchers to assume that observations follow a normal distribution , Like George · Mr. airy and Professor Merriman , His work was criticized by Karl · Pearson is in his 1900 paper .

until 19 end of the century , Pearson noticed a significant bias in some biological observations .Pearson stay 1893 Year to 1916 In a series of articles published in Pearson distribution , A series of continuous probability distributions , In order to model the observation results , Whether it's normal or skewed . It includes normal distribution and many partial distribution , A statistical analysis method is proposed , Including the use of Pearson The distribution was used to model the observation and test the goodness of fit , To determine the true fitness of the model and observations .

Pearson's chi square test

stay 1900 year , Pearson's paper <http://www.economics.soton.ac.uk/staff/aldrich/1900.pdf> On the

, It is considered a test of the basis of modern statistics . In this paper ,Pearson The test of goodness of fit is studied .

Let's assume that the random sample from the population n Observations were classified as having corresponding observations Of k Mutex classes （ about i = 1,2,...k）, And the zero hypothesis gives the observed probability Into my

class . therefore , For all i, We have the expected numbers .

Pearson proposed , The null hypothesis is correct , Because of the situation Ñ →∞ The limiting distribution of the quantity given below is the distribution .

Pearson The first thing to deal with is , Suppose that each can be considered a normal distribution

<https://en.wikipedia.org/wiki/Normal_distribution>, Expected number in all cells Big enough , And achieved such a result ： stay n

In the growing limit , Follow distribution ķ - 1 Degrees of freedom .

however , Pearson considers the case in which the expected quantity must be estimated from the sample in terms of parameter dependence , And suggested that , With symbols, I am the real expected number , and Is the estimated expected number , difference

It's usually positive , Small enough to be omitted . In a conclusion , Pearson believes that , If we think X ' 2 Chengye distribution χ 2 distribution ķ - 1 individual

freedom , Errors in this approximation do not affect actual decisions . This conclusion has caused some controversies in practical application , Until Fisher 1922 Year and year 1924 It's been a long time since I finished my paper 20 year .

application

In Cryptanalysis , Chi square test is used to compare plaintext and （ probably ） Distribution of decryption ciphertext . The lowest value of the test means that the probability of successful decryption is high . This method can be extended to solve modern encryption problems .

In Bioinformatics , Chi square test is used to compare the distribution of some properties of genes belonging to different categories

（ for example , Genome content , mutation rate , Interactive network clustering, etc ）（ for example , disease gene , Essential genes , Some genes are on a particular gene ） Chromosome, etc ）.

reference resources ：https://en.wikipedia.org/wiki/Chi-squared_test

<https://en.wikipedia.org/wiki/Chi-squared_test>