brief introduction

Chi square test , Also written as χ 2 test , It's any statistical hypothesis test
<>, Sampling distribution
<> The test statistic is a chi square distribution
<>, When zero hypothesis
<> It's true .

“ Chi square test ” It is often used as a short-term Pearson chi square test . Used to determine if there is a significant difference between the expected frequency and the frequency observed in one or more categories .


In the standard application of testing , Observations are classified as mutually exclusive categories , And there are some theories , Or zero hypothesis , It gives the probability that any observation falls into the corresponding category . The purpose of the test is to assess the possibility of the observations made , Let's assume zero. Suppose it's true .


Chi square test usually consists of the sum of square errors or sample variance . The hypothesis that test statistics following chi square distribution come from independent normal distribution data , This is valid in many cases due to the central limit theorem . Chi square test
Can be used to attempt to reject the null hypothesis of data independence .


Chi square test is an asymptotically correct test , This means the sampling distribution ( If zero is true ) The chi square distribution can be approximated by increasing the sample size .



stay 19 century , Statistical analysis methods are mainly used in biological data analysis , It's customary for researchers to assume that observations follow a normal distribution , Like George · Mr. airy and Professor Merriman , His work was criticized by Karl · Pearson is in his 1900 paper .

until 19 end of the century , Pearson noticed a significant bias in some biological observations .Pearson stay 1893 Year to 1916 In a series of articles published in Pearson distribution , A series of continuous probability distributions , In order to model the observation results , Whether it's normal or skewed . It includes normal distribution and many partial distribution , A statistical analysis method is proposed , Including the use of Pearson The distribution was used to model the observation and test the goodness of fit , To determine the true fitness of the model and observations .


Pearson's chi square test

stay 1900 year , Pearson's paper <> On the
, It is considered a test of the basis of modern statistics . In this paper ,Pearson The test of goodness of fit is studied .

Let's assume that the random sample from the population n Observations were classified as having corresponding observations    Of  k Mutex classes ( about i = 1,2,...k), And the zero hypothesis gives the observed probability    Into my
class . therefore , For all i,  We have the expected numbers  .


Pearson proposed , The null hypothesis is correct , Because of the situation Ñ →∞ The limiting distribution of the quantity given below is the distribution .


Pearson The first thing to deal with is , Suppose that each can be considered a normal distribution
<>, Expected number in all cells    Big enough , And achieved such a result : stay n
In the growing limit , Follow distribution ķ - 1 Degrees of freedom .

however , Pearson considers the case in which the expected quantity must be estimated from the sample in terms of parameter dependence , And suggested that , With symbols, I am the real expected number , and   Is the estimated expected number , difference

It's usually positive , Small enough to be omitted . In a conclusion , Pearson believes that , If we think X ' 2 Chengye distribution χ 2 distribution ķ - 1 individual
freedom , Errors in this approximation do not affect actual decisions . This conclusion has caused some controversies in practical application , Until Fisher 1922 Year and year 1924 It's been a long time since I finished my paper 20 year .



In Cryptanalysis , Chi square test is used to compare plaintext and ( probably ) Distribution of decryption ciphertext . The lowest value of the test means that the probability of successful decryption is high . This method can be extended to solve modern encryption problems .

In Bioinformatics , Chi square test is used to compare the distribution of some properties of genes belonging to different categories
( for example , Genome content , mutation rate , Interactive network clustering, etc )( for example , disease gene , Essential genes , Some genes are on a particular gene ) Chromosome, etc ).


reference resources :