Chi-square test in Python

December 23, 2019 4 minute read Follow @GaryLi

Chi-square (χ2) test for independence (Pearson Chi-square test)

Chi-square test is a non-parametric (distribution-free) method used to compare the relationship between the two categorical (nominal) variables in a contingency table
For example, we have different treatments (treated and nontreated) and treatment outcomes (cured and noncured), here we could use the chi-square test for independence to check whether treatments are related to treatment outcomes.
Note: Chi-square test for independence is different than the chi-square goodness of fit test

Formula

\( \chi^2 = \sum\limits_i^n \frac{(O_i - E_i)^2}{E_i} \)

Where, \( O_i \) = Observed value in contingency table,
\( E_i \) = Expected value for each cell in contingency table,
\( i \) = 1,2,..., n

Hypotheses

Null hypotheses: The two categorical variables are independent (no association between the two variables) ( H₀: O_i = E_i )
Alternative hypotheses: The two categorical variables are dependent (there is an association between the two variables) ( H_a: O_i ≠ E_i )
Note: There are no one or two-tailed P-value. Rejection region of the chi-square test is always on the right side of the distribution.

Assumptions

The two variables are categorical (nominal) and data is randomly sampled
The levels of variables are mutually exclusive
The expected frequency count for at least 80% of the cell in a contingency table is at least 5
The expected frequency count should not be less than 1
Observations should be independent of each other
Observation data should be frequency counts and not percentages or transformed data

Perform a chi-square test for independence

We will use bioinfokit v0.9.5 or later
Check bioinfokit documentation for installation and documentation
Download a hypothetical dataset for chi-square test for independence

# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import stat, get_data
# load example dataset
>>> df = get_data('drugdata').data
>>> df.head()
   treatments  cured  noncured
0     treated     60        10
1  nontreated     30        25
# set treatments column as index
>>> df = df.set_index('treatments')
>>> df.head()
            cured  noncured
treatments
treated        60        10
nontreated     30        25

# run chi-square test for independence
>>> res = stat()
>>> res.chisq(df=df)

# output
>>> print(res.summary)

Chi-squared test for independence

Test              Df    Chi-square      P-value
--------------  ----  ------------  -----------
Pearson            1       13.3365  0.000260291
Log-likelihood     1       13.4687  0.000242574

>>> print(res.expected_df)

Expected frequency counts

      cured    noncured
--  -------  ----------
 0     50.4        19.6
 1     39.6        15.4

Interpretation

The p value obtained from chi-square test for independence is significant (p < 0.05), and therefore, we conclude that there is a significant association between treatments (treated and nontreated) with treatment outcome (cured and noncured)

Chi-square (χ2) Goodness of Fit test

Chi-square Goodness of Fit Test test is a non-parametric (distribution-free) method used to compare the observed and expected values from one categorical variable. The expected values are calculated based on the known theoretical expectation.
For example, we have resistant (A) and susceptible (B) genotypes for some disease. The crosses between these two genotypes will produce offspring in 3:1 (75% A and 25% B genotype) as per Mendelian ratio assuming resistance to disease is a dominant trait. Here, we could use the chi-square Goodness of Fit Test test to check whether observed counts of A and B genotypes are similar to expected counts of A and B genotypes as per the Mendelian ratio.

Formula

\( \chi^2 = \sum\limits_i^n \frac{(O_i - E_i)^2}{E_i} \)

Where, \( O_i \) = Observed value for category i,
\( E_i \) = Expected value for category i,
\( i \) = 1,2,..., n

Hypotheses

Null hypotheses: The observed and expected counts in each group are equal ( H₀: O_i = E_i )
Alternative hypotheses: The observed and expected counts in each group are different ( H_a: O_i ≠ E_i )

Assumptions

The variable should be categorical (nominal) and data is randomly sampled
The groups of variables are mutually exclusive
The expected count should be at least 5 for each group
Observations should be independent of each other
Observation data should be frequency counts and not percentages or transformed data

Perform a Goodness of Fit test Python

We will use bioinfokit v0.9.5 or later
Check bioinfokit documentation for installation and documentation

# I am using interactive python interpreter (Python 3.7)
>>> from bioinfokit.analys import stat
>>> import pandas as pd
# create or import pandas dataframe of observed counts
>>> df = pd.DataFrame({'genotypes':['A', 'B'], 'observed':[155, 45]})
>>> df = df.set_index(['genotypes'])
>>> df.head()
           observed
genotypes
A               155
B                45

# run chi-square test 
>>> res = stat()
# p should be known theoretical expectation and must sum to 1
>>> res.chisq(df=df, p=(0.75, 0.25))

# output
>>> print(res.summary)

Chi-squared goodness of fit test

  Chi-Square    Df    P-value    Sample size
------------  ----  ---------  -------------
    0.666667     1   0.414216            200

# get expected counts
>>> print(res.expected_df)
           observed  expected_counts
genotypes
A               155            150.0
B                45             50.0

Interpretation

The p value obtained from the chi-square Goodness of Fit test is non-significant (p > 0.05 and fail to reject the null hypothesis), and therefore, we conclude that the observed genotypes counts after crosses is similar to that of expected counts as per the Mendelian ratio.

References

Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods. 2020 Mar;17(3):261-72.

Last updated: November 24, 2020

Twitter Facebook LinkedIn

Chi-square test in Python

Chi-square (χ2) test for independence (Pearson Chi-square test)

Formula

Hypotheses

Assumptions

Perform a chi-square test for independence

Interpretation

Chi-square (χ2) Goodness of Fit test

Formula

Hypotheses

Assumptions

Perform a Goodness of Fit test Python

Interpretation

References

You May Also Enjoy

Generate a course assessement rubric using ChatGPT

Injecting Commonsense Knowledge into Prompt Learning for Zero-Shot Text Classification

Hexie Theory Guided Generative AI Ethics Study

Generate an Avatar-based Movie Using AI models