Chi-square test in Python
Chi-square (χ2) test for independence (Pearson Chi-square test)
- Chi-square test is a non-parametric (distribution-free) method used to compare the relationship between the two categorical (nominal) variables in a contingency table
- For example, we have different treatments (treated and nontreated) and treatment outcomes (cured and noncured), here we could use the chi-square test for independence to check whether treatments are related to treatment outcomes.
- Note: Chi-square test for independence is different than the chi-square goodness of fit test
Formula
\( \chi^2 = \sum\limits_i^n \frac{(O_i - E_i)^2}{E_i} \)
Where, \( O_i \) = Observed value in contingency table,
\( E_i \) = Expected value for each cell in contingency table,
\( i \) = 1,2,..., n
Hypotheses
- Null hypotheses: The two categorical variables are independent (no association between the two variables) ( H0: Oi = Ei )
- Alternative hypotheses: The two categorical variables are dependent (there is an association between the two variables) ( Ha: Oi ≠ Ei )
- Note: There are no one or two-tailed P-value. Rejection region of the chi-square test is always on the right side of the distribution.
Assumptions
- The two variables are categorical (nominal) and data is randomly sampled
- The levels of variables are mutually exclusive
- The expected frequency count for at least 80% of the cell in a contingency table is at least 5
- The expected frequency count should not be less than 1
- Observations should be independent of each other
- Observation data should be frequency counts and not percentages or transformed data
Perform a chi-square test for independence
- We will use
bioinfokit
v0.9.5 or later - Check bioinfokit documentation for installation and documentation
- Download a hypothetical dataset for chi-square test for independence
# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import stat, get_data
# load example dataset
>>> df = get_data('drugdata').data
>>> df.head()
treatments cured noncured
0 treated 60 10
1 nontreated 30 25
# set treatments column as index
>>> df = df.set_index('treatments')
>>> df.head()
cured noncured
treatments
treated 60 10
nontreated 30 25
# run chi-square test for independence
>>> res = stat()
>>> res.chisq(df=df)
# output
>>> print(res.summary)
Chi-squared test for independence
Test Df Chi-square P-value
-------------- ---- ------------ -----------
Pearson 1 13.3365 0.000260291
Log-likelihood 1 13.4687 0.000242574
>>> print(res.expected_df)
Expected frequency counts
cured noncured
-- ------- ----------
0 50.4 19.6
1 39.6 15.4
Interpretation
The p value obtained from chi-square test for independence is significant (p < 0.05), and therefore, we conclude that there is a significant association between treatments (treated and nontreated) with treatment outcome (cured and noncured)
Chi-square (χ2) Goodness of Fit test
- Chi-square Goodness of Fit Test test is a non-parametric (distribution-free) method used to compare the observed and expected values from one categorical variable. The expected values are calculated based on the known theoretical expectation.
- For example, we have resistant (A) and susceptible (B) genotypes for some disease. The crosses between these two genotypes will produce offspring in 3:1 (75% A and 25% B genotype) as per Mendelian ratio assuming resistance to disease is a dominant trait. Here, we could use the chi-square Goodness of Fit Test test to check whether observed counts of A and B genotypes are similar to expected counts of A and B genotypes as per the Mendelian ratio.
Formula
\( \chi^2 = \sum\limits_i^n \frac{(O_i - E_i)^2}{E_i} \)
Where, \( O_i \) = Observed value for category i,
\( E_i \) = Expected value for category i,
\( i \) = 1,2,..., n
Hypotheses
- Null hypotheses: The observed and expected counts in each group are equal ( H0: Oi = Ei )
- Alternative hypotheses: The observed and expected counts in each group are different ( Ha: Oi ≠ Ei )
Assumptions
- The variable should be categorical (nominal) and data is randomly sampled
- The groups of variables are mutually exclusive
- The expected count should be at least 5 for each group
- Observations should be independent of each other
- Observation data should be frequency counts and not percentages or transformed data
Perform a Goodness of Fit test Python
- We will use
bioinfokit
v0.9.5 or later - Check bioinfokit documentation for installation and documentation
# I am using interactive python interpreter (Python 3.7)
>>> from bioinfokit.analys import stat
>>> import pandas as pd
# create or import pandas dataframe of observed counts
>>> df = pd.DataFrame({'genotypes':['A', 'B'], 'observed':[155, 45]})
>>> df = df.set_index(['genotypes'])
>>> df.head()
observed
genotypes
A 155
B 45
# run chi-square test
>>> res = stat()
# p should be known theoretical expectation and must sum to 1
>>> res.chisq(df=df, p=(0.75, 0.25))
# output
>>> print(res.summary)
Chi-squared goodness of fit test
Chi-Square Df P-value Sample size
------------ ---- --------- -------------
0.666667 1 0.414216 200
# get expected counts
>>> print(res.expected_df)
observed expected_counts
genotypes
A 155 150.0
B 45 50.0
Interpretation
The p value obtained from the chi-square Goodness of Fit test is non-significant (p > 0.05 and fail to reject the null hypothesis), and therefore, we conclude that the observed genotypes counts after crosses is similar to that of expected counts as per the Mendelian ratio.
References
- Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods. 2020 Mar;17(3):261-72.
Last updated: November 24, 2020