Chi-square test in Python

  4 minute read   Follow @GaryLi

Chi-square (χ2) test for independence (Pearson Chi-square test)

  • Chi-square test is a non-parametric (distribution-free) method used to compare the relationship between the two categorical (nominal) variables in a contingency table
  • For example, we have different treatments (treated and nontreated) and treatment outcomes (cured and noncured), here we could use the chi-square test for independence to check whether treatments are related to treatment outcomes.
  • Note: Chi-square test for independence is different than the chi-square goodness of fit test

Formula

\( \chi^2 = \sum\limits_i^n \frac{(O_i - E_i)^2}{E_i} \)

Where, \( O_i \) = Observed value in contingency table,
\( E_i \) = Expected value for each cell in contingency table,
\( i \) = 1,2,..., n

Hypotheses

  • Null hypotheses: The two categorical variables are independent (no association between the two variables) ( H0: Oi = Ei )
  • Alternative hypotheses: The two categorical variables are dependent (there is an association between the two variables) ( Ha: Oi ≠ Ei )
  • Note: There are no one or two-tailed P-value. Rejection region of the chi-square test is always on the right side of the distribution.

Assumptions

  • The two variables are categorical (nominal) and data is randomly sampled
  • The levels of variables are mutually exclusive
  • The expected frequency count for at least 80% of the cell in a contingency table is at least 5
  • The expected frequency count should not be less than 1
  • Observations should be independent of each other
  • Observation data should be frequency counts and not percentages or transformed data

Perform a chi-square test for independence

  • We will use bioinfokit v0.9.5 or later
  • Check bioinfokit documentation for installation and documentation
  • Download a hypothetical dataset for chi-square test for independence
# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import stat, get_data
# load example dataset
>>> df = get_data('drugdata').data
>>> df.head()
   treatments  cured  noncured
0     treated     60        10
1  nontreated     30        25
# set treatments column as index
>>> df = df.set_index('treatments')
>>> df.head()
            cured  noncured
treatments
treated        60        10
nontreated     30        25

# run chi-square test for independence
>>> res = stat()
>>> res.chisq(df=df)

# output
>>> print(res.summary)

Chi-squared test for independence

Test              Df    Chi-square      P-value
--------------  ----  ------------  -----------
Pearson            1       13.3365  0.000260291
Log-likelihood     1       13.4687  0.000242574

>>> print(res.expected_df)

Expected frequency counts

      cured    noncured
--  -------  ----------
 0     50.4        19.6
 1     39.6        15.4

Interpretation

The p value obtained from chi-square test for independence is significant (p < 0.05), and therefore, we conclude that there is a significant association between treatments (treated and nontreated) with treatment outcome (cured and noncured)

Chi-square (χ2) Goodness of Fit test

  • Chi-square Goodness of Fit Test test is a non-parametric (distribution-free) method used to compare the observed and expected values from one categorical variable. The expected values are calculated based on the known theoretical expectation.
  • For example, we have resistant (A) and susceptible (B) genotypes for some disease. The crosses between these two genotypes will produce offspring in 3:1 (75% A and 25% B genotype) as per Mendelian ratio assuming resistance to disease is a dominant trait. Here, we could use the chi-square Goodness of Fit Test test to check whether observed counts of A and B genotypes are similar to expected counts of A and B genotypes as per the Mendelian ratio.

Formula

\( \chi^2 = \sum\limits_i^n \frac{(O_i - E_i)^2}{E_i} \)

Where, \( O_i \) = Observed value for category i,
\( E_i \) = Expected value for category i,
\( i \) = 1,2,..., n

Hypotheses

  • Null hypotheses: The observed and expected counts in each group are equal ( H0: Oi = Ei )
  • Alternative hypotheses: The observed and expected counts in each group are different ( Ha: Oi ≠ Ei )

Assumptions

  • The variable should be categorical (nominal) and data is randomly sampled
  • The groups of variables are mutually exclusive
  • The expected count should be at least 5 for each group
  • Observations should be independent of each other
  • Observation data should be frequency counts and not percentages or transformed data

Perform a Goodness of Fit test Python

# I am using interactive python interpreter (Python 3.7)
>>> from bioinfokit.analys import stat
>>> import pandas as pd
# create or import pandas dataframe of observed counts
>>> df = pd.DataFrame({'genotypes':['A', 'B'], 'observed':[155, 45]})
>>> df = df.set_index(['genotypes'])
>>> df.head()
           observed
genotypes
A               155
B                45

# run chi-square test 
>>> res = stat()
# p should be known theoretical expectation and must sum to 1
>>> res.chisq(df=df, p=(0.75, 0.25))

# output
>>> print(res.summary)

Chi-squared goodness of fit test

  Chi-Square    Df    P-value    Sample size
------------  ----  ---------  -------------
    0.666667     1   0.414216            200

# get expected counts
>>> print(res.expected_df)
           observed  expected_counts
genotypes
A               155            150.0
B                45             50.0

Interpretation

The p value obtained from the chi-square Goodness of Fit test is non-significant (p > 0.05 and fail to reject the null hypothesis), and therefore, we conclude that the observed genotypes counts after crosses is similar to that of expected counts as per the Mendelian ratio.

References

  • Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods. 2020 Mar;17(3):261-72.

Last updated: November 24, 2020