t-test in Python

  6 minute read   Follow @GaryLi

One Sample t-test

  • One Sample t-test is used to compare the sample mean (a random sample from a population) with the specific value (hypothesized or known mean of the population).
  • For example, a ball has a diameter of 5 cm and we want to check whether the average diameter of the ball from the random sample (e.g. 50 balls) picked from the production line differs from the known size.

Assumptions

  • Dependent variable should have an approximately normal distribution (Shapiro-Wilks Test)
  • Observations are independent of each other

Hypotheses

  • Null hypotheses: Sample mean is equal to the hypothesized or known population mean
  • Alternative hypotheses: Sample mean is not equal to the hypothesized or known population mean (two-tailed or two-sided)
  • Alternative hypotheses: Sample mean is either greater or lesser to the hypothesized or known population mean (one-tailed or one-sided)

Formula

\( \it{t} = \frac{ \bar{x} - \mu }{ s / \sqrt{n} } \)

It follows approximately \( \it{t}\)-distribution with \(n-1 \) degrees of freedom

Where, \(\bar{x} \) = sample mean; \(\mu \) = hypothesized or known population mean; \(\it{s} \)= sample standard deviation (estimate of population standard deviation \(\sigma \)), and \(\it{n} \) is the sample size

How to perform one sample t-test in Python?

# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_one_samp').data
>>> df.head()
       size
0  5.739987
1  5.254042
2  5.152388
3  4.870819
4  3.536251

>>> res = stat()
>>> res.ttest(df=df, test_type=1, res='size', mu=5)
# output
>>> print(res.summary)

One Sample t-test

------------------  --------
Sample size         50
Mean                 5.05128
t                    0.36789
Df                  49
P-value (one-tail)   0.35727
P-value (two-tail)   0.71454
Lower 95.0%          4.77116
Upper 95.0%          5.3314
------------------  --------

Interpretation

The p value obtained from the one sample t-test is not significant (p > 0.05), and therefore, we conclude that the average diameter of the balls in a random sample is equal to 5 cm.

Two sample t-test (unpaired or independent t-test)

  • Two Sample independent t-test Used to compare the means of two independent groups
  • For example, we have two different plant genotypes (genotype A and genotype B) and would like to compare if the yield of genotype A is significantly different from genotype B

Two sample t-test Hypotheses

  • Null hypotheses: Two group means are equal
  • Alternative hypotheses: Two group means are different (two-tailed or two-sided)
  • Alternative hypotheses: Mean of one group either greater or lesser than another group (one-tailed or one-sided)

Two sample t-test Assumptions

  • Observations in two groups have an approximately normal distribution (Shapiro-Wilks Test)
  • Homogeneity of variances (variances are equal between treatment groups) (Levene or Bartlett Test)
  • The two groups are sampled independently from each other from the same population

Formula

\( \it{t} = \frac{ \bar{x_1} - \bar{x_2} }{ \sqrt{s^2 (\frac{1}{n_1} + \frac{1}{n_2}) } } \)

It follows approximately \( \it{t}\)-distribution with \(n_1+n_2-1 \) degrees of freedom

Where, \(\bar{x_1} \) and \( \bar{x_2}\) = means for two independent samples; \(\it{s^2} \)= pooled sample variance (estimate of unknown population variance \(\sigma^2\) ), and \(\it{n_1} \) and \(\it{n_2} \) are the sample size for two independent samples

\(\it{s^2} \) is calculated as,

\( \it{s^2} = \frac{ (n_1-1) s_{x_1}^2 + (n_2-1) s_{x_2}^2 }{ n_1+n_2-2 } \)

Where, \(s_{x_1}^2 \) and \(s_{x_2}^2\)= sample variances (estimate of unknown population variances \(\sigma_{x_1}^2\) and \(\sigma_{x_2}^2\) )

For Welch's test (where group variances are not equal)

\( \it{t} = \frac{ \mid \bar{x_1} - \bar{x_2} \mid }{ \sqrt{ \frac{s_{x_1}^2}{n_1} + \frac{s_{x_2}^2}{n_2} } } \)

It follows approximately \( \it{t}\)-distribution with \(\nu \) degrees of freedom

\( \nu = \frac{ (s_{x_1}^2/n_1 + s_{x_2}^2/n_2)^2 }{ \frac{ (s_{x_1}^2/n_1)^2 }{n_1-1} + \frac{ (s_{x_2}^2/n_2)^2 }{n_2-1} } \)

How to perform Two sample t-test in Python?

  • We will use bioinfokit v0.9.6 or later
  • Check bioinfokit documentation for installation and documentation
  • Download dataset for two sample and Welch’s t-test
# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_ind_samp').data
>>> df.head()
  Genotype  yield
0        A   78.0
1        A   84.3
2        A   81.0
3        B   88.0
4        B   92.0

>>> res = stat()
# for unequal variance t-test (Welch's t-test) set evar=False
>>> res.ttest(df=df, xfac="Genotype", res="yield", test_type=2)
# output
>>> print(res.summary)

Two sample t-test with equal variance

------------------  -------------
Mean diff           -10.3
t                    -5.40709
Std Error             1.90491
df                   10
P-value (one-tail)    0.000149204
P-value (two-tail)    0.000298408
Lower 95.0%         -14.5444
Upper 95.0%          -6.05561
------------------  -------------

Parameter estimates

Level      Number    Mean    Std Dev    Std Error    Lower 95.0%    Upper 95.0%
-------  --------  ------  ---------  -----------  -------------  -------------
A               6    79.1    3.30817      1.35056        75.6283        82.5717
B               6    89.4    3.29059      1.34338        85.9467        92.8533

Note: Even though you can perform a t-test when the sample size is unequal between two groups, it is more efficient to have an equal sample size in two groups to increase the power of the t-test.

Interpretation

The p value obtained from the t-test is significant (p < 0.05), and therefore, we conclude that the yield of genotype A is significantly different than genotype B.

Paired t-test (dependent t-test)

  • Paired t-test used to compare the differences between the pair of dependent variables for the same subject
  • For example, we have plant variety A and would like to compare the yield of A before and after the application of some fertilizer
  • Note: Paired t-test is a one sample t-test on the differences between the two dependent variables

Paired t-test Hypotheses

  • Null hypotheses: There is no difference between the two dependent variables (difference=0)
  • Alternative hypotheses: There is a difference between the two dependent variables (two-tailed or two-sided)
  • Alternative hypotheses: Difference between two response variables either greater or lesser than zero (one-tailed or one-sided)

Paired t-test Assumptions

  • Differences between the two dependent variables follows an approximately normal distribution (Shapiro-Wilks Test)
  • Independent variable should have a pair of dependent variables
  • Differences between the two dependent variables should not have outliers
  • Observations are sampled independently from each other

Formula

\( \it{t} = \frac{ \bar{d} }{ s_d / \sqrt{n} } \)

It follows approximately \( \it{t}\)-distribution with \(n-1 \) degrees of freedom

Where, \(\bar{d} \) = mean of sample differences; \(\it{s_d} \)= standard deviation of sample differences (estimate of population standard deviation \(\sigma_d \)), and \(\it{n} \) is the sample size

# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_pair').data
>>> df.head()
      BF     AF
0  44.41  47.99
1  46.29  56.64
2  45.98  48.90
3  43.35  49.01
4  45.75  48.41

>>> res = stat()
>>> res.ttest(df=df, res=['AF', 'BF'], test_type=3)
# output
>>> print(res.summary)

Paired t-test

------------------  ------------
Sample size         65
Difference Mean      5.55262
t                   14.2173
Df                  64
P-value (one-tail)   8.87966e-22
P-value (two-tail)   1.77593e-21
Lower 95.0%          4.7724
Upper 95.0%          6.33283
------------------  ------------

Interpretation

The p value obtained from the t-test is significant (p < 0.05), and therefore, we conclude that the yield of plant variety A significantly increased by the application of fertilizer.

References

  • Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods. 2020 Mar;17(3):261-72.
  • Kim TK, Park JH. More about the basic assumptions of t-test: normality and sample size. Korean journal of anesthesiology. 2019 Aug;72(4):331.

Last updated: November 15, 2020