t-test in Python
One Sample t-test
- One Sample t-test is used to compare the sample mean (a random sample from a population) with the specific value (hypothesized or known mean of the population).
- For example, a ball has a diameter of 5 cm and we want to check whether the average diameter of the ball from the random sample (e.g. 50 balls) picked from the production line differs from the known size.
Assumptions
- Dependent variable should have an approximately normal distribution (Shapiro-Wilks Test)
- Observations are independent of each other
Hypotheses
- Null hypotheses: Sample mean is equal to the hypothesized or known population mean
- Alternative hypotheses: Sample mean is not equal to the hypothesized or known population mean (two-tailed or two-sided)
- Alternative hypotheses: Sample mean is either greater or lesser to the hypothesized or known population mean (one-tailed or one-sided)
Formula
\( \it{t} = \frac{ \bar{x} - \mu }{ s / \sqrt{n} } \)
It follows approximately \( \it{t}\)-distribution with \(n-1 \) degrees of freedom
Where, \(\bar{x} \) = sample mean; \(\mu \) = hypothesized or known population mean; \(\it{s} \)= sample standard deviation (estimate of population standard deviation \(\sigma \)), and \(\it{n} \) is the sample size
How to perform one sample t-test in Python?
- We will use
bioinfokit v0.9.6
or later - Check bioinfokit documentation for installation and documentation
# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_one_samp').data
>>> df.head()
size
0 5.739987
1 5.254042
2 5.152388
3 4.870819
4 3.536251
>>> res = stat()
>>> res.ttest(df=df, test_type=1, res='size', mu=5)
# output
>>> print(res.summary)
One Sample t-test
------------------ --------
Sample size 50
Mean 5.05128
t 0.36789
Df 49
P-value (one-tail) 0.35727
P-value (two-tail) 0.71454
Lower 95.0% 4.77116
Upper 95.0% 5.3314
------------------ --------
Interpretation
The p value obtained from the one sample t-test is not significant (p > 0.05), and therefore, we conclude that the average diameter of the balls in a random sample is equal to 5 cm.
Two sample t-test (unpaired or independent t-test)
- Two Sample independent t-test Used to compare the means of two independent groups
- For example, we have two different plant genotypes (genotype A and genotype B) and would like to compare if the yield of genotype A is significantly different from genotype B
Two sample t-test Hypotheses
- Null hypotheses: Two group means are equal
- Alternative hypotheses: Two group means are different (two-tailed or two-sided)
- Alternative hypotheses: Mean of one group either greater or lesser than another group (one-tailed or one-sided)
Two sample t-test Assumptions
- Observations in two groups have an approximately normal distribution (Shapiro-Wilks Test)
- Homogeneity of variances (variances are equal between treatment groups) (Levene or Bartlett Test)
- The two groups are sampled independently from each other from the same population
Formula
\( \it{t} = \frac{ \bar{x_1} - \bar{x_2} }{ \sqrt{s^2 (\frac{1}{n_1} + \frac{1}{n_2}) } } \)
It follows approximately \( \it{t}\)-distribution with \(n_1+n_2-1 \) degrees of freedom
Where, \(\bar{x_1} \) and \( \bar{x_2}\) = means for two independent samples; \(\it{s^2} \)= pooled sample variance (estimate of unknown population variance \(\sigma^2\) ), and \(\it{n_1} \) and \(\it{n_2} \) are the sample size for two independent samples
\(\it{s^2} \) is calculated as,
\( \it{s^2} = \frac{ (n_1-1) s_{x_1}^2 + (n_2-1) s_{x_2}^2 }{ n_1+n_2-2 } \)
Where, \(s_{x_1}^2 \) and \(s_{x_2}^2\)= sample variances (estimate of unknown population variances \(\sigma_{x_1}^2\) and \(\sigma_{x_2}^2\) )
For Welch's test (where group variances are not equal)
\( \it{t} = \frac{ \mid \bar{x_1} - \bar{x_2} \mid }{ \sqrt{ \frac{s_{x_1}^2}{n_1} + \frac{s_{x_2}^2}{n_2} } } \)
It follows approximately \( \it{t}\)-distribution with \(\nu \) degrees of freedom
\( \nu = \frac{ (s_{x_1}^2/n_1 + s_{x_2}^2/n_2)^2 }{ \frac{ (s_{x_1}^2/n_1)^2 }{n_1-1} + \frac{ (s_{x_2}^2/n_2)^2 }{n_2-1} } \)
How to perform Two sample t-test in Python?
- We will use
bioinfokit v0.9.6
or later - Check bioinfokit documentation for installation and documentation
- Download dataset for two sample and Welch’s t-test
# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_ind_samp').data
>>> df.head()
Genotype yield
0 A 78.0
1 A 84.3
2 A 81.0
3 B 88.0
4 B 92.0
>>> res = stat()
# for unequal variance t-test (Welch's t-test) set evar=False
>>> res.ttest(df=df, xfac="Genotype", res="yield", test_type=2)
# output
>>> print(res.summary)
Two sample t-test with equal variance
------------------ -------------
Mean diff -10.3
t -5.40709
Std Error 1.90491
df 10
P-value (one-tail) 0.000149204
P-value (two-tail) 0.000298408
Lower 95.0% -14.5444
Upper 95.0% -6.05561
------------------ -------------
Parameter estimates
Level Number Mean Std Dev Std Error Lower 95.0% Upper 95.0%
------- -------- ------ --------- ----------- ------------- -------------
A 6 79.1 3.30817 1.35056 75.6283 82.5717
B 6 89.4 3.29059 1.34338 85.9467 92.8533
Note: Even though you can perform a t-test when the sample size is unequal between two groups, it is more efficient to have an equal sample size in two groups to increase the power of the t-test.
Interpretation
The p value obtained from the t-test is significant (p < 0.05), and therefore, we conclude that the yield of genotype A is significantly different than genotype B.
Paired t-test (dependent t-test)
- Paired t-test used to compare the differences between the pair of dependent variables for the same subject
- For example, we have plant variety A and would like to compare the yield of A before and after the application of some fertilizer
- Note: Paired t-test is a one sample t-test on the differences between the two dependent variables
Paired t-test Hypotheses
- Null hypotheses: There is no difference between the two dependent variables (difference=0)
- Alternative hypotheses: There is a difference between the two dependent variables (two-tailed or two-sided)
- Alternative hypotheses: Difference between two response variables either greater or lesser than zero (one-tailed or one-sided)
Paired t-test Assumptions
- Differences between the two dependent variables follows an approximately normal distribution (Shapiro-Wilks Test)
- Independent variable should have a pair of dependent variables
- Differences between the two dependent variables should not have outliers
- Observations are sampled independently from each other
Formula
\( \it{t} = \frac{ \bar{d} }{ s_d / \sqrt{n} } \)
It follows approximately \( \it{t}\)-distribution with \(n-1 \) degrees of freedom
Where, \(\bar{d} \) = mean of sample differences; \(\it{s_d} \)= standard deviation of sample differences (estimate of population standard deviation \(\sigma_d \)), and \(\it{n} \) is the sample size
# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_pair').data
>>> df.head()
BF AF
0 44.41 47.99
1 46.29 56.64
2 45.98 48.90
3 43.35 49.01
4 45.75 48.41
>>> res = stat()
>>> res.ttest(df=df, res=['AF', 'BF'], test_type=3)
# output
>>> print(res.summary)
Paired t-test
------------------ ------------
Sample size 65
Difference Mean 5.55262
t 14.2173
Df 64
P-value (one-tail) 8.87966e-22
P-value (two-tail) 1.77593e-21
Lower 95.0% 4.7724
Upper 95.0% 6.33283
------------------ ------------
Interpretation
The p value obtained from the t-test is significant (p < 0.05), and therefore, we conclude that the yield of plant variety A significantly increased by the application of fertilizer.
References
- Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods. 2020 Mar;17(3):261-72.
- Kim TK, Park JH. More about the basic assumptions of t-test: normality and sample size. Korean journal of anesthesiology. 2019 Aug;72(4):331.
Last updated: November 15, 2020