t-test in Python
One Sample t-test Permalink
- One Sample t-test is used to compare the sample mean (a random sample from a population) with the specific value (hypothesized or known mean of the population).
- For example, a ball has a diameter of 5 cm and we want to check whether the average diameter of the ball from the random sample (e.g. 50 balls) picked from the production line differs from the known size.
AssumptionsPermalink
- Dependent variable should have an approximately normal distribution (Shapiro-Wilks Test)
- Observations are independent of each other
HypothesesPermalink
- Null hypotheses: Sample mean is equal to the hypothesized or known population mean
- Alternative hypotheses: Sample mean is not equal to the hypothesized or known population mean (two-tailed or two-sided)
- Alternative hypotheses: Sample mean is either greater or lesser to the hypothesized or known population mean (one-tailed or one-sided)
FormulaPermalink
t=ˉx−μs/√n
It follows approximately t-distribution with n−1 degrees of freedom
Where, ˉx = sample mean; μ = hypothesized or known population mean; s= sample standard deviation (estimate of population standard deviation σ), and n is the sample size
How to perform one sample t-test in Python?Permalink
- We will use
bioinfokit v0.9.6
or later - Check bioinfokit documentation for installation and documentation
# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_one_samp').data
>>> df.head()
size
0 5.739987
1 5.254042
2 5.152388
3 4.870819
4 3.536251
>>> res = stat()
>>> res.ttest(df=df, test_type=1, res='size', mu=5)
# output
>>> print(res.summary)
One Sample t-test
------------------ --------
Sample size 50
Mean 5.05128
t 0.36789
Df 49
P-value (one-tail) 0.35727
P-value (two-tail) 0.71454
Lower 95.0% 4.77116
Upper 95.0% 5.3314
------------------ --------
InterpretationPermalink
The p value obtained from the one sample t-test is not significant (p > 0.05), and therefore, we conclude that the average diameter of the balls in a random sample is equal to 5 cm.
Two sample t-test (unpaired or independent t-test)Permalink
- Two Sample independent t-test Used to compare the means of two independent groups
- For example, we have two different plant genotypes (genotype A and genotype B) and would like to compare if the yield of genotype A is significantly different from genotype B
Two sample t-test HypothesesPermalink
- Null hypotheses: Two group means are equal
- Alternative hypotheses: Two group means are different (two-tailed or two-sided)
- Alternative hypotheses: Mean of one group either greater or lesser than another group (one-tailed or one-sided)
Two sample t-test AssumptionsPermalink
- Observations in two groups have an approximately normal distribution (Shapiro-Wilks Test)
- Homogeneity of variances (variances are equal between treatment groups) (Levene or Bartlett Test)
- The two groups are sampled independently from each other from the same population
FormulaPermalink
t=¯x1−¯x2√s2(1n1+1n2)
It follows approximately t-distribution with n1+n2−1 degrees of freedom
Where, ¯x1 and ¯x2 = means for two independent samples; s2= pooled sample variance (estimate of unknown population variance σ2 ), and n1 and n2 are the sample size for two independent samples
s2 is calculated as,
s2=(n1−1)s2x1+(n2−1)s2x2n1+n2−2
Where, s2x1 and s2x2= sample variances (estimate of unknown population variances σ2x1 and σ2x2 )
For Welch's test (where group variances are not equal)
t=∣¯x1−¯x2∣√s2x1n1+s2x2n2
It follows approximately t-distribution with ν degrees of freedom
ν=(s2x1/n1+s2x2/n2)2(s2x1/n1)2n1−1+(s2x2/n2)2n2−1
How to perform Two sample t-test in Python?Permalink
- We will use
bioinfokit v0.9.6
or later - Check bioinfokit documentation for installation and documentation
- Download dataset for two sample and Welch’s t-test
# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_ind_samp').data
>>> df.head()
Genotype yield
0 A 78.0
1 A 84.3
2 A 81.0
3 B 88.0
4 B 92.0
>>> res = stat()
# for unequal variance t-test (Welch's t-test) set evar=False
>>> res.ttest(df=df, xfac="Genotype", res="yield", test_type=2)
# output
>>> print(res.summary)
Two sample t-test with equal variance
------------------ -------------
Mean diff -10.3
t -5.40709
Std Error 1.90491
df 10
P-value (one-tail) 0.000149204
P-value (two-tail) 0.000298408
Lower 95.0% -14.5444
Upper 95.0% -6.05561
------------------ -------------
Parameter estimates
Level Number Mean Std Dev Std Error Lower 95.0% Upper 95.0%
------- -------- ------ --------- ----------- ------------- -------------
A 6 79.1 3.30817 1.35056 75.6283 82.5717
B 6 89.4 3.29059 1.34338 85.9467 92.8533
Note: Even though you can perform a t-test when the sample size is unequal between two groups, it is more efficient to have an equal sample size in two groups to increase the power of the t-test.
InterpretationPermalink
The p value obtained from the t-test is significant (p < 0.05), and therefore, we conclude that the yield of genotype A is significantly different than genotype B.
Paired t-test (dependent t-test)Permalink
- Paired t-test used to compare the differences between the pair of dependent variables for the same subject
- For example, we have plant variety A and would like to compare the yield of A before and after the application of some fertilizer
- Note: Paired t-test is a one sample t-test on the differences between the two dependent variables
Paired t-test HypothesesPermalink
- Null hypotheses: There is no difference between the two dependent variables (difference=0)
- Alternative hypotheses: There is a difference between the two dependent variables (two-tailed or two-sided)
- Alternative hypotheses: Difference between two response variables either greater or lesser than zero (one-tailed or one-sided)
Paired t-test AssumptionsPermalink
- Differences between the two dependent variables follows an approximately normal distribution (Shapiro-Wilks Test)
- Independent variable should have a pair of dependent variables
- Differences between the two dependent variables should not have outliers
- Observations are sampled independently from each other
FormulaPermalink
t=ˉdsd/√n
It follows approximately t-distribution with n−1 degrees of freedom
Where, ˉd = mean of sample differences; sd= standard deviation of sample differences (estimate of population standard deviation σd), and n is the sample size
# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_pair').data
>>> df.head()
BF AF
0 44.41 47.99
1 46.29 56.64
2 45.98 48.90
3 43.35 49.01
4 45.75 48.41
>>> res = stat()
>>> res.ttest(df=df, res=['AF', 'BF'], test_type=3)
# output
>>> print(res.summary)
Paired t-test
------------------ ------------
Sample size 65
Difference Mean 5.55262
t 14.2173
Df 64
P-value (one-tail) 8.87966e-22
P-value (two-tail) 1.77593e-21
Lower 95.0% 4.7724
Upper 95.0% 6.33283
------------------ ------------
InterpretationPermalink
The p value obtained from the t-test is significant (p < 0.05), and therefore, we conclude that the yield of plant variety A significantly increased by the application of fertilizer.
ReferencesPermalink
- Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods. 2020 Mar;17(3):261-72.
- Kim TK, Park JH. More about the basic assumptions of t-test: normality and sample size. Korean journal of anesthesiology. 2019 Aug;72(4):331.
Last updated: November 15, 2020