t-test in Python

December 15, 2020 6 minute read Follow @GaryLi

One Sample t-test

One Sample t-test is used to compare the sample mean (a random sample from a population) with the specific value (hypothesized or known mean of the population).
For example, a ball has a diameter of 5 cm and we want to check whether the average diameter of the ball from the random sample (e.g. 50 balls) picked from the production line differs from the known size.

Assumptions

Dependent variable should have an approximately normal distribution (Shapiro-Wilks Test)
Observations are independent of each other

Hypotheses

Null hypotheses: Sample mean is equal to the hypothesized or known population mean
Alternative hypotheses: Sample mean is not equal to the hypothesized or known population mean (two-tailed or two-sided)
Alternative hypotheses: Sample mean is either greater or lesser to the hypothesized or known population mean (one-tailed or one-sided)

Formula

\( \it{t} = \frac{ \bar{x} - \mu }{ s / \sqrt{n} } \)

It follows approximately \( \it{t}\)-distribution with \(n-1 \) degrees of freedom

Where, \(\bar{x} \) = sample mean; \(\mu \) = hypothesized or known population mean; \(\it{s} \)= sample standard deviation (estimate of population standard deviation \(\sigma \)), and \(\it{n} \) is the sample size

How to perform one sample t-test in Python?

We will use bioinfokit v0.9.6 or later
Check bioinfokit documentation for installation and documentation

# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_one_samp').data
>>> df.head()
       size
0  5.739987
1  5.254042
2  5.152388
3  4.870819
4  3.536251

>>> res = stat()
>>> res.ttest(df=df, test_type=1, res='size', mu=5)
# output
>>> print(res.summary)

One Sample t-test

------------------  --------
Sample size         50
Mean                 5.05128
t                    0.36789
Df                  49
P-value (one-tail)   0.35727
P-value (two-tail)   0.71454
Lower 95.0%          4.77116
Upper 95.0%          5.3314
------------------  --------

Interpretation

The p value obtained from the one sample t-test is not significant (p > 0.05), and therefore, we conclude that the average diameter of the balls in a random sample is equal to 5 cm.

Two sample t-test (unpaired or independent t-test)

Two Sample independent t-test Used to compare the means of two independent groups
For example, we have two different plant genotypes (genotype A and genotype B) and would like to compare if the yield of genotype A is significantly different from genotype B

Two sample t-test Hypotheses

Null hypotheses: Two group means are equal
Alternative hypotheses: Two group means are different (two-tailed or two-sided)
Alternative hypotheses: Mean of one group either greater or lesser than another group (one-tailed or one-sided)

Two sample t-test Assumptions

Observations in two groups have an approximately normal distribution (Shapiro-Wilks Test)
Homogeneity of variances (variances are equal between treatment groups) (Levene or Bartlett Test)
The two groups are sampled independently from each other from the same population

Formula

\( \it{t} = \frac{ \bar{x_1} - \bar{x_2} }{ \sqrt{s^2 (\frac{1}{n_1} + \frac{1}{n_2}) } } \)

It follows approximately \( \it{t}\)-distribution with \(n_1+n_2-1 \) degrees of freedom

Where, \(\bar{x_1} \) and \( \bar{x_2}\) = means for two independent samples; \(\it{s^2} \)= pooled sample variance (estimate of unknown population variance \(\sigma^2\) ), and \(\it{n_1} \) and \(\it{n_2} \) are the sample size for two independent samples

\(\it{s^2} \) is calculated as,

\( \it{s^2} = \frac{ (n_1-1) s_{x_1}^2 + (n_2-1) s_{x_2}^2 }{ n_1+n_2-2 } \)

Where, \(s_{x_1}^2 \) and \(s_{x_2}^2\)= sample variances (estimate of unknown population variances \(\sigma_{x_1}^2\) and \(\sigma_{x_2}^2\) )

For Welch's test (where group variances are not equal)

\( \it{t} = \frac{ \mid \bar{x_1} - \bar{x_2} \mid }{ \sqrt{ \frac{s_{x_1}^2}{n_1} + \frac{s_{x_2}^2}{n_2} } } \)

It follows approximately \( \it{t}\)-distribution with \(\nu \) degrees of freedom

\( \nu = \frac{ (s_{x_1}^2/n_1 + s_{x_2}^2/n_2)^2 }{ \frac{ (s_{x_1}^2/n_1)^2 }{n_1-1} + \frac{ (s_{x_2}^2/n_2)^2 }{n_2-1} } \)

How to perform Two sample t-test in Python?

We will use bioinfokit v0.9.6 or later
Check bioinfokit documentation for installation and documentation
Download dataset for two sample and Welch’s t-test

# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_ind_samp').data
>>> df.head()
  Genotype  yield
0        A   78.0
1        A   84.3
2        A   81.0
3        B   88.0
4        B   92.0

>>> res = stat()
# for unequal variance t-test (Welch's t-test) set evar=False
>>> res.ttest(df=df, xfac="Genotype", res="yield", test_type=2)
# output
>>> print(res.summary)

Two sample t-test with equal variance

------------------  -------------
Mean diff           -10.3
t                    -5.40709
Std Error             1.90491
df                   10
P-value (one-tail)    0.000149204
P-value (two-tail)    0.000298408
Lower 95.0%         -14.5444
Upper 95.0%          -6.05561
------------------  -------------

Parameter estimates

Level      Number    Mean    Std Dev    Std Error    Lower 95.0%    Upper 95.0%
-------  --------  ------  ---------  -----------  -------------  -------------
A               6    79.1    3.30817      1.35056        75.6283        82.5717
B               6    89.4    3.29059      1.34338        85.9467        92.8533

Note: Even though you can perform a t-test when the sample size is unequal between two groups, it is more efficient to have an equal sample size in two groups to increase the power of the t-test.

Interpretation

The p value obtained from the t-test is significant (p < 0.05), and therefore, we conclude that the yield of genotype A is significantly different than genotype B.

Paired t-test (dependent t-test)

Paired t-test used to compare the differences between the pair of dependent variables for the same subject
For example, we have plant variety A and would like to compare the yield of A before and after the application of some fertilizer
Note: Paired t-test is a one sample t-test on the differences between the two dependent variables

Paired t-test Hypotheses

Null hypotheses: There is no difference between the two dependent variables (difference=0)
Alternative hypotheses: There is a difference between the two dependent variables (two-tailed or two-sided)
Alternative hypotheses: Difference between two response variables either greater or lesser than zero (one-tailed or one-sided)

Paired t-test Assumptions

Differences between the two dependent variables follows an approximately normal distribution (Shapiro-Wilks Test)
Independent variable should have a pair of dependent variables
Differences between the two dependent variables should not have outliers
Observations are sampled independently from each other

Formula

\( \it{t} = \frac{ \bar{d} }{ s_d / \sqrt{n} } \)

It follows approximately \( \it{t}\)-distribution with \(n-1 \) degrees of freedom

Where, \(\bar{d} \) = mean of sample differences; \(\it{s_d} \)= standard deviation of sample differences (estimate of population standard deviation \(\sigma_d \)), and \(\it{n} \) is the sample size

# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_pair').data
>>> df.head()
      BF     AF
0  44.41  47.99
1  46.29  56.64
2  45.98  48.90
3  43.35  49.01
4  45.75  48.41

>>> res = stat()
>>> res.ttest(df=df, res=['AF', 'BF'], test_type=3)
# output
>>> print(res.summary)

Paired t-test

------------------  ------------
Sample size         65
Difference Mean      5.55262
t                   14.2173
Df                  64
P-value (one-tail)   8.87966e-22
P-value (two-tail)   1.77593e-21
Lower 95.0%          4.7724
Upper 95.0%          6.33283
------------------  ------------

Interpretation

The p value obtained from the t-test is significant (p < 0.05), and therefore, we conclude that the yield of plant variety A significantly increased by the application of fertilizer.

References

Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods. 2020 Mar;17(3):261-72.
Kim TK, Park JH. More about the basic assumptions of t-test: normality and sample size. Korean journal of anesthesiology. 2019 Aug;72(4):331.

Last updated: November 15, 2020

Twitter Facebook LinkedIn

t-test in Python

One Sample t-test

Assumptions

Hypotheses

Formula

How to perform one sample t-test in Python?

Interpretation

Two sample t-test (unpaired or independent t-test)

Two sample t-test Hypotheses

Two sample t-test Assumptions

Formula

How to perform Two sample t-test in Python?

Interpretation

Paired t-test (dependent t-test)

Paired t-test Hypotheses

Paired t-test Assumptions

Formula

Interpretation

References

You May Also Enjoy

Generate a course assessement rubric using ChatGPT

Injecting Commonsense Knowledge into Prompt Learning for Zero-Shot Text Classification

Hexie Theory Guided Generative AI Ethics Study

Generate an Avatar-based Movie Using AI models