List of important formulas and equations in medical statistics, biostatistics and epidemiology

Medical Statistics Formulas and Equations

Note: In the formulas the symbol $ \ln(x) $ is the natural logarithm. In other words the inverse function of $ \text{e}^x = \exp(x) $ meaning that $ \ln(\text{e}^x) = x $ and that $ \text{e}^{\ln(x)} = x $. In some text books, $ \ln(x) $ is written as $ \log(x)$. The expression $ \log(x) $ is, however, normally attributed to the base-10-logarithm, not the natural logarithm.

The symbol $ |x| $ is the numerical value of a number. The numerical value is always positive. For ex. $ |3-5| = |-2| = 2 $ and $ |5-3| = |2| = 2 $.

Content Menu: (click here to hide)

Risk and Odds
1. Proportions
2. Risk
3. Risk Difference
4. Risk Ratio
5. Odds
6. Odds Ratio
Sensitivity and Specificity
Means
1. Z-Test
2. F-Test
3. T-Test
Confidence Intervals
Chi Square Test / GOF Test
Weighted Average
Data Analysis

Rates
Mantel-Haenszel Stratified OR
1. "Rule of Five"
Mann-Whitney (Wilcoxon) U Test
1. Case: N < 20
2. Case: N > 20
Regression
Power & Sample Size
1. Power
2. Sample Size
Survival Analysis

( ↓ click the big red headlines to show/hide content ↓ )

Risk and Odds

Calculate proportions, their confidence intervals plus test equality between proportions HERE.
Calculate risk, odds, risk ratio, odds ratio and confidence intervals plus perform test HERE.

Proportions:

Events-table from a group
Events (observations)	Group Total
e	n

Proportion:

$$ \widehat{p} = \frac{e}{n} $$

or (as a percentage):

$$ \widehat{p} = \frac{e}{n} \cdot 100 $$

Standard error of proportion:

$$ SE(\widehat{p})=\sqrt{\frac{\widehat{p} \cdot (1-\widehat{p})}{n}} $$

Approximated 95 % CI of proportion:

$$ 95\% \; CI=[\widehat{p} - 1.96 \cdot SE(\widehat{p}) \; : \; \widehat{p} + 1.96 \cdot SE(\widehat{p})] $$

Exact 95 % CI of proportion:

$$ 95\% \; CI=\left[\frac{e}{e + (n - e + 1) \cdot f_1} \; : \; \frac{e + 1}{e + 1 + \frac{n - e}{f_2}}\right] $$

where f₁ is the 0.975 quantile of the f-distribution with the degrees of freedom DF₁ = 2(n - e) + 2 and DF₂ = 2e. The number f₂ is the 0.975 quantile of the f-distribution with degrees of freedom DF₁ = 2e + 2 and DF₂ = 2(n - e).

Odds: to find the odds from a given risk (or vice versa), you can use the following formulas to convert from odds to risk and from risk to odds;

$$ risk=\frac{odds}{1+odds} \; \iff \; odds=\frac{risk}{1-risk} \;\;\; $$

and as a consequence (dealing with proportions):

$$ \widehat{p}=\frac{odds}{1+odds} \; \iff \; odds=\frac{\widehat{p}}{1-\widehat{p}} $$

Standard Error of the logarithmic transformed odds:

$$ SE\left({\ln(odds)}\right)=\sqrt{\frac{\;\;\; 1 \;\;\;}{e} + \frac{ 1 }{\; n-e \;}} $$

Approximated 95 % CI of ln(odds):

$$ 95 \; \% \; CI=[\ln\left({odds}\right) - 1.96 \cdot SE\left({\ln(odds)}\right)\;\; : \;\; \ln\left({odds}\right) + 1.96 \cdot SE\left({\ln(odds)}\right)] $$

Approximated 95 % CI of odds:

$$ 95\; \% \;CI = [\text{e}^{\ln\left({odds}\right) - 1.96 \cdot SE\left({\ln(odds)}\right)}\;\; : \;\; \text{e}^{\ln\left({odds}\right) + 1.96 \cdot SE\left({\ln(odds)}\right)}]$$ $$ =\; [\exp\left({\ln\left({odds}\right) - 1.96 \cdot SE\left({\ln(odds)}\right)}\right) \;\; : \;\; \exp\left({ \ln\left({odds}\right) + 1.96 \cdot SE\left({\ln(odds)}\right) }\right)] $$

or (alternatively):

$$ 95 \; \% \; CI = \left[\frac{odds}{\text{e}^{1.96 \cdot SE\left({\ln(odds)}\right)}} \;\; : \;\; odds \cdot \text{e}^{1.96 \cdot SE\left({\ln(odds)}\right)} \right] $$

Exact 95 % CI of odds:

To calculate the exact 95 % confidence interval of a given odds, one could calculate an exact 95% CI of the proportion using the f-distribution (see above) and then convert both the lower and upper bound in that interval to odds via the formula:

$$ odds = \frac{proportion}{1 - proportion}$$

Risk:

Standard 2 × 2 Table
Exposure	Outcome
Exposure	Yes	No	Total
Exposed	d₁	h₁	n₁
Not Exposed	d₂	h₂	n₂
Total	d	h	n

The risk of an event occurring is the probability that the event happens given that another event has already happened. In this case, for example: Either the risk of getting the outcome (D) given the fact that one has been exposed (E). Or the risk of getting the outcome given not exposed (not E).

$$ Risk_1 = P(D \: | \: E) = \frac{d_1}{n_1} $$ $$ Risk_2 = P(D \: | \: not \: E) = \frac{d_2}{n_2} $$

Standard Error of risk:

$$ SE(risk_1)=\sqrt{\frac{risk_1 \cdot (1-risk_1)}{n_1}} , SE(risk_2)=\sqrt{\frac{risk_2 \cdot (1-risk_2)}{n_2}} $$

95 % Confidence Interval of risk:

$$ 95\; \% \;CI = \left[ lower \: bound \: : \: upper \: bound \right] $$ $$ = \left[ risk - 1.96 \cdot SE(risk) \; : \;risk + 1.96 \cdot SE(risk) \right] $$

If you need to find the Standard Error knowing only the 95 % Confidence Interval :

$$ SE(risk) = \frac{upper \: bound - lower \: bound}{2 \cdot 1.96} = \frac{upper \: bound - lower \: bound}{3.92} $$

Risk Difference

$$ RD = risk_1 - risk_2 $$

Standard error of Risk Difference:

$$ SE(RD)=\sqrt{SE(risk_1)^2+SE(risk_2)^2} $$

95 % Confidence Interval of RD:

$$ 95\; \% \; CI = \left[ RD - 1.96 \cdot SE(RD) : RD + 1.96 \cdot SE(RD) \right] $$

Calculation of z-value of RD in the general case, null hypothesis is denounced H0:

$$ z = \frac{| RD- H0 |}{SE(RD)} $$

In the standard case where H0 = 0:

$$ z = \frac{| RD- 0 |}{SE(RD)} = \frac{| RD |}{SE(RD)} $$

p-value of RD corresponding to a given z-value: (because the normal distribution is two-sided):

$$ p=2 \cdot \int_z^{\infty} \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2} \cdot x^2} dx $$

i.e. two times the area under the bell-shaped normal distribution curve from z to infinity

Risk Ratio (or: Relative Risk or: Prevalence Rate Ratio)

$$ RR=\frac{Risk_1}{Risk_2} = \frac{P(D \: | \: E)}{P(D \: | \: not \: E)} = \frac{\frac{d_1}{n_1}}{\frac{d_2}{n_2}}=\frac{d_1 \cdot n_2}{n_1 \cdot d_2} $$

If the number RR is greater than 1 it is a worsening factor to be exposed to the exposure compared to being not exposed (because then the risk of getting outcome in the exposed group is greater than the risk in the non exposed group). If RR is between 0 and 1 it is an bettering factor being exposed compared to being not exposed.

Standard Error of the logarithmic transformed RR, ln(RR):

$$ SE(\ln\left({RR}\right)) = \sqrt{\frac{1}{d_1} - \frac{1}{n_1} + \frac{1}{d_2} - \frac{1}{n_2}} $$

95 % Confidence Interval of ln(RR):

$$ 95 \; \% \; CI = \left[\ln\left({RR}\right) - 1.96 \cdot SE(\ln\left({RR}\right)) : \ln\left({RR}\right) + 1.96 \cdot SE(\ln\left({RR}\right))\right] $$

95 % Confidence Interval of RR:

$$ 95 \; \% \; CI = \left[ e^{\ln\left({RR}\right) - 1.96 \cdot SE(\ln\left({RR}\right))} : e^{\ln\left({RR}\right) + 1.96 \cdot SE(\ln\left({RR}\right))} \right] $$

If this interval contains 1 (i.e. lower bound is negative and upper bound is positive) then we cannot reject the H0 hypothesis saying that RR = 1 (on a 5% significance level). In other words; then the two risks wouldn't be significantly different from each other. In this case the p-value from the test will be above 0.05. The p-value will be below 0.05 if the interval doesn't contain 1. In these cases it is therefore not strictly necessary to perform the z-test, a look at the confidence interval will suffice.

or, (alternatively):

$$ 95 \; \% \; CI = \left[\frac{RR}{e^{1.96 \cdot SE(\ln\left({RR}\right))}} : RR \cdot e^{1.96 \cdot SE(\ln\left({RR}\right))} \right] $$

If you need to find the SE(ln(RR)) knowing only the 95 % Confidence Interval :

$$ SE(\ln(RR)) = \frac{\ln(upper \: bound) - \ln(lower \: bound)}{2 \cdot 1.96} $$ $$ = \frac{\ln(upper \: bound) - \ln(lower \: bound)}{3.92} $$

Calculation of z-value in the general case, where the null hypothesis is H0:

$$ z = \frac{| \ln\left({RR}\right) - \ln\left({H0}\right)|}{SE(\ln\left({RR}\right))} $$

In the standard case where H0 = 1, then $ \ln\left({H0}\right) = \ln\left({1}\right) = 0 $ and therefore the formula is:

$$ z=\frac{ | \ln\left({RR}\right) - \ln\left({1}\right)| }{SE(\ln\left({RR}\right))} =\frac{ | \ln\left({RR}\right) - 0 | }{SE(\ln\left({RR}\right))} = \frac{ | \ln\left({RR}\right) | }{SE(\ln\left({RR}\right))} $$

The p-value corresponding to the z-value is:

$$ p=2 \cdot \int_z^{\infty} \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2} \cdot x^2} dx $$

Odds:

The odds of disease (D) given that exposure (E) has occurred:

$$ Odds_1 = \frac{P(D \: | \: E)}{1 - P(D \: | \: E)} = \frac{P(D \: | \: E)}{P(not \: D \: | \: E)} = \frac{\: \frac{d_1}{n_1} \:}{\frac{h_1}{n_1}} = \frac{d_1 \cdot n_1}{h_1 \cdot n_1} = \frac{d_1}{h_1} $$

The odds of disease (D) given that exposure (E) has NOT occurred:

$$ Odds_2 = \frac{P(D \: | not \: E)}{1 - P(D \: | not \: E)} = \frac{P(D \: | not \: E)}{P(not \: D \: | not \: E)} = \frac{\: \frac{d_2}{n_2} \:}{\frac{h_2}{n_2}} = \frac{d_2 \cdot n_2}{h_2 \cdot n_2} = \frac{d_2}{h_2} $$

Standard error of logarithmic transformed odds:

$$ SE(\ln(odds_1)) = \sqrt{\frac{1}{d_1}+\frac{1}{h_1}} \;\;\; \text{and} \;\;\; SE(\ln(odds_2))=\sqrt{\frac{1}{d_2}+\frac{1}{h_2}} $$

95 % Confidence Interval of logarithmic transformed odds:

$$ 95\; \% \;CI = \left[ \ln(odds) - 1.96 \cdot SE(\ln(odds)) \;\; : \;\; \ln(odds) + 1.96 \cdot SE(\ln(odds))\right] $$

95 % Confidence Interval of odds:

$$ 95 \; \% \; CI = \left[ \text{e}^{\ln(odds) - 1.96 \cdot SE(\ln(odds))} : \text{e}^{\ln(odds) + 1.96 \cdot SE(\ln(odds))}\right] $$

or, alternatively:

$$ 95\; \% \;CI = \left[\frac{odds}{\text{e}^{1.96 \cdot SE(\ln(odds))}} \;\; : \;\; odds \cdot \text{e}^{1.96 \cdot SE(\ln(odds))}\right] $$

Odds Ratio

$$ OR = \frac{Odds_1}{Odds_2} = \frac{\frac{d_1}{h_1}}{\frac{d_2}{h_2}}=\frac{d_1 \cdot h_2}{h_1 \cdot d_2} $$

Standard Error of the logarithmic transformed OR, ln(OR):

$$ SE(\ln\left({OR}\right)) = \sqrt{\frac{1}{d_1}+\frac{1}{h_1}+\frac{1}{d_2}+\frac{1}{h_2}} $$

95 % Confidence Interval of ln(OR):

$$ 95 \; \% \; CI = \left[ \ln\left({OR}\right) - 1.96 \cdot SE(\ln\left({OR}\right)) : \ln\left({OR}\right) + 1.96 \cdot SE(\ln\left({OR}\right)) \right] $$

95 % Confidence Interval of OR:

$$ 95 \; \% \; CI = \left[ e^{\ln\left({OR} - 1.96 \cdot SE(\ln\left({OR}\right))\right)} : e^{\ln\left({OR}\right) + 1.96 \cdot SE(\ln\left({OR}\right))}\right] $$

or, (alternatively):

$$ 95 \; \% \; CI = \left[\frac{OR}{e^{1.96 \cdot SE(\ln\left({OR}\right))}} : OR \cdot e^{1.96 \cdot SE(\ln\left({OR}\right))}\right] $$

Calculation of z-value in the general case, where the null hypothesis is H0:

$$ z = \frac{| \ln\left({OR}\right) - \ln\left({H0}\right)|}{SE(\ln\left({OR}\right))} $$

In the standard case where H0 = 1, then $ \ln\left({H0}\right) = \ln\left({1}\right) = 0 $ and therefore the formula is:

$$ z=\frac{ | \ln\left({OR}\right) - \ln\left({1}\right)| }{SE(\ln\left({OR}\right))} =\frac{ | \ln\left({OR}\right) - 0 | }{SE(\ln\left({OR}\right))} = \frac{ | \ln\left({OR}\right) | }{SE(\ln\left({OR}\right))} $$

The p-value corresponding to the z-value is:

$$ p=2 \cdot \int_z^{\infty} \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2} \cdot x^2} dx $$

Test, if two OR could be equal: In other words; H0: OR₁ = OR₂

$$ z=\frac{\ln\left({OR_1} \right) - \ln\left({OR_2} \right)}{\sqrt{SE\left(\ln\left({OR_1}\right)\right)^2 + SE\left(\ln\left({OR_2}\right)\right)^2}} $$

Or, alternatively:

$$ z = \frac{\ln\left({\frac{OR_1}{OR_2}}\right)}{\sqrt{SE\left(\ln\left({OR_1}\right)\right)^2 + SE\left(\ln\left({OR_2}\right)\right)^2}} $$

The corresponding p-value is:

$$ p=2 \cdot \int_{z}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2} \cdot x^2} dx $$

95 % Confidence Interval of the difference "ln(OR₁) - ln(OR₂)":
First, the difference ln(OR₁) - ln(OR₂) is found:

$$ \text{Diff}_{\ln\left({OR}\right)}=\ln\left({OR_1}\right) - \ln\left({OR_2}\right) $$

Standard Error of this difference:

$$ SE\left(\text{Diff}_{\ln\left({OR}\right)}\right) = \sqrt{SE\left(\ln\left({OR_1}\right)\right)^2 + SE\left(\ln\left({OR_2}\right)\right)^2} $$

95 % CI of Diff_ln(OR): (The number 0 will be in this interval if H0 is true)

$$ 95 \; \% \; CI = \left[\text{Diff}_{\ln\left({OR}\right)} - 1.96 \cdot SE\left(\text{Diff}_{\ln\left({OR}\right)}\right) \; \; : \;\; \text{Diff}_{\ln\left({OR}\right)} + 1.96 \cdot SE\left(\text{Diff}_{\ln\left({OR}\right)}\right) \right] $$

Exponentiated: (This will be the 95 % CI of the Ratio $ \frac{OR_1}{OR_2} $ ). The number 1 will be in this interval if H0 is true.

$$ 95 \; \% \; CI = \left[ \text{e}^{\text{Diff}_{\ln\left({OR}\right)} - 1.96 \cdot SE\left(\text{Diff}_{\ln\left({OR}\right)}\right) } \;\; : \;\; \text{e}^{\text{Diff}_{\ln\left({OR}\right)} + 1.96 \cdot SE\left(\text{Diff}_{\ln\left({OR}\right)}\right) } \right] $$ $$ = \left[ \exp\left({\text{Diff}_{\ln\left({OR}\right)} - 1.96 \cdot SE\left(\text{Diff}_{\ln\left({OR}\right)}\right) }\right) \;\; : \;\; \exp\left({\text{Diff}_{\ln\left({OR}\right)} + 1.96 \cdot SE\left(\text{Diff}_{\ln\left({OR}\right)}\right) }\right) \right] $$

Sensitivity & Specificity

Calculate sensitivity, specificity and all other concepts described below HERE.

Test Result	Sick	Not Sick	Total
Test Result	Disease Status
Tested Positive	True Positives (TP)	False Positives (FP)	TP + FP
Tested Negative	False Negatives (FN)	True Negatives (TN)	FN + TN
Total	TP + FN	FP + TN	N

Disease Prevalence (the proportion of the sample that has the disease):

$$ DP = \frac{TP + FN}{N} $$

Sensitivity. This is the probability that you test positive given the fact that you have the disease. It is the same as the power (1 - β) of the test:

$$ Sensitivity = \frac{TP}{TP + FN} $$

Specificity. The probability that you test negative given the fact that you don't have the disease:

$$ Specificity = \frac{TN}{FP + TN} $$

Positive Predictive Value. The probability that you have the disease given the fact that you tested positive.:

$$ PPV = \frac{TP}{TP + FP} $$

Negative Predictive Value. The probability that you don't have the disease given the fact that you tested negative:

$$ NPV = \frac{TN}{FN + TN} $$

Accuracy. The probability that the test gives the correct answer (be it either positive or negative):

$$ Accuracy = \frac{TP + TN}{N} $$

False Positive Rate. The probability that you test positive given the fact that you don't have the disease. This is the same as the risk of type 1 error (the significance level, α):

$$ FPR = \frac{FP}{FP + TN} $$

False Negative Rate. The "miss rate". The probability that you test negative given the fact that you have the disease. This is the same as the risk of type 2 error (β):

$$ FNR = \frac{FN}{TP + FN} $$

False Discovery Rate. The probability that you don't have the disease given the fact that you tested positive:

$$ FDR = \frac{FP}{TP + FP} $$

False Omission Rate. The probability that you have the disease given the fact that you tested negative:

$$ FOR = \frac{FN}{FN + TN} $$

All of the above concepts are risks/proportions and therefore they have corresponding standard errors. These can be found using the formulas under section "Standard error of proportion" under "Proportions" above. After having found the SE the 95 % Confidence Intervals can be calculated as described under "Proportions".

Positive Likelihood Rate. The probability that a person with the disease tests positive divided by the probability that a person without the disease tests positive. If the rate is greater than 1 the test is more likely to give a true positive than a false positive. If the rate is between 0 and 1 the test is more likely to give a false positive than a true positive:

$$ LR+ = \frac{TPR}{FPR} = \frac{sensitivity}{1 - specificity}$$

Negative Likelihood Ratio. The probability that a person with the disease tests negative divided by the probability that a person without the disease tests negative. If the rate is greater than 1 the test is more likely to give a false negative than a true negative. If the rate is between 0 and 1 the test is more likely to give a true negative than a false negative:

$$ LR- = \frac{FNR}{TNR} = \frac{1 - sensitivity}{specificity} $$

Diagnostic Odds Ratio. The rate between the odds of a person testing positive who has the disease relative to the odds of a person testing positive who do not have the disease:

$$ DOR = \frac{LR+}{LR-} $$

F₁ Score. This is used as a single measure of performance of the test for the positive class. It is a measure of the test's accuracy:

$$ F_1 = 2 \cdot \frac{PPV \cdot sensitivity}{PPV + sensitivity} $$

Power. The power of the test is the probability that you test positive given the fact that you have the disease. It's the same as the sensitivity. :

$$ Power = 1 - \beta = sensitivity $$

Means

Compare two or more already known mean values with a z-test and a t-test HERE.
Calculate mean, standard deviation and confidence intervals from input data sets and perform z-test and a t-test HERE.

Mean value $ \bar{x} $ of sample or data set consisting of n numbers:

$$ \bar{x} = \frac{x_1 + x_2 + x_3 + ... + x_n}{n} $$

Variance of sample: (the variance is the average of the squared distances to the mean value).

$$ Var(x)=\sum_{i=1}^n {\frac{\left( x_i - \bar{x} \right)^2}{n-1}} $$

or (as an alternative):

$$ Var(x)=\frac{\sum_{i=1}^n{\left(x_i^2\right)} - n \cdot \bar{x}^2}{n-1} $$

This is slightly different from the variance of a population, which is:

$$ Var(x)=\sum_{i=1}^n {\frac{\left( x_i - \bar{x} \right)^2}{n}} = \frac{\sum_{i=1}^n{\left(x_i^2\right)} - n \cdot \bar{x}^2}{n}$$

Standard Deviation, SD, of a sample (SD is the average of the distances to the mean value):

$$ SD = \sqrt{Var\left( x \right) } $$

The number $ \sqrt{Var\left( x \right) } $ is (in this case) the variance of the sample, see above. The SD of a population is found with the same formula, but by using the variance of the population in the formula instead.

95 % Confidence Interval of a Standard Deviation:

$$ 95 \; \% \; CI = \left[ SD \cdot \sqrt{\frac{DF}{(\chi^2)_1}} \; : \; SD \cdot \sqrt{\frac{DF}{(\chi^2)_2}} \right] $$

where DF = n - 1 is the degree of freedom and the numbers $ \chi^2_1 $ and $ \chi^2_2 $ are the 97.5 and 2.5 percent quantiles of the cumulative chi-square probability distribution. In other words:

$$ 0.975 = \int_{0}^{\chi^2_1} {\left( \frac{x^{\frac{1}{2} \cdot df - 1} \cdot \text{e}^{-\frac{1}{2} \cdot x}}{2^{\frac{1}{2} \cdot df} \cdot \Gamma\left(\frac{df}{2}\right)} \right) dx} \;\; \text{and} \;\; 0.025 = \int_{0}^{\chi^2_2} {\left( \frac{x^{\frac{1}{2} \cdot df - 1} \cdot \text{e}^{-\frac{1}{2} \cdot x}}{2^{\frac{1}{2} \cdot df} \cdot \Gamma\left(\frac{df}{2}\right)} \right) dx}$$

For an explanation of the number Γ (Gamma) please see section "F Test".

To test whether a given SD could be equal to a certain number SD₀ we calculate the chi square value:

$$ \chi^2 = \frac{SD^2}{SD_0^2} \cdot DF $$

The corresponding p-value is two times the smallest of the numbers p₁ and p₂

$$ p_1 = \int_{\chi^2}^\infty {\left( \frac{x^{\frac{1}{2} \cdot df - 1} \cdot \text{e}^{-\frac{1}{2} \cdot x}}{2^{\frac{1}{2} \cdot df} \cdot \Gamma\left(\frac{df}{2}\right)} \right) dx} \;\; \text{and} \;\; p_2 = 1 - p_1 $$

If the p-value is smaller than 0.05 we can reject the null hypothesis H0: SD = SD₀.

Standard error of a mean value from a sample:

$$ SE=\frac{SD}{\sqrt{n}} $$

95 % Confidence Interval of Mean:

$$ 95 \; \% \; CI=\left[ \bar{x} - 1.96 \cdot SE : \bar{x} + 1.96 \cdot SE \right] $$

The "mean difference" between two samples:

$$ MD = Mean_1 - Mean_2$$

This number will be negative if Mean₁ < Mean₂. In that case you could swap the order: $MD = Mean_2 - Mean_1$.

Standard Error of Mean Difference:

$$ SE_{MD}=\sqrt{\left( SE_1 \right)^2 + \left(SE_2\right)^2 } $$

95 % CI of Mean Difference:
(the number 0 will be contained in this interval if H0 is true in the case where H0 says that MD = 0)

$$ 95 \; \% \; CI = \left[ MD - 1.96 \cdot SE_{MD} : MD + 1.96 \cdot SE_{MD} \right] $$

Z-test:

z-value, testing if two means are equal:

$$ z=\frac{|MD|}{SE_{MD}}=\frac{|Mean_1 - Mean_2|}{\sqrt{\left(SE_1\right)^2 + \left(SE_2 \right)^2}} $$

This is equivalent to testing the null hypothesis, that MD is equal to 0, i.e. H0: MD = 0

p-value corresponding to this z-value:

$$ p = 2 \cdot \int_z^\infty {\frac{1}{\sqrt{2\pi}} \text{e}^{-\frac{1}{2}x^2} dx} $$

If the null hypothesis is NOT H0: MD = 0 but instead H0: MD > 0 or H0: MD < 0 the z-value is found in the same way, only without the numerical signs:

$$ z = \frac{MD - 0}{SE_{MD}} = \frac{Mean_1 - Mean_2 - 0}{\sqrt{\left(SE_1\right)^2 + \left(SE_2 \right)^2}} = \frac{Mean_1 - Mean_2}{\sqrt{\left(SE_1\right)^2 + \left(SE_2 \right)^2}}$$

This z-value can then be either negative or positive.

If H0 says that MD > 0 then the p-value is calculated as the integral:

$$ p = \int_{-\infty}^{z} {\frac{1}{\sqrt{2\pi}} \text{e}^{-\frac{1}{2}x^2} dx} $$

In other words the area under the normal distribution curve from minus infinity to z.

If H0 says that MD < 0 then the p-value is calculated as the integral:

$$ p = \int_z^\infty {\frac{1}{\sqrt{2\pi}} \text{e}^{-\frac{1}{2}x^2} dx} $$

In other words the area under the curve from z to infinity.

In the general case, if the null hypothesis says that MD = z₀ for some number z₀ the z-value is found as:

$$ z = \frac{|MD - z_0|}{SE_{MD}} = \frac{|Mean_1 - Mean_2- z_0|}{\sqrt{\left(SE_1\right)^2 + \left(SE_2 \right)^2}} $$

And the corresponding p-value is:

$$ p = 2 \cdot \int_z^\infty {\frac{1}{\sqrt{2\pi}} \text{e}^{-\frac{1}{2}x^2} dx} $$

If the null hypothesis says that MD > z₀ the z-value is:

$$ z = \frac{MD - z_0}{SE_{MD}} = \frac{Mean_1 - Mean_2- z_0}{\sqrt{\left(SE_1\right)^2 + \left(SE_2 \right)^2}} $$

And the p-value will be:

$$ p = \int_{-\infty}^{z} {\frac{1}{\sqrt{2\pi}} \text{e}^{-\frac{1}{2}x^2} dx} $$

If the null hypothesis says that MD < z₀ the z-value is:

$$ z = \frac{MD - z_0}{SE_{MD}} = \frac{Mean_1 - Mean_2- z_0}{\sqrt{\left(SE_1\right)^2 + \left(SE_2 \right)^2}} $$

And the p-value is found as:

$$ p = \int_z^\infty {\frac{1}{\sqrt{2\pi}} \text{e}^{-\frac{1}{2}x^2} dx} $$

F-test:

Using an F-test to see if two variances $ Var_1 = {({SD}_1})^{2} $ and $ Var_2 = {({SD}_2})^{2} $ from two samples could be equal:

First, the degrees of freedom of the two samples are found as $ df_1 = n_1 - 1 $ and $ df_2 = n_2 - 1 $, where $ n_1 $ and $ n_2 $ are the sample sizes.

The first number $ f_1 $ in the f-test is calculated as:

$$ f_1 = \frac{(SD_1)^2}{(SD_2)^2} $$

The second number $ f_2 $ in the f-test is calculated as:

$$ f_2 = \frac{(SD_2)^2}{(SD_1)^2} = \frac{1}{f_1} $$

The p-value $ p_1 $ corresponding to $ f_1 $ is the smallest of the two following numbers:

$$ p=\int_{f_1}^\infty{\left( \frac{df_1^{\frac{df_1}{2}} \cdot df_2 ^\frac{df_2}{2} \cdot \Gamma \left(\frac{df_1 + df_2}{2}\right) \cdot x^{\frac{df_1}{2}-1}}{\Gamma \left(\frac{df_1}{2}\right) \cdot \Gamma \left(\frac{df_2}{2}\right) \cdot \left(df_2 + df_1 \cdot x\right)^{\frac{df_1 + df_2}{2}}} \right) dx} \; \; \; \; \; \; \text{and} \;\;\;\;\;\; 1-p $$

The Gamma number in the expression is in its general form $ \Gamma{\left( \frac{k}{2}\right)} $ calculated as follows:

$$ \Gamma\left(\frac{k}{2}\right) = \begin{cases} \; \sqrt{\pi} & , \text{if $k = 1$} \\[2ex] \; 1 & , \text{if $ k = 2$} \\[2ex] \; (\frac{4}{2} - 1) \cdot (\frac{6}{2} - 1) \cdot (\frac{8}{2} - 1) \cdot \cdots \cdot (\frac{k}{2} - 1) & , \text{if $k$ is even and greater than 2} \\[2ex] \; \sqrt{\pi} \cdot (\frac{3}{2} - 1) \cdot (\frac{5}{2} - 1) \cdot (\frac{7}{2} - 1) \cdot \cdots \cdot (\frac{k}{2} - 1) & , \text{if $k$ is odd and greater than 1} \end{cases} $$

The p-value $ p_2 $ corresponding to the f-value $ f_2 $ is the smallest of the two following numbers:

$$ p = \int_{f_2}^\infty {\left( \frac{df_1^{\frac{df_1}{2}} \cdot df_2 ^\frac{df_2}{2} \cdot \Gamma \left(\frac{df_1 + df_2}{2}\right) \cdot x^{\frac{df_1}{2}-1}}{\Gamma \left(\frac{df_1}{2}\right) \cdot \Gamma \left(\frac{df_2}{2}\right) \cdot \left(df_2 + df_1 \cdot x\right)^{\frac{df_1 + df_2}{2}}} \right) dx} \;\;\;\;\;\; \text{and} \;\;\;\;\;\; 1-p $$

The overall p-value of the F-test is:

$$ p = p_1 + p_2 $$

T-test:

If the F-test shows that the two populations could have the same variance then the common variance, $ Var_c $ , is calculated:

$$ {Var}_c=\frac{(df_1) \cdot (SD_1)^2 + (df_2) \cdot (SD_2)^2}{{df}_1 + {df}_2} = \frac{(n_1 - 1) \cdot (SD_1)^2 + (n_2 - 1) \cdot (SD_2)^2}{n_1 + n_2 - 2} $$

If desired, one could also calculate the common SD as well:

$$ {SD}_c = \sqrt{{Var_c}} = \sqrt{\frac{(df_1) \cdot (SD_1)^2 + (df_2) \cdot (SD_2)^2}{{df}_1 + {df}_2}} $$

The Mean Difference is:

$$ MD = {Mean}_1 - {Mean}_2 $$

The Standard Error of the Mean Difference (in this case, assuming same variance in both samples) is calculated as one of the 4 expressions:

$$ \begin{align} {SE}_{MD} & = \sqrt{\left(\frac{{(SD_c)}^2}{n_1} + \frac{{(SD_c)}^2}{n_2} \right)} \\ & = \sqrt{{(SD_c)}^2 \cdot \left(\frac{1}{n_1} + \frac{1}{n_2} \right)} \\ & = \sqrt{(SD_c)^2} \cdot \sqrt{\left( \frac{1}{n_1} + \frac{1}{n_2} \right)} \\ & = {SD}_c \cdot \sqrt{\left( \frac{1}{n_1} + \frac{1}{n_2} \right)} \end{align} $$

The 95 % Confidence Interval of MD is:

$$ 95 \; \% \; CI = \left[ MD - t_{(1-\frac{\alpha}{2})} \cdot {SE}_{MD} \; : \; MD + t_{(1-\frac{\alpha}{2})} \cdot {SE}_{MD} \right] $$

The number $ t_{(1-\frac{\alpha}{2})} $ is the quantile in the t-distribution with $ n_1 + n_2 - 2 $ degrees of freedom corresponding to the significance level α . For example, if α = 0.05 then 1 - α/2; = 1 - 0.025 = 0.975, in other words the 97.5 % quantile

The t-value in the t-test is calculated using this formula:

$$ t = \frac{|MD|}{{SE}_{MD}} $$

and finally the corresponding p-value is calculated as:

$$ p = 2 \cdot \int_{t}^\infty { \left( \frac{\Gamma{\left( \frac{df+1}{2}\right)}}{\sqrt{\pi \cdot df} \cdot \Gamma{\left(\frac{df}{2}\right)} \cdot \left( 1 + \frac{x^2}{df}\right)^{(df+1)/2}} \right) dx} $$

where $ df $ is the total degree of freedom for both samples: $ df = {df}_1 + {df}_2 = n_1 + n_2 - 2 $ .

If (on the other hand) the F-test shows the two variances $ {({SD}_1})^{2} $ and $ {({SD}_2})^{2} $ can NOT be assumed equal, then the Standard Error of MD is found in the following way:

$$ {SE}_{MD} = \sqrt{\frac{(SD_1)^2}{n_1} + \frac{(SD_2)^2}{n_2}} = \sqrt{(SE_1)^2 + (SE_2)^2} $$

The degrees of freedom can, in this case, be found as a sort of weighted average between the two DF's of the groups:

$$ DF = \frac{((SE_1)^2 + (SE_2)^2)^2}{ \frac{(SD_1)^4}{n_1 \cdot n_1 \cdot (n_1 - 1)} + \frac{(SD_2)^4}{n_2 \cdot n_2 \cdot (n_2 - 1)}} = \frac{((SE_1)^2 + (SE_2)^2)^2}{ \frac{(SE_1)^4}{n_1 - 1} + \frac{(SE_2)^4}{n_2 - 1}} $$

And the 95 % CI of the Mean Difference will then be:

$$ 95 \; \% \; CI = \left[ MD - t_{(1-\frac{\alpha}{2})} \cdot {SE}_{MD} \; : \; MD + t_{(1-\frac{\alpha}{2})} \cdot {SE}_{MD} \right] $$

The t-value in this case (assuming different variances in the two samples) is found using the same formula as above (but with the modified version of SE_MD):

$$ t = \frac{|MD|}{{SE}_{MD}} $$

and the corresponding p-value is:

$$ p = 2 \cdot \int_{t}^\infty { \left( \frac{\Gamma{\left(\frac{df+1}{2}\right)}}{\sqrt{\pi \cdot df} \cdot \Gamma{\left(\frac{df}{2}\right)} \cdot \left( 1 + \frac{x^2}{df}\right)^{(df+1)/2}}\right) dx} $$

The above procedure dealt with the case where the null hypothesis stated that MD = 0

If you want to test a different null hypothesis in the t-test, for example H0: MD > 0 or H0: MD < 0 then please look under "z-test" above for an overview of the procedures, since the procedures will be the same, only the formulas for t and p correspondingly adjusted.

Confidence Intervals

Calculate confidence intervals for means, proportions and slopes HERE.

Confidence Interval of a Mean

For a given significance level alpha (α) the (100 - α) % approximated confidence interval of a mean value $ \bar{x} $ is:

$$\left[ \bar{x} - z_{(1-\frac{\alpha}{2})} \cdot SE : \bar{x} + z_{(1-\frac{\alpha}{2})} \cdot SE \right] $$

The number $ z_{(1 - \frac{\alpha}{2})}$ is the $1 - \frac{\alpha}{2}$ quantile of the standard normal distribution N(0,1).

Standard Error of a mean is $ SE = \frac{SD}{\sqrt{n}}$ where SD is the standard deviation and n is the sample size.

For example: If $\alpha = 5\%$ then $ z_{(1 - \frac{\alpha}{2})} = z_{(1 - \frac{0.05}{2})} = z_{0.975} = 1.96$ .

Therefore, in this case, the 95 % approximated Confidence Interval of a mean will be:

$$ 95 \; \% \; CI=\left[ \bar{x} - 1.96 \cdot SE : \bar{x} + 1.96 \cdot SE \right] $$

If you want the exact confidence interval (using the t-distribution) instead of the approximated interval above, it is:

$$\left[ \bar{x} - t_{(1-\frac{\alpha}{2})} \cdot SE : \bar{x} + t_{(1-\frac{\alpha}{2})} \cdot SE \right] $$

Here, the number $ t_{(1 - \frac{\alpha}{2})}$ is the $1 - \frac{\alpha}{2}$ quantile of the t-distribution with DF = N - 2 degrees of freedom. N is the sample size.

For example: If $\alpha = 5\%$ and DF = 14 then $ t_{(1 - \frac{\alpha}{2})} = t_{(1 - \frac{0.05}{2})} = t_{0.975} = 2.145$. This 97.5 % quantile can also be found via the section "Tables" in the menu.

In this case, the 95 % exact Confidence Interval of a given mean will be:

$$ 95 \; \% \; CI=\left[ \bar{x} - 2.145 \cdot SE : \bar{x} + 2.145 \cdot SE \right] $$

Confidence Interval of a Proportion

After having chosen the desired significance level α (often 5 %) the (100 - α) % approximated confidence interval of a proportion $ \hat{p} $ is:

$$\left[ \hat{p} - z_{(1-\frac{\alpha}{2})} \cdot SE : \hat{p} + z_{(1-\frac{\alpha}{2})} \cdot SE \right] $$

Standard Error of a proportion $ \hat{p} $ is $ SE = \frac{\sqrt{\hat{p}(1 - \hat{p})}}{n}$ where n is the sample size.

For an explanation of the number $ z_{(1-\frac{\alpha}{2})} $ please see above under "Confidence Interval of a Mean"

The (100 - α) % exact confidence interval of a proportion $ \hat{p} $ is:

$$\left[ \hat{p} - t_{(1-\frac{\alpha}{2})} \cdot SE : \hat{p} + t_{(1-\frac{\alpha}{2})} \cdot SE \right] $$

Using the t-distribution. For an explanation of the number $ t_{(1-\frac{\alpha}{2})} $ please see above under "Confidence Interval of a Mean"

Confidence Interval of a Slope (β value)

After a linear regression the regression line will on the form $ Y = \beta X + constant \: (+ error) $ . The number β is the slope (regression coefficient) and the constant is the intersection with the Y axis.

For a significance level α the (100 - α) % approximated confidence interval of the slope β is:

$$\left[ \beta - z_{(1-\frac{\alpha}{2})} \cdot SE \: : \: \beta + z_{(1-\frac{\alpha}{2})} \cdot SE \right] $$

Standard Error of a slope β (regression coefficient) is:

$$SE = |\beta| \cdot \sqrt{\frac{1 - R^2}{DF \cdot R^2}} , $$

where DF = N - 2 is the degrees of freedom, N the number of points (x,y) and R² is the correlation coefficient.

For an explanation of the number $ z_{(1-\frac{\alpha}{2})} $ please see above under "Confidence Interval of a Mean"

The (100 - α) % exact confidence interval of the slope β is:

$$\left[ \beta - t_{(1-\frac{\alpha}{2})} \cdot SE \: : \: \beta + t_{(1-\frac{\alpha}{2})} \cdot SE \right] $$

Using the t-distribution. For an explanation of the number $ t_{(1-\frac{\alpha}{2})} $ please see above under "Confidence Interval of a Mean"

Chi-Square-Test

Perform chi-squared test of independence between two variables HERE.
Perform goodness-of-fit test to see whether an observed data set follows a known distribution HERE.

Observations in sample (i rows, j columns):
Variable B	Variable A
Variable B	Category 1	Category 2	...	Category j	Total
Category 1	O₁₁	O₁₂	...	O_1j	Sum_i=1
Category 2	O₂₁	O₂₂	...	O_2j	Sum_i=2
...	...	...	...	...	Sum_...
Category i	O_i1	O_i2	...	O_ij	Sum_i=i
Total	Sum_j=1	Sum_j=2	...	Sum_j=j	Sum_total

Calculation of expected values (assuming H0 is true):
Variable B	Variable A
Variable B	Category 1	Category 2	...	Category j	Total
Category 1	$ \text{E}_{11}=\frac{\text{Sum}_{j=1}}{\text{Sum}_{total}} \cdot \text{Sum}_{i=1} $	$ \text{E}_{12}=\frac{\text{Sum}_{j=2}}{\text{Sum}_{total}} \cdot \text{Sum}_{i=1} $	...	$ \text{E}_{1j}=\frac{\text{Sum}_{j=j}}{\text{Sum}_{total}} \cdot \text{Sum}_{i=1} $	Sum_i=1
Category 2	$ \text{E}_{21}=\frac{\text{Sum}_{j=1}}{\text{Sum}_{total}} \cdot \text{Sum}_{i=2} $	$ \text{E}_{22}=\frac{\text{Sum}_{j=2}}{\text{Sum}_{total}} \cdot \text{Sum}_{i=2} $	...	$ \text{E}_{2j}=\frac{\text{Sum}_{j=j}}{\text{Sum}_{total}} \cdot \text{Sum}_{i=2} $	Sum_i=2
...	...	...	...	...	...
Category i	$ \text{E}_{i1}=\frac{\text{Sum}_{j=1}}{\text{Sum}_{total}} \cdot \text{Sum}_{i=i} $	$ \text{E}_{i1}=\frac{\text{Sum}_{j=2}}{\text{Sum}_{total}} \cdot \text{Sum}_{i=i} $	...	$ \text{E}_{ij}=\frac{\text{Sum}_{j=j}}{\text{Sum}_{total}} \cdot \text{Sum}_{i=i} $	Sum_i=i
Total	Sum_j=1	Sum_j=2	...	Sum_j=j	Sum_total

After having found the expected values, the number $\chi^2 $ (chi square) is calculated as follows:

$$ \chi^2=\sum_{h=1}^i{\left(\sum_{k=1}^j {\frac{\left(O_{hk}-E_{hk}\right)^2}{E_{hk}}} \right)} $$

In other words: the sum of the squared distances between the observed values and the expected values relative to the expected values. Like this:

$$ \chi^2 = \frac{\left(O_{11}-E_{11}\right)^2}{E_{11}} + \frac{\left(O_{12}-E_{12}\right)^2}{E_{12}} + ... + \frac{\left(O_{1j}-E_{1j}\right)^2}{E_{1j}} + ... $$ $$ ... + \frac{\left(O_{21}-E_{21}\right)^2}{E_{21}} + \frac{\left(O_{22}-E_{22}\right)^2}{E_{22}} + ... + \frac{\left(O_{2j}-E_{2j}\right)^2}{E_{2j}} + ... $$ $$ ... + \frac{\left(O_{i1}-E_{i1}\right)^2}{E_{i1}} + \frac{\left(O_{i2}-E_{i2}\right)^2}{E_{i2}} + ... + \frac{\left(O_{ij}-E_{ij}\right)^2}{E_{ij}} $$

The p-value corresponding to the chi-square-number is then found as follows:

$$ p=\int_{\chi^2}^\infty {\left( \frac{x^{\frac{1}{2} \cdot df - 1} \cdot \text{e}^{-\frac{1}{2} \cdot x}}{2^{\frac{1}{2} \cdot df} \cdot \Gamma\left( \frac{df}{2}\right)} \right) dx} $$

Weighted Average

Calculate the weighted average (weighted estimate) of two or more means, slopes, risk ratios, odds ratios ... etc. HERE.

The weighted estimate (or weighted average) of two estimates can be calculated as:

$$ WE=\frac{estimate_1 \cdot w_1+estimate_2 \cdot w_2}{w_1 + w_2} $$

where the weights w1 and w2 are calculated as

$$ w_1=\frac{1}{(SE_1)^2} \; \text{ and } \; w_2=\frac{1}{(SE_2)^2} $$

If the WE is of more than two estimates, then the general formula is used:

$$ WE=\frac{\sum_{i=1}^n{\left(estimate_i \cdot w_i\right)}}{\sum_{i=1}^n{\left(w_i\right)}}= \frac{estimate_1 \cdot w_1+estimate_2 \cdot w_2+...+ estimate_n \cdot w_n}{w_1+w_2+... + w_n} $$

The Standard Error of WE:

$$ SE(WE) = \sqrt{\frac{1}{w_1+w_2+... + w_n}} $$

The above formulas are used, if the estimates are of the type Mean Values, Mean Differences, Risk Differences, ... etc.
If, however, the estimates are of the type RR, OR, IRR, ... and similar, then the following formula is used instead:

$$ \begin{align} \ln\left({ WE }\right) & = \frac{\sum_{i=1}^n{\left(\ln\left({estimate_i}\right) \cdot w_i\right)}}{\sum_{i=1}^n{\left(w_i\right)}} \\ & = \frac{\ln\left({estimate_1}\right) \cdot w_1+\ln\left({estimate_2 }\right) \cdot w_2+... + \ln\left({estimate_n }\right) \cdot w_n}{w_1+w_2+... + w_n} \end{align}$$

where the weights are (in this particular case):

$$ w_i=\frac{1}{SE(\ln\left({ estimate_i}\right))^2} $$

After having found the number $ \ln\left({ WE }\right) $ one can find the real value of WE by exponentiating:

$$ WE=\text{e}^{\ln\left({WE}\right)}=\exp\left({\ln\left({WE}\right) }\right) $$

Standard Error of ln(WE):

$$ SE\left( \ln\left({WE}\right) \right) = \sqrt{ \frac{1}{w_1 + w_2 + ... + w_n} } $$

95 % CI of ln(WE):
(The number 0 will be contained in this interval, if the H0 hypothesis ln(WE) = 0 cannot be rejected.)

$$ 95 \; \% \; CI = \left[ \ln\left({WE}\right) - 1.96 \cdot SE\left( \ln\left({WE}\right) \right) \;\; : \;\; \ln\left({WE}\right) + 1.96 \cdot SE\left( \ln\left({WE}\right) \right) \right] $$

95 % CI of WE :
(The number 1 will be contained in this interval if the H0 hypothesis WE = 1 cannot be rejected):

$$ 95 \; \% \; CI =\left[ \exp\left({ \ln\left({WE}\right) - 1.96 \cdot SE\left( \ln\left({WE}\right) \right) }\right) \; \; : \; \; \exp\left({\ln\left({WE}\right) + 1.96 \cdot SE\left( \ln\left({WE}\right)\right) }\right) \right] $$

Test to see if the WE could be 1 or (consequently); ln(WE) = 0 (Meaning WE not significant)

$$ z=\frac{\ln\left({WE}\right) - \ln\left({1}\right)}{SE\left(\ln\left({WE}\right)\right)} = \frac{\ln\left({WE}\right)}{SE\left(\ln\left({WE}\right)\right)} $$

The corresponding p-value is:

$$ p=2 \cdot \int_{z}^{\infty} \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2} \cdot x^2} dx $$

Data Analysis

Perform data analysis on one or more data sets HERE and calculate means, standard deviations, confidence intervals and do z-test, t-test and ANOVA.

Quartiles

The quartiles are the three numbers that divide a data set into four parts:

The 25% quartile (or first quartile, Q1): 25 percent (a quarter) of the data have this value or below that.
The 50% quartile, the "median" (or second quartile, Q2): 50 percent (half) of the data have this value or below that.
The 75% quartile (or third quartile, Q3): 75 percent (three quarters) of the data have this value or below that.
The last quarter of the data set will have values above the the 75% quartile. In other words: 25 percent of the data will have values above Q3

The best way to illustrate quartiles is probably with an example:

If we have, say, 10 people and count the cups of coffee they had during a day, we get a table like this:

Person no.	1	2	3	4	5	6	7	8	9	10
Cups of coffee	4	3	1	5	7	0	4	3	2	4

We see immediately that the number 4 is the mode of the data set, since it is the number of cups that occurs most times, namely 3.
Now, the first step to calculate the quartiles is to sort the data in ascending order:

Person no.	6	3	9	2	8	1	7	10	4	5
Cups of coffee	0	1	2	3	3	4	4	4	5	7

Note that the persons' index numbers are no longer in ascending order. This, however, is not important for calculating the quartiles, since the quartiles are regarding the data values and not the index values.

First we find the median: Since 10 is an even number, we divide the list into two parts with 5 in each

Person no.	6	3	9	2	8
Cups of coffee	0	1	2	3	3

1	7	10	4	5
4	4	4	5	7

And because 10 is an even number, the median of the data set will in this case be the number in between 3 and 4, in other words the average of the two numbers, which is 3.5. This means that 50% of the people had 3.5 cups of coffee that day or less. Although, technically speaking, any number between 3 and 4 (including 3!) would qualify as a median for this data set, since 50% would still have had that many cups or less that day.

To calculate the 25% quartile, Q1: We take the first 5 data values. Since 5 is an uneven/odd number, there is in this case a "natural" data value that divides the five numbers into two equal parts with two numbers in each. This natural divider is in this case the third value, namely 2: 25% (a quarter) had 2 cups or less that day.

Person no.	6	3
Cups of coffee	0	1

2	8
3	3

The calculation of the 75% qurtile, Q3, runs similar: We take the last five data values. The "natural" divider of these five numbers is the 8^th value, namely 4, since it divides into two equal parts with two numbers in each. The 75% quartile is therefore 4: 75 percent (three quarters) had 4 cups or less that day. The remaining 25% of the people had more than 4 cups that day.

1	7
4	4

4	5
5	7

We can finish by writing down the set of quartiles: (Q1, Q2, Q3) = (2, 3.5, 4)

Apart from the set of quartiles one can also find any quantile of a data set. For example the 20% quantile: 20% of 10 is 2 people, in other words, the 20% quantile would in this case be 1, since 20% of the ten people had 1 cup of coffee or less.

Skewness

Skewness of a data sample:

$$ skewness = \frac{\sqrt{n(n-1)}}{n-2} \cdot \frac{\frac{1}{n} \sum_{i=1}^{n}(x_i - \bar{x})^3 }{\left( \frac{1}{n-1} \sum_{i=1}^{n}(x_i - \bar{x})^2 \right)^{3/2}} $$

Since $ Var = SD^2 = \frac{1}{n-1} \sum_{i=1}^{n}(x_i - \bar{x})^2 $ is the variance of a sample this can also be written as:

$$ skewness = \frac{\sqrt{n(n-1)}}{n-2} \cdot \frac{\frac{1}{n} \sum_{i=1}^{n}(x_i - \bar{x})^3 }{\left( SD^2 \right)^{3/2}} = \frac{\sqrt{n(n-1)}}{n-2} \cdot \frac{\frac{1}{n} \sum_{i=1}^{n}(x_i - \bar{x})^3 }{ SD^3 } $$

A normal distribution has a skewness with the value 0. In other words; if a data set or sample is to be assumed normally distributed, it must have a skewness of 0 (or close to 0).

Kurtosis

Kurtosis of a data sample:

$$ K = \frac{\frac{1}{n} \sum_{i=1}^{n}(x_i - \bar{x})^4 }{\left( \frac{1}{n} \sum_{i=1}^{n}(x_i - \bar{x})^2 \right)^{2}} $$

Since $ Var_{pop} = SD_{pop}^2 = \frac{1}{n} \sum_{i=1}^{n}(x_i - \bar{x})^2 $ is the variance of a population this can also be written as:

$$ K = \frac{\frac{1}{n} \sum_{i=1}^{n}(x_i - \bar{x})^4 }{\left( Var_{pop} \right)^{2}} = \frac{\frac{1}{n} \sum_{i=1}^{n}(x_i - \bar{x})^4 }{ SD_{pop}^4 } $$

A normal distribution has a kurtosis with the value 3. In other words; if a data set or sample is to be assumed normally distributed, it must have a kurtosis of 3 (or close to 3).

ANOVA (one-way)

In a one-sided ANOVA (Analysis of Variance) it is invesitigated whether each observation in each of the groups (data sets) differ from the mean value of the group more than each group mean value differ from the overall mean of all the observations from all groups combined. This is done by using the F-distribution with the following number used as the F-statistics:

$$ F = \frac{MSG}{MSW} $$

With degrees of freedom DF₁ = J - 1 and DF₂ = n - J. J is the number of groups/data sets and n is the number of all observations in total for all groups combined.

The number MSG (between-groups mean square) in the numerator is the average of the squared differences between each group mean and the overall mean:

$$ MSG = \frac{SSG}{J - 1} = \frac{ \sum_{j=1}^J {n_j( \bar{x}_j - \bar{x} )^2} }{J - 1} $$

Here, $ \bar{x}_j $ is the mean of the j'th group. $ \bar{x} $ is the overall mean of all the observations. Each squared difference $ (\bar{x}_j - \bar{x})^2 $ is weighted by multiplying by n_j, the number of elements in the j'th group.

The number SSG in the numerator is called the sum of squares between-groups.

The number MSW (within-groups mean square) in the denominator is the mean of the sums of squared deviations from the mean values in all the J groups:

$$ MSW = \frac{SSW}{n - J} = \frac{ \sum_{j=1}^J {\sum_{i=1}^{n_j} {(x_{ij} - \bar{x}_j )^2} } }{n - J} $$

Here, $ x_{ij} $ is the i'th observation in the j'th group. $ \bar{x}_j $ is the mean of the j'th group.

The number SSW in the numerator is called the sum of squares within-groups.

The total sum of squares SST is the sum of squared differences between each of the total observations and the overall mean:

$$ SST = \sum_{j=1}^J {\sum_{i=1}^{n_j} {(x_{ij} - \bar{x} )^2} } = SSG + SSW $$

The P-value in the ANOVA one-way will then be calculated with the f-probability-density function:

$$ p = \int_{F}^\infty {\left( \frac{df_1^{\frac{df_1}{2}} \cdot df_2 ^\frac{df_2}{2} \cdot \Gamma \left(\frac{df_1 + df_2}{2}\right) \cdot x^{\frac{df_1}{2}-1}}{\Gamma \left(\frac{df_1}{2}\right) \cdot \Gamma \left(\frac{df_2}{2}\right) \cdot \left(df_2 + df_1 \cdot x\right)^{\frac{df_1 + df_2}{2}}} \right) dx} $$

ANOVA (two-way)

In the two-way ANOVA two different F-values are calculated, F₁ and F₂:

The F-value F₁ is calculated as follows:

$$ F_1 = \frac{MSG}{MSE} $$

The numerator MSG is the mean value of the between-groups sum of squares:

$$ MSG = \frac{SSG}{J - 1} = \frac{ I \cdot \sum_{j=1}^J {( \bar{x}_j - \bar{x} )^2} }{J - 1} $$

In other words the average of the squared differences between each group mean $ \bar{x}_j $ and the overall mean $ \bar{x} $. The number I is the number of rows or "blocks" in the I × J matrix consisting of the J groups (data sets), each of size I. Note that in the two-way ANOVA all columns (data sets) must have the same size.

The denominator MSE is the mean value of the error sum of squares:

$$ MSE = \frac{SSE}{(J - 1)(I - 1)} = \frac{ \sum_{i=1}^I \sum_{j=1}^J {( x_{ij} - \bar{x}_j - (\bar{x}_i - \bar{x}) )^2} }{(J - 1)(I - 1)} $$

Where $ \bar{x}_j $ is the mean of the j'th group and $ \bar{x}_i $ is the mean of the i'th row.

The F-value F₂, on the other hand, is calculated as follows:

$$ F_2 = \frac{MSB}{MSE} $$

The denominator MSE is the same as with F₁. The numerator MSB is the mean value of the between-rows sum of squares:

$$ MSB = \frac{SSB}{I - 1} = \frac{ J \cdot \sum_{i=1}^I {( \bar{x}_i - \bar{x} )^2} }{I - 1} $$

In other words the average of the squared differences between each row (block) mean $ \bar{x}_i $ and the overall mean $ \bar{x} $.

Furthermore, the number SST can be defined, which is the total sum of squares: SST = SSG + SSB + SSE

P-values P₁ and P₂ can be calculated for both F₁ and F₂ respectively. If P₁ is below the chosen significance level, the null hypotheis saying that there is no significant difference between the J group means can be rejected. If P₂ is below the chosen significance level, the null hypotheis saying that there is no significant difference between the I row means can be rejected.

Rates

Calculate the rates and confidence intervals of one or more groups HERE and test equality between rates.
Compare two incidence rates HERE by calculating their incidence rate ratio with confidence interval and test for equality.

Incidence Table
	Cases	Time (years)
Group 1:	c₁	t₁
Group 2:	c₂	t₂
Total:	Total_c	Total_t

Incidence Rate:

$$ IR_1=\frac{c_1}{t_1} \;\;\; \text{and} \;\;\; IR_2=\frac{c_2}{t_2} $$

The standard error of the logarithmic transformed IR:

$$ SE\left({\ln(IR_1)}\right) = \sqrt{\frac{1}{c_1}} = \frac{\sqrt{1}}{\sqrt{c_1}} = \frac{1}{\sqrt{c_1}} \;\;\;\;\; \text{and} \;\;\;\;\; SE\left({\ln(IR_2)}\right) = \sqrt{\frac{1}{c_2}} = \frac{\sqrt{1}}{\sqrt{c_2}} = \frac{1}{\sqrt{c_2}} $$

95 % Confidence Interval of a logarithmic transformed Incidence Rate:

$$ 95\; \% \;CI=\left[ \ln\left({IR}\right) - 1.96 \cdot SE\left({\ln(IR)}\right) \;\; : \;\; \ln\left({IR}\right) +1.96 \cdot SE\left({\ln(IR)}\right)\right] $$

95 % Confidence Interval of the Incidence Rate:

$$ 95\; \% \;CI=\left[\text{e}^{\ln\left({IR}\right) - 1.96 \cdot SE\left({\ln(IR)}\right)} \;\; : \;\; \text{e}^{\ln\left({IR}\right) + 1.96 \cdot SE\left({\ln(IR)}\right)} \right] $$

or, consequently:

$$ 95\; \% \;CI=\left[\frac{IR}{\text{e}^{1.96 \cdot SE\left({\ln(IR)}\right)}} \;\; : \;\; IR \cdot \text{e}^{1.96 \cdot SE\left({\ln(IR)}\right)} \right] $$

Incidence Rate Ratio:

$$ IRR=\frac{IR_1}{IR_2} $$

The Standard Error of the logarithmic transformed IRR:

$$ SE\left({\ln(IRR)}\right) = \sqrt{\frac{1}{c_1} + \frac{1}{c_2} } $$

95 % CI of the logarithmic transformed IRR:

$$ 95\; \% \;CI=\left[ \ln\left({IRR}\right) - 1.96 \cdot SE\left({\ln(IRR)}\right) \;\; : \;\; \ln\left({IRR}\right) + 1.96 \cdot SE\left({\ln(IRR)}\right) \right] $$

95 % CI of IRR:

$$ 95\; \% \;CI=\left[\text{e}^{\ln\left({IRR}\right) - 1.96 \cdot SE\left({\ln(IRR)}\right)} \;\; : \;\; \text{e}^{\ln\left({IRR}\right) + 1.96 \cdot SE\left({\ln(IRR)}\right)} \right] $$

or, consequently:

$$ 95\; \% \;CI = \left[ \frac{IRR}{\text{e}^{1.96 \cdot SE\left({\ln(IRR)}\right)}} \;\; : \;\; IRR \cdot \text{e}^{1.96 \cdot SE\left({\ln(IRR)}\right)}\right] $$

Z-Value, testing if the IRR could be equal to the null hypothesis H0:

$$ z = \frac{| \; \ln\left({IRR}\right) - \ln\left({H0}\right) \; |}{SE\left({\ln(IRR)}\right)} $$

The p-value corresponding to the z-value:

$$ p = 2 \cdot \int_z^{\infty} \frac{1}{\sqrt{2\pi}} \cdot \text{e}^{-\frac{1}{2} \cdot x^2} dx $$

Incidence Rate Difference:

$$ IRD = IR_1 - IR_2 $$

Standard Error of the Incidence Rate Difference:

$$ SE(IRD) = \sqrt{ \frac{c_1}{(t_1)^2} + \frac{c_2}{(t_2)^2} } $$

or, consequently:

$$ SE(IRD) = \sqrt{ \frac{IR_1}{t_1} + \frac{IR_2}{t_2} } $$

95 % Confidence Interval of IRD:

$$ 95\; \% \;CI=\left[ IRD - 1.96 \cdot SE(IRD) \;\; : \;\; IRD + 1.96 \cdot SE(IRD) \right] $$

z-value, testing if IRD could be equal to the null hypothesis H0:

$$ z = \frac{|IRD - H0|}{SE(IRD)} $$

The corresponding p-value is:

$$ p=2 \cdot \int_z^{\infty}\frac{1}{\sqrt{2\pi}} \cdot \text{e}^{-\frac{1}{2} \cdot x^2} dx $$

Chi-Square-Test (in the case of Incidence Rates):

First, the expected values are found:

Cases (expected)	Time (years)
$ \text{e}_{1}=\frac{\text{Total}_{c}}{\text{Total}_{t}} \cdot \text{t}_{1} $	t₁
$ \text{e}_{2}=\frac{\text{Total}_{c}}{\text{Total}_{t}} \cdot \text{t}_{2} $	t₂
Total_c	Total_t

Then the chi-square-number is calculated:

$$ \chi^2 = \frac{(c_1 - e_1)^2}{e_1} + \frac{(c_2 - e_2)^2}{e_2} $$

And then the p-value is found as the number:

$$ p= \int_{\chi^2}^\infty {\left( \frac{x^{\frac{1}{2} \cdot df - 1} \cdot \text{e}^{-\frac{1}{2} \cdot x}}{2^{\frac{1}{2} \cdot df} \cdot \Gamma\left(\frac{df}{2}\right)} \right) dx} $$

Mantel-Haenszel Stratified OR

Calculate Mantel-Haenszel odds ratio HERE from two or more strata and test the null hypothesis of OR = 1.

Mantel-Haenszel is a form of stratification to adjust for confounding via 2 × 2 tables that can be used if the sample sizes are very small, in other words if one or more of the cells in the 2 × 2 is below 5.

After having perfomed the stratification, where we stratify after the possible confounder, we end up with n different 2 × 2 tables. Each of them has the following form:

The i'th 2 × 2 table after a stratification
Exposure	Outcome
Exposure	Yes	No	Total
Exposed	d_1i	h_1i	n_1i
Not Exposed	d_2i	h_2i	n_2i
Total	d_i	h_i	n_i

Often there are only two tables, one containing all persons with the possible confounder and the other one containing all persons without the possible confounder. If, for example, "smoking status" is a possible confounder for a connection between an exposure and an outcome, we would have one table with all the smokers and the other table with all the non-smokers.

Each table, corresponding to a specific stratum number, has its own OR value. If the OR values from all these tables differ from each other (are significantly different from each other) then the variable, that we stratified after (eg. "smoking status"), is an effect modifier. Because the effect that the exposure has on outcome will then be modified depending on with stratum you belong to. If this is the case, then the analysis will be completed with this conclusion, since a variable cannot be both an effect modifier AND a confounder. If, however, the OR values from the strata do not differ significantly from each other, then we can proceed to combine them by calculating their Mantel-Haenszel Weighted Estimate (weighted average). If this weighted estimate (OR_WE) is significantly different from the crude OR (the OR value from the "original" 2 × 2 table without stratification) then the variable, that we stratified after, is a confounder

The formula for the Mantel-Haenszel Weigthed OR:

$$ OR_{WE} = \frac{\sum_{i=1}^n \left( w_i \cdot OR_i \right) }{\sum_{i=1}^n {w_i}} = \frac{\sum_{i=1}^n {\frac{d_{1i} \cdot h_{2i}}{n_i}} }{\sum_{i=1}^n {\frac{d_{2i} \cdot h_{1i}}{n_i}}} $$

To find the 95 % Confidence Interval of the Mantel-Haenszel Weigthed OR we must find the standard error of the logarithmic transformed OR_WE

$$ SE(\ln(OR_{WE})) = \sqrt{\frac{V}{Q \cdot R}} \; \; , $$

where

$$ Q = \sum_{i=1}^n {\frac{d_{1i} \cdot h_{2i}}{n_i}} , R = \sum_{i=1}^n {\frac{d_{2i} \cdot h_{1i}}{n_i}} \: \text{ and } \: V = \sum_{i=1}^n {V_i} = \sum_{i=1}^n {\frac{d_{i} \cdot h_{i} \cdot n_{1i} \cdot n_{2i}}{n_{i}^2 \cdot (n_i - 1)}}$$

Now the 95 % Confidence Interval of the logarithmic transformed OR_WE will be:

$$ 95 \% CI = [\ln(OR_{WE}) - 1.96 \cdot SE(\ln(OR_{WE})) \;\; : \;\; \ln(OR_{WE}) + 1.96 \cdot SE(\ln(OR_{WE}))] $$

And the actual 95 % Confidence Interval will be the exponentialized version of this:

$$ 95 \% CI = [\text{e}^{\ln(OR_{WE}) - 1.96 \cdot SE(\ln(OR_{WE}))} \;\; : \;\; \text{e}^{\ln(OR_{WE}) + 1.96 \cdot SE(\ln(OR_{WE}))}] $$

$$ 95 \% CI = \left[ \frac{OR_{WE}}{\text{e}^{1.96 \cdot SE(\ln(OR_{WE}))}} \;\; : \;\; OR_{WE} \cdot \text{e}^{1.96 \cdot SE(\ln(OR_{WE}))} \right] $$

To test for homogeneity, i.e. to see if the OR values from the strata could be equal to each other, we perform a chi-square test of idenpendence where the number chi-square (χ²) is calculated as follows:

$$ \chi^2 = \sum_{i=1}^n {\left( \frac{\ln(OR_i) - \ln(OR_{WE})}{SE(\ln(OR_i))} \right)^2 } $$

The corresponding p-value is then found as the area under the χ² propapility density function with DF = 1 degree of freedom from χ² to infinity. Please see in the section "Chi-Square-Test" above.

To test whether the calculated Weighted Estimate OR_WE could be equal to 1, in other words to test whether the association between the exposure and the outcome is significant after having adjusted for the confounder, we perfom a chi-square test. The number χ² in the test is calculated as follows:

$$ \chi^2 = \frac{\left( \sum_{i=1}^n {(d_{1i})} - \sum_{i=1}^n {(E_{1i})} \right)^2}{\sum_{i=1}^n {(V_{i})}} = \frac{\left( \sum_{i=1}^n {(d_{1i})} - \sum_{i=1}^n {(E_{1i})} \right)^2}{\sum_{i=1}^n { \frac{d_{i} \cdot h_{i} \cdot n_{1i} \cdot n_{2i}}{n_{i}^2 \cdot (n_i - 1)} }} $$

The corresponding p-value is then found as the area under the curve of the χ² propapility density function with DF = 1 degree of freedom from χ² to infinity. Please see in the section "Chi-Square-Test" above.

The number E_i belonging to each of the i strata is the "expected value" that would have been instead of d_1i if the proportion of outcome had been the same in both the exposed as in the unexposed group:

$$ E_{1i} = \frac{d_i \cdot n_{1i} }{n_i} $$

"Rule of Five"

The "Rule of Five" is about the validity of the χ² test performed on the OR_WE. The validity of the test can be assessed by (for each stratum) calculating the following numbers:

1. min(d_i, n_1i) (the minimum of the two numbers d_i and n_1i)

2. max(0, n_1i - h_i) (the maximum of the two numbers 0 and n_1i - h_i). This will be 0 if n_1i - h_i is negative.

The sum of all the minima is found, the sum of alle the maxima is found and the sum of all the expecyed values E is found:

$$ \text{Sum}_{\text{min}} = \sum_{i=1}^n {(\text{min}_{i})} \;\; \text{ and } \;\; \text{Sum}_{\text{max}} = \sum_{i=1}^n {(\text{max}_{i})} \;\; \text{ and } \;\; \text{Sum}_{\text{E}} = \sum_{i=1}^n {(\text{E}_{1i})}$$

The rule is that the absolute distance from both $ \text{Sum}_{\text{min}}$ and $ \text{Sum}_{\text{max}} $ to $ \text{Sum}_{\text{E}} $ must be 5 or more for the test to be valid:

$$ \left|\text{Sum}_{\text{min}} - \text{Sum}_{\text{E}}\right| \geq 5 \;\; \text{ and } \;\; \left|\text{Sum}_{\text{max}} - \text{Sum}_{\text{E}}\right| \geq 5 $$

Mann-Whitney (Wilcoxon) U Test

Perform Mann-Whitney U Test HERE. Both for small (N < 20) and large (N > 20) sample sizes.

Case: N < 20

Mann-Whitney U Test is a method to test whether two data sets could be assumed equally distributed by comparing their ranks, if those data sets are not or cannot be assumed normally distributed. It is a non-parametric test, since parameters like for eg. the mean values do not figure as parameters, like in a z-test or t-test.

If there are less than 20 numbers in each of the two sets, the following approach is used:

For each single value in set 1 the number of values in set 2 that it is larger than is counted. A tie between two equal numbers is counted as 0.5. This number is U₁. The very same procedure is used to calculate U₂ by comparing each value in set 2 with each value in set 1. The smallest of U1 and U2 is then U_obt. (U obtained). By comparing this U_obt. with the critical value U_crit. found in the table over critical U values below we can either reject or not reject the null hypothesis on a 5% or a 1% significance level. If U_obt. is smaller than or equal to U_crit. the null hypothesis is rejected. The null hypothesis being: H0: The distribution of the two data sets are identical.

The ranks are found by sorting the complete list consisting of all values from both n₁ and n₂ in ascending order. The lowest value (on position 1) will get rank 1, the next lowest rank 2, ... etc. In case there are ties (equal values) each of the equal values will be given the rank corresponding to the average of the positions of the tied values.

In the table n₁ is the size of data set 1 and n₂ is the size of data set 2:

n₁	α	n₂
n₁	α	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
1	0.05	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1	0.01	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2	0.05	...	...	...	...	...	...	...	0	0	0	0	1	1	1	1	1	2	2	2	2
2	0.01	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	0	0
3	0.05	...	...	...	...	0	1	1	2	2	3	3	4	4	5	5	6	6	7	7	8
3	0.01	...	...	...	...	...	...	...	...	0	0	0	1	1	1	2	2	2	2	3	3
4	0.05	...	...	...	0	1	2	3	4	4	5	6	7	8	9	10	11	11	12	13	13
4	0.01	...	...	...	...	...	0	0	1	1	2	2	3	3	4	5	5	6	6	7	8
5	0.05	...	...	0	1	2	3	5	6	7	8	9	11	12	13	14	15	17	18	19	20
5	0.01	...	...	...	...	0	1	1	2	3	4	5	6	7	7	8	9	10	11	12	13
6	0.05	...	...	1	2	3	5	6	8	10	11	13	14	16	17	19	21	22	24	25	27
6	0.01	...	...	...	0	1	2	3	4	5	6	7	9	10	11	12	13	15	16	17	18
7	0.05	...	...	1	3	5	6	8	10	12	14	16	18	20	22	24	26	28	30	32	34
7	0.01	...	...	...	0	1	3	4	6	7	9	10	12	13	15	16	18	19	21	22	24
8	0.05	...	0	2	4	6	8	10	13	15	17	19	22	24	26	29	31	34	36	38	41
8	0.01	...	...	...	1	2	4	6	7	9	11	13	15	17	18	20	22	24	26	28	30
9	0.05	...	0	2	4	7	10	12	15	17	21	23	26	28	31	34	37	39	42	45	48
9	0.01	...	...	0	1	3	5	7	9	11	13	16	18	20	22	24	27	29	31	33	36
10	0.05	...	0	3	5	8	11	14	17	20	23	26	29	33	36	39	42	45	48	52	55
10	0.01	...	...	0	2	4	6	9	11	13	16	18	21	24	26	29	31	34	37	39	42
11	0.05	...	0	3	6	9	13	16	19	23	26	30	33	37	40	44	47	51	55	58	62
11	0.01	...	...	0	2	5	7	10	13	16	18	21	24	27	30	33	36	39	42	45	46
12	0.05	...	1	4	7	11	14	18	22	26	29	33	37	41	45	49	53	57	61	65	69
12	0.01	...	...	1	3	6	9	12	15	18	21	24	27	31	34	37	41	44	47	51	54
13	0.05	...	1	4	8	12	16	20	24	28	33	37	41	45	50	54	59	63	67	72	76
13	0.01	...	...	1	3	7	10	13	17	20	24	27	31	34	38	42	45	49	53	56	60
14	0.05	...	1	5	9	13	17	22	26	31	36	40	45	50	55	59	64	67	74	78	83
14	0.01	...	...	1	4	7	11	15	18	22	26	30	34	38	42	46	50	54	58	63	67
15	0.05	...	1	5	10	14	19	24	29	34	39	44	49	54	59	64	70	75	80	85	90
15	0.01	...	...	2	5	8	12	16	20	24	29	33	37	42	46	51	55	60	64	69	73
16	0.05	...	1	6	11	15	21	26	31	37	42	47	53	59	64	70	75	81	86	92	98
16	0.01	...	...	2	5	9	13	18	22	27	31	36	41	45	50	55	60	65	70	74	79
17	0.05	...	2	6	11	17	22	28	34	39	45	51	57	63	67	75	81	87	93	99	105
17	0.01	...	...	2	6	10	15	19	24	39	34	39	44	49	54	60	65	70	75	81	86
18	0.05	...	2	7	12	18	24	30	36	42	48	55	61	67	74	80	86	93	99	106	112
18	0.01	...	...	2	6	11	16	21	26	31	37	42	47	53	58	64	70	75	81	87	92
19	0.05	...	2	7	13	19	25	32	38	45	52	58	65	72	78	85	92	99	106	113	119
19	0.01	...	0	3	7	12	17	22	28	33	39	45	51	56	63	69	74	81	87	93	99
20	0.05	...	2	8	14	20	27	34	41	48	55	62	69	76	83	90	98	105	112	119	127
20	0.01	...	0	3	8	13	18	24	30	36	42	46	54	60	67	73	79	86	92	99	105

Case: N > 20

In the case of larger sample sizes, N > 20, the principle is the same for calculating U_obt.: For each value in sample 1, count how many values in sample 2 are smaller than this. A tie counts as 0.5. This is then U₁. For sample 2 the number U₂ is found in a similar way. U_obt. is the smallest of U₁ and U₂. In the case of N > 20 there is, however, also a different way of finding the numbers U₁ and U₂:

In the case of N > 20 there is, however, also an alternative to finding the numbers U₁ and U₂, namely in the shape of the formulas:

$$ U_1 = R_1 - \frac{n_1(n_1 + 1)}{2} \text{ and } U_2 = R_2 - \frac{n_2(n_2 + 1)}{2} \; , $$

where R₁ and R₂ are the sum of the ranks in sample 1 and sample 2. n₁ and n₂ are the sample sizes of sample 1 and sample 2.

After having found U_obt. one can calculate its mean value m_U and its standard deviation σ_U:

$$ m_U = \frac{n_1 n_2}{2} \text{ and } \sigma_U = \sqrt{\frac{n_1 n_2 (n_1 + n_2 + 1)}{12}} $$

If there are tied (equal) ranks in the list, the formula for σ_U will be instead:

$$ \sigma_U = \sqrt{\frac{n_1 n_2}{12} \left( (n_1 + n_2 + 1) - \sum_{i=1}^{k}{ \frac{t_i^3 - t_i}{(n_1 + n_2)(n_1 + n_2 - 1)} } \right) } $$

k is the number of distinct ranks that is shared by two or more values. t_i is the number of values sharing rank number i.

The P value is then found through a normal distribution approximation with the z-value found in this way:

$$ z_U = \frac{| U_{obt.} - n_1 n_2 / 2 |}{\sqrt{n_1 n_2(n_1 + n_2 + 1)/12}} = \frac{| U_{obt.} - m_U |}{\sigma_U}$$

The number f is the common language effect size. It is calculated this way:

$$ f = \frac{U_1}{n_1 n_2} $$

It expresses the percentage out of all possible ordered pairs (s₁,s₂) consisting of numbers s₁ from sample 1 and s₂ from sample 2 where s₁ is larger than s₂.

The number r is the rank-biserial correlation. It is calculated this way:

$$ r = f - u = 2f - 1 = \frac{2 U_1}{n_1 n_2} - 1 = 1 - \frac{2 U_2}{n_1 n_2} $$

It is the difference between the effect size and its complement.

Regression

Do linear and logistic regression HERE. Both singular as well as multiple regression.

Singular Linear Regression

In the case where the dependent variable Y depends on only one independent variable X the model is:

$$ Y = \beta_0 + \beta_1 X + \epsilon $$

, where $ \beta_0 $ is the constant or the graphical intersection (intercept) with the Y-axis when the linear graph is drawn and $ \beta_1 $ is the slope of the line. More precisely; $ \beta_0 $ is the Y-value in the point where the line intersects the Y-axis. The point has the form $ (x, y) = (0, \beta_0) $. The number $ \epsilon $ is the error term, meaning the vertical distance from the predicted y value on the line (that the model has given) to the actual y value (the y coordinate of the point).

The meaning of the slope $ \beta_1 $ is that this is how much Y will increase (or decrease) for every 1 unit increase in X. If $ \beta_1 $ is negative we have a decreasing linear function and if $ \beta_1 $ is positive we have an increasing linear function. If $ \beta_1 $ is zero (or if we cannot reject the H0 hypothesis that $ \beta_1 $ could be zero) then the variable X does not contribute to the model, because a change in X does not lead to a significant change in Y. And X can therefore be omitted from the model and the straight line will become constant (horizontal); $ Y = \beta_0 $.

Given a set of n points $ (x_1 , y_1), (x_2 , y_2), (x_3 , y_3), ..., (x_n , y_n) $ in the 2D coordinate system, the task is to find the regression line $ Y = \beta_0 + \beta_1 X $ that will hit all the points at the same time, i.e.; all the points would ideally lie on the straight line. This is in almost all cases impossible since the n points only approximately lie on a straight line. The regression line is therefore the "best fitted line" i.e. the line that tries to get as close as possible to all the n points at the same time. The slope of this line will be $ \beta_1 $ and the intersection with the Y-axis $ \beta_0 $. The regression coefficients $ \beta_1 $ and $ \beta_0 $ will only be estimates for the true values of $ \beta_1 $ and $ \beta_0 $. This is because we only have a sample of n points and not all points from the whole population. For this reason they are sometimes written with a "hat" over them; $ \widehat{\beta_0} $ and $ \widehat{ \beta_1} $.

The formulas:

$$ \beta_1 = \frac{\sum_{i=1}^n{(x_i - \bar{x})(y_i - \bar{y})}}{\sum_{i=1}^n{(x_i - \bar{x})^2}} $$ $$ \beta_0 = \bar{y} - \beta_1 \bar{x} $$

Meaning: In the numerator of the fraction of $ \beta_1 $, for each point (x, y) the distance from the x value to the mean value of the x values is calculated and the distance from the y value to the mean of the y values is calculated. These distances are then multiplied together and all these multiplied distances are added up.
The denominator of the fraction is the sum of the squared distances to the mean for the x values: For each x, its distance to the x mean is found, this distance is multiplied by itself and added to the sum of them all.
After having found $ \beta_1 $, $ \beta_0 $ can be found by subtracting the product of $ \beta_1 $ and the x mean from the y mean.

To test the null hypothesis (H0) whether $ \beta_0 $ or $ \beta_1 $ could be equal to zero (or any other value), the following formula is used to find the t-values:

$$ t = \frac{\beta_0 - H_0}{s(\beta_0)} $$

where

$$ s(\beta_0) = \sqrt{Var(\beta_0)} $$ $$ = \sqrt{\frac{1}{n-2}\left(\sum_{i=1}^n{(y_i - \bar{y})^2} - \frac{\left(\sum_{i=1}^n{(x_i - \bar{x})(y_i - \bar{y})}\right)^2}{\sum_{i=1}^n{(x_i - \bar{x})^2}}\right)} \sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n{(x_i - \bar{x})^2}}} $$

and

$$ t = \frac{\beta_1 - H_0}{s(\beta_1)} $$

where

$$ s(\beta_1) = \sqrt{Var(\beta_1)} = \sqrt{ \frac{ \frac{1}{n-2}\left(\sum_{i=1}^n{(y_i - \bar{y})^2} - \frac{\left(\sum_{i=1}^n{(x_i - \bar{x})(y_i - \bar{y})}\right)^2}{\sum_{i=1}^n{(x_i - \bar{x})^2}}\right) }{ \sum_{i=1}^n{(x_i - \bar{x})^2} } } $$

It can be shown that in the case of $ \beta_1 $ the formula for $ s(\beta_1) $ is equivalent to:

$$ s(\beta_1) = |\beta_1| \cdot \sqrt{ \frac{1}{n-2} \cdot \frac{1 - R^2}{R^2}} $$

In both of these cases, after having found the t-values, the t-test can then be performed with $ n - 2 $ degrees of freedom to find the corresponding p-value (se section "t-test" in chapter "means").

The correlation coefficient R² of the linear regression is found this way:

$$ R^2 = 1 - \frac{\sum_{i=1}^n{e_i^2}}{\sum_{i=1}^n{(y_i - \bar{y})^2}} = 1 - \frac{\sum_{i=1}^n{(y_i - \widehat{y_i})^2}}{\sum_{i=1}^n{(y_i - \bar{y})^2}} $$

The numerator of the fraction is the sum of the squared errors. These errors are the vertical distances from each y value to the predicted y value $ \hat{y} $ (the y value on the line, i.e. the y value that the point should have had if it had been on the line). The denominator is the sum of the squared vertical distances between each y and the y mean.
R² is a number between 0 and 1. If R² = 1 then all the points are lying exactly on the regression line and the model gives a perfect prediction of the y value given an x value. This will, however, almost never happen. In this case all the errors (or "residuals") e_i would be zero, because there would be no vertical distance from the points' y values $ y_i $ to the predicted y values $ \hat{y}_i $ on the line. And we would then get, with the formula:

$$ R^2 = 1 - \frac{\sum_{i=1}^n{e_i^2}}{\sum_{i=1}^n{(y_i - \bar{y})^2}} = 1 - \frac{\sum_{i=1}^n{0^2}}{\sum_{i=1}^n{(y_i - \bar{y})^2}} = 1 - 0 = 1 $$

And vice versa: The closer R² is to zero the further away the points will be from the regression line.

For at given x we can calculate the predicted y-value from the model:

$$ \hat{y} = \beta_0 + \beta_1 \cdot x $$

This is the value that y should have, if it were to lie exactly on the regression line. This predicted y-value has a confidence interval:

$$ CI = \hat{y} \; \pm \; t_{(1-\frac{\alpha}{2})} \cdot SE_{\hat{y}} $$

The number $ t_{(1-\frac{\alpha}{2})} $ is the $ (1-\frac{\alpha}{2})\cdot 100 \% $ quantile of the t-distribution with (n - 2) degrees of freedom and significance level α. See section "t-test" in chapter "means" above for further info.

$ SE_{\hat{y}} $ is the standard error of the predicted y. It has the formula:

$$ SE_{\hat{y}} = \sqrt{ s^2 \left( 1 + \frac{1}{n} + \frac{ (x - \bar{x})^2 }{\sum_{i=1}^n{x_i^2} - n\bar{x}^2} \right)} $$

Where x is the x-value from the point $ (x, \hat{y}) $ . $ n $ is the number of points used in the regression and $ \bar{x} $ is the mean of all the x values in the regression. s² is the variance of the errors/residuals.

Multiple Linear Regression

In the case where the dependent Y-variable is dependent on more than one, say j, X-variables, we have the model:

$$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + ... + \beta_j X_j + \epsilon $$

A "point" in this model would then be a data set consisting of j + 1 numbers:

$$ (y, x_1, x_2, x_3, ..., x_j) $$

If we have $ i $ such "points" to be used in a multiple linear regression they can then be arranged as follows:

$$ (y_1, x_{11}, x_{21}, x_{31}, ..., x_{j1}) $$ $$ (y_2, x_{12}, x_{22}, x_{32}, ..., x_{j2}) $$ $$ (y_3, x_{13}, x_{23}, x_{33}, ..., x_{j3}) $$ $$ ... $$ $$ (y_i, x_{1i}, x_{2i}, x_{3i}, ..., x_{ji}) $$

The word "point" is here in quotations, because each of these data sets cannot be plotted as a dot in a coordinate system anyway, since we cannot draw things that has a dimension larger than 3. If the model has only 2 X-variables, that is; $ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2$ we are able to plot the points (and the graph of the function) in a 3D coordinate system. The "linear" function Y will then be a plane in 3D. In general: if there are $ j $ X-variables in the model, the dimension of the model will be $ j + 1 $.

The $ \beta_0 $ in the equation of the model is the intersection/intercept (or the "constant") and the other $ \beta $ values are the slopes belonging to their specific X variables. Each slope is the impact that its corresponding X variable has on the change of the Y variable. If for ex. the slope belonging to $ X_2 $ is $ \beta_2 = 0.34 $ then an increase in $ X_2 $ by 1 unit will lead to an increase in Y by 0.34 units, all other X variables remaining unchanged.

A negative slope means that its corresponding X-variable has a negative effect on Y (an increase in this X will lead to a decrease in the Y value). If a $ \beta $ value is not significantly different from zero, this means that the corresponding X-variable does not have a significant impact/effect on the Y variable and this X-variable can therefore be omitted (ignored) in the model. To test whether a $ \beta $ value could be equal to zero (or any other value) one can perform a t-test with $ i - j - 1 $ degrees of freedom. If the p-value from the test is below the significance level (often 0.05) then we can reject the null hypothesis (H0) that says that $ \beta = 0 $ and the $ \beta $ is therefore significantly different from zero (or another chosen value).

In order to find the beta values $ \beta_0, \beta_1, \beta_2, ..., \beta_j $ we can write the $ i $ equations, each equation corresponding to its specific point:

$$ y_1 = \beta_0 + \beta_1 x_{11} + \beta_2 x_{21} + \beta_3 x_{31} + ... + \beta_j x_{j1} $$ $$ y_2 = \beta_0 + \beta_1 x_{12} + \beta_2 x_{22} + \beta_3 x_{32} + ... + \beta_j x_{j2} $$ $$ y_3 = \beta_0 + \beta_1 x_{13} + \beta_2 x_{23} + \beta_3 x_{33} + ... + \beta_j x_{j3} $$ $$ ... $$ $$ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + ... + \beta_j x_{ji} $$

We must now find the set of $ \beta $ values that satisfy all of these equations simultaneously. This is impossible, however, since all the "points" do not lie on the same "line" in j + 1 dimensional space. We therefore introduce an error value $ \epsilon $ in each equation to make all the equations true.

$$ y_1 = \beta_0 + \beta_1 x_{11} + \beta_2 x_{21} + \beta_3 x_{31} + ... + \beta_j x_{j1} + \epsilon_1 $$ $$ y_2 = \beta_0 + \beta_1 x_{12} + \beta_2 x_{22} + \beta_3 x_{32} + ... + \beta_j x_{j2} + \epsilon_2 $$ $$ y_3 = \beta_0 + \beta_1 x_{13} + \beta_2 x_{23} + \beta_3 x_{33} + ... + \beta_j x_{j3} + \epsilon_3 $$ $$ ... $$ $$ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + ... + \beta_j x_{ji} + \epsilon_i $$

The task is now to find the smallest possible value of each of the $ \epsilon $ that will make the "line" get as close as possible to all the "points" at the same time.

Written in matrix notation we get the set of equations to be

$$ \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ ... \\ y_i \\ \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{21} & x_{31} & ... & x_{j1} \\ 1 & x_{12} & x_{22} & x_{32} & ... & x_{j2} \\ 1 & x_{13} & x_{23} & x_{33} & ... & x_{j3} \\ ... & ... & ... & ... & ... & ...\\ 1 & x_{1i} & x_{2i} & x_{3i} & ... & x_{ji} \\ \end{bmatrix} \cdot \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ ... \\ \beta_j \\ \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ ... \\ \epsilon_i \\ \end{bmatrix} $$

Or (written in more compact matrix notation):

$$ Y = X\beta + \epsilon $$

Here, Y and $ \epsilon $ are $ i \cdot 1 $ column vectors. X is a $ i \cdot (j + 1) $ matrix and $ \beta $ is a $ (j + 1) \cdot 1 $ column vector.

With the errors isolated we get:

$$ Y - X\beta = \epsilon $$

By squaring the errors we make sure they are always positive numbers:

$$ \epsilon_i^2 = (y_i - \beta_0 - \beta_1 x_{1i} - \beta_2 x_{2i} - ... - \beta_j x_{ji})^2 $$

The sum of all these squared errors

$$ \sum_{k=1}^i{ (y_k - \beta_0 - \beta_1 x_{1k} - \beta_2 x_{2k} - ... - \beta_j x_{jk})^2 } $$

must be as small as possible to make the "line" or "hyperplane" fit as good as possible to all the "points".

It can be shown (eg. by use of differentiation) that this is achieved by setting

$$ \beta = (X^T X)^{-1} (X^T X) Y $$

X^TX is the product of the X matrix with its own transposed matrix, X^T. (X^TX)^-1 is the inverse of the X^TX matrix.

Having now found the $ \beta $ column with all the beta values of the model we can find the variance and the standard deviation of the errors:

$$ s^2 = \frac{1}{i - j - 1} \sum_{k=1}^i{\epsilon_k^2} \iff s = \sqrt{\frac{1}{i - j - 1} \sum_{k=1}^i{\epsilon_k^2}} $$

The standard deviations of each of the $ \beta $ values are now found as the square roots of the diagonal elements in the matrix:

$$ s^2 \cdot (X^T X)^{-1} $$

After having found the $ \beta $ values, an F-test can be performed to test the null hypothesis H0: all the slopes are equal to zero; $ \beta_1 = \beta_2 = \beta_3 = ... = \beta_j = 0 $. In other words; the overall value of the model is useless, since none of the X variables contribute significantly to the change in the Y values. This is held up against the alternative hypothesis H1: at least one of the slopes is different from zero.

The F value for the test is calculated as

$$ F = \frac{R^2 / j}{(1 - R^2)/(i - j - 1)} $$

The corresponding p value is then found as described under F-test in chapter "means" with degrees of freedom DF₁ = j and DF₂ = i - j - 1 .

For a given set of x-values $ (x_1, x_2, x_3, ..., x_j) $ we can calculate the corresponding expected (or predicted) y-value according to the model:

$$ \hat{y} = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \beta_3 \cdot x_3 + ... + \beta_j \cdot x_j $$

This will then be the value that y should have, to make the "point" $ (\hat{y}, x_1, x_2, x_3, ..., x_j) $ lie directly on the "line" or "hyperplane" of the model.

A confidence interval of this $ \hat{y} $ can be calculated as:

$$ CI = \hat{y} \; \pm \; t_{(1-\frac{\alpha}{2})} \cdot SE_{\hat{y}} $$

The number $ t_{(1-\frac{\alpha}{2})} $ is the $ (1-\frac{\alpha}{2})\cdot 100 \% $ quantile of the t-distribution with (i - j - 1) degrees of freedom and significance level α. See section "t-test" in chapter "means" above for further info.

$ SE_{\hat{y}} $ is the standard error of $ \hat{y} $ . It can be found the following way:

$$ SE_{\hat{y}} = \sqrt{ s^2 (1 + X_0^T(X^T X)^{-1} X_0) } $$

Here, $ X_0 $ is the $ (1+j) \cdot 1 $ dimensional row vector $ (1,x_1,x_2,x_3,...,x_j)$. $ X_0^T $ is the transposed of this vector (written vertically instead of horizontically). s² is the variance of the errors/residuals and $ (X^T X)^{-1} $ is the inverse of $ X^T X $ (see above).

The above formula for $ SE_{\hat{y}} $ holds in general for all cases. If, however, the number of variables is only 1, i.e. the model is: $ Y = \beta_0 + \beta_1 X_1 $ the following formula could also be used instead:

$$ SE_{\hat{y}} = \sqrt{ s^2 \left( 1 + \frac{1}{i} + \frac{ (x - \bar{x})^2 }{\sum_{k=1}^i{x_k^2} - i\bar{x}^2} \right)} $$

Where x is the x-value from the point $ (x, \hat{y}) $ satisfying $ \hat{y} = \beta_0 + \beta_1 \cdot x $ . $ i $ is the number of points used in the regression and $ \bar{x} $ is the mean of the x values.

Correlation Matrix

The correlation between two data sets X₁ and X₂ of equal size is a number between -1 and 1. It is used to decide whether the two data sets can or cannot be assumed independent from each other. If the correlation is close to -1 or 1 the data sets are strongly correlated. If the correlation is close to 0 they are uncorrelated.

In a multivariable linear regression model, all the X variables X₁, X₂, ... X_k should be independent from one another. Meaning that for any pair of two X data sets, X_i and X_j, their correlation should be 0 or close to zero for the linear regression model to be valid. In other words $ corr(X_i , X_j) = 0 $ . In contrast, the correlation between the Y variable and any one of the X variables should be strong, i.e. close to -1 or 1.

Formula for correlation between X_i and X_j:

$$ corr(X_i , X_j) = \frac{ Mean(X_i \cdot X_j) - Mean(X_i) \cdot Mean(X_j) }{ \sqrt{ Mean(X_i^2) - Mean(X_i)^2} \cdot \sqrt{ Mean(X_j^2) - Mean(X_j)^2} } $$ $$ = \frac{ \overline{X_i X_j} - \overline{X_i} \cdot \overline{X_j} }{ \sqrt{ \overline{X_i^2} - \overline{X_i}^2} \cdot \sqrt{ \overline{X_j^2} - \overline{X_j}^2} } $$

The correlation matrix contains the correlation coefficients between any two of the involved samples or data sets. The matrix is symmetric, since $ corr(X_i , X_j) = corr(X_j , X_i)$.

Singular Logistic Regression

If we have a dependent Y variable, which is binary, i.e. that can only take the values 0 and 1, the value y=1 could represent the patient having the disease and the value y=0 could represent the opposite (the patient not having the disease),then we want a function p(x) that only gives output values between 0 and 1, namely the probability that a certain patient has the disease, given the input data of that patient. For eg. (in the case of singular logistic regression) if a patient with the x-value 23.56 has the disease/outcome the function p should give p(23.56) = 1 (or a number as close to 1 as possible).

The logistic function will fulfill these criteria:

$$ p(x) = \frac{1}{1 + \text{e}^{-(\beta_0 + \beta_1 x)}} , $$

where the values of the parameters β₀ and β₁ must be chosen so that the function value p(x) will predict the correct outcome as precisely as possible: If a patient has the disease, then p(x) should be as close to 1 as possible, if a patient doesn't have the disease p(x) should be as close to 0 as possible, noticing that p(x) can never be precisely 1 or 0.

Example: If we have the following data sets for Y and X:

$$ Y = [0,0,0,0,0,0,1,0,1,1] $$ $$ X = [1,2,3,4,5,6,7,8,9,10] , $$

we want to find the values of β₀ and β₁ that makes the function predict the actual Y-values as precisely as possibly given the X-values. Using logistic regression we find that β₀ = -9.5270 and β₁ = 1.2675. The function p(x) will then be:

$$ p(x) = \frac{1}{1 + \text{e}^{-(-9.5270 + 1.2675 x)}} $$ The probability function (best fitted curve) in singular logistic regression model, i.e. in the case of a single independent variable (x variable)

The probability function (best fitted curve) in singular logistic regression model, i.e. in the case of a single independent variable (x variable)

In this example p(10) = 0.9588 . This number is not exactly 1 (as it should have been, since the corresponding y-value to x = 10 is 1), but it's the closest it can get with the beta values found, which are the most optimal.

Instead of using probability/risk p(x) one can translate into odds:

$$ odds = \frac{p(x)}{1 - p(x)} = \text{e}^{\beta_0 + \beta_1 x} = \text{e}^{\beta_0} \times \text{e}^{\beta_1 x} = baseline \times OR^x $$

Using odds instead of risk can be an advantage in many cases, for eg. risk is a limited number which can only have values between 0 and 1.

The methods used for finding the most optimal values of the beta values, which will make the curve get as close as possible to all the points at the same time, are complicated.
Given a set of beta-values (coefficients) β₀ , β₁ that can both have a start (guess) value of 0, for each pair (x, y) the probability p(x) is then calculated, if the value of y is 1, otherwise the probability 1 - p(x) is calculated, if the y value is 0. Those are the predicted probabilities from the logistic function that y is either 0 or 1 respectively. All of these p(x) and 1 - p(x) values are then multiplied together to give the overall likelihood (LLH) of this model with the current choice of betas:

$$ LLH = p(x_1)^{y_1} \times (1 - p(x_1))^{1 - y_1} \times p(x_2)^{y_2} \times (1 - p(x_2))^{1 - y_2} \times ... \times p(x_n)^{y_n} \times (1 - p(x_n))^{1 - y_n} $$

The task is then to find the optimal values of the betas which will make the likelihood as large as possible. Since a product of numbers between 0 and 1 will approach 0 quite fast, it's more convenient to work with the "log likelihood" instead, by taking the natural logarithm on both sides of the above equation:

$$ \ln(LLH) = y_1 \ln(p(x_1)) + (1 - y_1) \ln(1 - p(x_1)) + y_2 \ln(p(x_2)) + (1 - y_2) \ln(1 - p(x_2)) + ... $$ $$ ... + y_n \ln (p(x_n)) + (1 - y_n) \ln(1 - p(x_n)) $$

This won't give problems, since a maximum of the likelihood in terms of the betas will also be a maximum of the log likelihood. The maximizing method itself is an algorithm involving Newton-Rhapson's formula of finding an extremum.

Multiple Logistic Regression

If there's more than one independent X variable, the principles and methods are the same as in the case of simple logistic regression above; we still need to find the set of beta values (β₀, β₁, ..., β_n), that will maximize the likelihood of the model:

$$ p(x) = \frac{1}{1 + \text{e}^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n)}} $$

For example, if there are the three following X variables: blood pressure (X₁), weight (X₂) and age (X₃) with the following values:

Y	X₁	X₂	X₃
0	109	13.1	68
0	112	12.9	65
1	124	14.1	64
0	125	16.2	67
0	127	21.5	93
1	130	17.5	68
1	139	30.7	89
1	150	28.4	69
0	146	25.1	67
1	155	31.5	68

And y=1 being disease/outcome and y=0 not disease, then a logistic regression will give the following values:
(β₀ = -8.5165, β₁ = 0.0744, β₂ = 0.0455 , β₃ = -0.0305) and (OR₀ = 0.0002, OR₁ = 1.0772, OR₂ = 1.0465 , OR₃ = 0.9700)

Written into the equation of p(x):

$$ p(x) = \frac{1}{1 + \text{e}^{-(-8.5165 + 0.0744 x_1 + 0.0455 x_2 - 0.0305 x_3)}} $$

Written in the Odds Ratio notation:

$$ Odds = 0.0002 \times 1.0772^{x_1} \times 1.0465^{x_2} \times 0.9700^{x_3} $$

Interpretation of the OR values: In a logistic regression, the variables involved will be adjusted for the confounding effect (if any), that the other variables have on it. In other words the OR value of 1.0772 means that for every increase of 1 in blood pressure, the odds of getting the outcome/disease will be 1.0772 times higher, everything else being the same, after having adjusted for the confounding effects of weight (X₂) and age (X₃).

Power & Sample Size

Calculate power or sample size HERE. Both in the dichotomous case as well as the continuous case.

Power

1. Dichotomous Outcome (proportions):

To calculate the power (1 - β) of a study given the proportions p₁ and p₂ and sample sizes n₁ and n₂ you can isolate the number $ z_{(1 - \beta)} $ in the below formula for n₁ under sample size, which gives:

$$ z_{(1 - \beta)} = \frac{ \sqrt{n_1(p_2 - p_1)^2} - z_{(1-\frac{\alpha}{2})} \sqrt{\left( 1 + \frac{1}{r} \right) \left( \frac{p_1 + p_2 \cdot r}{1 + r} \right) \left( 1 - \frac{p_1 + p_2 \cdot r}{1 + r} \right) } }{ \sqrt{p_1 (1 - p_1) + p_2 \left( \frac{1 - p_2}{r} \right) } } $$

$ z_{(1-\frac{\alpha}{2})} $ is the z-value corresponding to the $ 1-\frac{\alpha}{2} $ percentile in the standard normal distribution. For ex. if α = 0.05 then $ z_{(1-\frac{\alpha}{2})} = z_{(1-\frac{0.05}{2})} = z_{0.975} = 1.96 $

Then the power (1 - β) is found as the area under the normal distribution function from $ -\infty $ to $ z_{(1 - \beta) } $

$$ 1 - \beta = \int_{-\infty}^{z_{(1 - \beta) }} \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2} \cdot x^2} dx $$

2. Continuous Outcome (mean values):

To calculate the power (1 - β) of a study given the mean values $ \bar{x}_1 $ and $ \bar{x}_2 $, sample sizes n₁, n₂ plus the common SD:

$$ 1 - \beta = \int_{-\infty}^{z_A} \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2} \cdot x^2} dx + \int_{-\infty}^{z_B} \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2} \cdot x^2} dx $$ $$ \text{where } z_A = z - z_{(1-\frac{\alpha}{2})} \text{ and } z_B = -z - z_{(1-\frac{\alpha}{2})} $$

The number $ z $ is found in this way:

$$ z = \frac{\bar{x}_1 - \bar{x}_2}{SD\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} $$

Sample Size

1. Dichotomous Outcome (proportions):

To calculate the minimum necessary sample sizes n₁ and n₂ of a study given the proportions p₁ and p₂, the power 1 - β and the ratio $ r = \frac{n_2}{n_1} $:

$$ n_1 = \frac{ \left( z_{(1-\frac{\alpha}{2})} \sqrt{\left( 1 + \frac{1}{r} \right)\left( \frac{p_1 + p_2 \cdot r}{1 + r} \right)\left( 1 - \frac{p_1 + p_2 \cdot r}{1 + r} \right)} + z_{(1-\beta)} \sqrt{p_1 (1 - p_1) + p_2 \frac{1 - p_2}{r}} \right)^2 }{(p_2 - p_1)^2} $$ $$ n_2 = r \cdot n_2 $$

Same with $ z_{(1 - \beta)} $; it's the z-value corresponding to the $ 1 - \beta $ percentile in the standard normal distribution. For ex. if β = 0.80 then $ z_{(1 - \beta)} = z_{(1 - 0.80)} = z_{0.2} = 0.842 $

2. Continuous Outcome (mean values):

To calculate the minimum necessary sample sizes n₁ and n₂ of a study given the means $ \bar{x}_1 , \bar{x}_2 $, the common SD, the power 1-β and the ratio $ r = \frac{n_2}{n_1} $:

$$ n_1 = \left( 1 + \frac{1}{r} \right) \cdot \left(SD \cdot \frac{ z_{(1-\frac{\alpha}{2})} + z_{(1-\beta)}}{\bar{x}_2 - \bar{x}_1}\right)^2 $$ $$ n_2 = r \cdot n_1 $$

$ z_{(1-\frac{\alpha}{2})} $ and $ z_{(1 - \beta)} $ : see above.

Survival Analysis

Do survival analysis HERE. Calculate life tables, life curves and Kaplan-Meier curves and estimates plus log-rank test.

Survival estimates and survival curves

In survival analysis one or more risk groups are followed over time to see if they develop an outcome. The outcome in this case is often death, but cut also be, more generally expressed, failure. If patients in the study do not fail/die, they can be "censored" instead, either when they withdraw from the study (while it's still in progress) or are still alive at the time of study termination. The later type of censoring is called right censoring in contrast to left censoring, which occurs when the patient is censored before study start. Most common censoring is right censoring and the "time" for a right censored patient is then amount of time the patient has survived in the study until censoring occurs. The status "censored" is marked with the number 0 and "death/failure" with the number 1.
For a certain individual in the study, the "time" is the time the patient has been followed in the study (survived) until the event occurs. The status is the either 1=death or 0=censored. The patient's group number is an integer (1,2,3, ... ) to be able to compare the possible effect of different treatment between the observed groups (for eg. 1=placebo, 2=treatment).
The result of a follow-up study of two groups could then be the raw data table:

ID	Time	Status	Group
A145	6	1	1
B139	8	1	2
A134	9	0	1
B131	9	1	2
B141	9	0	2
A129	7	1	1
B137	12	1	2
A133	10	0	1
A128	8	1	1
B132	12	1	2
B130	15	0	2
A136	11	0	1
A142	7	1	1

In the table, the patient with ID A145 belonged to the placebo group, survived for 6 weeks and then died, whereas the patient with ID B141 belonged to the treatment group, survived for 9 weeks and then got censored.
From the above raw data table the life tables are obtained when performing the survival analysis:

Lifetable Group 1
Ordered failure times (i)	No. of failures (d_i)	No. of censored (c_i)	Alive at the start of the period (a_i)	Number at risk n_i = a_i - c_i/2	Probability of failure r_i = d_i/n_i	Probability of surviving each time period s_i = 1 - r_i	Cumulative prob. of surviving S_i = S_i-1 × s_i
0	0	0	7	7	0	1	1
6	1	0	7	7	0.1429	0.8571	0.8571
7	2	0	6	6	0.3333	0.6667	0.5714
8	1	0	4	4	0.25	0.75	0.4286
9	0	1	3	2.5	0	1	0.4286
10	0	1	2	1.5	0	1	0.4286
11	0	1	1	0.5	0	1	0.4286

Lifetable Group 2
Ordered failure times (i)	No. of failures (d_i)	No. of censored (c_i)	Alive at the start of the period (a_i)	Number at risk n_i = a_i - c_i/2	Probability of failure r_i = d_i/n_i	Probability of surviving each time period s_i = 1 - r_i	Cumulative prob. of surviving S_i = S_i-1 × s_i
0	0	0	6	6	0	1	1
8	1	0	6	6	0.1667	0.8333	0.8333
9	1	1	5	4.5	0.2222	0.7778	0.6481
12	2	0	3	3	0.6667	0.3333	0.2160
15	0	1	1	0.5	0	1	0.2160

In the above life tables, the number at risk in each time period is obtained by subtracting half of the censored patients in the time period from the patients alive at the start of the period. Assuming that the censored were censored on average half way through the time period. The number alive at the start of each period is obtained by subtracting the number of deaths and censored during the previous period. The survivor function S_i is the probability of surviving through this time period i (and longer), given that one has survived up until then. It's obtained by multiplying the initial S_i with all subsequent s_i, For eg. in group 2, the chance of surviving 9 weeks (or more) is S = 1 × 0.8333 × 0.7778 = 0.6481.
The standard error and 95% confidence interval of each single survivor probability S_i is:

$$ \left[ (S_i)^{\text{e}^{1.96 \times se}} : (S_i)^{\text{e}^{-1.96 \times se}} \right] , \text{where} \:\: se = \sqrt{ \frac{ \sum_{j=1}^i{\frac{d_j}{n_j(n_j - d_j)}} }{ \left( \sum_{j=1}^i{\log(\frac{n_j-d_j}{n_j})} \right)^2 } } $$

The sums in the expression of se are over all time periods (rows) in the above table up until and including the particular time t_i.

After the calculation of the life tables for each of the groups, the survival curve of the survivor function S(i) can be drawn by connecting the dots ( t_i, S(i) ) with lines, where the t_i's are the times in the life table and the S(i)'s are the probabilities of the survivor function.
The above approach is mostly used, when the exact time of each event is unknown, but the number of events within each time period (interval) is known. Thus, the times in the left side of a life table would often be 0,1,2,3, ... etc. That is; of the same interval length of 1 day (or week or month). If the exact times for each event is known, however, it's more appropriate to calculate the Kaplan-Meier estimate and the Kaplan-Meier curve, see below.

Kaplan-Meier estimates and Kaplan-Meier curves

The Kaplan-Meier estimate for each group involved in the study is calculated using the following procedure:

Kaplan-Meier estimate of the survivor function S(t) for Group 2
Times	No. at risk at time of event(s) (n_t)	No. of deaths at time t (d_t)	No. censored at time t (c_t)	Risk of death/failure r_t = d_t / n_t	Probability of survival s_t = 1 - r_t	Survivor function S(t) = S(t_previous) × s_t
0	6	0	0	0	1	1
8	6	1	0	0.1667	0.8333	0.8333
9	5	1	1	0.2	0.8	0.6667
12	3	2	0	0.6667	0.3333	0.2222
15	1	0	1	0	1	0.2222

Here, the number at risk at each time period is obtained by subtracting all dead and censored at the previous time. The probabilities of the survivor function S(t) for a specific time t is obtained like in the above method of the life tables by multiplying the value of S(t) from the previous time by the probability of surviving the current time period S(t) = S(t_previous) × s(t).
From this Kaplan-Meier estimate of the survivor function the Kaplan-Meier curve is drawn. It is a step function connecting the points (t, S(t)).

Picture of the Kaplan-Meier survival curve for two groups

Mantel-Cox hazard ratio estimate

Between two groups involved in a study (for eg. treatment vs. placebo) it's possible to calculate the Mantel-Cox hazard ration estimate. The ratio between the two rates are calculated under the assumption that each rate is constant during the entire period of time. And that the curves of the survivor functions of the two groups don't cross each other, i.e. one is constantly above the other. The calculation of the Mantel-Cox estimate of the hazard ratio is done with this formula:

$$ HR_{MC} = \frac{Q}{R} , \text{ where } Q = \sum_{j=1}^J { \frac{d_{1j} \times h_{0j}}{n_j} } \text{ and } R = \sum_{j=1}^J { \frac{d_{0j} \times h_{1j}}{n_j} } $$

In the formula, J is the total number of distinct days (or weeks or months ... ) in the study (the no. of rows in the above table) and j is the index of each specific time in the left table column, for eg. t = 7 which would have index j = 3. h_0j is the number of patients that survived the time period j in the first group, d_1j is the amount who died during time period j in the second group and vice versa with h_1j and d_0j. n_j is the total amount of patients at risk in both groups at start of time j.

The 95% confidence interval and standard error of the HR_MC are:

$$ \left[ HR_{MC} \times e^{-1.95 \times se} : HR_{MC} \times e^{1.95 \times se} \right] , $$ $$ \text{ where } se = \sqrt{\frac{V}{Q \times R}} \text{ and } V = \sum_{j=1}^J {\frac{d_j \times (n_j - d_j) \times n_{1j} \times n_{0j}}{n_j^2(n_j - 1)}} $$

In the formula, d_j is the sum of the dead in both groups at each specific time j. n_1j and n_0j are the no. of patients at risk in group 1 and 0 respectively at start of time j.

Log-rank test of equality between the rates of two groups

It can be tested whether the hazard rates, and hence the survival rates, of two groups can be assumed to be equal. This is done with the Mantel-Cox chi-square test, also known as the log-rank test. It is (as the name indicates) a chi-square (χ²) test where the observed numbers of deaths in the groups are compared with the expected numbers of deaths in the groups, if the rates had been the same in the groups. The chi-square number is calculated as follows:

$$ \chi^2 = \frac{U^2}{V} , \text{ where } U = \sum_{j=1}^J {\left( d_{1j} - \frac{d_j \times n_{1j}}{n_j} \right) } $$

The corresponding p-value is then obtained from the chi-square-distribution as the area under the curve of the chi-square probability density function from χ² to infinity. If the p-value is below 0.05 we can reject the null hypothesis, saying that the rates of the two groups are equal. In this case the rates are then significantly different from each other on a 5% significance level. The degrees of freedom (DF) in this case is always 1 (since there are two groups).