Logistic regression tips: correlation between variables

1 minute read

Published:

This post describes method choice when calculating the correlation between variables.

 outcomevariable
predictor variablecategoricalcontinuous
categoricalChi Square, Log linear, Logistict-test, ANOVA (Analysis of Varirance), Linear regression
continuousLogistic regressionLinear regression, Analysis of Covariance
Mixture of Categorical and ContinuousLogistic regressionLinear regression, Analysis of Covariance

anova

自变量要求$level \geq 2$(可以处理2 level的离散变量), 因变量为连续型变量。使用R中aov函数计算one-way anova。

aov(Y~X, data=mydata) # 函数调用

调查各大洲胸部疾病的关系 零假设(Null Hypothesis)(H0):所有七个大洲的平均值相等,即大洲与胸部疾病之间没有关系

替代假设(Alternative Hypothesis)(H1):有关系

调用aov函数及summary函数,得到:

 DfSum SqMean SqF valuePr(>F)
gapCleaned$continent652531875540.28<2e-16 ***
Residuals16636083217  

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

其中 F statistics = Variation among sample means(组间方差平均值) / Variation within groups(组内方差)

Through the F statistics we can see if the variation among sample means dominates over the variation within groups(F值越大,差异越大)(从上表中我们可以看出:F值是40.28,表示不同大洲之间胸部疾病方差的平均值远远大于每个大洲内部的数据方差;并且P值<0.05),因此拒绝$H_0$,接受$H_1$。

但是到目前为止,我们只知道组间的数据平均值是不相等的,但并不知道两两组之间的差异性。我们可以通过 POST HOC TEST来进行测试。通过得到的p值判断差异性是否显著。

R语言中使用tuk <- TukeyHSD(aov_model)函数实现。结果如下方表格所示

 difflwruprp adj
AS-AF0.4953571-8.9868489.97756260.9999987
EE-AF25.424837714.35200736.49766800.0000000
LATAM-AF12.68750002.50197722.87302250.0050172

diff giving the difference in the observed means(观测均值的差值)

lwr giving the lower end point of the interval

upr giving the upper end point

p adj giving the p-value after adjustment for the multiple comparisons (显著性差异,当p<0.05,$H_{0}^{'}$-组间无显著性差异 被拒绝)

通过plot(tuk)可以画出结果。

T-test

对于自变量$level=2$的,因变量为连续型变量。使用R中的t.test函数计算相关性。(但是要求数据服从正态分布,因此并不被广泛使用)

t.test(Y~X, data=mydata)

to be continued…

references

Using ANOVA to get correlation between categorical and continuous variables

Performing ANOVA Test in R: Results and Interpretation

r document: TukeyHSD