R Studio - Proportion Data - Survfit, Survreg

Jun 13, 2021

The R Book ( Michael Crawley, ISBN: 978-0-470-97392-9 ) and its chapter 16 inspired me to run the code and pick up the most suitable distribution (Parametric survival model) for given data sample. The file contains the number of deaths and the most suitable has been shown exponential distribution. Let me share the code and output, and I do not use syntactic sugar in this example.

Gamma distribution - In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. or example the gamma distribution is frequently used to model waiting times. For instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution.

Erlang distribution - The Erlang distribution was developed by A. K. Erlang to examine the number of telephone calls which might be made at the same time to the operators of the switching stations.

Chi squared distribution - The chi-squared distribution is used in the common chi-squared tests for goodness of fit of an observed distribution to a theoretical one, the independence of two criteria of classification of qualitative data, and in confidence interval estimation for a population standard deviation of a normal distribution from a sample standard deviation. Many other statistical tests also use this distribution, such as Friedman's analysis of variance by ranks.

Hyperbolic secant distribution - creates relationship between the distances of a point on a hyperbola to the origin and to the coordinate axes. Examples: sinh , cosh, tanh, sech, cosech, and coth. Usage of hyperbolic secant is for example in fast pulses of applied magnetical resonance.

mortality<- read.csv("~/Documents/analytics/books/data/deaths.csv")

mortality

death treatment

1 7 high

tapply(death, treatment, mean)

control high low

3.46 6.88 4.70

model <- glm(death~treatment,Gamma)

summary(model)

Call:

glm(formula = death ~ treatment, family = Gamma)

Deviance Residuals:

Min 1Q Median 3Q Max

-0.4177 -0.1393 -0.1338 0.1486 0.4266

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.289017 0.008327 34.708 < 2e-16 ***

treatmenthigh -0.143669 0.009321 -15.414 < 2e-16 ***

treatmentlow -0.076251 0.010340 -7.374 1.11e-11 ***

---

Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.04150576)

Null deviance: 17.7190 on 149 degrees of freedom

Residual deviance: 5.8337 on 147 degrees of freedom

AIC: 413.52

Number of Fisher Scoring iterations: 4

the mean value for high dose is calculated in this reciprocal function as 0.289017 - 0.143669 = 0.1453 and after 1 / 0.1453 = 6.882.

# Survival analysis with censoring

library(survival)

sheep<- read.csv("~/Documents/analytics/books/data/sheep.deaths.csv”)

attach(sheep)

names(sheep)

[1] "death" "status" "weight" "group”

sheep

death status weight group

1 20 1 5.385 A

2 34 1 7.413 A

plot(survfit(Surv(death,status)~group),col=c("lightblue","mediumblue","midnightblue"),xlab="Age at death(months)")

legend("topright", legend=c("Group A", "Group B","Group C”),

+ col=c("lightblue","mediumblue","midnightblue"), lty=1:2, cex=0.8)

model <- survreg(Surv(death,status)~weight*group,dist="exponential”)

summary(model)

Call:

survreg(formula = Surv(death, status) ~ weight * group, dist = "exponential")

Value Std. Error z p

(Intercept) 3.8702 0.3854 10.041 1.00e-23

weight -0.0803 0.0659 -1.219 2.23e-01

groupB -0.8853 0.4508 -1.964 4.95e-02

groupC -1.7804 0.4386 -4.059 4.92e-05

weight:groupB 0.0643 0.0674 0.954 3.40e-01

weight:groupC 0.0796 0.0674 1.180 2.38e-01

Scale fixed at 1

Exponential distribution

Loglik(model)= -480.6 Loglik(intercept only)= -502.1

Chisq= 43.11 on 5 degrees of freedom, p= 3.5e-08

Number of Newton-Raphson Iterations: 5

n= 150

# the second level relationships can be removed, they are not significant

model2 <- survreg(Surv(death,status)~weight+group,dist="exponential”)

anova(model,model2,test="Chi”)

Terms Resid. Df -2*LL Test Df Deviance Pr(>Chi)

1 weight * group 144 961.1800 NA NA NA

2 weight + group 146 962.9411 = -2 -1.761142 0.4145462

# with * in model is not significant, so we can leave it out and keep only +

model3 <- survreg(Surv(death,status)~group,dist="exponential”)

anova(model,model2,test="Chi”)

Terms Resid. Df -2*LL Test Df Deviance Pr(>Chi)

1 weight * group 144 961.1800 NA NA NA

2 weight + group 146 962.9411 = -2 -1.761142 0.4145462

anova(model2,model3,test="Chi”)

Terms Resid. Df -2*LL Test Df Deviance Pr(>Chi)

1 weight + group 146 962.9411 NA NA NA

2 group 147 963.9393 = -1 -0.9981333 0.3177626

# we did simplificantion model versus model2 nor simplification model2 versus model 3 is significant

model4 <- survreg(Surv(death,status)~1,dist="exponential”)

# we did extreme simplification in model4

anova(model3,model4,test="Chi”)

Terms Resid. Df -2*LL Test Df Deviance Pr(>Chi)

1 group 147 963.9393 NA NA NA

2 1 149 1004.2865 = -2 -40.34721 1.732661e-09

# simplification from model3 to model4 is highly significant so we must accept model3 as the simpliest one

summary(model3)

Call:

survreg(formula = Surv(death, status) ~ group, dist = "exponential")

Value Std. Error z p

(Intercept) 3.467 0.167 20.80 3.91e-96

groupB -0.671 0.225 -2.99 2.83e-03

groupC -1.386 0.219 -6.34 2.32e-10

Scale fixed at 1

Exponential distribution

Loglik(model)= -482 Loglik(intercept only)= -502.1

Chisq= 40.35 on 2 degrees of freedom, p= 1.7e-09

Number of Newton-Raphson Iterations: 5

n= 150

# we must keep all 3 groups A,B,C

# now we can test various distributions for the best fit

model3 <- survreg(Surv(death,status)~group,dist="exponential”)

model4 <- survreg(Surv(death,status)~group,dist="extreme”)

model5 <- survreg(Surv(death,status)~group,dist="gaussian”)

model6 <- survreg(Surv(death,status)~group,dist="logistic”)

anova(model3,model4,model5,model6)

Terms Resid. Df -2*LL Test Df Deviance Pr(>Chi)

1 group 147 963.9393 NA NA NA

2 group 146 1225.3512 = 1 -261.411949 NA

3 group 146 1178.6582 = 0 46.692975 NA

4 group 146 1173.9478 = 0 4.710457 NA

# exponential distribution is the best, with the lowest residual deviance 963.939

tapply(predict(model3,type="response"),group,mean)

A B C

32.05556 16.38636 8.02000

Sarka Pribylova

Email: sarka.pribylova@gmail.com

R Studio - Proportion Data - Survfit, Survreg

Recent Posts

Comentarios

Sarka Pribylova Email: sarka.pribylova@gmail.com

R Studio - Proportion Data - Survfit, Survreg

Recent Posts

Comentarios

Sarka Pribylova

Email: sarka.pribylova@gmail.com