top of page

R Studio - Proportion Data - Survfit, Survreg


The R Book ( Michael Crawley, ISBN: 978-0-470-97392-9 ) and its chapter 16 inspired me to run the code and pick up the most suitable distribution (Parametric survival model) for given data sample. The file contains the number of deaths and the most suitable has been shown exponential distribution. Let me share the code and output, and I do not use syntactic sugar in this example.


Gamma distribution - In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. or example the gamma distribution is frequently used to model waiting times. For instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution.


Erlang distribution - The Erlang distribution was developed by A. K. Erlang to examine the number of telephone calls which might be made at the same time to the operators of the switching stations.


Chi squared distribution - The chi-squared distribution is used in the common chi-squared tests for goodness of fit of an observed distribution to a theoretical one, the independence of two criteria of classification of qualitative data, and in confidence interval estimation for a population standard deviation of a normal distribution from a sample standard deviation. Many other statistical tests also use this distribution, such as Friedman's analysis of variance by ranks.


Hyperbolic secant distribution - creates relationship between the distances of a point on a hyperbola to the origin and to the coordinate axes. Examples: sinh , cosh, tanh, sech, cosech, and coth. Usage of hyperbolic secant is for example in fast pulses of applied magnetical resonance.


mortality<- read.csv("~/Documents/analytics/books/data/deaths.csv")


mortality


death treatment

1 7 high


tapply(death, treatment, mean)


control high low

3.46 6.88 4.70


model <- glm(death~treatment,Gamma)

summary(model)


Call:

glm(formula = death ~ treatment, family = Gamma)


Deviance Residuals:

Min 1Q Median 3Q Max

-0.4177 -0.1393 -0.1338 0.1486 0.4266


Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.289017 0.008327 34.708 < 2e-16 ***

treatmenthigh -0.143669 0.009321 -15.414 < 2e-16 ***

treatmentlow -0.076251 0.010340 -7.374 1.11e-11 ***

---

Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


(Dispersion parameter for Gamma family taken to be 0.04150576)


Null deviance: 17.7190 on 149 degrees of freedom

Residual deviance: 5.8337 on 147 degrees of freedom

AIC: 413.52


Number of Fisher Scoring iterations: 4


the mean value for high dose is calculated in this reciprocal function as 0.289017 - 0.143669 = 0.1453 and after 1 / 0.1453 = 6.882.



# Survival analysis with censoring


library(survival)


sheep<- read.csv("~/Documents/analytics/books/data/sheep.deaths.csv”)


attach(sheep)


names(sheep)


[1] "death" "status" "weight" "group”


sheep


death status weight group

1 20 1 5.385 A

2 34 1 7.413 A



plot(survfit(Surv(death,status)~group),col=c("lightblue","mediumblue","midnightblue"),xlab="Age at death(months)")

legend("topright", legend=c("Group A", "Group B","Group C”),

+ col=c("lightblue","mediumblue","midnightblue"), lty=1:2, cex=0.8)
















model <- survreg(Surv(death,status)~weight*group,dist="exponential”)


summary(model)


Call:

survreg(formula = Surv(death, status) ~ weight * group, dist = "exponential")

Value Std. Error z p

(Intercept) 3.8702 0.3854 10.041 1.00e-23

weight -0.0803 0.0659 -1.219 2.23e-01

groupB -0.8853 0.4508 -1.964 4.95e-02

groupC -1.7804 0.4386 -4.059 4.92e-05

weight:groupB 0.0643 0.0674 0.954 3.40e-01

weight:groupC 0.0796 0.0674 1.180 2.38e-01


Scale fixed at 1


Exponential distribution

Loglik(model)= -480.6 Loglik(intercept only)= -502.1

Chisq= 43.11 on 5 degrees of freedom, p= 3.5e-08

Number of Newton-Raphson Iterations: 5

n= 150



# the second level relationships can be removed, they are not significant


model2 <- survreg(Surv(death,status)~weight+group,dist="exponential”)


anova(model,model2,test="Chi”)



Terms Resid. Df -2*LL Test Df Deviance Pr(>Chi)

1 weight * group 144 961.1800 NA NA NA

2 weight + group 146 962.9411 = -2 -1.761142 0.4145462



# with * in model is not significant, so we can leave it out and keep only +


model3 <- survreg(Surv(death,status)~group,dist="exponential”)


anova(model,model2,test="Chi”)



Terms Resid. Df -2*LL Test Df Deviance Pr(>Chi)

1 weight * group 144 961.1800 NA NA NA

2 weight + group 146 962.9411 = -2 -1.761142 0.4145462


anova(model2,model3,test="Chi”)



Terms Resid. Df -2*LL Test Df Deviance Pr(>Chi)

1 weight + group 146 962.9411 NA NA NA

2 group 147 963.9393 = -1 -0.9981333 0.3177626



# we did simplificantion model versus model2 nor simplification model2 versus model 3 is significant


model4 <- survreg(Surv(death,status)~1,dist="exponential”)


# we did extreme simplification in model4


anova(model3,model4,test="Chi”)



Terms Resid. Df -2*LL Test Df Deviance Pr(>Chi)

1 group 147 963.9393 NA NA NA

2 1 149 1004.2865 = -2 -40.34721 1.732661e-09



# simplification from model3 to model4 is highly significant so we must accept model3 as the simpliest one


summary(model3)


Call:

survreg(formula = Surv(death, status) ~ group, dist = "exponential")

Value Std. Error z p

(Intercept) 3.467 0.167 20.80 3.91e-96

groupB -0.671 0.225 -2.99 2.83e-03

groupC -1.386 0.219 -6.34 2.32e-10


Scale fixed at 1


Exponential distribution

Loglik(model)= -482 Loglik(intercept only)= -502.1

Chisq= 40.35 on 2 degrees of freedom, p= 1.7e-09

Number of Newton-Raphson Iterations: 5

n= 150


# we must keep all 3 groups A,B,C


# now we can test various distributions for the best fit


model3 <- survreg(Surv(death,status)~group,dist="exponential”)


model4 <- survreg(Surv(death,status)~group,dist="extreme”)


model5 <- survreg(Surv(death,status)~group,dist="gaussian”)


model6 <- survreg(Surv(death,status)~group,dist="logistic”)


anova(model3,model4,model5,model6)



Terms Resid. Df -2*LL Test Df Deviance Pr(>Chi)

1 group 147 963.9393 NA NA NA

2 group 146 1225.3512 = 1 -261.411949 NA

3 group 146 1178.6582 = 0 46.692975 NA

4 group 146 1173.9478 = 0 4.710457 NA



# exponential distribution is the best, with the lowest residual deviance 963.939


tapply(predict(model3,type="response"),group,mean)


A B C

32.05556 16.38636 8.02000




54 views0 comments

Recent Posts

See All

Python - Basic regression comparison

Regression models are the principles of machine learning models as well. They help to understand the dataset distributions. The objective...

Comments


bottom of page