r - How to ignore linearly correlated variables introduced by factor reference cell coding -
assume have dataset containing 2 categorical predictor variables (a,b) , binary target (y) variable.
> df <- data.frame( > = factor(c("cat1","cat2","cat3","cat1","cat2")), > b = factor(c("cat1","cat1","cat3","cat2","cat2")), > y = factor(c(t,f,t,f,t)) > )
the following logical relations exist in data:
if (a = cat3) (b = cat3 , y = true) else if (a = b) (y = true) else y = false
i want use glm
build model dataset. glm
automatically apply reference cell coding on categorical variables , b. take care of finding right number of codes each factor variable, no alias
variables introduced (explained here).
however can happen, dataset above, linear relationship exists between 1 reference code generated variable , 1 reference code of variable b.
see output of model:
> model <- glm(y ~ ., family=binomial(link='logit'), data=df) > summary(model) ... coefficients: (1 not defined because of singularities) estimate std. error z value pr(>|z|) (intercept) 1.965e-16 1.732e+00 0.000 1.000 acat2 -2.396e-16 2.000e+00 0.000 1.000 acat3 1.857e+01 6.523e+03 0.003 0.998 bcat2 0.000e+00 2.000e+00 0.000 1.000 bcat3 na na na na # <- rid of this?
how should handle case? there way tell glm omit of generated reference codes? in real problem "cat3"
value corresponds na
. have 2 meaningful factor variables na
in same instances of dataset.
edit:
the checked answer solves question, however, in specific case singularities can ignored pointed out in comments.
the comments made under question pertinent may still useful try eliminating na model matrix columns can compare not doing such elimition in order satisfy regarding equivalence.
in particular, run glm
twice removing redundant model matrix columns on second run:
model <- glm(y ~ ., family=binomial(link='logit'), data=df) # in question mm <- model.matrix(model)[, !is.na(coef(model)) ] df0 <- data.frame(y = df$y, mm[, -1]) update(model, data = df0)
giving:
call: glm(formula = y ~ ., family = binomial(link = "logit"), data = df0) coefficients: (intercept) acat2 acat3 bcat2 1.965e-16 -2.396e-16 1.857e+01 0.000e+00 degrees of freedom: 4 total (i.e. null); 1 residual null deviance: 6.73 residual deviance: 5.545 aic: 13.55
note if don't want use fact know response named y extract response , name replacing assignment df0
above with:
df0 <- data.frame(model.response(model.frame(model)), mm[, -1]) names(df0)[1] <- as.character(attr(terms(model), "variables")[[2]])
Comments
Post a Comment