r - How to ignore linearly correlated variables introduced by factor reference cell coding -

- June 15, 2014

assume have dataset containing 2 categorical predictor variables (a,b) , binary target (y) variable.

> df <- data.frame( >  = factor(c("cat1","cat2","cat3","cat1","cat2")),  >  b = factor(c("cat1","cat1","cat3","cat2","cat2")),  >  y = factor(c(t,f,t,f,t)) > )

the following logical relations exist in data:

if (a = cat3) (b = cat3 , y = true) else if (a = b) (y = true) else y = false

i want use glm build model dataset. glm automatically apply reference cell coding on categorical variables , b. take care of finding right number of codes each factor variable, no alias variables introduced (explained here).

however can happen, dataset above, linear relationship exists between 1 reference code generated variable , 1 reference code of variable b.

see output of model:

> model <- glm(y ~ ., family=binomial(link='logit'), data=df) > summary(model) ... coefficients: (1 not defined because of singularities)           estimate std. error z value pr(>|z|) (intercept)  1.965e-16  1.732e+00   0.000    1.000 acat2       -2.396e-16  2.000e+00   0.000    1.000 acat3        1.857e+01  6.523e+03   0.003    0.998 bcat2        0.000e+00  2.000e+00   0.000    1.000 bcat3               na         na      na       na # <- rid of this?

how should handle case? there way tell glm omit of generated reference codes? in real problem "cat3" value corresponds na. have 2 meaningful factor variables na in same instances of dataset.

edit:

the checked answer solves question, however, in specific case singularities can ignored pointed out in comments.

the comments made under question pertinent may still useful try eliminating na model matrix columns can compare not doing such elimition in order satisfy regarding equivalence.

in particular, run glm twice removing redundant model matrix columns on second run:

model <- glm(y ~ ., family=binomial(link='logit'), data=df) # in question  mm <- model.matrix(model)[, !is.na(coef(model)) ] df0 <- data.frame(y = df$y, mm[, -1]) update(model, data = df0)

giving:

call:  glm(formula = y ~ ., family = binomial(link = "logit"), data = df0)  coefficients: (intercept)        acat2        acat3        bcat2     1.965e-16   -2.396e-16    1.857e+01    0.000e+00    degrees of freedom: 4 total (i.e. null);  1 residual null deviance:      6.73  residual deviance: 5.545        aic: 13.55

note if don't want use fact know response named y extract response , name replacing assignment df0 above with:

df0 <- data.frame(model.response(model.frame(model)), mm[, -1]) names(df0)[1] <- as.character(attr(terms(model), "variables")[[2]])

Search This Blog

QR

r - How to ignore linearly correlated variables introduced by factor reference cell coding -

Comments

Post a Comment

Popular posts from this blog

java - .class files under target/classes folder Maven -

linux - Could not find a package configuration file provided by "Qt5Svg" -

simple.odata.client - Simple OData Client Unlink -