我试图估计那些决定纽约人和芝加哥人幸福水平差异的因素。
数据如下所示。
Happiness City Gender Employment Worktype Holiday
1 60 New York 0 0 Unemployed Unemployed
2 80 Chicago 1 1 Whitecolor 1 day a week
3 39 Chicago 0 0 Unemployed Unemployed
4 40 New York 1 0 Unemployed Unemployed
5 69 Chicago 1 1 Bluecolor 2 day a week
6 90 Chicago 1 1 Bluecolor 2 day a week
7 100 New York 0 1 Whitecolor 2 day a week
8 30 New York 1 1 Whitecolor 1 day a week幸福水平是因变量,“城市”是人居住的地方。“‘Gender”编码为0=男性1=女性。“就业”是0=失业,1=就业。“工作型”是三个层次的因素:“失业”、“白颜色”、“蓝颜色”。“假日”是指一个人一周休息多少天。在这里,“城市”、“性别”、“工作类型”和“假日”变量都是影响因素。“幸福”和“就业”这两个变量类型是数字。
我想估计的模型是
lm(Happiness ~ City + Gender + Employment:(Worktype + Holiday))我把“就业”值作为数值,所以如果“就业”等于0(失业),0:(Worktype + Holiday) = 0,那么模型会自动简化为
lm(Happiness ~ City + Gender)对失业者来说。
但是,回归结果返回NA值。
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.75 23.56 2.408 0.138
CityNew York -14.50 27.21 -0.533 0.647
Gender1 -2.25 35.99 -0.063 0.956
Employment:WorktypeBluecolor 25.00 43.02 0.581 0.620
Employment:WorktypeUnemployed NA NA NA NA
Employment:WorktypeWhitecolor 57.75 35.99 1.604 0.250
Employment:Holiday1 day a week -50.00 54.42 -0.919 0.455
Employment:Holiday2 day a week NA NA NA NA这似乎是由于“工作类型”和“假日”变量中的“失业”值所致。但是,我不知道为什么R不将就业:WorktypeUnemployed(显然是0:Worktype =0)视为零,而不将其从模型中删除。这是因为R正在设定就业:假期未就业作为基线,两者都是完全多元线性的吗?(我不得不在“工作型”和“假日”中加上“失业”的价值,因为我想看看“工作型”和“假日”与“失业者”相比的效果。如果我删除‘失业’值NA消失,但基线将是‘白颜色’和‘一周一天’,所以我看不到与‘失业’相比的影响。)
如果是这样的话,为什么我要得到‘就业:每周2天假期’系数的NA?这似乎与“失业”的价值无关。
我可以依靠这个结果,而只是删除NA系数吗?
下面是可复制的代码。
Happiness <- c(60, 80, 39, 40, 69, 90, 100, 30)
City <- as.factor(c("New York", "Chicago", "Chicago", "New York", "Chicago",
"Chicago", "New York", "New York"))
Gender <- as.factor(c(0, 1, 0, 1, 1, 1, 0, 1)) # 0 = man, 1 = woman.
Employment <- c(0,1, 0, 0, 1 ,1 , 1 , 1) # 0 = unemployed, 1 = employed.
Worktype <- as.factor(c("Unemployed", "Whitecolor", "Unemployed",
"Unemployed", "Bluecolor", "Bluecolor", "Whitecolor","Whitecolor"))
Holiday <- as.factor(c(0, 1, 0, 0, 2, 2, 2, 1))
levels(Holiday) <- c("Unemployed", "1 day a week", "2 day a week")
data <- data.frame(Happiness, City, Gender, Employment, Worktype, Holiday)
head(data,8)
str(data)
reg <- lm(Happiness ~ City + Gender + Employment:(Worktype + Holiday))
summary(reg)发布于 2017-12-26 11:51:32
您不应该担心Employment:WorktypeUnemployed的NA值。R试图自动计算所有的相互作用,但这个特定的系数仍未确定,因为很明显,Employment=1和Worktype=从未“失业”过。它对其他系数的计算没有任何影响:您可以通过手动编码虚拟变量来验证:
> library(lme4) # for the convenient "dummy" function
> data <- data.frame(data,
+ dummy(Worktype, c("Bluecolor","Whitecolor")),
+ h1=dummy(Holiday)[,1],
+ h2=dummy(Holiday)[,2])
>
> reg <- lm(Happiness ~ City + Gender + Employment:Bluecolor + Employment:Whitecolor + Employment:h1 + Employment:h2 , data)
> summary(reg)
Call:
lm(formula = Happiness ~ City + Gender + Employment:Bluecolor +
Employment:Whitecolor + Employment:h1 + Employment:h2, data = data)
Residuals:
1 2 3 4 5 6 7 8
1.775e+01 1.775e+01 -1.775e+01 8.882e-16 -1.050e+01 1.050e+01 4.441e-15 -1.775e+01
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.75 23.56 2.408 0.138
CityNew York -14.50 27.21 -0.533 0.647
Gender1 -2.25 35.99 -0.063 0.956
Employment:Bluecolor 25.00 43.02 0.581 0.620
Employment:Whitecolor 57.75 35.99 1.604 0.250
Employment:h1 -50.00 54.42 -0.919 0.455
Employment:h2 NA NA NA NA
Residual standard error: 27.21 on 2 degrees of freedom
Multiple R-squared: 0.6798, Adjusted R-squared: -0.1208
F-statistic: 0.8491 on 5 and 2 DF, p-value: 0.619即使不再存在Employment:WorktypeUnemployed,估计的系数也是相同的。
但是,对于Employment:h2 (等效于Employment:Holiday2 day a week),NA值仍然存在。这似乎是因为在这个简化的数据集中,您最终得到了一个奇异的模型矩阵(即一列是其他列的线性组合)。
> solve(crossprod(model.matrix(reg)))
Error in solve.default(crossprod(model.matrix(reg))) :
system is computationally singular: reciprocal condition number = 1.79897e-18因此,这个问题可能不存在于更大的数据集中。最终,你可以尝试放弃模型中的任何冗余(例如,是否有任何雇员每周休假0天?)如果不是的话,1天应该是基线,您将为假日>1的天数添加额外的列。您可以使用alias()函数来检查哪个词给出了问题。
https://stackoverflow.com/questions/47976109
复制相似问题