dat <- read.csv("NHANES1.csv")Exercise 4: Solutions first part
Task 1.1
Load the data in R.
Task 1.2
What is the dimension of the data set? How many rows (samples), and how many columns (variables) does the data set contain? What are the variable names of the data set?
dim(dat) # dimension of the data set, first number: number of rows, second number: number of columns[1] 5000 60
# Alternatively:
nrow(dat) # number of rows[1] 5000
ncol(dat) # number of columns[1] 60
names(dat) # variable names [1] "seqn" "cd" "pb" "hg"
[5] "hdl" "hivpos" "weight" "height"
[9] "bmi" "rr_sys" "rr_dia" "srhgnrl"
[13] "srphbad_prv30d" "srmhbad_prv30d" "adlimp_prv30d" "age"
[17] "educ" "martlst" "male" "ethnic"
[21] "increl" "asthma_ever" "asthma_now" "ovrwght_ever"
[25] "arthrit_ever" "stroke_ever" "livdis_ever" "cbronch_now"
[29] "livdis_now" "cancer_ever" "rel_heartdis" "rel_asthma"
[33] "rel_diab" "heartdis_ever" "lungpath_ever" "diab_lft"
[37] "hrsworked_prvwk" "jobstat_lwk" "wrkt_irreg" "workpollut"
[41] "sleep_dur" "sleep_probl" "cannab_ever" "harddrg_ever"
[45] "drnkprd_prv12mo" "alc_lft" "cigsprd_prv30d" "smokstat"
[49] "rdyfood_prvmo" "frzfood_prvmo" "milk_month" "phq1"
[53] "phq2" "phq3" "phq4" "phq5"
[57] "phq6" "phq7" "phq8" "phq9"
colnames(dat) # works too [1] "seqn" "cd" "pb" "hg"
[5] "hdl" "hivpos" "weight" "height"
[9] "bmi" "rr_sys" "rr_dia" "srhgnrl"
[13] "srphbad_prv30d" "srmhbad_prv30d" "adlimp_prv30d" "age"
[17] "educ" "martlst" "male" "ethnic"
[21] "increl" "asthma_ever" "asthma_now" "ovrwght_ever"
[25] "arthrit_ever" "stroke_ever" "livdis_ever" "cbronch_now"
[29] "livdis_now" "cancer_ever" "rel_heartdis" "rel_asthma"
[33] "rel_diab" "heartdis_ever" "lungpath_ever" "diab_lft"
[37] "hrsworked_prvwk" "jobstat_lwk" "wrkt_irreg" "workpollut"
[41] "sleep_dur" "sleep_probl" "cannab_ever" "harddrg_ever"
[45] "drnkprd_prv12mo" "alc_lft" "cigsprd_prv30d" "smokstat"
[49] "rdyfood_prvmo" "frzfood_prvmo" "milk_month" "phq1"
[53] "phq2" "phq3" "phq4" "phq5"
[57] "phq6" "phq7" "phq8" "phq9"
Task 1.3
All the variables in the data set are either of a class integer, numeric or boolean (i.e., logical). However, some of the variables should be factors rather than numerical variables. Which ones?
str(dat)'data.frame': 5000 obs. of 60 variables:
$ seqn : int 64780 65837 67799 71052 64130 70941 71372 68669 68367 62777 ...
$ cd : num 0.98 4.54 1.6 4.36 3.91 1.87 0.98 4.27 4.45 1.69 ...
$ pb : num 0.017 0.08 0.009 0.026 0.029 0.07 0.03 0.05 0.048 0.181 ...
$ hg : num 3 2.2 0.5 6.3 6.9 11.4 1.4 3.4 25.2 45.5 ...
$ hdl : num 1.81 1.03 1.16 1.22 1.01 1.22 1.34 1.16 1.11 2.38 ...
$ hivpos : logi FALSE NA FALSE FALSE NA FALSE ...
$ weight : num NA 94.5 94.9 57.2 66.4 80.5 88.3 93.9 90.9 92.1 ...
$ height : num NA 174 158 156 167 ...
$ bmi : num NA 31.2 37.8 23.7 23.9 27 32.5 29.3 31.7 25.4 ...
$ rr_sys : num 102 154 117 111 153 ...
$ rr_dia : num 11.3 68.7 68 17.3 77.3 ...
$ srhgnrl : int 4 3 3 3 3 NA 1 4 3 2 ...
$ srphbad_prv30d : int 2 2 1 2 3 NA 1 2 1 1 ...
$ srmhbad_prv30d : int 2 1 2 2 1 NA 1 2 1 3 ...
$ adlimp_prv30d : int 2 1 1 2 1 NA 1 2 1 3 ...
$ age : int 18 66 38 18 69 45 37 51 78 28 ...
$ educ : int NA 4 4 NA 2 4 1 2 2 5 ...
$ martlst : int NA 3 1 NA 1 1 6 5 1 5 ...
$ male : logi FALSE TRUE FALSE FALSE TRUE TRUE ...
$ ethnic : int 3 3 2 1 4 2 1 2 3 4 ...
$ increl : int 2 NA 2 NA NA 3 1 1 3 2 ...
$ asthma_ever : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
$ asthma_now : logi FALSE NA NA NA NA NA ...
$ ovrwght_ever : logi FALSE TRUE TRUE FALSE FALSE FALSE ...
$ arthrit_ever : logi NA FALSE TRUE NA TRUE FALSE ...
$ stroke_ever : logi NA TRUE FALSE NA FALSE FALSE ...
$ livdis_ever : logi NA FALSE FALSE NA FALSE FALSE ...
$ cbronch_now : logi NA NA NA NA NA NA ...
$ livdis_now : logi NA NA NA NA NA NA ...
$ cancer_ever : logi NA FALSE FALSE NA FALSE FALSE ...
$ rel_heartdis : logi NA TRUE FALSE NA TRUE FALSE ...
$ rel_asthma : logi TRUE FALSE TRUE FALSE TRUE TRUE ...
$ rel_diab : logi NA TRUE TRUE NA TRUE TRUE ...
$ heartdis_ever : logi NA FALSE FALSE NA FALSE FALSE ...
$ lungpath_ever : logi NA FALSE FALSE NA FALSE FALSE ...
$ diab_lft : int 1 1 3 1 3 1 1 1 3 1 ...
$ hrsworked_prvwk: int NA NA NA NA 36 75 40 NA NA 20 ...
$ jobstat_lwk : int 5 NA 8 NA 1 1 1 NA NA 4 ...
$ wrkt_irreg : logi NA NA NA NA FALSE FALSE ...
$ workpollut : logi NA FALSE FALSE NA FALSE TRUE ...
$ sleep_dur : int 6 4 6 8 5 6 8 8 5 8 ...
$ sleep_probl : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
$ cannab_ever : logi TRUE NA FALSE TRUE NA NA ...
$ harddrg_ever : logi FALSE FALSE FALSE FALSE FALSE NA ...
$ drnkprd_prv12mo: int 1 NA NA 2 NA NA 4 6 3 3 ...
$ alc_lft : int 3 NA 1 NA 3 NA 3 3 3 3 ...
$ cigsprd_prv30d : int NA NA NA NA NA NA NA NA 10 NA ...
$ smokstat : int NA 1 1 NA 1 1 NA NA 3 NA ...
$ rdyfood_prvmo : int 0 0 0 0 0 0 0 0 0 0 ...
$ frzfood_prvmo : int 1 0 1 0 0 0 0 0 0 0 ...
$ milk_month : int 3 3 3 1 0 2 2 3 0 0 ...
$ phq1 : int 0 0 0 2 0 NA 0 0 0 3 ...
$ phq2 : int 1 0 0 0 0 NA 0 0 0 3 ...
$ phq3 : int 0 1 1 1 0 NA 0 0 0 2 ...
$ phq4 : int 3 0 1 1 0 NA 0 1 0 2 ...
$ phq5 : int 1 1 0 0 0 NA 0 0 0 1 ...
$ phq6 : int 1 0 0 1 0 NA 0 0 0 3 ...
$ phq7 : int 0 0 0 0 0 NA 0 2 0 2 ...
$ phq8 : int 0 0 0 1 0 NA 0 0 0 0 ...
$ phq9 : int 0 0 0 0 0 NA 0 0 0 1 ...
Change the class of these variables to factor.
dat$srhgnrl <- as.factor(dat$srhgnrl)
dat$srphbad_prv30d <- as.factor(dat$srphbad_prv30d)
dat$srmhbad_prv30d <- as.factor(dat$srmhbad_prv30d)
dat$adlimp_prv30d <- as.factor(dat$adlimp_prv30d)
dat$educ <- as.factor(dat$educ)
dat$martlst <- as.factor(dat$martlst)
dat$ethnic <- as.factor(dat$ethnic)
dat$increl <- as.factor(dat$increl)
dat$diab_lft <- as.factor(dat$diab_lft)
dat$jobstat_lwk <- as.factor(dat$jobstat_lwk)
dat$alc_lft <- as.factor(dat$alc_lft)
dat$smokstat <- as.factor(dat$smokstat)
dat$milk_month <- as.factor(dat$milk_month)Task 1.4
How many women and how many men are there in your data set?
sum(dat$male)[1] 2481
sum(!dat$male)[1] 2519
# Alternatively:
table(dat$male)
FALSE TRUE
2519 2481
Task 1.5
What is the mean BMI in the overall population? What is the mean BMI for men and women?
# overall
mean(dat$bmi, na.rm=T)[1] 28.57915
# gender groups
mean(dat$bmi[dat$male], na.rm=T) # mean bmi for men[1] 28.03552
mean(dat$bmi[!dat$male], na.rm=T) # # mean bmi for women[1] 29.11956
Task 1.6
Who has an higher mercury level in blood: men or women? People with chronic bronchitis or people without it? ‘Hispanic’, ‘White’, ‘Black’ or ‘Other/Mixed’ people?
mean(dat$hg[dat$male==T], na.rm=T) # mercury level in men[1] 8.842238
mean(dat$hg[dat$male==F], na.rm=T) # mercury level in women[1] 8.067284
mean(dat$hg[dat$cbronch_now==T], na.rm=T) # mercury in people with chronic bronchitis[1] 6.561111
mean(dat$hg[dat$cbronch_now==F], na.rm=T) # mercury in people without c. bronchitis[1] 6.295122
# mercury in different ethnic groups
mean(dat$hg[dat$ethnic==1], na.rm=T)[1] 6.812474
mean(dat$hg[dat$ethnic==2], na.rm=T)[1] 6.29259
mean(dat$hg[dat$ethnic==3], na.rm=T)[1] 6.996337
mean(dat$hg[dat$ethnic==4], na.rm=T)[1] 17.60334
# advanced method as one-liner because some people asked
tapply(dat, dat$ethnic, function(d) mean(d$hg, na.rm=T)) 1 2 3 4
6.812474 6.292590 6.996337 17.603338
Task 1.7
Use the function summary to get the summarized information on all the variables in the data set.
summary(dat) seqn cd pb hg
Min. :62161 Min. : 0.980 Min. :0.0090 Min. : 0.500
1st Qu.:64635 1st Qu.: 1.780 1st Qu.:0.0340 1st Qu.: 2.100
Median :67110 Median : 3.020 Median :0.0530 Median : 4.200
Mean :67081 Mean : 4.761 Mean :0.0731 Mean : 8.454
3rd Qu.:69541 3rd Qu.: 5.430 3rd Qu.:0.0850 3rd Qu.: 9.400
Max. :71915 Max. :77.400 Max. :2.9600 Max. :253.500
NA's :471 NA's :471 NA's :471
hdl hivpos weight height
Min. :0.360 Mode :logical Min. : 29.1 Min. :134.5
1st Qu.:1.090 FALSE:3051 1st Qu.: 65.2 1st Qu.:160.0
Median :1.290 TRUE :14 Median : 77.2 Median :167.1
Mean :1.353 NA's :1935 Mean : 80.2 Mean :167.3
3rd Qu.:1.550 3rd Qu.: 91.3 3rd Qu.:174.4
Max. :4.530 Max. :216.1 Max. :204.5
NA's :574 NA's :278 NA's :279
bmi rr_sys rr_dia srhgnrl srphbad_prv30d
Min. :13.40 Min. : 81.33 Min. : 0.00 1 : 427 1 :2748
1st Qu.:23.80 1st Qu.:110.67 1st Qu.: 63.33 2 :1165 2 :1078
Median :27.40 Median :120.00 Median : 70.67 3 :1720 3 : 425
Mean :28.58 Mean :123.00 Mean : 70.27 4 : 795 NA's: 749
3rd Qu.:32.00 3rd Qu.:132.00 3rd Qu.: 78.00 5 : 152
Max. :80.60 Max. :234.67 Max. :125.00 NA's: 741
NA's :290 NA's :437 NA's :437
srmhbad_prv30d adlimp_prv30d age educ martlst
1 :2567 1 :3434 Min. :18.00 1 : 481 1 :2277
2 :1219 2 : 574 1st Qu.:31.00 2 : 665 5 :1004
3 : 467 3 : 246 Median :47.00 3 : 990 3 : 496
NA's: 747 NA's: 746 Mean :47.50 4 :1418 2 : 400
3rd Qu.:62.25 5 :1180 6 : 378
Max. :80.00 NA's: 266 (Other): 183
NA's : 262
male ethnic increl asthma_ever asthma_now
Mode :logical 1:1042 1 :1220 Mode :logical Mode :logical
FALSE:2519 2:1803 2 :1185 FALSE:4244 FALSE:313
TRUE :2481 3:1314 3 :1036 TRUE :752 TRUE :426
4: 841 4 :1104 NA's :4 NA's :4261
NA's: 455
ovrwght_ever arthrit_ever stroke_ever livdis_ever
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:3422 FALSE:3549 FALSE:4538 FALSE:4541
TRUE :1575 TRUE :1180 TRUE :195 TRUE :190
NA's :3 NA's :271 NA's :267 NA's :269
cbronch_now livdis_now cancer_ever rel_heartdis
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:133 FALSE:71 FALSE:4316 FALSE:4083
TRUE :119 TRUE :110 TRUE :416 TRUE :550
NA's :4748 NA's :4819 NA's :268 NA's :367
rel_asthma rel_diab heartdis_ever lungpath_ever diab_lft
Mode :logical Mode :logical Mode :logical Mode :logical 1 :4090
FALSE:3886 FALSE:2805 FALSE:4348 FALSE:4413 2 : 21
TRUE :1016 TRUE :1845 TRUE :367 TRUE :310 3 : 581
NA's :98 NA's :350 NA's :285 NA's :277 NA's: 308
hrsworked_prvwk jobstat_lwk wrkt_irreg workpollut
Min. : 1.0 1 :1461 Mode :logical Mode :logical
1st Qu.: 34.0 5 : 774 FALSE:2104 FALSE:2130
Median : 40.0 4 : 419 TRUE :491 TRUE :2243
Mean : 141.1 3 : 329 NA's :2405 NA's :627
3rd Qu.: 48.0 8 : 266
Max. :99999.0 (Other): 515
NA's :2474 NA's :1236
sleep_dur sleep_probl cannab_ever harddrg_ever
Min. : 2.000 Mode :logical Mode :logical Mode :logical
1st Qu.: 6.000 FALSE:3841 FALSE:1318 FALSE:2917
Median : 7.000 TRUE :1157 TRUE :1517 TRUE :574
Mean : 6.983 NA's :2 NA's :2165 NA's :1509
3rd Qu.: 8.000
Max. :99.000
NA's :4
drnkprd_prv12mo alc_lft cigsprd_prv30d smokstat rdyfood_prvmo
Min. : 1.000 1 : 631 Min. : 1.00 1 :1089 Min. : 0.000
1st Qu.: 1.000 2 : 124 1st Qu.: 5.00 2 : 162 1st Qu.: 0.000
Median : 2.000 3 :3045 Median :10.00 3 : 788 Median : 0.000
Mean : 2.923 NA's:1200 Mean :11.29 NA's:2961 Mean : 1.834
3rd Qu.: 3.000 3rd Qu.:15.00 3rd Qu.: 1.000
Max. :82.000 Max. :80.00 Max. :90.000
NA's :2170 NA's :4001 NA's :8
frzfood_prvmo milk_month phq1 phq2
Min. : 0.000 0 : 897 Min. :0.0000 Min. :0.0000
1st Qu.: 0.000 1 : 829 1st Qu.:0.0000 1st Qu.:0.0000
Median : 0.000 2 :1378 Median :0.0000 Median :0.0000
Mean : 2.522 3 :1861 Mean :0.3683 Mean :0.3478
3rd Qu.: 2.000 NA's: 35 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :180.000 Max. :3.0000 Max. :3.0000
NA's :6 NA's :783 NA's :779
phq3 phq4 phq5 phq6
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.6031 Mean :0.7036 Mean :0.3765 Mean :0.2633
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :3.0000 Max. :3.0000 Max. :3.0000 Max. :3.0000
NA's :780 NA's :779 NA's :780 NA's :784
phq7 phq8 phq9
Min. :0.0000 Min. :0.0000 Min. :0.00
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00
Median :0.0000 Median :0.0000 Median :0.00
Mean :0.2578 Mean :0.1814 Mean :0.06
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00
Max. :3.0000 Max. :3.0000 Max. :3.00
NA's :780 NA's :783 NA's :782