Exercise 4: Solutions first part

Task 1.1

Load the data in R.

dat <- read.csv("NHANES1.csv")

Task 1.2

What is the dimension of the data set? How many rows (samples), and how many columns (variables) does the data set contain? What are the variable names of the data set?

dim(dat) # dimension of the data set, first number: number of rows, second number: number of columns
[1] 5000   60
# Alternatively:
nrow(dat)     # number of rows
[1] 5000
ncol(dat)     # number of columns
[1] 60
names(dat)    # variable names
 [1] "seqn"            "cd"              "pb"              "hg"             
 [5] "hdl"             "hivpos"          "weight"          "height"         
 [9] "bmi"             "rr_sys"          "rr_dia"          "srhgnrl"        
[13] "srphbad_prv30d"  "srmhbad_prv30d"  "adlimp_prv30d"   "age"            
[17] "educ"            "martlst"         "male"            "ethnic"         
[21] "increl"          "asthma_ever"     "asthma_now"      "ovrwght_ever"   
[25] "arthrit_ever"    "stroke_ever"     "livdis_ever"     "cbronch_now"    
[29] "livdis_now"      "cancer_ever"     "rel_heartdis"    "rel_asthma"     
[33] "rel_diab"        "heartdis_ever"   "lungpath_ever"   "diab_lft"       
[37] "hrsworked_prvwk" "jobstat_lwk"     "wrkt_irreg"      "workpollut"     
[41] "sleep_dur"       "sleep_probl"     "cannab_ever"     "harddrg_ever"   
[45] "drnkprd_prv12mo" "alc_lft"         "cigsprd_prv30d"  "smokstat"       
[49] "rdyfood_prvmo"   "frzfood_prvmo"   "milk_month"      "phq1"           
[53] "phq2"            "phq3"            "phq4"            "phq5"           
[57] "phq6"            "phq7"            "phq8"            "phq9"           
colnames(dat) # works too
 [1] "seqn"            "cd"              "pb"              "hg"             
 [5] "hdl"             "hivpos"          "weight"          "height"         
 [9] "bmi"             "rr_sys"          "rr_dia"          "srhgnrl"        
[13] "srphbad_prv30d"  "srmhbad_prv30d"  "adlimp_prv30d"   "age"            
[17] "educ"            "martlst"         "male"            "ethnic"         
[21] "increl"          "asthma_ever"     "asthma_now"      "ovrwght_ever"   
[25] "arthrit_ever"    "stroke_ever"     "livdis_ever"     "cbronch_now"    
[29] "livdis_now"      "cancer_ever"     "rel_heartdis"    "rel_asthma"     
[33] "rel_diab"        "heartdis_ever"   "lungpath_ever"   "diab_lft"       
[37] "hrsworked_prvwk" "jobstat_lwk"     "wrkt_irreg"      "workpollut"     
[41] "sleep_dur"       "sleep_probl"     "cannab_ever"     "harddrg_ever"   
[45] "drnkprd_prv12mo" "alc_lft"         "cigsprd_prv30d"  "smokstat"       
[49] "rdyfood_prvmo"   "frzfood_prvmo"   "milk_month"      "phq1"           
[53] "phq2"            "phq3"            "phq4"            "phq5"           
[57] "phq6"            "phq7"            "phq8"            "phq9"           

Task 1.3

All the variables in the data set are either of a class integer, numeric or boolean (i.e., logical). However, some of the variables should be factors rather than numerical variables. Which ones?

str(dat)
'data.frame':   5000 obs. of  60 variables:
 $ seqn           : int  64780 65837 67799 71052 64130 70941 71372 68669 68367 62777 ...
 $ cd             : num  0.98 4.54 1.6 4.36 3.91 1.87 0.98 4.27 4.45 1.69 ...
 $ pb             : num  0.017 0.08 0.009 0.026 0.029 0.07 0.03 0.05 0.048 0.181 ...
 $ hg             : num  3 2.2 0.5 6.3 6.9 11.4 1.4 3.4 25.2 45.5 ...
 $ hdl            : num  1.81 1.03 1.16 1.22 1.01 1.22 1.34 1.16 1.11 2.38 ...
 $ hivpos         : logi  FALSE NA FALSE FALSE NA FALSE ...
 $ weight         : num  NA 94.5 94.9 57.2 66.4 80.5 88.3 93.9 90.9 92.1 ...
 $ height         : num  NA 174 158 156 167 ...
 $ bmi            : num  NA 31.2 37.8 23.7 23.9 27 32.5 29.3 31.7 25.4 ...
 $ rr_sys         : num  102 154 117 111 153 ...
 $ rr_dia         : num  11.3 68.7 68 17.3 77.3 ...
 $ srhgnrl        : int  4 3 3 3 3 NA 1 4 3 2 ...
 $ srphbad_prv30d : int  2 2 1 2 3 NA 1 2 1 1 ...
 $ srmhbad_prv30d : int  2 1 2 2 1 NA 1 2 1 3 ...
 $ adlimp_prv30d  : int  2 1 1 2 1 NA 1 2 1 3 ...
 $ age            : int  18 66 38 18 69 45 37 51 78 28 ...
 $ educ           : int  NA 4 4 NA 2 4 1 2 2 5 ...
 $ martlst        : int  NA 3 1 NA 1 1 6 5 1 5 ...
 $ male           : logi  FALSE TRUE FALSE FALSE TRUE TRUE ...
 $ ethnic         : int  3 3 2 1 4 2 1 2 3 4 ...
 $ increl         : int  2 NA 2 NA NA 3 1 1 3 2 ...
 $ asthma_ever    : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
 $ asthma_now     : logi  FALSE NA NA NA NA NA ...
 $ ovrwght_ever   : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
 $ arthrit_ever   : logi  NA FALSE TRUE NA TRUE FALSE ...
 $ stroke_ever    : logi  NA TRUE FALSE NA FALSE FALSE ...
 $ livdis_ever    : logi  NA FALSE FALSE NA FALSE FALSE ...
 $ cbronch_now    : logi  NA NA NA NA NA NA ...
 $ livdis_now     : logi  NA NA NA NA NA NA ...
 $ cancer_ever    : logi  NA FALSE FALSE NA FALSE FALSE ...
 $ rel_heartdis   : logi  NA TRUE FALSE NA TRUE FALSE ...
 $ rel_asthma     : logi  TRUE FALSE TRUE FALSE TRUE TRUE ...
 $ rel_diab       : logi  NA TRUE TRUE NA TRUE TRUE ...
 $ heartdis_ever  : logi  NA FALSE FALSE NA FALSE FALSE ...
 $ lungpath_ever  : logi  NA FALSE FALSE NA FALSE FALSE ...
 $ diab_lft       : int  1 1 3 1 3 1 1 1 3 1 ...
 $ hrsworked_prvwk: int  NA NA NA NA 36 75 40 NA NA 20 ...
 $ jobstat_lwk    : int  5 NA 8 NA 1 1 1 NA NA 4 ...
 $ wrkt_irreg     : logi  NA NA NA NA FALSE FALSE ...
 $ workpollut     : logi  NA FALSE FALSE NA FALSE TRUE ...
 $ sleep_dur      : int  6 4 6 8 5 6 8 8 5 8 ...
 $ sleep_probl    : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
 $ cannab_ever    : logi  TRUE NA FALSE TRUE NA NA ...
 $ harddrg_ever   : logi  FALSE FALSE FALSE FALSE FALSE NA ...
 $ drnkprd_prv12mo: int  1 NA NA 2 NA NA 4 6 3 3 ...
 $ alc_lft        : int  3 NA 1 NA 3 NA 3 3 3 3 ...
 $ cigsprd_prv30d : int  NA NA NA NA NA NA NA NA 10 NA ...
 $ smokstat       : int  NA 1 1 NA 1 1 NA NA 3 NA ...
 $ rdyfood_prvmo  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ frzfood_prvmo  : int  1 0 1 0 0 0 0 0 0 0 ...
 $ milk_month     : int  3 3 3 1 0 2 2 3 0 0 ...
 $ phq1           : int  0 0 0 2 0 NA 0 0 0 3 ...
 $ phq2           : int  1 0 0 0 0 NA 0 0 0 3 ...
 $ phq3           : int  0 1 1 1 0 NA 0 0 0 2 ...
 $ phq4           : int  3 0 1 1 0 NA 0 1 0 2 ...
 $ phq5           : int  1 1 0 0 0 NA 0 0 0 1 ...
 $ phq6           : int  1 0 0 1 0 NA 0 0 0 3 ...
 $ phq7           : int  0 0 0 0 0 NA 0 2 0 2 ...
 $ phq8           : int  0 0 0 1 0 NA 0 0 0 0 ...
 $ phq9           : int  0 0 0 0 0 NA 0 0 0 1 ...

Change the class of these variables to factor.

dat$srhgnrl <- as.factor(dat$srhgnrl)
dat$srphbad_prv30d <- as.factor(dat$srphbad_prv30d)
dat$srmhbad_prv30d <- as.factor(dat$srmhbad_prv30d)
dat$adlimp_prv30d <- as.factor(dat$adlimp_prv30d)
dat$educ <- as.factor(dat$educ)
dat$martlst <- as.factor(dat$martlst)
dat$ethnic <- as.factor(dat$ethnic)
dat$increl <- as.factor(dat$increl)
dat$diab_lft <- as.factor(dat$diab_lft)
dat$jobstat_lwk <- as.factor(dat$jobstat_lwk)
dat$alc_lft <- as.factor(dat$alc_lft)
dat$smokstat <- as.factor(dat$smokstat)
dat$milk_month <- as.factor(dat$milk_month)

Task 1.4

How many women and how many men are there in your data set?

sum(dat$male)
[1] 2481
sum(!dat$male)
[1] 2519
# Alternatively: 
table(dat$male)

FALSE  TRUE 
 2519  2481 

Task 1.5

What is the mean BMI in the overall population? What is the mean BMI for men and women?

# overall
mean(dat$bmi, na.rm=T)
[1] 28.57915
# gender groups
mean(dat$bmi[dat$male], na.rm=T) # mean bmi for men
[1] 28.03552
mean(dat$bmi[!dat$male], na.rm=T) # # mean bmi for women
[1] 29.11956

Task 1.6

Who has an higher mercury level in blood: men or women? People with chronic bronchitis or people without it? ‘Hispanic’, ‘White’, ‘Black’ or ‘Other/Mixed’ people?

mean(dat$hg[dat$male==T], na.rm=T) # mercury level in men
[1] 8.842238
mean(dat$hg[dat$male==F], na.rm=T) # mercury level in women
[1] 8.067284
mean(dat$hg[dat$cbronch_now==T], na.rm=T) # mercury in people with chronic bronchitis
[1] 6.561111
mean(dat$hg[dat$cbronch_now==F], na.rm=T) # mercury in people without c. bronchitis
[1] 6.295122
# mercury in different ethnic groups
mean(dat$hg[dat$ethnic==1], na.rm=T)
[1] 6.812474
mean(dat$hg[dat$ethnic==2], na.rm=T)
[1] 6.29259
mean(dat$hg[dat$ethnic==3], na.rm=T)
[1] 6.996337
mean(dat$hg[dat$ethnic==4], na.rm=T)
[1] 17.60334
# advanced method as one-liner because some people asked

tapply(dat, dat$ethnic, function(d) mean(d$hg, na.rm=T))
        1         2         3         4 
 6.812474  6.292590  6.996337 17.603338 

Task 1.7

Use the function summary to get the summarized information on all the variables in the data set.

summary(dat)
      seqn             cd               pb               hg         
 Min.   :62161   Min.   : 0.980   Min.   :0.0090   Min.   :  0.500  
 1st Qu.:64635   1st Qu.: 1.780   1st Qu.:0.0340   1st Qu.:  2.100  
 Median :67110   Median : 3.020   Median :0.0530   Median :  4.200  
 Mean   :67081   Mean   : 4.761   Mean   :0.0731   Mean   :  8.454  
 3rd Qu.:69541   3rd Qu.: 5.430   3rd Qu.:0.0850   3rd Qu.:  9.400  
 Max.   :71915   Max.   :77.400   Max.   :2.9600   Max.   :253.500  
                 NA's   :471      NA's   :471      NA's   :471      
      hdl          hivpos            weight          height     
 Min.   :0.360   Mode :logical   Min.   : 29.1   Min.   :134.5  
 1st Qu.:1.090   FALSE:3051      1st Qu.: 65.2   1st Qu.:160.0  
 Median :1.290   TRUE :14        Median : 77.2   Median :167.1  
 Mean   :1.353   NA's :1935      Mean   : 80.2   Mean   :167.3  
 3rd Qu.:1.550                   3rd Qu.: 91.3   3rd Qu.:174.4  
 Max.   :4.530                   Max.   :216.1   Max.   :204.5  
 NA's   :574                     NA's   :278     NA's   :279    
      bmi            rr_sys           rr_dia       srhgnrl     srphbad_prv30d
 Min.   :13.40   Min.   : 81.33   Min.   :  0.00   1   : 427   1   :2748     
 1st Qu.:23.80   1st Qu.:110.67   1st Qu.: 63.33   2   :1165   2   :1078     
 Median :27.40   Median :120.00   Median : 70.67   3   :1720   3   : 425     
 Mean   :28.58   Mean   :123.00   Mean   : 70.27   4   : 795   NA's: 749     
 3rd Qu.:32.00   3rd Qu.:132.00   3rd Qu.: 78.00   5   : 152                 
 Max.   :80.60   Max.   :234.67   Max.   :125.00   NA's: 741                 
 NA's   :290     NA's   :437      NA's   :437                                
 srmhbad_prv30d adlimp_prv30d      age          educ         martlst    
 1   :2567      1   :3434     Min.   :18.00   1   : 481   1      :2277  
 2   :1219      2   : 574     1st Qu.:31.00   2   : 665   5      :1004  
 3   : 467      3   : 246     Median :47.00   3   : 990   3      : 496  
 NA's: 747      NA's: 746     Mean   :47.50   4   :1418   2      : 400  
                              3rd Qu.:62.25   5   :1180   6      : 378  
                              Max.   :80.00   NA's: 266   (Other): 183  
                                                          NA's   : 262  
    male         ethnic    increl     asthma_ever     asthma_now     
 Mode :logical   1:1042   1   :1220   Mode :logical   Mode :logical  
 FALSE:2519      2:1803   2   :1185   FALSE:4244      FALSE:313      
 TRUE :2481      3:1314   3   :1036   TRUE :752       TRUE :426      
                 4: 841   4   :1104   NA's :4         NA's :4261     
                          NA's: 455                                  
                                                                     
                                                                     
 ovrwght_ever    arthrit_ever    stroke_ever     livdis_ever    
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:3422      FALSE:3549      FALSE:4538      FALSE:4541     
 TRUE :1575      TRUE :1180      TRUE :195       TRUE :190      
 NA's :3         NA's :271       NA's :267       NA's :269      
                                                                
                                                                
                                                                
 cbronch_now     livdis_now      cancer_ever     rel_heartdis   
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:133       FALSE:71        FALSE:4316      FALSE:4083     
 TRUE :119       TRUE :110       TRUE :416       TRUE :550      
 NA's :4748      NA's :4819      NA's :268       NA's :367      
                                                                
                                                                
                                                                
 rel_asthma       rel_diab       heartdis_ever   lungpath_ever   diab_lft   
 Mode :logical   Mode :logical   Mode :logical   Mode :logical   1   :4090  
 FALSE:3886      FALSE:2805      FALSE:4348      FALSE:4413      2   :  21  
 TRUE :1016      TRUE :1845      TRUE :367       TRUE :310       3   : 581  
 NA's :98        NA's :350       NA's :285       NA's :277       NA's: 308  
                                                                            
                                                                            
                                                                            
 hrsworked_prvwk    jobstat_lwk   wrkt_irreg      workpollut     
 Min.   :    1.0   1      :1461   Mode :logical   Mode :logical  
 1st Qu.:   34.0   5      : 774   FALSE:2104      FALSE:2130     
 Median :   40.0   4      : 419   TRUE :491       TRUE :2243     
 Mean   :  141.1   3      : 329   NA's :2405      NA's :627      
 3rd Qu.:   48.0   8      : 266                                  
 Max.   :99999.0   (Other): 515                                  
 NA's   :2474      NA's   :1236                                  
   sleep_dur      sleep_probl     cannab_ever     harddrg_ever   
 Min.   : 2.000   Mode :logical   Mode :logical   Mode :logical  
 1st Qu.: 6.000   FALSE:3841      FALSE:1318      FALSE:2917     
 Median : 7.000   TRUE :1157      TRUE :1517      TRUE :574      
 Mean   : 6.983   NA's :2         NA's :2165      NA's :1509     
 3rd Qu.: 8.000                                                  
 Max.   :99.000                                                  
 NA's   :4                                                       
 drnkprd_prv12mo  alc_lft     cigsprd_prv30d  smokstat    rdyfood_prvmo   
 Min.   : 1.000   1   : 631   Min.   : 1.00   1   :1089   Min.   : 0.000  
 1st Qu.: 1.000   2   : 124   1st Qu.: 5.00   2   : 162   1st Qu.: 0.000  
 Median : 2.000   3   :3045   Median :10.00   3   : 788   Median : 0.000  
 Mean   : 2.923   NA's:1200   Mean   :11.29   NA's:2961   Mean   : 1.834  
 3rd Qu.: 3.000               3rd Qu.:15.00               3rd Qu.: 1.000  
 Max.   :82.000               Max.   :80.00               Max.   :90.000  
 NA's   :2170                 NA's   :4001                NA's   :8       
 frzfood_prvmo     milk_month       phq1             phq2       
 Min.   :  0.000   0   : 897   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:  0.000   1   : 829   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :  0.000   2   :1378   Median :0.0000   Median :0.0000  
 Mean   :  2.522   3   :1861   Mean   :0.3683   Mean   :0.3478  
 3rd Qu.:  2.000   NA's:  35   3rd Qu.:0.0000   3rd Qu.:0.0000  
 Max.   :180.000               Max.   :3.0000   Max.   :3.0000  
 NA's   :6                     NA's   :783      NA's   :779     
      phq3             phq4             phq5             phq6       
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.6031   Mean   :0.7036   Mean   :0.3765   Mean   :0.2633  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
 Max.   :3.0000   Max.   :3.0000   Max.   :3.0000   Max.   :3.0000  
 NA's   :780      NA's   :779      NA's   :780      NA's   :784     
      phq7             phq8             phq9     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.00  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00  
 Median :0.0000   Median :0.0000   Median :0.00  
 Mean   :0.2578   Mean   :0.1814   Mean   :0.06  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00  
 Max.   :3.0000   Max.   :3.0000   Max.   :3.00  
 NA's   :780      NA's   :783      NA's   :782