Air Pollution and Mortality

Jae-seong Yoo 2014. 5. 29. 21:51

2014. 5. 29. 21:51

R-code.r

pollution.txt

This is the pollution data so loved by writers of papers on ridge regression.

Source: McDonald, G.C. and Schwing, R.C. (1973) 'Instabilities of regression estimates relating air pollution to mortality', Technometrics, vol.15, 463-482.

Variables in order:

PREC Average annual precipitation in inches

JANT Average January temperature in degrees F

JULT Same for July

OVR65 % of 1960 SMSA population aged 65 or older

POPN Average household size

EDUC Median school years completed by those over 22

HOUS % of housing units which are sound & with all facilities

DENS Population per sq. mile in urbanized areas, 1960

NONW % non-white population in urbanized areas, 1960

WWDRK % employed in white collar occupations

POOR % of families with income < $3000

HC Relative hydrocarbon pollution potential

NOX Same for nitric oxides

SO@ Same for sulphur dioxide

HUMID Annual average % relative humidity at 1pm

MORT Total age-adjusted mortality rate per 100,000

분류 : Cluster Analysis

출처 : http://www.umass.edu/statdata/statdata/

R-code.r

JaeseongYoo — May 29, 2014, 9:46 PM

rm(list = ls())

NANANA

[1] NA

set.seed(1)
data = scan("pollution.txt")
data = matrix(data, ncol=16)
dimnames(data)[[2]] = c("PREC", "JANT", "JULT", "OVR65", "POPN",
                                        "EDUC", "HOUS", "DENS", "NONW", "WWDRK",
                                        "POOR", "HC", "NOX", "SOat", "HUMID",
                                        "MORT")

eig = eigen(cor(data))
round(eig$value, 2)

 [1] 5.88 4.33 2.26 2.17 0.32 0.31 0.27 0.18 0.10 0.07 0.04 0.04 0.02 0.00
[15] 0.00 0.00

n_factor = sum((eig$value > 1)*1)

# Single Linkage
hclust_result = hclust(dist(data, method="euclidean"), method="single")
plot(hclust_result, hang=-1, main="Single Linkage")
rect.hclust(hclust_result, n_factor)

plot of chunk unnamed-chunk-1

cutree_result = cutree(hclust_result, n_factor)
table(cutree_result)

cutree_result
 1  2  3  4 
49  4  4  3

# Complete Linkage
hclust_result = hclust(dist(data, method="euclidean"), method="complete")
plot(hclust_result, hang=-1, main="Complete Linkage")
rect.hclust(hclust_result, n_factor)

plot of chunk unnamed-chunk-1

cutree_result = cutree(hclust_result, n_factor)
table(cutree_result)

cutree_result
 1  2  3  4 
49  4  4  3

# Average Linkage
hclust_result = hclust(dist(data, method="euclidean"), method="average")
plot(hclust_result, hang=-1, main="Average Linkage")
rect.hclust(hclust_result, n_factor)

plot of chunk unnamed-chunk-1

cutree_result = cutree(hclust_result, n_factor)
table(cutree_result)

cutree_result
 1  2  3  4 
49  4  4  3

# Centroid Method
hclust_result = hclust(dist(data, method="euclidean"), method="centroid")
plot(hclust_result, hang=-1, main="Centroid")
rect.hclust(hclust_result, n_factor)

plot of chunk unnamed-chunk-1

cutree_result = cutree(hclust_result, n_factor)
table(cutree_result)

cutree_result
 1  2  3  4 
52  3  4  1

# Ward's Method
hclust_result = hclust(dist(data, method="euclidean"), method="ward.D2")
plot(hclust_result, hang=-1, main="Ward")
rect.hclust(hclust_result, n_factor)

plot of chunk unnamed-chunk-1

cutree_result = cutree(hclust_result, n_factor)
table(cutree_result)

cutree_result
 1  2  3  4 
45  4  7  4

# K-means
kmeanclust_result =kmeans(data, n_factor, nstart=500)
kmeanclust_result

K-means clustering with 4 clusters of sizes 7, 4, 45, 4

Cluster means:
      PREC    JANT    JULT   OVR65    POPN     EDUC     HOUS    DENS
1 2541.586   18.50 2418.34   27.71 2542.30    6.557 1963.969  104.51
2   13.250 4146.25    8.70  985.04   19.00 2758.250    8.125  903.72
3   34.933   43.42   32.55   43.51   32.25   34.842   33.658   44.75
4    9.025 1004.66   13.50 4381.00    7.50  915.862    8.500 3912.50
      NONW   WWDRK    POOR       HC     NOX    SOat   HUMID    MORT
1 2282.868   10.94 2401.56    9.471 2550.42   10.33 2045.75   10.49
2   17.750 4160.75   10.15  946.117  147.75 3420.25    8.05  941.36
3   37.716   37.09   46.66   33.584   41.04   33.37   35.01   32.27
4    8.825  963.76   42.75 4118.500    8.20  886.21   21.25 4995.00

Clustering vector:
 [1] 3 3 3 4 3 3 3 1 3 3 3 2 3 3 3 1 3 3 3 4 3 3 3 1 3 3 3 2 3 3 3 1 3 3 3
[36] 4 3 3 3 1 3 3 3 2 3 3 3 1 3 3 3 4 3 3 3 1 3 3 3 2

Within cluster sum of squares by cluster:
[1] 150869711  31947851    809151  37550545
 (between_SS / total_SS =  77.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"         "iter"        
[9] "ifault"

require(cluster)

Loading required package: cluster

clusplot(data, kmeanclust_result$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

require(fpc)

Loading required package: fpc
Loading required package: MASS
Loading required package: mclust
Package 'mclust' version 4.3
Loading required package: flexmix
Loading required package: lattice

plot of chunk unnamed-chunk-1

plotcluster(data, kmeanclust_result$cluster)

plot of chunk unnamed-chunk-1

'# Download Files Post > Data' 카테고리의 다른 글

Ecological Correlates of Sleep in mammals (0)	2014.06.02
James M. Lattin, J. Douglas Carroll, Paul E. Green, "Datasets of Analyzing Multivariate Data", Thomson, 2003 (0)	2014.03.13
아이스탯(ISTAT) raw data (0)	2013.08.10
가계동향조사 - 1990 - 통계청 MDSS (0)	2012.07.22
가계동향조사 - 1995 - 통계청 MDSS (0)	2012.07.22
가계동향조사 - 2000 - 통계청 MDSS (0)	2012.07.22
가계동향조사 - 2005 - 통계청 MDSS (0)	2012.07.22
가계동향조사 - 2010 - 통계청 MDSS (0)	2012.07.22
가계동향조사 - 2011 - 통계청 MDSS (0)	2012.06.20
가계금융조사 - 2010 - 통계청 MDSS (0)	2012.06.17

Jaeseong's private post