How can I compute a principal component analysis using my dataset to draw a graph showing PC1 vs PC2?
Please can someone suggest me how to do principal component analysis with the gene dataset I have.
I have a table with 15 columns. The first is the disease group, where 0 is control, 1 is ulcerative colitis, and 2 is Crohn's disease.
The remaining 14 columns correspond to 14 different genes.
I would like to plot PC1 vs PC2 (via prcomp) after PCA to show if any clustering or separation occurs between the three groups based on the gene expression data (each axis shows the proportion of variation). However, I'm having a hard time knowing where to start because I can't convert column 1 to row names with row.names = 1 because R doesn't allow duplicate row names.
Converting to a matrix and trying the following code doesn't work.
mockdata1 <- as.matrix(mockdata)
rownames(mockdata1) <- mockdata1[,1]
mockdata1[,1] <- NULL
or
mockdata2 <-mockdata1 [ ,-1]
In the previous example, I've been able to calculate the PCA and plot PCA1 vs PCA2 and color the data accordingly, as per row.names = 1, but I'm not sure how to overcome the first initial problem of not using it here.
I have included my data below via dput(head(mockdata))
structure(list(Disease = c(1L, 1L, 0L, 0L, 2L, 2L), Gene1 = c(9104.774619,
35924.12358, 6.780294688, 1284.690716, 69.50341155, 3935.107345
), Gene2 = c(5224.114486, 35625.73119, 18.35291351, 511.9272679,
186.7270146, 47611.65544), Gene3 = c(1472.348466, 137571.5525,
20.78531289, 3019.140256, 146.9615338, 108935.1303), Gene4 = c(2487.124686,
147604.774, 3.574347972, 1371.576262, 210.6773417, 82831.97458
), Gene5 = c(1872.328747, 235675.6461, 9.834667594, 583.1631957,
120.6931223, 75874.49936), Gene6 = c(1675.724728, 35931.1852,
9.91026361, 1634.038443, 58.04818134, 23502.78972), Gene7 = c(3775.885073,
169672.9921, 5.41305941, 929.2125312, 97.72621248, 46023.7009
), Gene8 = c(5015.202216, 137455.0032, 2.995124554, 1113.882634,
83.17636201, 14048.19237), Gene9 = c(883.5716868, 45920.44167,
6.399646876, 892.313155, 117.1104906, 10825.47974), Gene10 = c(1607.790858,
146627.0588, 1.967559425, 1237.299298, 90.8941744, 32747.04713
), Gene11 = c(2345.478241, 91047.57303, 12.33867961, 663.576224,
384.5839119, 6692.728154), Gene12 = c(2772.362496, 15511.96753,
15.64843017, 4143.085461, 169.545757, 22484.03574), Gene13 = c(4131.51741,
48601.7059, 21.66175797, 2250.0628, 316.0677196, 16612.6508),
Gene14 = c(1252.440598, 54794.36695, 2.925615978, 708.0342528,
211.822519, 14021.28425)), row.names = c(NA, 6L), class = "data.frame")
I read about the statistics behind PCA a long time ago. It basically generates df*t(df). If it is transposed, the whole gene can get 14 PCs. Alternatively, you can do this in 6 samples. Didn't know you were interested in comparing genes or diseases?
Now, just change the disease column.
df1 <- df[,-1]
prcomp(df1, scale=TRUE)
Standard deviations (1, .., p=6): [1] 3.535100e+00 1.197773e+00 2.341240e-01 1.164908e-01 4.575297e-03 1.341079e-16 Rotation (n x k) = (14 x 6): PC1 PC2 PC3 PC4 PC5 PC6 Gene1 0.2599087 -0.28372577 0.85593436 0.105415154 0.22237406 -0.03275210 Gene2 0.2217516 0.51374227 0.19750686 0.587702793 -0.28955565 -0.05771744 Gene3 0.2658620 0.28239514 -0.18315564 0.178995551 0.20110401 0.55717953 Gene4 0.2793220 0.12737931 -0.15464656 0.169556323 0.07195134 -0.18818707 Gene5 0.2818335 -0.06260368 -0.17524380 0.066508759 0.19418520 -0.23609300 Gene6 0.2753095 0.19146854 -0.05941163 -0.009035774 0.27267048 -0.02872877 Gene7 0.2804052 -0.10775145 -0.11519116 0.035882068 0.20239393 -0.18183889 Gene8 0.2697234 -0.25139746 -0.04378742 -0.067463492 0.21022545 0.41231175 Gene9 0.2787618 -0.13908918 -0.13407274 -0.103917770 -0.05348686 0.14863994 Gene10 0.2782801 -0.14660837 -0.15870274 -0.039789645 0.20346369 -0.24515625 Gene11 0.2672274 -0.27313184 -0.09102046 -0.087085239 -0.41532869 0.18704109 Gene12 0.2089616 0.55676419 0.20707121 -0.730763260 0.02663728 -0.08109933 Gene13 0.2819621 -0.06202963 0.11256511 -0.133063115 -0.54206507 0.22546812 Gene14 0.2797161 -0.12219139 -0.12034622 -0.025956127 -0.33551757 -0.46422864
For PC across diseases, transpose your matrix:
df2 <- t(df)
colnames(df2) <- df$Disease
df3 <- df2[-1,]
prcomp(df3,scale=TRUE)
Standard deviations (1, .., p=6): [1] 1.3881805 1.3086775 1.0332001 0.8999575 0.5686278 0.3994426 Rotation (n x k) = (6 x 6): PC1 PC2 PC3 PC4 PC5 PC6 1 -0.1761229 -0.4084611 0.42521515 -0.7364243 0.2424742 -0.14218944 1 0.5843042 0.3073572 -0.04722848 -0.2941286 0.4419985 0.52916471 0 -0.5145338 0.4178422 0.04746496 -0.3238892 -0.4356411 0.51353922 0 -0.4042091 0.3652627 0.47093069 0.3634759 0.5908253 -0.01526729 2 -0.3531166 0.1750554 -0.74358399 -0.2618183 0.3973516 -0.25555829 2 0.2734007 0.6324854 0.20003938 -0.2561246 -0.2215796 -0.60868811