--- title: 【迁移】使用limma包进行差异基因分析 urlname: shi-yong-limma-bao-jin-hang-cha-yi-ji-yin-fen-xi date: 2022-07-30 19:13:36 index_img: https://api.limour.top/randomImg?d=2022-07-30 19:13:36 tags: ['DEG', 'limma'] --- 虽然现在已经是高通量测序的时代,大家基本都是从counts矩阵出发,使用DESeq2进行[差异表达分析](/cong-cha-yi-ji-yin-dao-RRA-ju-he),但是[GEO](https://www.ncbi.nlm.nih.gov/geo/)和[ArrayExpress](https://www.ebi.ac.uk/arrayexpress/)上的仍有海量且持续更新的芯片数据,有时候也不可避免遇到一些FPKM格式乃至已经进行了z-score转换的数据,对于这些数据的分析,我们可以认为其在适当变换下(log2FPKM),满足正态分布,那么仍可以使用limma直接进行分析。下面博主以[E-MEXP-1422](/oligo-GEO-ArrayExpress-xin-pian-shu-ju-chu-li)为例,写一份分析代码的demo。 ```R require(limma) f_DE_limma <- function(cts_bb, rowInfo, ControlN, TreatN, rm.NA=T, trend=T){ # trend 表示先验方差是否与基因表达值的大小相关,False表示其为常数 cts_b <- cts_bb[,c(ControlN, TreatN)] conditions <- c(rep("Control",length(ControlN)), rep("Treat",length(TreatN))) design <- model.matrix(~0+factor(conditions)) colnames(design) <- levels(factor(conditions)) rownames(design) <- colnames(cts_b) print(design) contrast.matrix <- makeContrasts('Treat-Control', levels = design) print(contrast.matrix) fit <- lmFit(cts_b, design) # 拟合线性模型 fit2 <- contrasts.fit(fit, contrast.matrix) # 计算拟合系数和标准误差 fit2 <- eBayes(fit2, trend=trend) # 通过经验贝叶斯方法估计统计量和logFC值 tempOutput <- topTable(fit2, coef=1, n=Inf) if(!is.null(rowInfo)){tempOutput <- cbind(rowInfo[rownames(tempOutput),], tempOutput)} if(rm.NA){tempOutput <- na.omit(tempOutput)} tempOutput } # 经过 oligo::rma 标准化后提取出来的表达矩阵 data.exprs # SDRF <- read.delim('E-MEXP-1422.sdrf.txt') # 从sdrf文件可知 AF15、AF16为PROX1 siRNA组 # AF6、AF14为GFP siRNA组 Ct1 <- c('AF6.CEL', 'AF14.CEL') Tt1 <- c('AF15.CEL', 'AF16.CEL') f_DE_limma(data.exprs, NULL, Ct1, Tt1, F) ``` ## 蛋白质组学TCPA数据集 ### 获取数据 * 进入[TCPA的下载页面](https://tcpaportal.org/tcpa/download.html)选择感兴趣的L4数据 * unzip TCGA-PRAD-L4.zip ### 清洗数据 [f\_dedup\_IQR](/oligo-GEO-ArrayExpress-xin-pian-shu-ju-chu-li) ```R tcpa <- read.csv('tmp/TCGA-PRAD-L4.csv') type <- as.numeric(substr(tcpa$Sample_ID, 14, 15)) tcpa <- subset(tcpa, type < 10) # tp rowNa <- substr(tcpa$Sample_ID,1, 12) tcpa <- f_dedup_IQR(tcpa[-(1:4)],rowNa) tcpa ``` ## TCGAbiolinks下载蛋白质组数据 之前通过tcpa下载过蛋白数据],而[TCGAbiolinks也有下载蛋白质组学数据的示例](https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/download_prepare.html),后者看上去更全面一点。 ### 下载数据 ```R library(TCGAbiolinks) query.rppa <- GDCquery( project = "TCGA-PRAD", data.category = "Proteome Profiling", data.type = "Protein Expression Quantification" ) GDCdownload(query.rppa) rppa <- GDCprepare(query.rppa) saveRDS(rppa, 'PRAD_rppa.rds') ``` ### 清洗数据 ```R pMiss <- function(x){round(sum(is.na(x))/length(x),3)} rppa <- rppa[apply(rppa, 1, pMiss) < 0.05, ] rppa <- rppa[, apply(rppa, 2, pMiss) < 0.05] sum(is.na(rppa)) rowInfo <- rppa[1:5] rppa <- rppa[-(1:5)] rppa <- rppa[, substr(colnames(rppa), 14, 16) == "01A"] colnames(rppa) <- substr(colnames(rppa),1,12) group <- readRDS('../fig5/tcga.predict.rds') gRow <- intersect(colnames(rppa), rownames(group)) group <- group[gRow,] rppa <- rppa[, colnames(rppa) %in% gRow] Ct1 <- rownames(group)[group$group == 'Low Risk'] Tt1 <- rownames(group)[group$group == 'High Risk'] ``` ### 计算差异蛋白 ```R r1 <- f_DE_limma(rppa, rowInfo, Ct1, Tt1, trend=F) rownames(rppa) <- rowInfo$AGID save(r1, rppa, file = 'PRAD_TCPA_DE.rdata') ```