---
title: 【学习】孟德尔随机化
urlname: Mendelian-Randomization
date: 2023-10-14 17:18:54
index_img: https://api.limour.top/randomImg?d=2023-10-14 17:18:54
tags: ['SNP', 'MR']
---
## MR定义
孟德尔随机化是一种基于全基因组测序数据(GWAS数据)，利用单核首酸多态性(SNPs)作为工具变量(IV)，用于揭示因果关系的新型流行病学方法，相较于队列研究等观察性研究，暴露在出生前便已确定，较少受到反向因果及混杂因素的影响，因而能够有效减少偏倚。
![RCT与MR的比较](https://img.limour.top/2023/10/14/652a61ab222a4.webp)
MR的核心是运用遗传学数据作为桥梁，来探索某一暴露和某一结局之间的因果关联。与RCT将参与者随机分配到试验组或对照组类似，MR研究基于影响危险因素的一个或多个等位基因，对参与基因进行"随机化"(自然的随机化)，以确定这些遗传变异的携带者与非携带者相比，是否具有不同的疾病发生风险，因此，孟德尔随机化可以被认为类似于["自然的随机对照试验"](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2458144)。[MR的相关术语](https://mr-dictionary.mrcieu.ac.uk/)
## 理论假设
1. the variant is associated with the exposure
2. the variant is not associated with the outcome via a confounding pathway
3. the variant does not affect the outcome directly, only possibly indirectly via the exposure

![孟德尔随机化框架的有向无环图表示](https://img.limour.top/2023/10/14/652a651b96f98.webp)
1. 关联性假设：变异与暴露有关
2. 独立性假设：变异与结果之间没有通过混杂途径相关
3. 排他性假设：变异不直接影响结果，只可能通过暴露途径间接影响

+ 关联性假设：p值，F统计量，R^2
+ 排他性假设：与结局的相关性计算时，p值要大于0.05
+ [MR-Egger](https://doi.org/10.1093/ije/dyv080)回归相比线性回归可以弱化对排他性假设的要求
## 适用范围
+ 不确定先有鸡还是先有蛋，比如，到底是抑郁导致肺癌还是肺癌导致了抑郁？
+ 暴露因素难以测量，或者花费昂贵。例如，水溶性维生素等生物标志物的检测金标准可能成本太高，大样本无法承受，或者空腹血糖的测量需要隔夜空腹，可能不现实。
+ 暴露与结局数据来自同一人群，且不存在或存在少量可接受范围内的样本重叠
## 配置环境
+ [基础编程环境](/-ji-lu--an-zhuang-sheng-xin-de-dai-ma-bian-xie-huan-jing)
+ [GitHub 下载加速](/-fu-ke-GitHub-wen-jian-jia-su)
+ [可能需要用到的加速服务](/-ji-lu-SOCKS5-zhuan-QUIC)
```bash
conda create -n MR -c conda-forge r-devtools -y
conda activate MR
conda install -c conda-forge r-irkernel -y
Rscript -e "IRkernel::installspec(name='MR', displayname='MR')"
# Rscript -e "usethis::edit_r_environ()" # 设置 GITHUB_PAT
# nano ~/.Renviron # MRCIEU 真是超喜欢GITHUB，要访问一万次 api.github.com
conda install -c conda-forge r-rmarkdown -y
conda install -c conda-forge r-meta -y
wget https://github.com/MRCIEU/TwoSampleMR/archive/refs/heads/master.zip -O TwoSampleMR.zip
Rscript -e "devtools::install_local('TwoSampleMR.zip')"
wget https://github.com/MRCIEU/MRInstruments/archive/refs/heads/master.zip -O MRInstruments.zip
Rscript -e "devtools::install_local('MRInstruments.zip')"
conda install -c conda-forge r-susier -y
conda install -c bioconda bioconductor-variantannotation -y
wget https://github.com/MRCIEU/gwasglue/archive/refs/heads/master.zip -O gwasglue.zip
Rscript -e "devtools::install_local('gwasglue.zip')"
# wget https://github.com/MRCIEU/genetics.binaRies/archive/refs/heads/master.zip -O genetics.binaRies.zip
# Rscript -e "devtools::install_local('genetics.binaRies.zip')"
conda install -c bioconda plink -y
# whereis plink # /opt/conda/envs/MR/bin/plink
Rscript -e 'install.packages("MendelianRandomization")'
```
## 数据来源
+ 精神病学基因组：[PGC](https://pgc.unc.edu/)
+ 社会科学遗传学：[SSGAC](https://www.thessgac.org/)
+ 大脑健康和疾病：[CTG](https://ctg.cncr.nl/software/summary_statistics)
+ MRCIEU汇总数据库：[IEU](https://gwas.mrcieu.ac.uk/)
+ GWAS研究目录：[NHGRI-EBI](https://www.ebi.ac.uk/gwas/search)
+ [自己分析出数据](/shi-yong-GATK-zhao-SNP)
+ [更多相关网站](https://od.limour.top/archives/GWAS/MR)
### 一些参考数据
```bash
wget http://fileserve.mrcieu.ac.uk/ld/1kg.v3.tgz
tar -zxvf 1kg.v3.tgz
# mkdir EUR && mv EUR.* EUR
```
### 示例结局数据
+ 浏览器下载 [ADHD2022_iPSYCH_deCODE_PGC.meta.gz](https://figshare.com/ndownloader/files/40036684)
+ [上传到服务器](/Rclone-bei-fen-VPS-shu-ju-dao-onedrive)
```bash
# zcat ADHD2022_iPSYCH_deCODE_PGC.meta.gz | head
CHR SNP BP A1 A2 FRQ_A_38691 FRQ_U_186843 INFO OR SE P Direction Nca Nco
8 rs62513865 101592213 C T 0.925 0.937 0.981 0.99631 0.0175 0.8325 +---+++0-++-+ 38691 186843
8 rs79643588 106973048 G A 0.91 0.917 1 1.00411 0.0159 0.7967 ++--++-+-+-++ 38691 186843
8 rs17396518 108690829 T G 0.561 0.577 0.998 0.99611 0.0096 0.6876 --++-++??-+-- 37367 184388
8 rs983166 108681675 A C 0.57 0.586 0.996 0.99491 0.0096 0.5956 --++-++++-+-- 38691 186843
8 rs28842593 103044620 T C 0.839 0.836 0.982 0.98314 0.0135 0.2081 ----++0+??--+ 37504 184525
8 rs7014597 104152280 G C 0.824 0.824 0.997 0.99950 0.0122 0.9679 +-++-+++++--- 38691 186843
8 rs3134156 100479917 T C 0.841 0.833 0.997 0.98866 0.0128 0.3762 -+----+--++-- 38691 186843
8 rs6980591 103144592 A C 0.783 0.79 1 1.01106 0.0108 0.3075 ++-++---+++++ 38691 186843
8 rs72670434 108166508 A T 0.642 0.623 0.983 1.00672 0.0103 0.5171 +++-+++--+++- 38691 186843
```
```txt
CHR Chromosome (hg19)
SNP Marker name
BP Base pair location (hg19)
A1 Reference allele for OR (may or may not be minor allele)
A2 Alternative allele
FRQ_A_38691 allele frequency of A1 in 38,691 ADHD cases
FRQ_U_186843 allele frequency of A1 in 38,691 controls
INFO Imputation information score (the reported imputation INFO score is a weighted average across the
cohorts contributing to the meta-analysis for that variant)
OR Odds ratio for the effect of the A1 allele
SE Standard error of the log(OR)
P P-value for association test in the meta-analysis
Direction direction of effect in the included cohorts
Nca number of cases with variant information
Nco number of controls with variant information
```
其中`SNP`,`Effect allele`,`Beta(OR)`,`SE`,`P`这五列是必须的。遇到没有提供EAF的数据，可以[匹配千人基因组数据的EAF](https://github.com/HaobinZhou/Get_MR)，`get_eaf_from_1000G`。
### 示例暴露数据
```bash
wget -c https://gwas.mrcieu.ac.uk/files/ieu-a-2/ieu-a-2.vcf.gz
```
```R
VCF_dat = VariantAnnotation::readVcf('~/upload/GWAS/IEU/ieu-a-2.vcf.gz')
exp_dat = gwasglue::gwasvcf_to_TwoSampleMR(vcf = VCF_dat)
saveRDS(file = 'ieu-a-2.exp_dat', exp_dat)
exp_dat = subset(exp_dat, pval.exposure < 5e-08) # 关联性假设
# 去除连锁不平衡
# exp_dat = TwoSampleMR::clump_data(dat = exp_dat, clump_kb = 10000, clump_r2 = 0.001) # MRCIEU太喜欢用cloud api了
fix_ld_clump_local = function (dat, tempfile, clump_kb, clump_r2, clump_p, bfile, plink_bin) {
    shell <- ifelse(Sys.info()["sysname"] == "Windows", "cmd", 
        "sh")
    write.table(data.frame(SNP = dat[["rsid"]], P = dat[["pval"]]), 
        file = tempfile, row.names = F, col.names = T, quote = F)
    fun2 <- paste0(shQuote(plink_bin, type = shell), " --bfile ", 
        shQuote(bfile, type = shell), " --clump ", shQuote(tempfile, 
            type = shell), " --clump-p1 ", clump_p, " --clump-r2 ", 
        clump_r2, " --clump-kb ", clump_kb, " --out ", shQuote(tempfile, 
            type = shell))
    print(fun2)
    system(fun2)
    res <- read.table(paste(tempfile, ".clumped", sep = ""), header = T)
    unlink(paste(tempfile, "*", sep = ""))
    y <- subset(dat, !dat[["rsid"]] %in% res[["SNP"]])
    if (nrow(y) > 0) {
        message("Removing ", length(y[["rsid"]]), " of ", nrow(dat), 
            " variants due to LD with other variants or absence from LD reference panel")
    }
    return(subset(dat, dat[["rsid"]] %in% res[["SNP"]]))
}
fuck = fix_ld_clump_local(
    dat = dplyr::tibble(rsid=exp_dat$SNP, pval=exp_dat$pval.exposure),
    tempfile = file.path(getwd(),'tmp.ld_clump.exp_dat'),
    clump_kb = 10000, clump_r2 = 0.001, clump_p = 1,
    # pop = "EUR", # Super-population. Options are "EUR", "SAS", "EAS", "AFR", "AMR"
    plink_bin = '/opt/conda/envs/MR/bin/plink', # 千万别用什么 genetics.binaRies::get_plink_binary()，他们自己编译的文件有问题
    bfile = "/home/jovyan/upload/GWAS/ld/EUR" # 前缀，不是文件夹也不是文件
)
exp_dat_clumped = exp_dat[exp_dat$SNP %in% fuck$rsid,]
saveRDS(file = 'ieu-a-2.exp_gwas', exp_dat_clumped)
```
## 获取暴露数据
### 自己的数据
```R
df_gwas <- data.frame(
  SNP = c("rs1", "rs2"),
  beta = c(1, 2),
  se = c(1, 2),
  effect_allele = c("A", "T")
)
head(df_gwas)
exp_dat <- TwoSampleMR::format_data(df_gwas, type = "exposure")
```
### gwas_catalog
```R
df_gwas <-
  subset(MRInstruments::gwas_catalog,
         grepl("Speliotes", Author) &
           Phenotype == "Body mass index")
head(df_gwas)
exp_dat <- TwoSampleMR::format_data(df_gwas)
```
### metab_qtls
```R
df_gwas <-
  subset(MRInstruments::metab_qtls,
        phenotype == "Ala"
    )
head(df_gwas)
exp_dat <- TwoSampleMR::format_metab_qtls(df_gwas)
```
### proteomic_qtls
```R
df_gwas <-
  subset(MRInstruments::proteomic_qtls,
        analyte == "ApoH"
    )
head(df_gwas)
exp_dat <- TwoSampleMR::format_proteomic_qtls(df_gwas)
```
### 某个基因
```R
df_gwas <-
  subset(MRInstruments::gtex_eqtl,
        gene_name == "IRAK1BP1" & tissue == "Adipose Subcutaneous"
    )
head(df_gwas)
exp_dat <- TwoSampleMR::format_gtex_eqtl(df_gwas)
```
### 某个性状的某个甲基化位点相关QTL
```R
df_gwas <-
  subset(MRInstruments::aries_mqtl,
        cpg == "cg25212131" & age == "Birth"
    )
head(df_gwas)
exp_dat <- TwoSampleMR::format_aries_mqtl(df_gwas)
```
### IEU的ID
```R
exp_gwas <- TwoSampleMR::extract_instruments(outcomes = 'ieu-a-2')
head(exp_gwas)
saveRDS(file = 'ieu-a-2.exp_gwas', exp_gwas) # 和自己从VCF开始经过clump得到的差不多
```
###  UK_Biobank
```R
hyperten_tophits <- ieugwasr::tophits(id="ukb-b-12493", clump=0)
hyperten_gwas <- dplyr::rename(hyperten_tophits, c(
  "SNP"="rsid",
  "effect_allele.exposure"="ea",
  "other_allele.exposure"="nea",
  "beta.exposure"="beta",
  "se.exposure"="se",
  "eaf.exposure"="eaf",
  "pval.exposure"="p",
  "N"="n"))
fuck = fix_ld_clump_local(
    dat = dplyr::tibble(rsid=hyperten_gwas$SNP, pval=hyperten_gwas$pval.exposure),
    tempfile = file.path(getwd(),'tmp.ld_clump.exp_dat'),
    clump_kb = 10000, clump_r2 = 0.001, clump_p = 1,
    # pop = "EUR", # Super-population. Options are "EUR", "SAS", "EAS", "AFR", "AMR"
    plink_bin = '/opt/conda/envs/MR/bin/plink', # 千万别用什么 genetics.binaRies::get_plink_binary()，他们自己编译的文件有问题
    bfile = "/home/jovyan/upload/GWAS/ld/EUR" # 前缀，不是文件夹也不是文件
)
exp_dat_clumped = hyperten_gwas[hyperten_gwas$SNP %in% fuck$rsid,]
MR_calc_r2_F(
    beta = exp_dat_clumped$beta.exposure, # Vector of Log odds ratio. beta = log(OR)
    eaf = exp_dat_clumped$eaf.exposure, # Vector of allele frequencies.
    N = exp_dat_clumped$N, # Array of sample sizes
    se = exp_dat_clumped$se.exposure # Vector of SE.
) # 取 F>10 的
```
## 计算统计效力
```R
# 分类变量
tmp_r2 =TwoSampleMR::get_r_from_lor(
    lor = exp_dat_clumped$beta.exposure, # Vector of Log odds ratio. beta = log(OR)
    af = exp_dat_clumped$eaf.exposure, # Vector of allele frequencies.
    ncase = exp_dat_clumped$ncase.exposure, # Vector of Number of cases. 
    ncontrol = exp_dat_clumped$ncontrol.exposure, # Vector of Number of controls. 
    prevalence = 1, # Vector of Disease prevalence in the population.
)
# 连续变量
tmp_r2 =TwoSampleMR::get_r_from_pn(
    p = exp_dat_clumped$pval.exposure, # Array of pvals
    n = exp_dat_clumped$samplesize.exposure # Array of sample sizes
)
```
```R
MR_calc_r2_F = function(beta, eaf, N, se){
    # https://doi.org/10.1038/s41467-020-14389-8
    # https://doi.org/10.1371/journal.pone.0120758
    r2 = (2 * (beta^2) * eaf * (1 - eaf)) /
            (2 * (beta^2) * eaf * (1 - eaf) +
            2 * N * eaf * (1 - eaf) * se^2)
    F = r2 * (N - 2) / (1 - r2)
    print(mean(F))
    return(dplyr::tibble(r2=r2, F=F))
}
MR_calc_r2_F(
    beta = exp_dat_clumped$beta.exposure, # Vector of Log odds ratio. beta = log(OR)
    eaf = exp_dat_clumped$eaf.exposure, # Vector of allele frequencies.
    N = exp_dat_clumped$samplesize.exposure, # Array of sample sizes
    se = exp_dat_clumped$se.exposure # Vector of SE.
) # 取 F>10 的
```
## 获取结局数据
### IEU
```R
out_gwas = TwoSampleMR::extract_outcome_data(snps = exp_gwas$SNP, outcomes = 'ieu-a-7')
```
### UK_Biobank
```R
anxiety_hyperten_liberal <- TwoSampleMR::extract_outcome_data(snps = exp_dat_clumped$SNP, outcomes = "ukb-b-11311")
```
### PGC的示例
```R
df_gwas = read.table(gzfile('~/upload/GWAS/PGC/ADHD2022_iPSYCH_deCODE_PGC.meta.gz'), header = T)
head(df_gwas)
df_gwas = df_gwas[df_gwas$SNP %in% exp_gwas$SNP,]
out_gwas = data.frame(
    SNP = df_gwas$SNP,
    chr = as.character(df_gwas$CHR),
    pos = df_gwas$BP,
    beta.outcome = log(df_gwas$OR),
    se.outcome = df_gwas$SE,
    samplesize.outcome = df_gwas$Nca + df_gwas$Nco,
    pval.outcome = df_gwas$P,
    eaf.outcome = with(df_gwas, (FRQ_A_38691*Nca+FRQ_U_186843*Nco)/(Nca+Nco)),
    effect_allele.outcome = df_gwas$A1,
    other_allele.outcome = df_gwas$A2,
    outcome = 'ADHD',
    id.outcome = 'ADHD2022_iPSYCH_deCODE_PGC'    
)
out_gwas = subset(out_gwas, pval.outcome > 5e-08) # 排他性假设
```

## 附加 代理SNP
一部分暴露的SNPs在结局中找不到，可以找和这部分SNPs连锁不平衡的SNPs来代替。相关网站：[snipa](https://snipa.org/snipa3/)
## Harmonization
+ 将Exposure-SNP及Outcome-SNP等位基因方向协同
+ 根据EAF大小，剔除不能判断方向的回文SNP
+ 剔除incompatible SNP
```R
dat <- TwoSampleMR::harmonise_data(
 exposure_dat = exp_gwas, 
 outcome_dat = out_gwas
)
```
## 附加 一键报告
```R
TwoSampleMR::mr_report(dat, output_type = "md")
```
## MR分析
### 回归分析
```R
TwoSampleMR::mr_method_list() # 查看mr支持的MR分析方法
mr_regression = TwoSampleMR::mr(dat, method_list = c('mr_ivw', 'mr_egger_regression', 'mr_weighted_median'))
mr_regression_or = TwoSampleMR::generate_odds_ratios(mr_res = mr_regression) # 分类变量
{pdf(file = 'MR.BMIvsADHD.plot.pdf', width = 6, height = 6); # 导出 PDF 开始
print(TwoSampleMR::mr_scatter_plot(mr_results = mr_regression, dat = dat)); # 返回的是一个ggplot2对象
dev.off()} # 导出 PDF 结束
```
![mr_scatter_plot](https://img.limour.top/2023/10/15/652bd26b9010d.webp)
### 异质性检测
+ 有异质性用随机效应模型`ivw`，无异质性用固定效应模型（也可以用随机效应模型，两者结果一致）
+ 异质性可能带来多效性，如果没有多效性，则可以说异质性没有带来多效性
```R
TwoSampleMR::mr_heterogeneity(dat) # ivw的 Q_pval < 0.05 则说明有异质性
heterogeneity_presso = TwoSampleMR::run_mr_presso(dat, NbDistribution = 3000) # NbDistribution越高分辨率越高，找不到离群的SNP时需要提高
heterogeneity_presso[[1]]$`MR-PRESSO results`$`Global Test`$Pvalue # < 0.05 说明有异质性
heterogeneity_presso[[1]]$`MR-PRESSO results`$`Distortion Test`$`Outliers Indices` # 显示离群的SNP，将其剔除后重新分析
```
### 水平多效性
+ P < 0.05 说明不满足独立性假设，建议放弃继续做这个课题
+ P < 0.05 拒绝了截距为0的假设，说明SNP效应为0时依然有影响(截距存在)，有其他因素在起作用
```R
TwoSampleMR::mr_pleiotropy_test(dat)
```
### 敏感性分析
+ Leave-one-out analysis
+ 所有结果都不应该存在跨过0的情况，否则说明结果不稳定，不再能说明因果关系
```R
mr_loo <- TwoSampleMR::mr_leaveoneout(dat)
{pdf(file = 'MR.BMIvsADHD.leaveoneout.plot.pdf', width = 6, height = 6); # 导出 PDF 开始
print(TwoSampleMR::mr_leaveoneout_plot(leaveoneout_results = mr_loo)); # 返回的是一个ggplot2对象
dev.off()} # 导出 PDF 结束
```
### 单SNP分析
+ 对每个暴露-结果组合进行多次分析，每次使用不同的单 SNP 进行分析
```R
mr_res_single <- TwoSampleMR::mr_singlesnp(dat)
TwoSampleMR::mr_funnel_plot(mr_res_single)
TwoSampleMR::mr_forest_plot(mr_res_single)
```
### 方向性检测
```R
TwoSampleMR::directionality_test(dat) # TRUE表示确实是暴露导致了结果
```
## 附加 稳健回归
```R
dat2 <- TwoSampleMR::dat_to_MRInput(dat)
mr_ivw_robust <- MendelianRandomization::mr_ivw(dat2[[1]], model= "default", # “random”指的就是随机效应模型，“fixed”指的是固定效应模型
                                             robust = TRUE, penalized = TRUE,correl = FALSE, # 参数penalized代表下调异常值的权重
                                             weights ="simple", psi = 0,distribution = "normal",alpha = 0.05)
```
## 附加 绘制森林图
+ [美化森林图](/Forest-plot-displays-the-results-of-regression-analysis)

## 附加 计算Power
+ [Calculating statistical power in Mendelian randomization studies](https://doi.org/10.1093/ije/dyt179)
+ [Power calculations for Mendelian Randomization](https://shiny.cnsgenomics.com/mRnd/)
+ Sample size: 结局总的样本量，不是暴露的样本量
+ K: 结局中病例的比例，case/(case+control)
+ OR: IVW的OR值，exp(beta)
+ R2: MR_calc_r2_F 计算得到的所有R2的sum