2022-10-04-【迁移】导出SingleR需要的数据.md 5.4 KB


title: 【迁移】导出SingleR需要的数据 urlname: -dao-chu-SingleR-xu-yao-de-shu-ju date: 2022-10-04 17:20:44

tags: SingleR

loom文件使用记录

随着单细胞数据量的增长,计算要求成指数增长,当数据量大于10万个细胞的时候,seurat包分析就显得非常有压力了,因为在实时内存中储存数据就变得非常困难,HDF5数据格式提供了高效的磁盘存储,而不是在内存中存储数据,这就将分析扩展到大规模数据集,甚至可以达到大于100万细胞的级别 ,Linnarson实验室开发了一种基于hdf5的数据结构,loom,可以方便地存储单细胞基因组数据集和元数据。他们发布了一个名为loompy的Python API来与loom文件交互,而loomR能基于R的与loom交互(Merlin_cd6c

安装补充包

下载数据

The Human Cell Atlas is an international collaborative consortium that charts the cell types in the healthy body, across time from development to adulthood, and eventually to old age. This enormous undertaking, larger even than the Human Genome Project, will transform our understanding of the 37.2 trillion cells in the human body.

The HCA Data Portal stores and provides single-cell data contributed by labs around the world. Anyone can contribute data, find data, or access community tools and applications.

  • wget 'xxx' -O ProstateCellAtlas-human-prostate-gland-10xv2.loom

类似的网站:CZ CELLxGENE;可以检索需要的数据集,下载rds格式的文件

  • curl -o local.rds "xxx"
  • mv local.rds cellxgene_Human_prostate.rds

使用记录

公共的集群是真难用,一堆包装半天装不上,磁盘IO慢的一批。。。。。

conda activate seurat
.libPaths('')
sce <- loomR::connect(filename = "./HumanCellAtlas/ProstateCellAtlas/ProstateCellAtlas-human-prostate-gland-10xv2.loom", mode = "r", skip.validate = T)
mat <- sce[["matrix"]][,]
gene <- sce$row.attrs$Gene[]
# barcode <- sce$col.attrs$CellID[]
barcode <- sce$col.attrs$cell_names[]
mat <- t(mat)
colnames(mat)= barcode
rownames(mat)= gene
sce$close_all()
sce <- Seurat::CreateSeuratObject(counts = mat, project = 'prostate', min.cells = 3, min.features = 200)
rm(mat)
gc()
#             used   (Mb)  gc trigger     (Mb)    max used     (Mb)
# Ncells   3611333  192.9     6136950    327.8     6136950    327.8
# Vcells 287451723 2193.1 45703419996 348689.5 57126196661 435838.3
sce
# An object of class Seurat 
# 39879 features across 128673 samples within 1 assay 
# Active assay: RNA (39879 features, 0 variable features)

构造Pseudo-bulk的表达量矩阵加速SingleR运行

The Human Cell Atlas 和 CZ CELLxGENE 的单细胞数据集有些metadata里有细胞类型注释。我们使用前面下载的数据集来构建一个SingleR的参考集。

sce <- readRDS('~/HumanCellAtlas/ProstateCellAtlas/cellxgene_Human_prostate.rds')
table(sce$tissue)
sce <- SeuratObject::UpdateSeuratObject(sce)
saveRDS(sce@meta.data, 'meta.rds')
sce@assays$RNA@counts
umi <- sce@assays$RNA@counts
sce <- Seurat::CreateSeuratObject(counts = umi, project = 'prostate', min.cells = 3, min.features = 200)
sce@meta.data <- readRDS('meta.rds')
all(colnames(sce) == rownames(sce@meta.data))
sce <- subset(sce, tissue == 'prostate gland')
gc()
table(sce$`Broad cell type`)
table(sce$`Granular cell type`)
table(sce$`Granular cell type`)
table(sce$`Tissue composition`)
table(sce$`Cell types level 2`)
table(sce$`Cell types level 3`)
sce$cell_t_1 <- droplevels(sce$`Tissue composition`)
all(as.character(sce$cell_t_1) == as.character(sce$`Tissue composition`))
sce$cell_t_2 <- droplevels(sce$`Cell types level 2`)
sce$cell_t_3 <- droplevels(sce$`Cell types level 3`)
sce <-  Seurat::NormalizeData(sce)
ref <- list()
ref$cell_t_1 <- Seurat::AverageExpression(sce,
                          group.by = "cell_t_1",
                          assays = "RNA")$RNA
ref$cell_t_2 <- Seurat::AverageExpression(sce,
                          group.by = "cell_t_2",
                          assays = "RNA")$RNA
ref$cell_t_3 <- Seurat::AverageExpression(sce,
                          group.by = "cell_t_3",
                          assays = "RNA")$RNA
saveRDS(ref, 'singleR_prostate.rds')

导出SingleR需要的数据

tp_samples <- list.files('~/GEO/GSE193337')
tp_dir <- file.path('~/GEO/GSE193337', tp_samples)
names(tp_dir) <- tp_samples
counts <- Seurat::Read10X(data.dir = tp_dir)
sce <- Seurat::CreateSeuratObject(counts, project = 'prostate',
                            min.cells = 3, min.features = 200)
rm(counts)
gc()
table(sce$orig.ident)
sce <- Seurat::NormalizeData(sce)
logumi <- Seurat::GetAssayData(sce, slot="data")
saveRDS(logumi, 'logumi.rds')