Last updated on March 19, 2024 pm
loom文件使用记录
随着单细胞数据量的增长,计算要求成指数增长,当数据量大于10万个细胞的时候,seurat包分析就显得非常有压力了,因为在实时内存中储存数据就变得非常困难,HDF5数据格式提供了高效的磁盘存储,而不是在内存中存储数据,这就将分析扩展到大规模数据集,甚至可以达到大于100万细胞的级别 ,Linnarson实验室开发了一种基于hdf5的数据结构,loom,可以方便地存储单细胞基因组数据集和元数据。他们发布了一个名为loompy的Python API来与loom文件交互,而loomR 能基于R的与loom交互(Merlin_cd6c )
安装补充包
下载数据
The Human Cell Atlas is an international collaborative consortium that charts the cell types in the healthy body, across time from development to adulthood, and eventually to old age. This enormous undertaking, larger even than the Human Genome Project, will transform our understanding of the 37.2 trillion cells in the human body.
The HCA Data Portal stores and provides single-cell data contributed by labs around the world. Anyone can contribute data, find data, or access community tools and applications.
wget 'xxx' -O ProstateCellAtlas-human-prostate-gland-10xv2.loom
类似的网站:CZ CELLxGENE ;可以检索需要的数据集 ,下载rds格式的文件
curl -o local.rds "xxx"
mv local.rds cellxgene_Human_prostate.rds
使用记录
公共的集群是真难用,一堆包装半天装不上,磁盘IO慢的一批。。。。。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 conda activate seurat .libPaths( '' ) sce <- loomR:: connect( filename = "./HumanCellAtlas/ProstateCellAtlas/ProstateCellAtlas-human-prostate-gland-10xv2.loom" , mode = "r" , skip.validate = T ) mat <- sce[[ "matrix" ] ] [ , ] gene <- sce$ row.attrs$ Gene[ ] barcode <- sce$ col.attrs$ cell_names[ ] mat <- t( mat) colnames( mat) = barcode rownames( mat) = gene sce$ close_all( ) sce <- Seurat:: CreateSeuratObject( counts = mat, project = 'prostate' , min.cells = 3 , min.features = 200 ) rm( mat) gc( ) sce
构造Pseudo-bulk的表达量矩阵加速SingleR运行
The Human Cell Atlas 和 CZ CELLxGENE 的单细胞数据集有些metadata里有细胞类型注释。我们使用前面下载的数据集来构建一个SingleR的参考集。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 sce <- readRDS( '~/HumanCellAtlas/ProstateCellAtlas/cellxgene_Human_prostate.rds' ) table( sce$ tissue) sce <- SeuratObject:: UpdateSeuratObject( sce) saveRDS( sce@ meta.data, 'meta.rds' ) sce@ assays$ RNA@ counts umi <- sce@ assays$ RNA@ counts sce <- Seurat:: CreateSeuratObject( counts = umi, project = 'prostate' , min.cells = 3 , min.features = 200 ) sce@ meta.data <- readRDS( 'meta.rds' ) all ( colnames( sce) == rownames( sce@ meta.data) ) sce <- subset( sce, tissue == 'prostate gland' ) gc( ) table( sce$ `Broad cell type`) table( sce$ `Granular cell type`) table( sce$ `Granular cell type`) table( sce$ `Tissue composition`) table( sce$ `Cell types level 2`) table( sce$ `Cell types level 3`) sce$ cell_t_1 <- droplevels( sce$ `Tissue composition`) all ( as.character ( sce$ cell_t_1) == as.character ( sce$ `Tissue composition`) ) sce$ cell_t_2 <- droplevels( sce$ `Cell types level 2`) sce$ cell_t_3 <- droplevels( sce$ `Cell types level 3`) sce <- Seurat:: NormalizeData( sce) ref <- list ( ) ref$ cell_t_1 <- Seurat:: AverageExpression( sce, group.by = "cell_t_1" , assays = "RNA" ) $ RNA ref$ cell_t_2 <- Seurat:: AverageExpression( sce, group.by = "cell_t_2" , assays = "RNA" ) $ RNA ref$ cell_t_3 <- Seurat:: AverageExpression( sce, group.by = "cell_t_3" , assays = "RNA" ) $ RNA saveRDS( ref, 'singleR_prostate.rds' )
导出SingleR需要的数据
1 2 3 4 5 6 7 8 9 10 11 12 tp_samples <- list.files( '~/GEO/GSE193337' ) tp_dir <- file.path( '~/GEO/GSE193337' , tp_samples) names ( tp_dir) <- tp_samples counts <- Seurat:: Read10X( data.dir = tp_dir) sce <- Seurat:: CreateSeuratObject( counts, project = 'prostate' , min.cells = 3 , min.features = 200 ) rm( counts) gc( ) table( sce$ orig.ident) sce <- Seurat:: NormalizeData( sce) logumi <- Seurat:: GetAssayData( sce, slot= "data" ) saveRDS( logumi, 'logumi.rds' )