--- title: "cellchatr: Fast CellChat Database Querying with Rust" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{cellchatr-workflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) cellchat_available <- requireNamespace("CellChat", quietly = TRUE) ``` ## Overview `cellchatr` accelerates three bottleneck steps in the CellChat workflow using Rust via extendr. This vignette shows a complete pipeline using real `CellChatDB.human` data. The three functions replace: | cellchatr | CellChat | Speedup | |---|---|---| | `rust_subset_genes()` | `subsetData()` | O(n+m) vs O(n×m) | | `rust_wilcoxon_filter()` | `identifyOverExpressedGenes()` | 183x faster | | `rust_match_lr_pairs()` | `identifyOverExpressedInteractions()` | O(m+r) vs O(m×r) | ## Setup ```{r setup} library(cellchatr) if (requireNamespace("CellChat", quietly = TRUE)) { library(CellChat) } else { message("CellChat not installed. Install with: devtools::install_github('jinworks/CellChat')") } ``` ## Step 1 — Load CellChatDB ```{r load-db, eval=cellchat_available} db <- CellChatDB.human cat("Total L-R interactions:", nrow(db$interaction), "\n") cat("Signaling pathways:", length(unique(db$interaction$pathway_name)), "\n") head(db$interaction[, c("interaction_name", "ligand", "receptor", "pathway_name")]) ``` ## Step 2 — Subset Expression Genes to CellChatDB (`rust_subset_genes`) In a real experiment you would use your scRNA-seq gene names here. We simulate 15,000 genes with real CellChatDB genes embedded. ```{r subset-genes, eval=cellchat_available} expr_genes <- unique(c( paste0("GENE", 1:14000), db$interaction$ligand, db$interaction$receptor )) cat("Total genes in expression matrix:", length(expr_genes), "\n") db_genes <- unique(c(db$interaction$ligand, db$interaction$receptor)) cat("Genes in CellChatDB:", length(db_genes), "\n") subset_genes <- rust_subset_genes(expr_genes, db_genes) cat("Genes present in both:", length(subset_genes), "\n") head(subset_genes) ``` ## Step 3 — Identify Overexpressed Genes (`rust_wilcoxon_filter`) Parallel Wilcoxon rank-sum test across all genes and cell types. Uses Rayon for multi-core parallelism — 183x faster than base R. ```{r wilcoxon-filter, eval=cellchat_available} set.seed(42) n_genes <- length(subset_genes) n_cells <- 300 labels <- c(rep("Fibroblast", 100), rep("Macrophage", 100), rep("Endothelial", 100)) counts <- matrix(rnorm(n_genes * n_cells, mean = 1, sd = 1), nrow = n_genes) counts[1:20, labels == "Fibroblast"] <- counts[1:20, labels == "Fibroblast"] + 5 overexpressed <- rust_wilcoxon_filter(counts, labels, subset_genes, 0.05) cat("Overexpressed genes found:", length(overexpressed), "\n") head(overexpressed) ``` ## Step 4 — Match to L-R Pairs (`rust_match_lr_pairs`) HashSet-based matching — O(m + r) vs R's O(m × r). ```{r match-lr, eval=cellchat_available} interactions <- rust_match_lr_pairs( overexpressed = overexpressed, lr_ligands = db$interaction$ligand, lr_receptors = db$interaction$receptor, lr_names = db$interaction$interaction_name ) cat("Matched L-R interactions:", length(interactions), "\n") matched <- db$interaction[ db$interaction$interaction_name %in% interactions, c("interaction_name", "ligand", "receptor", "pathway_name") ] sort(table(matched$pathway_name), decreasing = TRUE)[1:10] ``` ## Benchmark ```{r benchmark, message=FALSE} library(microbenchmark) set.seed(42) n_genes <- 2000 n_cells <- 200 labels_bm <- c(rep("A", 100), rep("B", 100)) gene_names <- paste0("GENE", 1:n_genes) counts_bm <- matrix(rnorm(n_genes * n_cells), nrow = n_genes) r_wilcoxon <- function(counts, labels, gene_names, pval_threshold) { cell_types <- unique(labels) overexpressed <- c() for (g in 1:nrow(counts)) { for (ct in cell_types) { group <- counts[g, labels == ct] other <- counts[g, labels != ct] p <- wilcox.test(group, other, alternative = "greater")$p.value if (p < pval_threshold) { overexpressed <- c(overexpressed, gene_names[g]) break } } } overexpressed } bm <- microbenchmark( rust = rust_wilcoxon_filter(counts_bm, labels_bm, gene_names, 0.05), base_r = r_wilcoxon(counts_bm, labels_bm, gene_names, 0.05), times = 3 ) print(bm) ``` ## Session Info ```{r session-info} sessionInfo() ```