Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Transition Matrix and Initial States #1198

Open
jennylsmith opened this issue May 10, 2024 · 1 comment
Open

Question about Transition Matrix and Initial States #1198

jennylsmith opened this issue May 10, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@jennylsmith
Copy link

jennylsmith commented May 10, 2024

Thank you for developing a great suite of tools to investigate cell fates in developmental biology!

Our Dataset

I am using CellRank and Moscot to identify and map cellular fates of an iPSC motor neuron (iPSC-MN) scRNA-seq (10x Genomics) dataset collected across 15 time-points. We have 86,954cells for which we annotated the cell types using an SCVI/SCANVI model trained on an integrated primary spinal cord/motor cortex scRNA-seq reference.

The dataset was analyzed using MOSCOT to generate a TemporalProblem object, which initialized the RealTimeKernel to compute the cell transition matrix. The RealTimeKernel was then further analyzed using GPCCA to identify terminal/initial states and driver genes.

Demo Dataset

I have also run the tutorial code for RealTimeKernel (500_real_time.ipynb) using the scRNA-seq dataset of MEF reprogramming to produce iPSCs (cr.datasets.reprogramming_schiebinger(subset_to_serum=True)). Then, I input the results from Schiebinger dataset into the tutorial for estimators to identify the terminal and initial states (600_initial_terminal.ipynb). This identified 5 macrostates and selected the same 30 cells as the initial state and terminal state (IPS cell type).

Question about Transition Matrix

The output of initial and terminal state estimation using GPCCA in CellRank to not be quite what we expected, and I am hoping to clarify some aspects.

The GPCCA estimator for our RealTimeKernel is identifying 14 macrostates ( that are then defined into 7 terminal states and 1 initial state using the automatic method). The single_flow plot does show some expected transition from progenitor to the other cell types, such as motor neurons, but only very very small proportion appear to be transitioning.

The course grained transition matrix suggests that macrostates essentially only transition into the same macrostate(the diagonal does not have a value less 0.96). I would interpret this as meaning “if a cell starts as a progenitor cell type, most likely the descendants will be a progenitor cell type”.

In this biological model system, we are “pushing” the iPSCs to differentiate into motor neurons, so I would expect a much larger transition from progenitor cells into the more terminal differentiated cells, especially MNs.

Do you have any thoughts on why we don’t seem to identify more cell transitions from progenitors into the other cell-types? Are there any parameters in compute_transition_matrix that we should be optimizing, such as conn_weight to preferentially add more weight to transcriptomic data?

Question about Initial States

Prediction of the terminal states using the automatic method seems to be working well. We are finding 7 terminal states and these include different motor neuron sub-clusters.

However, initial state predicted using the automatic method doesn’t seem to “make sense” in both our dataset and the provided schiebinger dataset. The selected 30 cells that represent the initial state in our iPSC dataset includes the progenitor cell type from day 35 to day 50, which are very late time points in the differentiation protocol.

Similarly, in the schiebinger dataset, the initial state cells are identified as the IPS cell type, but are from days 16.5 to 18.0, which again are very late time-points in the experiment.

How should we interpret the initial state class assignment? It appears that we are identifying progenitor cells as initial macrostates, but would have thought that day0 progenitor or IPS cell types would have been the initial state cells.

# Note: I am using CellRank in R/Rmarkdown with reticulate. 

# initial the temporal problem from adata 
tp <- mt$problems$TemporalProblem(ipsc_adata)

# check the object class
tp$adata$obs['day']$cat$categories
# Index([0, 3, 5, 8, 11, 15, 18, 21, 24, 27, 30, 35, 40, 45, 50], dtype='int64') # OK

# calculate marginals
tp <- tp$score_genes_for_marginals(
  gene_set_proliferation="human",
  gene_set_apoptosis="human"
)

tp <- tp$prepare(time_key="day",
                 policy='sequential', 
                 joint_attr = "X_scANVI")
tp <- tp$solve(epsilon=1e-3, tau_a=0.95, tau_b = 0.99, scale_cost="mean")

tmk <-
  cr$kernels$RealTimeKernel$from_moscot(problem =  tp,
                                        kwargs = list("time_key" = "day",
                                                      "policy"="sequential"))

tmk$compute_transition_matrix(self_transitions="all", 
                              threshold="auto_local", 
                              conn_weight = 0.2,
                              conn_kwargs = list(use_rep = "X_scANVI",
                                                 metric = "cosine",
                                                 random_state = seed_val,
                                                 n_neighbors = 15L))
# set-up the estimators
gpcca <- cr$estimators$GPCCA(tmk)

# fit the clusters 
# two key outputs: the soft assignment of cells to macrostates, and a matrix of transition probabilities among these macrostates
# the algorithm to scan this interval for the optimal number of macrostates 
gpcca$fit(cluster_key="subtype_standardized",
          n_states=c(6L, 20L)) 


# Predict the terminal states of cells/clusters
# values reflect our confidence in assigning a cell to a terminal state
gpcca$predict_terminal_states(allow_overlap = FALSE)

# predict the initial states of cells/clusters
# by default the algorithm is expecting there is only 1 initial state cell type - perhaps there could be > 1?
gpcca$predict_initial_states(allow_overlap = TRUE,
                             n_states = 1L)

Versions

cr$logging$print_versions()

cellrank==2.0.2 scanpy==1.9.5 anndata==0.10.5.post1 numpy==1.26.4 numba==0.59.0 scipy==1.11.4 pandas==2.1.0 pygpcca==1.0.4 scikit-learn==1.1.3 statsmodels==0.14.0 python-igraph==0.10.8 scvelo==0.3.1 pygam==0.9.1 matplotlib==3.7.1 seaborn==0.13.2

@jennylsmith jennylsmith added the question Further information is requested label May 10, 2024
@Marius1311
Copy link
Collaborator

Hi @jennylsmith, thank you for your question, and sorry that it took me a bit longer to get back to you.

Transition matrix

I wouldn't worry too much about the coarse-grained transition matrix; we use it mostly to identify initial and terminal states. When you actually want to study transitions towards terminal states, I would move back to the single-cell level and compute fate probabilities.

Initial state identification

That's a bit weird indeed - you can check whether things look okay when you simulate random walks from day 0 and let them evolve according to the transition matrix (see our kernel tutorials). If does not look okay, i.e. if random walks don't progress nicely to later days, then something is wrong with your transition matrix. Otherwise, I'm not quite sure what's going wrong here.

In any case, are you interested in the initial states in your dataset? I suppose since this these neurons are generated from iPSCs, and we typically have good marker genes for iPSCs, there are easier ways to identify your initial states? If your terminal states look fine, and you're not actually interested in the initial states, because you already know where they are, I would proceed with the terminal states and compute fate probabilities to study differentiation towards terminal states and genes that regulate these transitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants