Tutorial
Installing SCOT
SCOT is developed using Python 3. It depends on a few Python packages, namely: numpy
, cython
, scipy
, sklearn
, matlab
, and POT
Installing Requirements: After you clone our repository locally, you can install these dependencies using the requirements.txt
file.
If you are using pip
you can do so by running pip3 install r requirements.txt
or python3 m pip install r requirements.txt
on your terminal.
If you are using conda
, you can do so with the command conda install file requirements.txt
Getting SCOT: You can clone the SCOT repository locally in one of two ways:
1 If you use git
, by running git clone https://github.com/rsinghlab/SCOT.git
on your terminal, or
2 By navigating to our GitHub repository, clicking on the green Code
button with the download icon, selecting Download ZIP
option and then extracting the downloaded compressed folder.
Running SCOT
Once you have cloned the SCOT repository and installed the requirements, you will be ready to use it on your own datasets by importing SCOT in a Python script:
from scot import scot
.
Note that if your Python script lives elsewhere, you would need to specify the path to scot.py in your local copy of SCOT using sys
. Example:
import sys
sys.path.insert(1, '/path/to/SCOT')
from scot import SCOT
SCOT expects datasets to be in numpy
arrays. If you have your data in text format, you can read in these using the numpy.genfromtxt()
or numpy.loadtxt()
functions. Example:
import numpy as np
domain1= np.genfromtxt("path_to_data_file.txt", delimiter="\t") #Change delimiter according to your text file
domain2= np.loadtxt("path_to_data_file2.txt", delimiter="\t") #Same, but with "loadtxt".
#genfromtxt gives a few more options when loading, e.g. dealing with missing values.
If you have .mtx
data format, which is a common format for singlecell RNA sequencing datasets, you can turn these into numpy
arrays with the Python package called scanpy
Example:
import scanpy as sc
my_dataset=sc.read_mtx("datasetFilename.counts.mtx")
my_dataset_npy=my_dataset.X.toarray()
Please make sure that the rows in your data matrix/numpy array correspond to samples and columns correspond to genomic features (and transpose your matrix with numpy.transpose
if needed).
Once you have read in the datasets, you can initialize the SCOT and then run the alignment algorithm on it, which will return either the aligned datasets or the celltocell correspondence matrix, depending on your specification:
import numpy as np
from scot import SCOT
scot_aligner=SCOT(domain1, domain2)
k= 50 # a hyperparameter of the model, determines the number of neighbors to be used in the kNN graph constructed for cells based on sequencing data correlations
e= 1e3 # another hyperparameter of the model, determines the coefficient of the entropic regularization term
normalize=True #
aligned_domain1, aligned_domain2= scot_aligner.align(k=k, e=epsilon, normalize=normalize)
Please take a look at the examples page for Python scripts demonstrating the use of SCOT to align datasets.
Choosing hyperparameters
There are two required hyperparameters for performing alignment with SCOT:
Parameter  Description  Default Value  Recommended Range to Try 

k  Number of neighbors to consider in kNN graphs  50  [20 – n/5], where n is the number of samples (cells) in the smallest dataset 
e  Coefficient of the entropic regularization term in the objective function of OT formulation  1e3  [5e4 – 1e1] 
In general, we have found that the algorithm is fairly robust to the choice of k
and the parameter e
makes a larger difference. The larger values of e
disperses the correspondence probabilities across more samples. If you expect to find 1to1 correspondences between samples, err towards smaller values of e
.
If you are not sure which hyperparameters to set while running SCOT alignment, you have two options:
1. If you have some validation data about the celltocell correspondences between the two domains in your dataset, you can use these for hyperparameter tuning. For this, take a look at the hyperparameter tuning example script.
2. If you don’t have any validation data on correspondences, no worries! You can use the unsupervised hyperparameter tuning heuristic, where we use the GromovWasserstein distance as a proxy for graph distances to check for alignment quality as we sweep through different hyperparameter combinations. This procedure can take some time as we iterate over multiple hyperparameter combinations.
Optional parameters for SCOT alignment
align(self, k, e, balanced=True, rho=1e3, verbose=True, normalize=True, norm=”l2”, XontoY=True):
Parameter  Description  Default Value  Notes 

balanced  Determines whether to perform balanced or unbalanced optimal tranport  (boolean) False 
By default, we perform balanced transport. However, if you have a reason to believe there will be severe underrepresentation of at least one cell type in one of the domains in comparison to the other, setting this to False will yield better alignments. Otherwise, keep at True. 
rho  Coefficient of the KullbackLeibler relaxation term in unbalanced OT formulation  5e2 
Only need this if you are performing unbalanced OT. 
verbose  Determines whether to print transport progress (loss over iterations) while optimizing alignment  (boolean) True 

normalize  Determines whether to normalize datasets before performing alignment on them.  (boolean) True 
Empirically, we have found that normalization slightly helps with the resulting alignment quality, so we suggest you keep this True . 
normalize  Determines whether to normalize datasets before performing alignment on them.  (boolean) True 
Empirically, we have found that normalization slightly helps with the resulting alignment quality, so we suggest you keep this True . 
norm  Defines what sort of normalization will be applied on the datasets. Normalization is always performed columnwise.  l2 
Empirically, we have found that best alignment results are obtained when the preprocessed (via a dimensionality reduction scheme) real world sequencing datasets are l2 normalized first. Other options: zscore , l1 , and max . 
XontoY  Sets the direction of the barycentric projection.  (boolean) True 
The direction of the barycentric projection makes very little difference in the quality of the resulting alignment. Generally, it is advisable to project onto the dataset with the more defined clusters. 
Attributes of SCOT aligner you can access to:
The align
function of SCOT returns the two aligned matrices. However, there is additional information you can access:
scot.coupling
will yield the probabilistic correspondence (coupling) matrix. The rows of the matrix will correspond to the samples in the first domain (X) and the columns will correspond to the samples in the second domain (Y). The order of the samples is the same as the order in the input datasets. You can use these correspondence probabilities in your downstream analyses.scot.gwdist
will yield the GromovWasserstein distance between the aligned datasets. This is also internally used when performing fully unsupervised alignment as a proxy for alignment quality.scot.flag
tells whether the optimization procedure for optimal transport has converged. The aligner notifies the user with a printed error message if convergence has failed, but one can also check with this flag. If it has not converged (returnsFalse
), you might need to set the parametere
to a higher value.scot.Xgraph
holds the adjacency matrix of the kNN graph built for the first domain.scot.ygraph
holds the adjacency matrix of the kNN graph built for the second domain.scot.Cx
holds the intradomain distance matrix for the first domain, computed based on shortest distances on the kNN graph.scot.Cy
holds the intradomain distance matrix for the second domain, computed based on shortest distances on the kNN graph.scot.p
corresponds to the marginal probability distribution for the samples in the first domain. We use uniform distribution, treating this as a vector of empirical probabilities for each sample. However, if you have some prior information on the marginal probabilities, please change this after initializing SCOT and before running the alignment.scot.q
corresponds to the marginal probability distribution for the samples in the second domain. We use uniform distribution, treating this as a vector of empirical probabilities for each sample. However, if you have some prior information on the marginal probabilities, please change this after initializing SCOT and before running the alignment.