View on GitHub Logo

Tutorial

Installing SCOT

SCOT is developed using Python 3. It depends on a few Python packages, namely: numpy, cython, scipy, sklearn, matlab, and POT

Installing Requirements: After you clone our repository locally, you can install these dependencies using the requirements.txt file.
If you are using pip you can do so by running pip3 install -r requirements.txt or python3 -m pip install -r requirements.txt on your terminal.
If you are using conda, you can do so with the command conda install --file requirements.txt

Getting SCOT: You can clone the SCOT repository locally in one of two ways:
1 If you use git, by running git clone https://github.com/rsinghlab/SCOT.git on your terminal, or
2 By navigating to our GitHub repository, clicking on the green Code button with the download icon, selecting Download ZIP option and then extracting the downloaded compressed folder.

Running SCOT

Once you have cloned the SCOT repository and installed the requirements, you will be ready to use it on your own datasets by importing SCOT in a Python script:
from scot import scot.

Note that if your Python script lives elsewhere, you would need to specify the path to scot.py in your local copy of SCOT using sys. Example:

import sys
sys.path.insert(1, '/path/to/SCOT')
from scot import SCOT

SCOT expects datasets to be in numpy arrays. If you have your data in text format, you can read in these using the numpy.genfromtxt() or numpy.loadtxt() functions. Example:

import numpy as np 
domain1= np.genfromtxt("path_to_data_file.txt", delimiter="\t") #Change delimiter according to your text file
domain2= np.loadtxt("path_to_data_file2.txt", delimiter="\t") #Same, but with "loadtxt". 
#genfromtxt gives a few more options when loading, e.g. dealing with missing values.

If you have .mtx data format, which is a common format for single-cell RNA sequencing datasets, you can turn these into numpy arrays with the Python package called scanpy Example:

import scanpy as sc
my_dataset=sc.read_mtx("datasetFilename.counts.mtx")
my_dataset_npy=my_dataset.X.toarray()

Please make sure that the rows in your data matrix/numpy array correspond to samples and columns correspond to genomic features (and transpose your matrix with numpy.transpose if needed).

Once you have read in the datasets, you can initialize the SCOT and then run the alignment algorithm on it, which will return either the aligned datasets or the cell-to-cell correspondence matrix, depending on your specification:

import numpy as np
from scot import SCOT

scot_aligner=SCOT(domain1, domain2)
k= 50 # a hyperparameter of the model, determines the number of neighbors to be used in the kNN graph constructed for cells based on sequencing data correlations
e= 1e-3 # another hyperparameter of the model, determines the coefficient of the entropic regularization term
normalize=True #
aligned_domain1, aligned_domain2= scot_aligner.align(k=k, e=epsilon, normalize=normalize)

Please take a look at the examples page for Python scripts demonstrating the use of SCOT to align datasets.

Choosing hyperparameters

There are two required hyperparameters for performing alignment with SCOT:

Parameter Description Default Value Recommended Range to Try
k Number of neighbors to consider in kNN graphs 50 [20 – n/5], where n is the number of samples (cells) in the smallest dataset
e Coefficient of the entropic regularization term in the objective function of OT formulation 1e-3 [5e-4 – 1e-1]

In general, we have found that the algorithm is fairly robust to the choice of k and the parameter e makes a larger difference. The larger values of e disperses the correspondence probabilities across more samples. If you expect to find 1-to-1 correspondences between samples, err towards smaller values of e.

If you are not sure which hyperparameters to set while running SCOT alignment, you have two options:
1. If you have some validation data about the cell-to-cell correspondences between the two domains in your dataset, you can use these for hyperparameter tuning. For this, take a look at the hyperparameter tuning example script.
2. If you don’t have any validation data on correspondences, no worries! You can use the unsupervised hyperparameter tuning heuristic, where we use the Gromov-Wasserstein distance as a proxy for graph distances to check for alignment quality as we sweep through different hyperparameter combinations. This procedure can take some time as we iterate over multiple hyperparameter combinations.

Optional parameters for SCOT alignment

align(self, k, e, balanced=True, rho=1e-3, verbose=True, normalize=True, norm=”l2”, XontoY=True):

Parameter Description Default Value Notes
balanced Determines whether to perform balanced or unbalanced optimal tranport (boolean) False By default, we perform balanced transport. However, if you have a reason to believe there will be severe underrepresentation of at least one cell type in one of the domains in comparison to the other, setting this to False will yield better alignments. Otherwise, keep at True.
rho Coefficient of the Kullback-Leibler relaxation term in unbalanced OT formulation 5e-2 Only need this if you are performing unbalanced OT.
verbose Determines whether to print transport progress (loss over iterations) while optimizing alignment (boolean) True  
normalize Determines whether to normalize datasets before performing alignment on them. (boolean) True Empirically, we have found that normalization slightly helps with the resulting alignment quality, so we suggest you keep this True.
normalize Determines whether to normalize datasets before performing alignment on them. (boolean) True Empirically, we have found that normalization slightly helps with the resulting alignment quality, so we suggest you keep this True.
norm Defines what sort of normalization will be applied on the datasets. Normalization is always performed column-wise. l2 Empirically, we have found that best alignment results are obtained when the pre-processed (via a dimensionality reduction scheme) real world sequencing datasets are l-2 normalized first. Other options: zscore, l1, and max.
XontoY Sets the direction of the barycentric projection. (boolean) True The direction of the barycentric projection makes very little difference in the quality of the resulting alignment. Generally, it is advisable to project onto the dataset with the more defined clusters.

Attributes of SCOT aligner you can access to:

The align function of SCOT returns the two aligned matrices. However, there is additional information you can access: