Machine Learning Meets Biology (once again) or Single Cell RNA Sequencing As a Field for Unsupervised Learning
In a past decade amount of data in biology skyrocketed, and there seems to be no limit for that, as we have more and more sequenced genomes, deeper RNA expression and epigenetics datasets. On a separate note, we have brilliant examples of the broad use of machine learning for protein structure prediction and drug discovery. Now we are also able to study the 3D organisation of the genome, which can later be used for understanding of genomic rearrangements and subsequent diseases.
But there is one relatively new field of bioinformatics where classic ML tools are particularly useful — single-cell RNA sequencing.
Before we move to single cell data, let’s briefly cover what is RNA sequencing and what is RNA in general.
Some of you may have heard about central dogma of molecular biology, such as information being transformed from DNA to RNA and from RNA to proteins. Proteins are the molecular machines which are responsible for everything in the cell. Unfortunately, it is very hard to quantify the exact amount of each protein in the cell, but we can easily sequence their corresponding RNA.
As stated in the name, RNA sequencing is a group of methods for quantifying the amount of RNA in a sample. The reason we want to quantify it is that different cells may express different genes on different levels. Because of that difference, cells have extremely various properties and functions and we are interested in what caused that. Typical examples of the use of RNA sequencing is a comparison of gene “activity” (amount of RNA) in two groups, for example, treatment and control, where we have N control samples, N treatment samples and we use statistical methods to estimate which genes are deferentially expressed (= have different means in populations) between our groups (this can be interpreted directly as A\B test for those who are familiar with them). That’s quite useful when we are trying to understand if, for example, our treatment has any effect. Another example is our attempt to understand the effect of different mutations of the genomic sequence on a level of gene expression
However, those methods have their limitations. Since we extract RNA from a group of cells, we lose all information about the difference between them and we can only get a mean expression of each gene. In some cases that as useful as knowing the average temperature of all patients in a hospital. That may be a number, but does it make sense? This is especially important for cancer studies, as it has been shown that a tumor is not homogeneous but in fact can consist of the cell with different morphological, phenotypic and expression profiles so drug sensitivity tests may not be 100% accurate. It doesn’t mean, though, that bulk RNA sequencing is useless, we just need to understand its limitations.
So what should we do? Well, without going deep into the details of molecular biology, we can try to keep the information about each cell, so every time we sequence an RNA molecule we also sequence a small technical sequence, which is unique for each cell. If we add additionally a small technical sequence to each unique molecule of RNA, we can also accurately quantify the real number of RNA in each cell. So for each molecule, we sequence we know the ID of the cell (barcode) it came from and its own unique ID (in bioinformatics we call it a unique molecular identifier, UMI) so we can quantify the exact amount of RNA in each cell, therefore keeping all possible information! After all, at the end of the experiment, we will have a genes-by-cells table P, where P[i,j] is an expression of gene i in cell j.
But, like any technique, there are problems and limitations. And one of such problems, the most important for us, is drop out events. Due to the stochastic nature of gene expression and technical limitations from each cell we get only a subset of genes, a subset of data we theoretically can get from it. We observe a lot of zero expression of different genes in our cells and unfortunately, we can’t tell if one particular “zero” is because that gene is simply not expressed in the cell or it simply didn’t make it to our subset — we call such events “dropouts”. (Some people may notice that it looks very similar to recommendation systems, where we also have quite sparse data, as one user can’t rate all possible films\products)
Now, when we cover all details necessary for the understanding of our data, both from biological and technical points of view, we can summarize it in terms of the machine learning task.
So our data has the following properties:
Typical number of features (genes) is ~20,000–30,000. (Here, however, I need no notice that we usually remove genes with low expression and work with around ~5,000–7,000 genes… which is still a lot). The number of samples can vary from 5,000 cells (usual number from one run of the pipeline we use in the lab) to 400,000 cells and even 1,300,000 cells. As mentioned above, due to dropouts and nature of the samples, data is sparse. Another important point for clustering algorithms is different scales of our features (gene A can be expressed 100 times higher than gene B and yet that doesn’t mean that gene A is more important than gene B) and scales of samples. We can have a different number of UMIs (sum of expression) per cell, but that can be a biological variation (some cells during their life simply have less mRNA) or technical (two cells can have exactly the same amount of mRNA, but from one we get more data due to technical variations). Therefore proper normalization is crucial.
The typical question biologists have is usually given all K cells, how many cell states (cell clusters) do we have and what are their marker genes, which we define as statistically significant features of a particular cluster. Given that we rarely have any labels from cells we can say that this is a typical question for unsupervised learning, where we have no prior knowledge for the number of clusters and clusters can be different size, shape, density and can be connected one with another (For example, cell cycle adds additional variation between close cell states). Keeping in mind the dropout component, which adds additional noise to data and can increase the distance between close cells, this task becomes more and more challenging.
Another important issue here is a high dimensionality of our data because of the so-called “curse of dimensionality”. Since we work with at least 5,000 genes, it would require 2⁵⁰⁰⁰ cells to cover all points in that space even if binarize gene expression. Obviously, we can’t sequence that many cells, so we need to first reduce the number of dimensions.
Typical pipeline for that is finding highly variable features (for example, using the coefficient of variation) and then using PCA to reduce the number of components to some reasonable number and on top of that, for sake of visualization, use tSNE or UMAP, or even more complex methods, as variational autoencoders and even GANs (However, this may work only with large datasets and it is hard to transfer such models from one dataset to another.)
Next step in the analysis is clustering. At this moment at least 15 different tools for single cell RNA seq clustering are widely used, using different clustering mechanism, from simple K means with several initializations to dbscan and agglomerative clustering. However, since we usually lack any labels for cells, we can try to estimate the robustness of our clusters on datasets with known labels or by using different metrics to estimate cluster performance, such as BIC, AIC or silhouette method, or we can do a smarter thing — ask biologists to help us! We can use domain knowledge, since for many important genes we know their pattern of expression and by using that we can try to analyze and identify some known cell types. This is not always possible, but this is a topic for further articles.
Let’s build a simple pipeline for single cell RNA analysis. We will analyze a public dataset of Peripheral Blood Mononuclear Cells (PBMC) from 10X Genomics, which consists of ~8000 cells, which is available here
Next, we load our data. The matrix is quite big with lots of zeros, so it is stored in a sparse format.
Let’s do some EDA to understand what are we working with
What we see is that we have a lot of genes with extremely low expression and some (maybe even 1 with extremely high expression) — they may be outliers. On the other hand, we also know that some transcription factors (genes which are regulating expression of other genes) can be expressed on a really low lever and filtering our genes like may lead to loosing some structure.
At the same time, working with 33555 features is quite hard, so let’s remove some of them. For that we can do a simple thing. First, we remove genes with low expression (let’s say 100), plot CV (coefficient of variation) and select a reasonable threshold select genes, which we will use for clustering.
Let’s use define highly variable and useful genes as genes with CV ≥ 10.
We again need to check our cells. It may happen that after filtering genes we will end up with cells with zero expression of any genes. It is better to remove them and after log-normalize our data
Next our step is using UMAP. We will use it to reduce number of dimensions to 2 and we will run clustering on that. This is not a good idea, since projection only on two dimensions may not be optimal and probably won’t reflect the whole complexity of the dataset, but it has one plus — it is easy to visualize and understand results of different clustering techniques. Another thing we can try to optimize here — is n_neighbors, min_dist, learning rate and other parameters of UMAP.
Usually clustering is done after PCA and we use tSNE to project the results on a two dimensional plane. However, UMAPis better if we want to add additional data points latter. Well, UMAP is just better in general.
We already see some structure here, without optimization of UMAP and pca parameters and with quite naive way to select features for clustering. The only question is if this structure reflects biology. But as an exercise we can try to run several clustering methods
Just visually, DBSCAN and SpectralClustering look good, but as I said, there are some metrics to monitor to find “optimal” clustering. Therefore here we can tune our parameters as well. After all, there are three steps where our naive approach can be improved
- Accurate feature selection
- Parameters tuning for UMAP+pca
- Parameters tuning for clustering
However, the main question here, as I said, if this structure reflects the true biological complexity. To answer this question we will need to look at marker genes of each identified clusters to try to match them with known cell types and understand underlying biology.
In the next post, we will compare our results with commonly used pipelines and will try to find “markers” for each cluster and identify them.