The field of single-cell genomics is advancing rapidly and is generating many new insights into complex biological systems, ranging from the diversity of microbial ecosystems to the genomics of human cancer. In this Review, we provide an overview of the current state of the field of single-cell genome sequencing. First, we focus on the technical challenges of making measurements that start from a single molecule of DNA, and then explore how some of these recent methodological advancements have enabled the discovery of unexpected new biology. Areas highlighted include the application of single-cell genomics to interrogate microbial dark matter and to evaluate the pathogenic roles of genetic mosaicism in multicellular organisms, with a focus on cancer. We then attempt to predict advances we expect to see in the next few years.
Cell theory provided an entirely new framework for understanding biology and disease by asserting that cells are the basic unit of life1. The subsequent discovery that DNA is the heritable programme that encodes the proteins that carry out cellular functions led to the development of the fields of modern genetics and genomics2. Although bulk approaches for studying genetic variation have identified thousands of new unicellular species and determined genetic aetiologies for thousands of human diseases, most of that work has been done at the level of the ecosystem or organism3, 4. However, we now know that diversity within an ecosystem of unicellular species is far greater than we can accurately measure by studying a mixed group of organisms, and that the genomes within the cells of an individual multicellular organism are not always the same.Single-cell genomics aims to provide new perspectives to our understanding of genetics by bringing the study of genomes to the cellular level (Fig. 1). These tools are opening up new frontiers by dissecting the contributions of individual cells to the biology of ecosystems and organisms. For example, it is now possible to use single-cell genomics to identify and assemble the genomes of unculturable microorganisms5, evaluate the roles of genetic mosaicism in normal physiology and disease6, and determine the contributions of intra-tumour genetic heterogeneity in cancer development or treatment response7. However, this field rests on the ability to study a single DNA molecule from individually isolated cells, a process that is technically challenging.
In this Review, we describe the current state of the field, including approaches for cell isolation,whole-genome amplification (WGA), DNA sequencing considerations and sequence data analysis, and highlight how recent progress is addressing some of the technical challenges associated with these approaches. We then discuss how those advancements have begun to fulfil some of the ambitious aspirations for the field in applications such as identifying new features of microbial ecosystems and characterizing human intercellular genetic heterogeneity, in particular in cancer.
Acquiring high-quality single-cell sequencing data has four primary technical challenges: efficient physical isolation of individual cells; amplification of the genome of that single cell to acquire sufficient material for downstream analyses; querying the genome in a cost-effective manner to identify variation that can test the hypotheses of the study; and interpreting the data within the context of biases and errors that are introduced during the first three steps. To maximize the quality of single-cell data and to ensure that the signal is separable from technical noise, each of these variables requires careful consideration when designing single-cell studies.
Cell isolation. The first step in isolating individual cells from primary samples is to produce a suspension of viable single cells. This is not trivial when working with complex solid tissues, which require mechanical or enzymatic dissociation that keeps the cells viable while not biasing for specific subpopulations. In addition, diseased tissues can have different dissociation kinetics when compared with their normal counterparts, as well as varied dissociation between samples of the same disease. Standard digestion protocols for commonly studied tissues, as well as vigorous approaches for optimizing the dissociation of rare or diseased tissues, are areas that require further development. Laser-capture microdissection8 provides a low-throughput way of isolating DNA from single cells in their native spatial context, but the quality of sequencing data derived from microdissected single cells has been relatively poor. Finally, microfluidic and bead-based methods have been developed to specifically enrich for single circulating tumour cells (CTCs), as reviewed in detail elsewhere9. Environmental microbial samples also require efficient lysis of bacteria, with requirements that can be highly variable between species10.
Once in suspension, several approaches have been developed to isolate single cells. These include methods that require manual manipulation, such as serial dilution11, micropipetting12, microwell dilution13 and optical tweezers14. In addition, several protocols have been developed to isolate intact cells or nuclei using fluorescence-activated cell sorting (FACS)15. Nuclear isolation has the advantage of enabling single-cell sequencing on frozen tissue, which has not yet been demonstrated with other methods16. For microbial samples, depending on the environmental source additional sample preparation and FACS setting considerations can also be required17. Finally, automated micromanipulation methods that use droplets or micromechanical valves in microfluidic devices are entering mainstream use18, 19, 20. Regardless of the method used, it is also important to accurately confirm that a single cell has been physically isolated so that spurious biological conclusions are not made after evaluating chambers that are empty or contain multiple cells. In the ideal case this can be accomplished by obtaining microscopy data from each chamber or well containing a single cell.
Various single-cell isolation technologies have recently been reviewed, where the trade-offs in accuracy, throughput, reproducibility and ease of use were highlighted9, 21, 22. Most of the studies using these technologies have been done to illustrate feasibility using a small number of cells. Addressing many of the fundamental biological questions that are uniquely approachable with single-cell genomics will require the interrogation of thousands of cells, making it more likely that technologies that are scalable through parallelization, such as microfluidic-based approaches, are adopted for the long term. In addition, identifying scalable methods for single-cell isolation is an area of active research that is likely to continue producing innovative new tools that will improve all the capture performance metrics.
Whole-genome amplification. Another critical component of obtaining genetic information from single cells is to amplify the single copy of a genome while minimizing the introduction of artefacts, such as amplification bias, genome loss, mutations and chimaeras. WGA has been an area with substantial progress over the past 10 years, which has been reviewed in detail elsewhere21 (Fig. 2). Briefly, the first group of methods that attempted to amplify entire human genomes from single cells coupled PCR amplification with either common sequences interspersed throughout the genome23, a common sequence ligated to sheared genomes24, or degenerate or random oligonucleotide priming25, 26 (Fig. 2a). In practice, these methods have resulted in loss of signal from the majority of the genome during the amplification owing to differences in the density of common sequences or variability in PCR efficiency between loci, which are exacerbated when starting with a single genome copy. Starting with two genome copies by sorting out tetraploid nuclei improves the recovery of the genome to about 10% using degenerate oligonucleotide primed PCR (DOP-PCR). However, it is unclear whether selection biases are introduced when selecting for rapidly dividing cells15. A recent modification of this method has extended the approach to diploid cells16. Of note, these methods use thermostable polymerases, which have higher error rates than thermolabile polymerases, resulting in more mutations introduced during the amplification process.
The second category of WGA is based on isothermal methods (Fig. 2b). The most commonly used approach is multiple displacement amplification (MDA), which uses isothermal random priming and extension with Φ29 polymerase, which has high processivity, a low error rate and strand displacement activity27, 28. These methods produce greater genome coverage than the initial PCR-based methods, with lower error rates owing to the higher fidelity of Φ29 polymerase29. However, the exponential amplification results in overrepresentation of the loci that are amplified first, which is exacerbated by greater fold amplification29. It is unclear whether the overrepresentation of specific loci is due to stochastic or systematic biases. In addition, Φ29 polymerase activity results in the formation of a low level of chimeric sequence side products30, 31, which can be reduced with endonuclease treatment allowing the physical separation of the amplicons by debranching the reaction32.
In an attempt to overcome the low coverage of PCR-based methods and lack of uniformity of isothermal approaches, two quite similar hybrid methods have been developed. Both of these methods use a limited isothermal amplification followed by PCR amplification of the amplicons generated during the isothermal step12, 33 (Fig. 2c). As the name implies, ‘displacement DOP-PCR’ (also known as PicoPLEX) uses degenerate primers in the first step to add a common sequence, followed by priming of the common sequence for subsequent PCR amplification33. Most recently, multiple annealing and looping-based amplification cycles (MALBAC) uses a similar protocol, with the exception of using random primers, as well as new common sequences and temperature cycling that are claimed to promote looping of the isothermal amplicons to inhibit further amplification before the PCR step, which may result in more uniform amplification12.
In practice, the most commonly used WGA methods in current single-cell studies are the isothermal and hybrid methods. We recently compared these methods using serial dilutions ofEscherichia coli DNA, as well as single bacterial cells29. Both MDA and MALBAC could successfully amplify genomes from single cells, but when amplification was carried out in microlitre reaction volumes in tubes, a significant amount of extraneous contaminant DNA was also amplified. This contaminant DNA was largely eliminated by moving to a microfluidic format that used nanolitre volumes. Bias in amplification was different between the two methods; for MDA bias depended strongly on the amount of gain, whereas for MALBAC it was largely independent of gain. MDA and MALBAC had roughly similar bias when the MDA gain was limited by nanolitre volumes in microfluidic devices. In addition, MALBAC was better at measuring copy number variation but MDA had a significantly lower false-positive rate. These findings suggest that the amplification method should be carefully chosen for each experiment based on the type of genetic variation that will be interrogated.
Two recent reports looked at similar performance metrics in human cells. Both reports compared DOP-PCR, MDA and MALBAC34, 35. The first report34 found that MDA had better coverage than MALBAC (84% versus 52%), which resulted in higher detection rates of single nucleotide variants (SNVs; 88% versus 52%). However, MALBAC and DOP-PCR had more uniform coverage, resulting in greater sensitivity and specificity for the detection of copy number variants (CNVs) of >1 Mb. Interestingly, some cells amplified by MDA had comparable CNV detection rates to MALBAC and DOP-PCR; it is unclear what variables result in the lack of reproducible MDA uniformity. The second report35 also found that MDA had greater coverage breadth (84%) than MALBAC (72%) and DOP-PCR (39%), with MALBAC and DOP-PCR having greater uniformity (coefficient of variation 0.10 versus 0.14, respectively) than MDA (coefficient of variation 0.21). In addition, they found that the isothermal methods had lower false-positive rates, but with more false-negative variance between experiments. The authors note that MALBAC had a lower allelic dropout (ADO) rate than MDA, although MALBAC only covered 72% of the genome, so the ADO rate of 21% was probably calculated only using covered sites. Consequently, it is unclear from this analysis which method had a lower false-negative rate, as MDA covered more of the genome but lost more variant alleles at heterozygous sites owing to less uniform amplification. These studies largely confirm our conclusion from the study of single bacterial genomes that there is no clear winner in amplifier performance and that each approach has strengths depending on the metric of interest (Fig. 2d).
Previous reports had also shown a significant decrease in contamination when single-cell WGA was performed in a microfluidic device36. In addition, it has been shown that using nanolitre volumes of microfluidics devices results in more uniform MDA when compared to traditional microlitre reactions31. A microfluidic device has recently been developed to perform MALBAC37, but it is unclear whether the performance of MALBAC will be further improved by carrying out the reactions in enclosed nanolitre amplification chambers. A recent study that used nanolitre volumes in microwells for MDA WGA (in a technique termed microwell displacement amplification system (MIDAS)) claimed to further improve amplification uniformity and reported single-cell amplification with equivalent uniformity to bulk MDA amplification; this level of performance would be striking and awaits independent confirmation13. Performing single-cell MDA in microfluidic emulsions seems to markedly improve the uniformity of amplification, and multiple groups have had success with this approach38, 39.
Interrogation of WGA products. The next step in single-cell genomic analyses is to determine how the amplified genomes will be interrogated. Broadly speaking, for complex eukaryotic genomes such as the human genome, one can choose to query specific loci of interest (typically <1 Mb), sequence all of the protein-coding regions (the exome; 30–60 Mb) or sequence the entire genome (3 Gb). As seen in Box 1 and Table 1, each of these approaches has trade-offs in coverage, propensity for specific types of errors and cost per cell evaluated. The type of genomic interrogation also needs to be carefully considered in the context of the questions being addressed by the study, by taking the biases of the WGA method into account.
Box 1: Cost considerations when designing single-cell sequencing experiments
Targeting specific locations of the genomes of single cells can help to focus on areas that have the greatest biological contribution to the system being studied while reducing sequencing costs and false mutation discoveries. Smaller target regions are less likely to contain errors that were introduced during the first few rounds of WGA that would be propagated to result in the erroneous identification of a genetic variant (known as a false-positive variant call). Furthermore, using the bulk sample as a reference can reduce false-positive variant calls by requiring concordance of variant calls in the bulk and single cells, although this limits the mutation discovery space to variants identified in the bulk sample. Targeting can be mediated by target-specific amplification using PCR, or target capture through hybridization. Target-specific amplification provides more uniform target coverage than capture-based methods, which is important when trying to maximize coverage of a genome that has already undergone non-uniform amplification40. Target capture more easily provides greater coverage breadth41, although parallel target-specific amplification using microfluidic devices can significantly increase coverage without large increases in effort.
Single-cell exome sequencing allows broader genome interrogation, which can be used to identify variants that are unique to each of the cells. However, as the size of the genome region interrogated increases, the probability that false variants will be discovered also increases — especially when using polymerases with higher error rates with the PCR-based WGA methods (C.G. and S.R.Q., unpublished observations).
The entire genomes of single cells can also be interrogated. Again, this comes with the trade-off of increased false mutation discovery and cost with the ability to query a larger proportion of the genome. Whole-genome sequencing (WGS) of single cells also removes the additional decrease in uniformity that occurs as a result of targeted or exome capture; thus, WGS can facilitate the detection of SNVs and CNVs. In addition, WGS can catalogue non-coding and structural variantsthat may contribute to the biological system being studied. However, this comes at a cost of requiring roughly 30-fold more sequencing per cell relative to exome sequencing, which may become limiting if working with many cells.
Overview of single-cell sequencing errors. One of the major challenges of analysing single-cell genomics data is to develop tools that differentiate technical artefacts and noise introduced during single-cell isolation, WGA and genome interrogation from true biological variation. During single-cell isolation, the population of cells being interrogated can be biased through selection of cells based on size, viability or propensity to enter the cell cycle. Consequently, it is necessary to compare the variant alleles detected in the single cells to the bulk population to ensure there was no selection bias. This can be done by comparing the percentage of single cells with a variant to the variant allele frequency in the original bulk sample40.
As detailed in Fig. 3, numerous errors are introduced during WGA, including loss of coverage, decreased coverage uniformity, allelic imbalance, ADO and errors during genome amplification. Most published papers have attempted to quantify rates for some or all of these errors. However, many studies use cell lines to carry out quality control analyses, followed by experiments on primary samples. This makes it difficult to compare protocols between studies, as it is unclear whether similar performance can be obtained on the primary samples for each of these studies relative to the cell lines that were used for protocol optimization. One must be particularly mindful that certain cell lines or cell types may not be diploid; they can be highly aneuploid or even polyploid, and this affects experimental performance enormously. In addition, various metrics are applied in a quality control step in which the cells are categorized into a subset that meets the chosen criteria and are used to draw biological conclusions, versus a subset that is discarded. Some of the quality metrics used include visual confirmation of an individually isolated cell, WGA product qualification and/or quantification, genome coverage and ADO. Two recent studies developed methods to predict the breadth of genome coverage using low-pass sequencing, which could provide a low-cost approach for assessing cell lysate quality in larger eukaryotic genomes42,43. However, comparing single-cell genomics studies is currently difficult, as most studies do not report the total number of cells evaluated, the quality of the data from the discarded cells or the metrics used for the quality control categorization (Table 1). Finally, the definition of ADO is not uniform across studies. Some studies do not include loci where both alleles have dropped out of the single cells in their ADO calculation, which artificially reduces those ADO values relative to those studies that include all loci. In practice, the determination of clonal structures is hampered by loss of somatic variants, which occurs always when the locus drops out and occurs about 50% of the time when one of the alleles drops out. A less ambiguous term is ‘false-negative rate’, which would take into account both allelic and locus dropout. An additional consideration for microorganism sequencing is changes in lysis and/or amplification efficiency between species owing to cellular or genome differences. A recent study compared errors and assembly performance between species44. Hence, more uniform analysis and reporting methods are needed to facilitate data interpretation between single-cell studies and provide accurate performance metrics for each approach.
Single-cell variant calling. Although numerous errors are introduced during WGA, tools and strategies are now being developed to overcome the additional technical noise created with WGA, allowing the identification of true variation. SNV calling requires coverage of a variant allele at a rate that exceeds the sum of the amplification and sequencing error rates. More specifically, mutations introduced during the amplification, as well as the allelic imbalances that occur during genome amplification, must be taken into account when calling variants (Fig. 4). There are two basic strategies to overcome the false-positive variants introduced as artefacts of the amplification. First, the bulk sample can be used as a reference to reduce the false discovery rate40. Second, when using only the single-cell data, two or three cells can be required to have the same variant at the same location, which is unlikely to occur by chance with the several thousand mutations introduced during single-cell WGA in a 3 Gb human genome12. However, the actual number of cells required to call a mutation has not yet been rigorously tested based on the size of the genomic region interrogated. To overcome allelic imbalance, we need variant calling algorithms that are designed to take the technical noise into consideration. One strategy is to require that all variant calls be above the level of technical noise in control samples, which should not have variants40. Another approach is to decrease the sequencing error rate by using molecular barcoding7. Finally, algorithms are beginning to be developed to correct errors in single-cell sequencing data45. Nonetheless, more tools that incorporate all single-cell amplification errors are needed to optimally carry out variant calling in single-cell data.
CNV detection relies on algorithms such as hidden Markov models, circular binary segmentation and rank segmentation, which can normalize noisy coverage data after single-cell WGA to identify regions that are over- or under-represented compared with a diploid genome12, 46, 47. CNV detection algorithms are currently being developed to specifically address the technical artefacts introduced during specific types of single-cell WGA47, 48. Chimaera formation can create false structural variants, although unless they occur at the beginning of the amplification they should be much less abundant than the corresponding wild-type sequences. This is important for both identifying structural variation in sequencing data and when constructing contigs for de novogenome assemblies. In addition, assemblies are hampered by loss of coverage and uneven coverage, which results in truncated or artefactual sequences in assembled genomes. Several assemblers have been created to specifically address these challenges49, 50, and it is likely that further progress will be made in the coming years.
Determining genetic relationships between single cells. General strategies for clustering gene expression and other large data sets have depended heavily on distance functions that provide a quantitative measurement of the differences between pairs of samples51. Within the context of single-cell sequencing, we require that these distance functions be robust to missing data as a result of false-negative variant detection. We have found that Jaccard distance is best suited for genotype data, as it is binary in nature52. However, we also observed that in general the false-negative rate can hinder statistical determination of the number of clones in a sample.
An alternative to distance-based methods is to perform model-based clustering53, which allows the inclusion of false-negative errors modelled as binomial processes. Model-based clustering is a soft clustering approach. Unlike methods that subdivide phylogenetic trees that have been generated by distance-based clustering, model-based clustering provides probabilities that a cell originates from the different clones. As seen in the example in Fig. 5, the observed single-cell data are represented as a binary matrix that is first considered to be derived from a mixture of an unknown number of clones with some data missing. Parameters in the model, such as the probability of a particular single cell originating from a specific clone, as well as the false-negative rate, can be estimated across a distinct number of possible clones using an expectation–maximization (EM) algorithm54. The challenge of determining the number of clones is then reduced to selecting the statistical model that best describes the observed single-cell data using Bayesian or Akaike information criteria55. There is also a hybrid approach based on obtaining an initial estimate of the number of clones derived from distance-based hierarchical clustering, which increases the convergence speed of the computationally intensive model-based methods56.
After estimating the number of clones in a sample and determining which clone each cell belongs to, a consensus clonal mutation profile can be established. We have done this using mutation frequency cutoff values that exceed the false-negative rate57, although more rigorous statistical methods could be developed. After determining the clonal genotype, the relationships between clones can be determined. There are a number of algorithms used in evolutionary biology that can be applied to establish clonal structures58, such as those based on maximum parsimony, maximum likelihood or distance-based methods such as unweighted pair group method with arithmetic mean, neighbour joining and minimum evolution algorithms58, 59. We prefer the modified use of directed minimum spanning trees, as they can be rooted and allow us to readily include ancestral clones as internal nodes of the evolutionary tree are identified.
Compartmentalizing microbial dark matter. Sequencing has the capacity to overcome the sampling bias that occurs when investigators rely on culturing methods to isolate microorganisms. Sequencing 16S ribosomal RNA has identified as-yet unculturable bacterial phyla and major archaeal groups, although the remainder of the genomes of those putative new phyla are difficult to assemble because the sequencing data are acquired from samples that are composed of multiple species. In principle, single-cell genomics has the potential to assemble the genomes of species that are present at low frequencies in these metagenomic samples4, as well as to produce assemblies of genomes of completely uncharacterized microorganisms. Here, we focus on the sequencing of species of phyla that had been detected by 16S ribosomal RNA sequencing but had not had full genome assemblies, as these single-cell studies have shown the greatest likelihood of advancing our understanding of microbial ecosystems in the near term.
The first single-cell genome to be sequenced from the environment was a member of the TM7 phylum. In this study, species were identified from the mouths of human subjects, followed by physical isolation and MDA using a microfluidic device5. In a later study, cells were sorted using FACS, followed by MDA and sequencing60. More recently, species from the OP11 phylum isolated from an anoxic spring61, SR-1 phylum derived from human oral mucosa62, TM6 phylum from biofilm on a hospital sink63 and OP9 phylum from a hot spring64 have been sequenced using similar methods. The Joint Genome Institute has undertaken a project to sequence the genomes of hundreds of unculturable microorganisms from diverse environments, and has already sequenced the genomes of numerous archaeal and bacterial species of known but unsequenced phyla65. This large sequencing study has also identified new biological phenomena in these bacteria, including a new purine synthesis pathway65.
The most important variable when performing de novo genome assemblies is the genome coverage. Almost all studies to date have used MDA. In our comparison study using raw reads from single E. coli cells, MDA performed better than MALBAC29. Much of the genome coverage for MALBAC was lost owing to contamination when the reaction was carried out in tubes. If only mapped reads are considered, MALBAC would cover a greater proportion of the genome, providing further evidence that reducing contamination using a microfluidic-based MALBAC strategy could potentially provide even better microbial genome assemblies. Tools have recently been developed to systematically assess the quality of single-cell microorganism sequencing data66, including the presence of a contaminating sequence67. Another approach for improving amplification metrics and subsequent assemblies is to capture and culture individual bacteria in droplets, followed by amplification of the hundreds to thousands of cells that descended from the original bacteria68. Alternatively, investigators have focused on species with polyploid genomes to acquire better assemblies by starting with bacteria that have 200–900 genome copies per cell69. Finally, there have been several short-read assemblers that have been developed to correct for the artefacts of single-cell MDA49, 50, 70.
Recent advances in single-cell genomics have enabled the description of completely new phyla, and are now beginning to provide biological insights that could not be made using metagenomics approaches62. In addition, a better understanding of the microbiome is creating knowledge that is leading to commercial applications. For example, new members of the Oceanospiralles order, the genomes of which contain enzymes that metabolize crude oil, were identified by single-cell sequencing of ocean samples after the Deepwater Horizon oil spill71. There is also promise in using single-cell genomics to identify unculturable human microbial pathogens, as well as to determine differences in pathogenicity between strains of the same species within an individual72. In addition, although most studies have focused on bacterial phyla with known 16S rRNA gene sequences, single-cell genomics could be used to assemble the genomes of bacteria or archaea that can be visualized, but the rRNA of which cannot be detected by PCR because of sequence divergence from the universal amplification primers.
An emerging application of single-cell genomics is to use single-cell sequencing to identify new viruses that may be difficult to assemble from metagenomic samples73. Several papers have highlighted the power of this approach, including the discovery of five new virus genera through single-cell interrogations of unculturable SUP05 bacteria74. Another study found the first viral sequences for 13 new bacterial phyla using public data sets75. Computational tools are being developed to improve methods for deciphering new viral sequences from their host76. These methods are beginning to be used to study phage–host interactions, which will probably be augmented with single-cell transcriptome sequencing. Another study has looked at virus–protist interactions using single-cell sequencing77, and it is likely that additional studies will provide details on the relationship between a virus, phage or bacterium and its host by deconvoluting the cell-to-cell variance in that interaction which is partially lost with bulk sequencing strategies.
Still, several challenges remain to increase the throughput and quality of single-cell microbial genomes. More efficient tools for isolating and lysing single microorganisms, uniform and less error-prone amplification methods, and even more robust assembly algorithms that incorporate the additional uncertainty introduced by technical artefacts during single-cell WGA are needed to produce high-quality genome assemblies. The challenge of providing a more uniform approach for producing, analysing and assessing single-cell microorganism genomes is being addressed by the Human Microbiome Project78, which is in the process of sequencing the genomes of 3,000 single cultured and uncultured bacteria isolated from various human anatomical sites. We have largely focused on single bacterial genomes, but investigators are also interrogating the genomes of other single-cell organisms, including protists77, 79.
Identifying genetic mosaicism in multicellular organisms. The development of cytogenetic methods in the 1950s led to the discovery that cells within the same individual can harbour different numbers of chromosomes80. Patients with mosaic expression of dominant Mendelian diseases were subsequently identified by unusual patterns of the stereotypical cutaneous manifestations of several diseases, including neurofibromatosis type I and hereditary haemorrhagic telangiectasia81. It was then shown that other diseases such as McCune–Albright Syndrome are only expressed as mosaic diseases, suggesting that germline mutations are lethal82. More recently, the development of variant detection methods based on microarrays and next-generation sequencing has enabled the identification of several new diseases that are the result of mosaic SNVs83, 84, 85 or CNVs86.
However, previous studies of human mosaicism have been limited to the identification of genetic aberrations that are present at relatively high frequencies owing to the low sensitivity of current technologies. Still, a human cell is estimated to acquire an SNV within its coding region after every 300 cell divisions87. As the average human body is estimated to contain 37 trillion cells88, each position in our genomes acquires hundreds to thousands of mutations in different cells as we develop from a zygote into an adult human. In addition, studies that have sampled tissues from different sites of the same person suggest that mosaic CNV and SNV rates are higher than previously appreciated89, 90. However, the role of that low-level genetic variation in the predisposition and pathogenesis of human diseases remains largely unexplored.
Recent studies have started characterizing mosaic genetic variation in human samples using single-cell sequencing. For example, the de novo mutation rate and recombination map were measured in single human sperm91, followed by a second study on sperm recombination rates92and a later study on human oocytes93. It has also been shown that a substantial percentage of single human neurons from healthy individuals harbour megabase CNVs6, 94, although these findings have been disputed95. More recently, single-cell whole-genome sequencing was used to identify mosaic SNVs whereby the authors found an enrichment in mutations at sites that are actively transcribed in the brain, suggesting those locations are the main source of mutation in those cells96. We have also used single-cell sequencing to confirm a mosaic SNV in the sodium channel SCN5A as a cause of long-QT syndrome in a neonate (Euan Ashley, James Priest, C.G. and S.R.Q., unpublished observations). It is likely that low-level mosaic genetic variants will be increasingly connected with human diseases as the experimental and analytical tools continue to reduce the technical noise from single-cell WGA, which will contribute to improvements in our ability to decipher the true variants from experimental artefacts. In addition, these tools are likely to find direct clinical applications. Single-cell genomic techniques have long been used to screen embryos for in vitro fertilization97 and more recently they have been used to detect aneuploidy in polar bodies before implantation93, 98, 99.
Cancer. The best studied example of genetic mosaicism is cancer. Tumour initiation, maintenance and evolution are mediated by the sequential acquisition of genetic variants in single cells. The aim of the large ongoing cancer sequencing projects is to catalogue those variants to better understand tumour biology100. However, like other studies of genetic mosaicism, the sensitivity of detection is limited to variants that are present in about 20% of cells of a bulk sample composed of thousands of cells. The use of variant allele frequency distributions in bulk and regional sequencing studies has indicated that many cancers have considerable genetic heterogeneity101, 102. However, those methods do not co-segregate mutations into distinct clones, which is required to unambiguously determine the clonal structure of the samples, as well as to determine the evolutionary histories of the malignancies.
Single-cell sequencing studies have now begun to dissect intra-tumour genetic heterogeneity at single-cell resolution. The first published study used DOP-PCR to identify CNVs in breast cancer nuclei15. Another group used isothermal amplification methods to identify SNVs in a renal cell tumour, and in a sample from a patient with a myeloproliferative disorder103, 104. The authors of those two studies concluded that the tumours were monoclonal even though there was significant genetic heterogeneity identified between cells. Another study did identify two distinct clonal populations within a bladder carcinoma105. A subsequent single-cell sequencing study of colon cancer claimed that the tumour was biclonal in origin, which seems to be contradicted by the fact that the two putative unrelated clones share mutations106. The use of ambiguous descriptions of the clonal structures in these studies highlights the need to create common nomenclature as the field of single-cell cancer genomics matures. These initial studies provided hope that single-cell cancer sequencing would become feasible, but uncertainty of data quality owing to technical limitations prevented the investigators from making new biological insights.
Circulating tumour cells (CTCs) can be isolated and interrogated as a potential window into the genetics of a tumour through non-invasive sampling. The unique technical challenges associated with isolating and analysing the genomes of single CTCs have been detailed elsewhere9. Still, these studies have begun to show promise in identifying and characterizing CTCs as alternative diagnostic and disease monitoring strategies107, 108. One of the fundamental questions that is yet to be resolved is whether CTCs will provide a representative sampling of the genomic diversity within the source tumour.
More recent studies have aimed to improve experimental and computational methods so that examinations of malignancies at single-cell resolution provide a higher-resolution understanding of the disease. Some are limited by evaluating an inadequate number of loci or have insufficient genome coverage to independently determine clonal structures based only on the single-cell sequencing data41, 109, 110. However, by sorting out haematopoietic precursor populations, one study of acute myeloid leukaemia was able to order the acquisition of mutations and provide evidence that specific mutations persisted in populations that have a phenotype similar to normal haematopoietic stem cells111. A more recent breast cancer study that used MDA on tetraploid nuclei inferred the clonal structure of the sample using SNVs7. The authors also did CNV profiling, although not on the same cells, and found that most CNVs were acquired before SNVs.
We recently used MDA to amplify the genomes of almost 1,500 acute lymphoblastic leukaemia cells40. With the large number of cells, we were able to develop methods to determine the clonal structures. In addition, we established general criteria that are required to accurately identify clonal structures, including: having a variant dropout rate of less than 30%, interrogating at least 20 mutations per sample and detecting at least three independent cells to accurately identify a new clone. The vigorous validation of our clonal structures using these analysis methods enabled us to confidently make new conclusions about the events that result in ALL formation, including the presence of co-dominant clones at diagnosis, the acquisition of clone-specific punctuated cytosine mutagenesis, the existence of leukaemia cells at various stages in differentiation arrest and the observation that KRAS mutations are acquired late in disease development but are not sufficient for clonal dominance.
With recent experimental and computational developments, the field of single-cell genomics is poised to begin offering important new insights into cancer development and evolution. Currently, only SNVs or CNVs can be accurately identified from a single cell with targeted or low-pass sequencing; improvements in WGA methods could further decrease sequencing requirements, which would allow more cost-effective whole-genome interrogation of all genomic variation in single cells, including SNVs and structural variants that reside in non-coding regions. The strategies used to interrogate amplified cancer genomes, such as WGS, whole-exome sequencing or targeted sequencing, should be carefully selected based on the hypotheses being tested, as well as the trade-offs between cost, throughput and the quality of the data acquired (Box 1). Further computational method development is needed to maximize the accuracy of variant calling, as well as the clonal structures identified. Finally, more uniform definitions across cancer sequencing studies are required to allow accurate comparisons between studies. For example, cell lines should not be used to evaluate the quality of methods that are performed on primary cells, and unambiguous terms such as false-negative rate, which incorporate both the locus dropout and ADO, should be substituted for ADO, and a universal definition for a clone should be determined. The latter point is important, as more sensitive single-cell methods are beginning to identify variants that are unique to individual cells or small groups of cells (C.G. and S.R.Q., unpublished observations), and there is no consensus with regard to whether those likely incidental rare mutations should be used to establish those cells as an independent clonal population.
In this Review, we have presented an overview of the current state of the field of single-cell genome sequencing. Substantial progress has been made over recent years in obtaining higher quality single-cell data, which has resulted in the discovery of new biological phenomena that could not be detected with standard bulk genomic interrogations. Still, many challenges remain. Increases in the throughput of cell isolation techniques, as well as improvements in genome amplification, sequencing and computational methods will undoubtedly make the field accessible to many more groups while broadening the types of hypotheses that can be tested. In addition, single-cell genome sequencing has begun to be coupled with RNA112, 113, 114 and/or protein measurements115 from the same cells. The ability to correlate genotype with other cellular building blocks, as well as phenotypic measurements, will make even more biological questions accessible. Finally, incorporating intracellular116 and intercellular117, 118 spatial information with genomic measurements will enable researchers to begin putting the cellular building blocks together by providing the surrounding cellular contexts. Many obstacles remain, but we believe the field of single-cell genomics is going to rapidly advance our understanding of microbial ecology, evolution and human disease.
Avery, O. T., Macleod, C. M. & McCarty, M.Studies on the chemical nature of the substance inducing transformation of pneumococcal types: induction of transformation by a desoxyribonucleic acid fraction isolated from Pneumococcus type III. J. Exp. Med.79,137–158 (1944).
Marcy, Y.et al. Dissecting biological “dark matter” with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth. Proc. Natl Acad. Sci. USA104,11889–11894 (2007). This study shows that we can identify uncultivated microorganisms using single-cell sequencing.
McConnell, M. J.et al. Mosaic copy number variation in human neurons. Science342,632–637 (2013). This article provides the first evidence that mosaic CNV may be more common than previously appreciated.
Wang, Y.et al. Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature512, 155–160 (2014). The study is an example of high-quality single-cell cancer sequencing data, which has enabled new insights into the pathogenesis of breast cancer.
Navin, N.et al. Tumour evolution inferred by single-cell sequencing. Nature472, 90–94(2011). This study provides the first evidence that single-cell sequencing can be used to dissect intratumour heterogeneity.
Macosko, E. Z.et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell161, 1202–1214 (2015). The study presents droplet-based microfluidics as a viable option for efficiently sequencing the transcriptomes of thousands of cells.
Lichter, P., Ledbetter, S. A., Ledbetter, D. H. & Ward, D. C.Fluorescence in situ hybridization with Alu and L1 polymerase chain reaction probes for rapid characterization of human chromosomes in hybrid cell lines. Proc. Natl Acad. Sci. USA87, 6634–6638 (1990).
Troutt, A. B., McHeyzer-Williams, M. G., Pulendran, B. & Nossal, G. J.Ligation-anchored PCR: a simple amplification technique with single-sided specificity. Proc. Natl Acad. Sci. USA89, 9823–9825 (1992).
Dean, F. B., Nelson, J. R., Giesler, T. L. & Lasken, R. S.Rapid amplification of plasmid and phage DNA using Phi29 DNA polymerase and multiply-primed rolling circle amplification.Genome Res.11, 1095–1099 (2001). This paper provides the first evidence that isothermal amplification could be used to efficiently analyse whole genomes.
Gawad, C., Koh, W. & Quake, S. R.Dissecting the clonal origins of childhood acute lymphoblastic leukemia by single-cell genomics. Proc. Natl Acad. Sci. USA111,17947–17952 (2014). This paper uses microfluidics to efficiently resequence the genomes of almost 1,500 cells, allowing new insights into the development of leukaemia.
Bankevich, A.et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol.19, 455–477 (2012). This method overcomes some whole-genome amplification artefacts, resulting in more accurate single-cell genome assemblies.
Youssef, N. H., Blainey, P. C., Quake, S. R. & Elshahed, M. S.Partial genome assembly for a candidate division OP11 single cell from an anoxic spring (Zodletone Spring, Oklahoma).Appl. Environ. Microbiol.77, 7804–7814 (2011).
Rinke, C.et al. Insights into the phylogeny and coding potential of microbial dark matter.Nature499, 431–437 (2013). This study identifies new phyla of microorganisms from diverse environments, enabling new insights into the biology of those ecosystems.
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W.CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res.25, 1043–1055 (2015).
Yoon, H. S.et al. Single-cell genomics reveals organismal interactions in uncultivated marine protists. Science332, 714–717 (2011). This paper shows that single-cell sequencing can be used to study interactions of bacteria, protists and viruses at single-cell resolution.
Wang, J., Fan, H. C., Behr, B. & Quake, S. R.Genome-wide single-cell analysis of recombination activity and de novo mutation rates in human sperm. Cell150, 402–412(2012). This study establishes the feasibility of using single-cell sequencing to identify genomic structural variants and SNVs genome-wide.
Alfarawati, S., Fragouli, E., Colls, P. & Wells, D.First births after preimplantation genetic diagnosis of structural chromosome abnormalities using comparative genomic hybridization and microarray analysis. Hum. Reprod.26, 1560–1574 (2011).
Lee, J. H.et al. Highly multiplexed subcellular RNA sequencing in situ. Science343,1360–1363 (2014). This study presents a method for acquiring single-cell transcriptomic data while retaining intercellular and intracellular spatial information.
C.G. is supported by funding from the Burroughs Wellcome Fund, American Lebanese Syrian Associated Charities, Hyundai Foundation for Pediatric Research, American Society of Haematology, and Leukaemia and Lymphoma Society. W.K. is supported by A*STAR (Agency of Science, Technology and Research; Singapore).
Charles Gawad is an assistant member in the Departments of Oncology and Computational Biology at St. Jude Children’s Research Hospital in Memphis, Tennessee, USA. He completed his M.D. from the University of Arizona, USA, in 2006 and subsequently completed his training in pediatric haematology-oncology. He then received his Ph.D. in cancer biology from Stanford University, California, USA, under the guidance of Patrick O. Brown, where they discovered that circular RNA is an abundant new class of non-coding RNA. Following this, he completed his postdoctoral training in single-cell genomics in Stephen Quake’s group, at Stanford University. His laboratory now focuses on combining sequencing with new biochemical and computational methods to more deeply understand the development and treatment resistance of childhood leukaemias.
Winston Koh is a scientist at a start-up company that is developing nucleic acid-based diagnostics. He was previously a graduate student in Stephen Quake’s group in the Department of Bioengineering at Stanford University, California, USA, where he received his Ph.D. in 2015. His work has focused on the development of computational and experimental methods for the application of emerging genomics technologies to clinical problems. He contributed to the assembly of the Chinese Hamster Ovary and Botryllus schlosseri genomes, as well as the development of tools for the quantification of fetal cell-free nucleic acid levels in maternal serum and the determination of clonal structures from single-cell DNA sequencing data.
Stephen R. Quake is the Lee Otterson Professor in the School of Engineering at Stanford University, California, USA, and an investigator of the Howard Hughes Medical Institute. His scientific work has focused on applying principles from his training in physics to address previously inaccessible biological questions. This has resulted in the invention of the microfluidic valve enabling microfluidic large-scale integration, the first-single molecule sequencer, a non-invasive test for fetal aneuploidy, methods for readily deciphering the immune repertoire of an individual and tools for analysing genomics at single-cell resolution. Awards recognizing his contributions include election to the National Academy of Sciences, National Academy of Engineering, Institute of Medicine, National Academy of Inventors and the American Academy of Arts and Sciences.