无需建库,直接测序的测序新技术

【字体: 时间:2012年12月14日 来源:生物通

编辑推荐:

  来自英国老牌测序研究机构Sanger研究院,以及英国剑桥巴布拉汉研究所的研究人员发表了题为“Direct sequencing of small genomes on the Pacific Biosciences RS without library preparation”的文章,首次研发出了一种无需文库制备,就能完成DNA单分子测序的新技术,这一技术是在第三代单分子测序系统PacBio RS上完成的,不仅能简化了基因组测序的标准流程,而且也降低了所需样品DNA的数量。

  

生物通报道:来自英国老牌测序研究机构Sanger研究院,以及英国剑桥巴布拉汉研究所的研究人员发表了题为“Direct sequencing of small genomes on the Pacific Biosciences RS without library preparation”的文章,首次研发出了一种无需文库制备,就能完成DNA单分子测序的新技术,这一技术是在第三代单分子测序系统PacBio RS上完成的,不仅能简化了基因组测序的标准流程,而且也降低了所需样品DNA的数量。相关成果公布在12月的Biotechniques上。

需要建库的测序技术

一般来说,基因组测序(具体来说,指的是第一代和第二代测序方法)的实验步骤主要包括以下几点:

在此流程中应该说文库构建或者称序列收集最为费时费力,并且由于建库的过程中样本被扩增上千倍,因此样本中基因量的线性关系就会出现偏差,NGS定量受到影响,所以说如果能无需这部分步骤,将能大大提升基因组测序的速度和精确性。
 

无需建库的测序新技术

这一新技术无需进行文库制备,可直接从DNA片段获得测序数据,并且与传统标准方法相比,所需的DNA量也相当少,用量可低至不到1ng(10亿分之一克),仅为常规测序方法的500分之一到600分之一。研究人员指出,这种方法适合用于小基因组测序。

文章的第一作者是Sanger研究院的Paul Coupland博士,他表示,“这是首次实现了DNA单分子的直接测序”,“我们利用这种新方法,完成了病毒和细菌的基因组测序,发现即使是没花大力气进行条件优化,我们也能鉴定出是何种生物,并且就算这些生物体内带有一些特殊的基因或质粒(决定抗生素耐药性),或者譬如特殊DNA碱基修饰之类的遗传信息,都不会影响基因组测序。”

“这项技术通过优化,将能快速、高效地识别医院和其他医疗场所中的细菌和病毒,具有很大的应用潜力。而且这也将能提升序列的可信度,因为这一过程无需构建文库。”

研究人员在第三代单分子测序系统PacBio RS上演示了这种简化的直接测序方法,这种测序方案被称为SMRT,也就是单分子实时DNA测序技术,这种技术的一个显著特点在于从样本制备到获得测序结果,所需的时间还不到一天,因此十分适合用于传染病监控。                单分子测序系统最新技术资料索取>> >>

在这项研究中,研究人员利用极少量的DNA样品,分析了环形小分子单链,以及双链DNA病毒基因组,以及金黄色葡萄球菌的MRAS菌株中的一种线性片段。这些样品量只有800pg,比标准分析中所需的少了600倍。通过PacBio,研究人员虽然只完成70个片段——这相对于常规测序方法来说,不过是很小的一部分,但这些信息足以让确定这些样品是何种生物了。

这一方法是Sanger研究院这一平台上开发出来的,可以针对未知序列进行检查,因此可用于识别之前无法识别的生物种类,而且这种方法耗时少,样品需要量也少,是未来传染病监控,以及临床检查的一种有力手段。(生物通:张迪)

原文全文:

Direct sequencing of small genomes on the Pacific Biosciences RS without library preparation.

Pacific Biosciences (Menlo Park, CA, USA) have developed a platform that will sequence a single molecule of DNA in real-time via the polymerization of that strand with a single polymerase (1-6). This technique has many benefits over multi-molecule (clonal) sequencing technologies (7, 8); one such potential advantage is that it may not be absolutely necessary to make a library (i.e., create SMRT bells (9)) to generate sequence data. The only input (molecular) requirements to enable sequencing are a primed piece of DNA; both single-stranded and double-stranded molecules will work. The polymerase is necessarily highly processive starting with a location on the DNA at which it can bind, i.e., a free 3′-OH group. We decided to test whether any primed DNA molecules, lacking any other features of a PacBio SMRT bell, could be used directly in a sequencing reaction. The bound complex (DNA-primer-polymerase), although lacking PacBio adapter sequences, can still be sequenced on the PacBio platform. The present efficiency of this process, in terms of the numbers of reads generated and Mb yield per SMRT cell, is considerably less than that using standard libraries. With standard methods a typical SMRT cell will yield 35,000–50,000 reads and 100–160 Mb of mapped bases. The direct sequencing method described here has generated up to 3000 reads per SMRT cell and therefore its utility is limited to small genomes. However, this approach enables one to acquire sequence data from comparatively low amounts of DNA, even less than 1 ng of input, and within eight hours from receiving the sample. There is a slight time saving, compared with the 12 h required for standard library prep. This is not the main advantage, though it does now offer a route from sample to sequence within an average working day. This protocol may be of benefit to the direct sequencing of plasmids, single-standed or double-stranded viruses, mitochondrial DNA, and microbial pathogens in a clinical setting.

Materials and methods

M13mp18 viral DNA (both single-stranded and double-stranded; catalog no. N4040S and N4018S, respectively) and M13 forward (5′-GTTTTCCCAGTCACGAC-3′) and reverse sequencing primers (5′-AACAGCTATGACCATG-3′) were from New England Biolabs (Hitchin, UK). Methicillin-resistant Staphylococcus aureus (MRSA) plasmids were purified from a solution prep of S. aureus TW20 using a Qiagen (Crawley, UK) Plasmid Midi Kit with Qiagen Genomic-tip 100/G following the manufacturer's “very low-copy plasmid/cosmid purification protocol” from a 500 ml culture. Plasmid Safe DNase (Epicentre Biotechnologies, Madison, WI, USA) was used to reduce the amount of linear single- and double-stranded molecules from the TW20 plasmid prep. Random hexamer primers from Roche (Welwyn Garden City, UK) were used, as provided in the Transcriptor First Strand cDNA Synthesis Kit. pET28a plasmid vectors encoding EcoDamI methyltransferase (Dam constructs) expressed in dam-/dcm Escherichia coli cells, were prepared in-house. Components from the DNA/ Polymerase Binding Kit 2.0 from Pacific Biosciences were used during the annealing and binding reactions. The Annealing and Binding Calculator (version 1.3.1) provided by Pacific Biosciences was used to calculate the concentration of bound complex to be loaded onto the sample plate for the instrument. An MJ PTC-225 thermocycler from MJ Research (Watertown, MA, USA) was used for the annealing and binding reactions. The PacBio DNA Sequencing Kit 2.0 (8Rxn) and SMRT Cell 8Pac v2 (8 Cells) were used for sequencing. Sequence analysis was performed with SMRT portal, SMRT pipe, and SMRT View, version 1.3.1, and Motif Finder, version 0807, all from Pacific Biosciences.

Annealing reaction

Standard library preparation was omitted; the DNA templates were used directly in the annealing reaction. For each experiment, a quantity of DNA between 1 ng and 100 ng was annealed with suitable primers. With ssDNA, the annealing reaction used the standard PacBio protocol; i.e., 2 min at 80°C followed by cooling at 0.1°C/s to 25°C. With dsDNA, a different annealing protocol was used; the reaction was heated to 95°C for 5 min, then immediately snap-cooled on wet ice. As an example, when using ds M13mp18 DNA, 2.2 µL of DNA at 46 ng/µl (∼100 ng), 0.9 µL PacBio Primer Buffer (10×), and both 0.9 µL forward primer (10 µM) and 0.9 µLreverse primer (10 µM) were mixed in a final annealing reaction volume of 9 µL. The final concentration of DNA template was therefore ∼2.5 nM, with 1000 nM primer (∼400×). In order to use the PacBio Annealing and Binding calculator, we assumed that denatured M13mp18 DNA is comparable to a SMRT bell, with half the original double-stranded M13mp18 molecule's nominal length; i.e., one double-stranded 7.2-kb molecule, when denatured, becomes two 3.6 kb SMRT bells. A 2-fold dilution series of DNA was used to create additional annealing reactions in the range of 0.8-100 ng of DNA. There was a massive excess of forward and reverse primers at the lower concentrations of DNA in these reactions.

Binding reaction, loading, and sequencing

In the binding reaction, the ratio of polymerase to template DNA used was 3:1. First, 1.5 µL of polymerase (1600 nM) was combined with 25 µL of binding buffer giving a 90 nM polymerase solution. Four µl of a 1:1:1 DTT:dNTP:binding buffer mix (each from the PacBio Binding Kit) was added to the annealed template DNA and 1.5 µL of 90 nM polymerase was added. This was mixed gently by pipetting and then incubated at 30°C for 4 h.

The bound complex was loaded at 1 nM onto the instrument. Typically this is achieved by diluting the bound complexes with a mixture of 1:10 DTT:Complex Dilution Buffer. In this experiment, however, it was only possible to achieve a 1 nM loading concentration for the samples containing 100 ng and 50 ng input DNA. For the other samples in the 2-fold dilution series, the calculated concentration was <1 nM before dilution. The total volume of 14.6 µL of binding reaction was therefore loaded directly into the sample plate wells for each of these dilute samples.

Two × 45 min sequencing movies were acquired for each sample in this study. Mapping, de novo assembly, and modification analysis, were carried out with PacBio's SMRT Analysis pipeline run via the SMRT Portal interface. PacBio's Motif Finder was used in the final step of analysis for the pet28a plasmid vector to characterize the sequence specific motif at which base modifications were observed.

Results and discussion

At the outset of this study, an experiment using single-stranded M13mp18 viral DNA and the M13 forward sequencing primer (5′-GTTTTCCCAGTCACGAC-3′) showed that it was possible to generate sequence data directly from circular DNA molecules without library preparation; i.e., fragmentation, end repair, and adapter ligation. From 25 ng of ssDNA and 100-fold molar excess of primer, it was possible to map the data generated against the 7.2 kb M13mp18 reference sequence, calling 100% of the bases with 100% consensus accuracy. We next attempted to sequence double-stranded circular molecules of M13mp18 using both forward and reverse primers to obtain information from both strands in a single run. The sequencing of dsDNA molecules should have much wider application; for example, plasmids, phages, and ultimately larger genomes. Proving the ability to generate sequence data for both strands was therefore an important step in the development of this technique, especially considering the future application of PacBio for epigenetics (including hemi-methylation patterning); the ultimate goal was to sequence fragmented linear dsDNA, e.g., any sheared genomic sample, and generate enough useful data for future applications. We denatured the double-stranded DNA at 95°C for 5 min and snap-cooled (see Materials and Methods) in the presence of excess primer to successfully prime the two strands. This snap-cooling technique and the large primer concentration was utilized to give maximum opportunity for priming each strand while minimizing re-annealing of the genomic DNA. Alternative annealing conditions were tested as well: (i) following the standard PacBio recommended protocol of slowly cooling from 80°C to 25°C, (ii) snap-cooling from 95°C then raising the temperature to 45°C for 2 min, and (iii) cooling as quickly as possible on a thermocycler from 95°C to 45°C. In each of the latter three cases, far fewer reads weregenerated in the sequencing run. Snap-cooling on ice from 95°C was used subsequently for each dsDNA sample.

Figure 1 shows the difference in coverage profile of the M13mp18 genome when sequenced as ssDNA with the M13 forward sequencing primer, and the dsDNA sequenced with both the M13 forward and reverse sequencing primers (Figure 1, middle panel). The uneven coverage profile is due to the population of mapped reads (on each strand) having a distribution of read lengths, but most of the reads would start from approximately the same position on the genome. With PacBio sequencing at present there is a “dark-time” of several minutes prior to the start of movie acquisition, which is the time it takes from initiation of the sequencing reaction to alignment of the SMRT cell in the correct position. Although the priming sequence for a given strand is in the same position on each molecule, variation in the polymerization speed and longevity does account for some of the observed distribution. Additionally, the SMRT pipe software might have difficulty in mapping some reads, especially those that extend beyond the end of the linear FASTA reference. The DNA molecules sequenced were circular, but the reference used is a single linear sequence. Therefore, a number of the reads generated in these runs will, in fact, extend beyond the artificially imposed boundaries of the reference file. Some of the longest reads will also span the entire circular genome, further complicating the automatic analysis. The SMRT analysis software is not designed to deal with reads of this nature; although the initial filtering of the data are unaffected, as it's based on read quality metrics only. None of the reads have PacBio adapters that signal the end of a DNA template fragment so the standard re-sequencing (mapping) protocol in SMRT portal possibly contributes to the uneven coverage profiles generated (mapping thresholds were a maximum of one hit per read, 30% maximum divergence, and minimum anchor size of 12). Some reads were longer than the entire genome as evidenced by the maximum read lengths in the SMRT Portal raw read-length histogram (i.e., any reads >7.2 kb). These very long reads could be observed using PacBio's SMRT view software by concatenating two M13mp18 references in tandem into a single FASTA file (Figure 2).

To enhance this technique, we used random hexamer primers rather than primers specifically designed for the sample. In this case, no prior knowledge of the DNA sequence is necessary and the method, in principle, can be applied directly to a wide range of unknown samples e.g., in a clinical setting. Figure 1 (right panel) shows the coverage of ds M13mp18 sequenced with Roche's random hexamers. There is a more uniform coverage in comparison to the results obtained using the specific primers. The coverage distribution is still not ideal but having started with 50 ng of DNA, we generated similar data to that shown in Table 1 for 50 ng of input DNA (2392 mapped reads, 100% bases called, 100% consensus accuracy, 403 × mean coverage).

To test the application of our direct sequencing method for the PacBio detection of modified bases, 6-kb vectors were sequenced with known positions of methylation. ApET28a vector encoding EcoDam methyltransferase, which generates N6-methyladenine (m6A) methylation in GATC motifs, was directly sequenced with subsequent kinetic analysis using PacBio's software to identify base modification. Random hexamer primers were used and the experimental procedure was as described previously for M13mp18, starting with 25 ng of DNA. Figure 3 shows the SMRT View genome browser depiction of a portion of the sequence data for one of the vectors sequenced using this direct sequencing technique; four instances of GATC methylated sites are evident as peaks in the purple trace (+ strand) and orange trace (- strand). As the GATC sequence is a palindrome, there is an m6A base on both strands, and by observing the inter-pulse distance (IPD) ratio reported by the PacBio software, it is exceptionally easy to see these base modifications. Other studies have used the PacBio RS for base modification analysis on similar plasmid/methyltransferase models (10, 11) and entire bacterial genomes (12) but to our knowledge these studies follow standard library preparation protocols and required far greater amounts of DNA than in the technique described here. The data we generated were analyzed with Motif Finder, an application provided by PacBio, for mining polymerization kinetics for motifs associated with base modifications. In this vector, 50 instances of GATC methylated at the A position were identified; there are 25 GATC sites in the sequence and wild type EcoDam was expected to methylate each one of them.

Direct sequencing was then tested using a DNA extract of Staphylococcus aureus TW20, a MRSA strain and well-known nosocomial infection (13). The plasmids of this bacterial sample were of interest as an example of the application of the PacBio RS to infectious disease identification through sequencing. Antibiotic resistance genes are often carried on plasmids (14, 15) and can spread very quickly in heterogeneous bacterial communities (16-19). DNA was extracted from a solution culture of TW20 and digested with Plasmid Safe DNase to reduce the amount of linear fragments and effectively increase the concentration of plasmids in the sample. An electropherogram of the final sample showed that the DNA preparation also contained a smear of linear double stranded fragments ranging from 100 bp to >25 kbp, with a peak at approximately 20 kb (Supplementary Figure 1). The two plasmids in TW20 are double-stranded and circular, with lengths of 3 kb and 30 kb. We generated sequence data using random hexamer primers in the annealing reaction. Four reactions containing 50 ng of the S. aureus DNA preparation with various amounts of hexamer primers, from 10-fold to 600-fold, i.e., 500 ng, 1 µg, 10 µg, and 30 µg per annealing reaction, were performed in 9 µL reaction volumes. A single SMRT cell was sequenced for each reaction and the trend observed across these four reactions showed fewer mapped reads as the amount of random hexamer primers increased. This is perhaps because of the proximity of annealed primers on the DNA strand at higher concentrations, leading to polymerases colliding with one another, or simply the reduction of signal to noise as two fluorescent signals could be observed concurrently. The annealing reaction with 10-fold primers generated 3240 mapped reads, 20-fold generated 3085 mapped reads, 200-fold generated 2911 mapped reads, and 600-fold generated 2011 mapped reads, all with a mean mapped read length of approximately 500 bp. There was also a difference in coverage depth between the two plasmids; the mean coverage for the 3-kb plasmid was 35×, but only 5× coverage was obtained for the 30-kb plasmid, which is due mostly to the difference in plasmid length. There is a loading inefficiency of larger molecules because of their lower diffusion coefficient, as well as the disparity between the molecule's hydrodynamic radius and the very small zero-mode waveguide (ZMW). Future upgrades to the loading mechanism on the PacBio instrument (MagBead loading) which should eliminate this problem are very close to release. The combined sequence data from these four SMRT cells produced 13,724 reads; 479 reads mapped to the plasmids and 11,247 to the genome (5.3 Mb mapped providing a mean 1.6× coverage), an overall mapping rateof 85% which is not dissimilar to standard mapping rates of SMRT bell libraries we have made (from a recent single SMRT cell of S. aureus TW20 1 kb SMRT bell library 39,478 reads were mapped from 47,465 filtered reads, a mapping rate of 83%).

Finally, the technique was used to sequence linear molecules of Candidatus Phytoplasma mali, a plant-pathogenic mycoplasma with a small genome of ∼600 kb that is 21.4% GC and characterized by large terminal inverted repeats and covalently closed hairpin ends (20). The DNA was sheared to approximately 3-kb fragments and a 25-ng aliquot was sequenced using random hexamers in a similar manner to that described previously. From a single SMRT cell with 2 × 45 min movies, only 870 post-filter reads were generated of which 63 reads mapped, with a mean consensus accuracy of 84.4%. The mean mapped read-length was 817 bp and the coverage only 0.08%. The poor mapping rate is most likely due to a greater percentage of low-quality reads from this particular sample. Although the yield is poor, direct sequencing of these linear DNA molecules shows some promise too. A blastn (21) search using the NCBI server against the refseq_genomic database called out Candidatus Phytoplasma mali as the most likely taxonomic hit (Supplementary Table 1). This suggests it is possible to obtain enough information from very few mapped reads to begin to identify the genomes present in a sample. However, comparing the difference in data yield between the S. aureus and Ca. Phytoplasma mali, it is clear that further optimization of the method is required to improve the number of reads that can be mapped when sequencing linear molecules from a variety of genomic samples.

The method described here utilizes the PacBio RS platform for direct sequencing, enabling the generation of sequence data from small single- and double-stranded DNA genomes. Potentially this technique also could be applied to circularized molecules, e.g., amplicons or sheared fragments that have been circularized. However, the additional circularization step and clean up would mean relatively minor time and DNA savings compared with current PacBio protocols. The direct sequencing technique could allow the identification of plasmids present in a bacterial sample in an extremely straightforward and fast manner. Although there is an indication that different genomes may be more or less accessible with this method, we have demonstrated its application to sequencing ssDNA and dsDNA viruses, plasmid vector models for methylation studies, antibiotic resistance gene-carrying plasmids, and the entire genome of a clinically relevant microbial pathogen. All of these were performed without the need for library preparation, and it is possible to generate sequence data within 8 h from <1 ng of DNA without a PCR amplification step. The fact that our method can be performed without a priori knowledge of any sequence and with no organism-specific reagents, coupled with its simplicity and speed, makes it particularly well suited for use in acute disease and infectious outbreak scenarios.


 

相关新闻
生物通微信公众号
微信
新浪微博
  • 搜索
  • 国际
  • 国内
  • 人物
  • 产业
  • 热点
  • 科普
  • 急聘职位
  • 高薪职位

知名企业招聘

热点排行

    今日动态 | 人才市场 | 新技术专栏 | 中国科学人 | 云展台 | BioHot | 云讲堂直播 | 会展中心 | 特价专栏 | 技术快讯 | 免费试用

    版权所有 生物通

    Copyright© eBiotrade.com, All Rights Reserved

    联系信箱:

    粤ICP备09063491号