NBT-19年2月刊4篇35分文章聚焦宏基因組研究

新年4篇35分文章聚焦宏基因組研究

NBT-19年2月刊4篇35分文章聚焦宏基因组研究

Nature Biotechnology (NBT,自然生物技術,IF 35.7)在2019年2月刊(https://www.nature.com/nbt/volumes/37/issues/2)共發表了8篇研究(Research)論文(包括3篇Letters,3篇Articles,2篇Resources),其中4篇文章發表了宏基因組學研究進展(2篇Articles+2篇Resources)。其中關於超高速細菌基因組檢索的技術作為本期的封面文章。

下面我們對這四篇文章進行簡介:

1. 超高速細菌基因組檢索技術

Ultrafast search of all deposited bacterial and viral genomic data

NBT-19年2月刊4篇35分文章聚焦宏基因组研究

來自牛津大學威康人類遺傳學信託中心(Wellcome Trust Centre for Human Genetics, University of Oxford)的Zamin Iqbal教授團隊在宏基因組數據超高速搜索算法中取得突破進展,可實現全球細菌、病毒基因組的整合、更新和高速索引,新的數據索引方法存儲空間較傳統方法降低了4個數量級。該研究作為自然生物技術本期封面論文,推薦給讀者。

摘要

在全球的生物數據中心,存儲的未經處理的細菌和病毒基因組序列數據呈指數級增長。擁有對這些數據進行序列搜索的能力將有助於基礎研究和應用研究,如實時基因組流行病學和監測。然而,目前的技術手段仍無法實現。為了解決這一問題,我們將微生物種群基因組學的知識與網絡搜索的計算方法相結合,生成一個可搜索的數據結構,即位片基因組簽名索引(BItsliced Genomic Signature Index, BIGSI)。我們對來自全球數據庫的447,833個細菌和病毒全基因組序列數據集的進行了索引,使用的存儲空間比以前的方法減少四個數量級。我們應用BIGSI搜索功能快速尋找耐藥基因MCR-1、MCR-2和 MCR-3,確定2827個質粒的宿主範圍,並在存檔數據集中量化抗生素耐藥性。我們的索引可以隨著新的(包括未處理或組裝的)序列數據集的存儲而遞增,並且可以擴展至數百萬個數據集的級別。

Exponentially increasing amounts of unprocessed bacterial and viral genomic sequence data are stored in the global archives. The ability to query these data for sequence search terms would facilitate both basic research and applications such as realtime genomic epidemiology and surveillance. However, this is not possible with current methods. To solve this problem, we combine knowledge of microbial population genomics with computational methods devised for web search to produce a searchable data structure named BItsliced Genomic Signature Index (BIGSI). We indexed the entire global corpus of 447,833 bacterial and viral whole-genome sequence datasets using four orders of magnitude less storage than previous methods. We applied our BIGSI search function to rapidly find resistance genes MCR-1, MCR-2, and MCR-3, determine the host-range of 2,827 plasmids, and quantify antibiotic resistance in archived datasets. Our index can grow incrementally as new (unprocessed or assembled) sequence datasets are deposited and can scale to millions of datasets.

序列搜索方法

NBT-19年2月刊4篇35分文章聚焦宏基因组研究

圖1. 序列匹配方法。

A,比對序列至同一物種的參考基因組,假設差異相對較低;需要在可接受的時間內比對數百萬個序列,並返回一個對齊和比對得分。常用工具為BWA和bowtie。

B,BLAST將一個查詢字符串與一個包含大量系統發育範圍的參考基因組數據庫(圖中我們在虛線框中顯示RefSeq基因組)進行比較。BLAST 從查詢中獲取k-mer,對於每個k-mer,它在一個固定的編輯距離內創建一個k-mer的“鄰域(neighborhood)”(編輯顯示為紅色,b(iii)),並在參考基因組數據庫中搜索這些。比對只能通過從這些候選位點擴展來完成。BLAST可用於核苷酸和蛋白質的搜索,並能找到近距離同源匹配。

C,MASH在數據庫中存儲每個參考數據的微小指紋(在本例中是RefSeq)。通過對組裝序列集的查詢,將組裝序列的指紋與RefSeq的指紋進行比較,以找到最接近的參考序列。

D,序列開花樹(Sequence Bloom Tree)是一種通過索引數據中的k-mers,然後壓縮索引來搜索原始未組裝的序列集(未組裝的序列集顯示為“堆(piles)”的序列(短線),所有這些序列的顏色都相同,表示相同的種類)。設計用於人類數據,SBT可以用來尋找哪些RNA測序數據集包含指定的轉錄本。

E,BIGSI可以搜索完整的細菌和病毒原始序列數據。RefSeq顯示在未組裝的readset之間的虛線框中;不同的顏色表示物種和門的巨大範圍。SBT和BIGSI的不同輸入數據意味著這些方法具有不同的速度和壓縮的權衡考慮。

Fig. 1 | Sequence matching methods.

a, Mapping of sequence reads to a reference genome from the same species, assuming relatively low divergence; requirement to map millions of reads in acceptable time and return an alignment and mapping score. Common tools: bwa and bowtie.

b, BLAST compares a query string with a database of reference genomes (in the figure we show RefSeq genomes in a dotted box) covering a massive phylogenetic range. BLAST takes k-mers from the query, and for each k-mer it creates a ‘neighborhood’ of k-mers within a fixed edit distance (edits are shown in red, b(iii)), and searches for these in the reference genome database. Alignment is only done by extending from these hits. Blast can be applied to nucleotide and protein searches and can find close and remote homology matches.

c, MASH stores a tiny fingerprint of each reference in the database (in this case RefSeq). Querying with an assembly, the fingerprint of the assembly is compared with that of RefSeq to find the closest reference.

d, Sequence Bloom Tree13 was the first scalable method to search through raw unassembled readsets (unassembled readsets are shown as ‘piles’ of reads (short lines), all in same color to signify same species), by indexing the k-mers in the data and then compressing the index. Designed for human data, SBT can be applied to find which RNA-seq datasets contain a given transcript.

e, BIGSI can search the complete set of raw sequence data for bacteria and viruses. RefSeq is shown in a dotted box amongst unassembled readsets; different colors to signify the massive range of species and phyla. The different input data for SBT and BIGSI mean that these methods have different speed and compression trade-offs.

2. 宏基因組中設計全面可擴展探針捕獲序列多樣性

Capturing sequence diversity in metagenomes with comprehensive and scalable probe design

NBT-19年2月刊4篇35分文章聚焦宏基因组研究

來自哈佛和麻省理工聯合博德研究所(Broad Institute)的Hayden C. Metsky和Katherine J. Siddle團隊在宏基因組數據中的探針設計方法取得突破進展,可實現完整病毒基因組探針的設計,高效用於病毒檢測、序列捕獲,有助於實現更敏感和更經濟有效的宏基因組捕獲測序。

摘要

宏基因組測序結果有應用於微生物檢測和鑑定的潛力,但需要新的工具來提高其敏感性。在這裡,我們提出了一種計算方法——CATCH,以增強核酸捕獲豐富的各種微生物類群。CATCH可設計具有指定數量的寡核苷酸的最佳探針集,可實現已知序列多樣性的完全覆蓋和擴展。我們致力於在複雜的宏基因組樣本中應用CATCH來捕獲病毒基因組。我們設計、合成和驗證多個探針集,包括一個針對356種已知感染人類病毒全基因組的探針集。用這些探針集捕獲的病毒平均含量增加了18倍,這使得我們能夠組裝那些不濃縮就無法恢復的基因組,並準確地保存在樣本多樣性中。我們還使用這些探針組恢復2018年尼日利亞拉沙熱爆發的基因組,並改進人類和蚊子樣本中未鑑定病毒感染的檢測。結果表明,CATCH可以實現更敏感和更經濟有效的宏基因組測序。

Metagenomic sequencing has the potential to transform microbial detection and characterization, but new tools are needed to improve its sensitivity. Here we present CATCH, a computational method to enhance nucleic acid capture for enrichment of diverse microbial taxa. CATCH designs optimal probe sets, with a specified number of oligonucleotides, that achieve full coverage of, and scale well with, known sequence diversity. We focus on applying CATCH to capture viral genomes in complex metagenomic samples. We design, synthesize, and validate multiple probe sets, including one that targets the whole genomes of the 356 viral species known to infect humans. Capture with these probe sets enriches unique viral content on average 18-fold, allowing us to assemble genomes that could not be recovered without enrichment, and accurately preserves within-sample diversity. We also use these probe sets to recover genomes from the 2018 Lassa fever outbreak in Nigeria and to improve detection of uncharacterized viral infections in human and mosquito samples. The results demonstrate that CATCH enables more sensitive and cost-effective metagenomic sequencing.

CATCH設計探針

NBT-19年2月刊4篇35分文章聚焦宏基因组研究

圖1. 使用CATCH設計探針組。

a,CATCH探針設計方法的概述,顯示了三個數據集(通常,每個數據集都是一個分類單元)。對於每個數據集d,CATCH通過跨輸入基因組平鋪(tiling)來生成候選探針,並且可以選擇使用位置敏感散列來減少候選探針的數量。然後確定每個候選探針在參數為θd的模型下雜交(基因組和其中的區域)的位置(詳見補充圖1b)。使用這些覆蓋率曲線近似於完全捕獲所有輸入基因組的最小探針集合(在文本中描述為s(d,θd))。考慮到探針總數(n)的限制和θd上的損失函數,它搜索d所有的最佳θd.

b,完全捕獲不斷增加的HCV基因組所需的探針數量。所示的方法是簡單的平鋪(灰色),一種基於聚類的方法,在兩個嚴格級別(紅色)上,並使用三個參數值選擇捕獲,這些參數值指定不同的嚴格級別(藍色)。參數選擇詳見補充說明2。以前針對病毒多樣性的方法在探針集設計中使用聚類。每一行周圍的陰影區域是隨機抽樣輸入基因組計算的95%點置信區間。

c,CATCH為VALL探針集所有349,998個探針中的每個數據集(共296個數據集)設計的探針數。我們的樣本測試中包含的物種都有標籤。

d,CATCH為VALL設計中的每個數據集選擇的兩個參數值:假設在雜交中允許不匹配數量和雜交區域每側的目標片段長度(以核苷酸為單位)。每個氣泡的標籤和大小指示分配給特定值組合的數據集數量。樣本測試中包含的物種用黑色標記,未包含在測試中的異常物種用灰色標記。一般來說,多樣性更高的病毒(例如,HCV和HIV-1)被分配的參數值(這裡是高值)比多樣性低的病毒更寬鬆,但在設計中仍然需要相對大量的探針來覆蓋已知的多樣性(見C)。用於設計VWAFR探針集時,類似於c和d的圖在補充圖3中。

Fig. 1 | Using CATCH for probe set design.

a, Sketch of CATCH’s approach to probe design, shown with three datasets (typically, each is a taxon). For each dataset d, CATCH generates candidate probes by tiling across input genomes and, optionally, reduces the number of them using locality-sensitive hashing. Then it determines a profile of where each candidate probe will hybridize (the genomes and regions within them) under a model with parameters θd (see Supplementary Fig. 1b for details). Using these coverage profiles, it approximates the smallest collection of probes that fully captures all input genomes (described in the text as s(d, θd)). Given a constraint on the total number of probes (N) and a loss function over θd, it searches for the optimal θd for all d.

b, Number of probes required to fully capture increasing numbers of HCV genomes. Approaches shown are simple tiling (gray), a clustering-based approach at two levels of stringency (red), and CATCH with three choices of parameter values specifying varying levels of stringency (blue). See Supplementary Note 2 for details regarding parameter choices. Previous approaches for targeting viral diversity use clustering in probe set design. The shaded regions around each line are 95% pointwise confidence bands calculated across randomly sampled input genomes.

c, Number of probes designed by CATCH for each dataset (of 296 datasets in total) among all 349,998 probes in the VALL probe set. Species incorporated in our sample testing are labeled.

d, Values of the two parameters selected by CATCH for each dataset in the design of VALL: number of mismatches to tolerate in hybridization and length of the target fragment (in nucleotides) on each side of the hybridized region assumed to be captured along with the hybridized region (cover extension). The label and size of each bubble indicate the number of datasets that were assigned a particular combination of values. Species included in our sample testing are labeled in black, and outlier species not included in our testing are in gray. In general, more diverse viruses (for example, HCV and HIV-1) are assigned more relaxed parameter values (here, high values) than less diverse viruses, but still require a relatively large number of probes in the design to cover known diversity (see c). Panels similar to c and d for the design of VWAFR are in Supplementary Fig. 3.

3. 1520個人類腸道可培養細菌基因組使微生物組功能分析成為可能

1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses

NBT-19年2月刊4篇35分文章聚焦宏基因组研究

2019年2月5日上午,華大團隊在國際頂級學術期刊Nature旗下子刊Nature Biotechnology上發表了全球最大人體腸道細菌基因組集(Culturable GenomeReference, CGR)研究成果。該研究提供了1500多個高質量的人體腸道細菌基因組,為腸道微生物組研究提供了大量全新的參考基因組數據,同時將腸道菌群的功能分析提升到新維度,這也是首次通過大規模培養的技術手段獲得如此多數量的高質量細菌基因組數據。這項由深圳華大生命科學研究院宏基因組學研究團隊主導構建的人腸道細菌基因組集及菌株庫,對於實現精準解密腸道菌群與疾病之間的關係具有重要的科研價值,同時也為人腸道菌株功能的深入探索提供了寶貴的基礎資源。

更多相關報導,詳見《NBT-2019-華大發布全球最大人體腸道細菌基因組集研究成果》

摘要

參考基因組對於人類腸道微生物群的宏基因組分析和功能特徵描述是必不可少的。我們提供了可培養基因組參考(Culturable Genome Reference,CGR),這是一個1520個非冗餘的、高質量的基因組草圖,由健康人糞便樣本中培養出的超過6000個細菌獲得。1520個基因組覆蓋人類腸道所有主要細菌門和屬的,其中264個沒有出現在現有的參考基因組目錄中。進一步研究表明,細菌參考基因組數量的增加提高了宏基因組測序數據的可比對率,從50%提高到70%,使人類腸道微生物組的分辨率更高。我們利用CGR基因組對338種細菌的功能進行了註釋,表明該資源在功能研究中的有效性。我們還對38種重要的人類腸道物種進行了全基因組分析,揭示了它們的核心基因組與其它可有可無的基因組之間功能富集的多樣性和特異性。

Reference genomes are essential for metagenomic analyses and functional characterization of the human gut microbiota. We present the Culturable Genome Reference (CGR), a collection of 1,520 nonredundant, high-quality draft genomes generated from >6,000 bacteria cultivated from fecal samples of healthy humans. Of the 1,520 genomes, which were chosen to cover all major bacterial phyla and genera in the human gut, 264 are not represented in existing reference genome catalogs. We show that this increase in the number of reference bacterial genomes improves the rate of mapping metagenomic sequencing reads from 50% to >70%, enabling higher-resolution descriptions of the human gut microbiome. We use the CGR genomes to annotate functions of 338 bacterial species, showing the utility of this resource for functional studies. We also carry out a pan-genome analysis of 38 important human gut species, which reveals the diversity and specificity of functional enrichment between their core and dispensable genomes.

腸道細菌系統發育樹

NBT-19年2月刊4篇35分文章聚焦宏基因组研究

圖1. 基於全基因組序列的1520株腸道細菌系統發育樹。CGR中1520個高質量基因組根據其全基因組序列分為338個種級簇(ANI≥95%)。厚壁菌門的細菌以橙色表示;擬桿菌門,藍色;變形菌門,綠色;放線菌門,紫色;梭菌門,灰色。新屬和新種分別以紅色和橙色枝突出。最外層的條表示每個簇中獲得的基因組數量。以硒化根瘤菌ATCC BAA 1503為外類群進行系統發育樹構建。

Fig. 1 | Phylogenetic tree of 1,520 isolated gut bacteria based on whole-genome sequences. The 1,520 high-quality genomes in CGR are classified into 338 species-level clusters (ANI ≥ 95%) based on their whole-genome sequences. Bacterial species from Firmicutes are colored in orange; Bacteroidetes, blue; Proteobacteria, green; Actinobacteria, violet; Fusobacteria, gray. Novel genera and species are highlighted by red and orange branches, respectively. The bar on the outermost layer indicates the number of genomes archived in each cluster. Rhizobium selenitireducens ATCC BAA 1503 was used as an outgroup for phylogenetic analysis.

4. 人類腸道細菌基因組和培養組用於改進的宏基因組分析

A human gut bacterial genome and culture collection for improved metagenomic analyses

NBT-19年2月刊4篇35分文章聚焦宏基因组研究

來自桑格研究所(Wellcome Sanger Institute)宿主與微生物組互作實驗室(Host-Microbiota Interactions Laboratory)的Trevor D. Lawley團隊發佈了人類胃腸道細菌培養的737個全基因組測序細菌分離株。這一資源的發佈,使人類胃腸道微生物組的細菌基因組數量增加了37%。比HMP基因組數據集分類比例提高了61%,有助於實現非組裝的快速宏基因組功能基因定量。本研究與上篇華大的培養組學研究工作類似,背靠背同期發佈於NBT雜誌的研究論文的資源欄目。

摘要

瞭解腸道微生物群的功能需要培養細菌進行實驗驗證,並參考細菌基因組序列來解釋宏基因組數據集並指導功能分析。我們介紹了人類胃腸道細菌培養集(Human Gastrointestinal Bacteria Culture Collection, HBC),這是一套完整的737個全基因組測序細菌分離株,來自人類胃腸道微生物組中31個科的273個物種(105個新物種)。HBC使人類胃腸道微生物組的細菌基因組數量增加了37%。由此產生的全球人類胃腸道細菌基因組資源庫(HGG)測試13,490個鳥槍測序的宏基因組樣本,可對其中83%的屬進行分類。與人類微生物組項目(HMP)基因組數據集相比,分類比例提高了61%,並實現了近50%序列的亞種級分類。改進的胃腸道細菌參考序列資源避免了對宏基因組從頭組裝的依賴,並使人胃腸道微生物組的宏基因組分析更準確、且經濟有效。

Understanding gut microbiome functions requires cultivated bacteria for experimental validation and reference bacterial genome sequences to interpret metagenome datasets and guide functional analyses. We present the Human Gastrointestinal Bacteria Culture Collection (HBC), a comprehensive set of 737 whole-genome-sequenced bacterial isolates, representing 273 species (105 novel species) from 31 families found in the human gastrointestinal microbiota. The HBC increases the number of bacterial genomes derived from human gastrointestinal microbiota by 37%. The resulting global Human Gastrointestinal Bacteria Genome Collection (HGG) classifies 83% of genera by abundance across 13,490 shotgun-sequenced metagenomic samples, improves taxonomic classification by 61% compared to the Human Microbiome Project (HMP) genome collection and achieves subspecies-level classification for almost 50% of sequences. The improved resource of gastrointestinal bacterial reference sequences circumvents dependence on de novo assembly of metagenomes and enables accurate and cost-effective shotgun metagenomic analyses of human gastrointestinal microbiota.

胃腸道細菌系統發育樹

NBT-19年2月刊4篇35分文章聚焦宏基因组研究

圖1. 人類胃腸道微生物組基因組可培養細菌的系統發育多樣性。最大似然樹的40個通用核心基因,是由737個HBC基因組(綠色外圓)和617個來自人類胃腸道樣本的高質量公共基因組共同構成。分支顏色區分不同菌門:放線菌門(金;n=129)、擬桿菌門(綠色;n=231)、厚壁菌門(藍色;n=772)、梭桿菌門(黑色;n=26)、互養菌門(粉紅色;n=2)和變形菌門(橙色;n=194)。

Fig. 1 | Phylogenetic diversity of the human gastrointestinal microbiota genome collection. Maximum-likelihood tree generated using the 40 universal core genes from the 737 HBC genomes (green outer circle) and the 617 high-quality public genomes derived from human gastrointestinal tract samples, which together make up the HGG. Branch color distinguishes bacterial phyla belonging to Actinobacteria (gold; n = 129 genomes), Bacteroidetes (green; n = 231 genomes), Firmicutes (blue; n = 772 genomes), Fusobacteria (black; n = 26 genomes), Synergistetes (pink; n = 2 genomes) and Proteobacteria (orange; n = 194 genomes) shown.

Reference

  1. Bradley Phelim,den Bakker Henk C,Rocha Eduardo P C et al. Ultrafast search of all deposited bacterial and viral genomic data.[J] .Nat. Biotechnol., 2019, 37: 152-159.

  2. Metsky Hayden C,Siddle Katherine J,Gladden-Young Adrianne et al. Capturing sequence diversity in metagenomes with comprehensive and scalable probe design.[J] .Nat. Biotechnol., 2019, 37: 160-168.

  3. Zou Yuanqiang,Xue Wenbin,Luo Guangwen et al. 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses.[J] .Nat. Biotechnol., 2019, 37: 179-185.

  4. Forster Samuel C,Kumar Nitin,Anonye Blessing O et al. A human gut bacterial genome and culture collection for improved metagenomic analyses.[J] .Nat. Biotechnol., 2019, 37: 186-192.

猜你喜歡

10000+:菌群分析寶寶與貓狗 梅毒狂想曲 提DNA發Nature Cell專刊 腸道指揮大腦

系列教程:微生物組入門 Biostar 微生物組 宏基因組

專業技能:學術圖表高分文章生信寶典 不可或缺的人

一文讀懂:宏基因組 寄生蟲益處 進化樹

必備技能:提問 搜索 Endnote

文獻閱讀 熱心腸 SemanticScholar Geenmedical

擴增子分析:圖表解讀 分析流程 統計繪圖

16S功能預測 PICRUSt FAPROTAX Bugbase Tax4Fun

在線工具:16S預測培養基 生信繪圖

編程模板: Shell R Perl

生物科普: 腸道細菌人體上的生命生命大躍進 細胞暗戰 人體奧秘

寫在後面

為鼓勵讀者交流、快速解決科研困難,我們建立了“宏基因組”專業討論群,目前己有國內外5000+ 一線科研人員加入。參與討論,獲得專業解答,歡迎分享此文至朋友圈,並掃碼加主編好友帶你入群,務必備註“姓名-單位-研究方向-職稱/年級”。PI請明示身份,另有海內外微生物相關PI群供大佬合作交流。技術問題尋求幫助,首先閱讀《如何優雅的提問》學習解決問題思路,仍末解決群內討論,問題不私聊,幫助同行。

學習16S擴增子、宏基因組科研思路和分析實戰,關注“宏基因組”


分享到:


相關文章: