Nature Methods：微生物來源分析包SourceTracker——結果解讀和使用教程

2020-02-07 05:05:46 微生物組

前一陣我們翻譯Rob Knight的綜述，1.8萬字，讓你熟讀2遍輕鬆握掌微生物組領域分析框架、把握未來分析趨勢。目前在宏基因組平臺累計1.9萬人次，熱心腸平臺首發閱讀8500+，科學網加精置頂閱讀8700+，CSDN閱讀1200+，四大平臺閱讀人數近4萬次。

還沒有仔細學習的，趕快讀兩遍吧！

Nature綜述：大佬手把手教你分析菌群數據——1.8萬字

其中提到了一款追蹤微生物來源的軟件SourceTracker，被多位朋友留言，今天為大家分享一下軟件的結果解讀。同時詳細說明輸入文件的準備和代碼的註釋，方便大家上手應用於自己的數據。

SourceTracker有什麼用？

用途是可以識別相關各組間來源的分析，如具體的問題：

嬰兒的腸道菌群有哪些繼承了母親的腸道菌群、哪些來自陰道菌群、哪些來自皮膚
法醫學的應用，屍體中的菌群與來源土壤的鑑定、腐敗菌來自本身，還是周圍環境

河流汙染物的來源分析、周圍工廠、農田、養殖廠對河流汙染的貢獻和來源追溯。
分析植物菌組形成過程：植物根際菌在土壤中來源和種子來源；葉際菌群的土壤來源比例等。

軟件簡介

Bayesian community-wide culture-independent microbial source tracking 於2011年發表於Nature Methods

Nature Methods：微生物来源分析包SourceTracker——结果解读和使用教程

由聖地亞哥大學的Scott教授及Rob Knight團隊合作完成。第一作者Dan Knights。

Google統計目前引用299次。

該軟件中目標樣本為<code>Sink/<code>，微生物汙染源或來源的樣品為<code>Source/<code>；基於貝葉斯算法，探究目標樣本（Sink）中微生物汙染源或來源（Source）的分析。根據Source樣本和Sink樣本的群落結構分佈，來預測Sink樣本中來源於各Source樣本的組成比例。

我們之前解讀過Rob Knight的一篇Sciences文章中圖2A就使用此軟件分析確定屍體腐敗過程中主要菌來自於土壤的結果。

16S+功能預測發Sciences

軟件結果解讀

SourceTracker分析圖a，預測樣本來源比例柱狀圖。一幅圖代表一個預測樣本，用不同顏色的柱子表示該樣本中各來源的比例，Unknow代表未知來源分類，誤差線代表100次Gibbs採樣的標準差。

SourceTracker分析圖b，預測樣本來源比例面積圖。一幅圖代表一個預測樣本，不同顏色代表不同來源的比例，每一列代表一次Gibbs採樣結果，100次Gibbs採樣結果按照相近的排列順序進行展示。

SourceTracker分析圖c，預測樣本來源比例餅圖。一個餅圖代表一個預測樣本，不同顏色扇形的比例代表該預測樣本中各來源的比例。

原文解讀

DOI: 10.1038/nmeth.1650

圖1 SourceTracker和其他模型的比較。所示模型估計模擬樣本中兩個源環境的比例，Jensen-Shannon差異表示環境之間的重疊程度從0（完全相同，因此不可能消除歧義）到1（完全不重疊，因此容易區分）。繪製了估計比例的確定係數（r2）。每個點代表100個樣本的三次試驗的平均R2；誤差條顯示均值的標準誤s.e.m.（n=3）。

Figure 1 | Comparison of SourceTracker and other models. Indicated models estimate the proportions of two source environments in simulated samples, as the degree of overlap between the environments was varied from a Jensen-Shannon divergence of 0 (completely identical and thus impossible to disambiguate) to 1 (completely non-overlapping and thus trivial to disambiguate). The coefficients of determination (R2) of the estimated proportions are plotted. Each point represents the mean R2 for three trials of 100 samples each; error bars show s.e.m. (n = 3).

Nature Method [1]原文：每個圖形代表一個樣本<code>Sink/<code>，分別是Lab1的PCR水、NICU桌子、辦公室電話；不同顏色表示不同樣本來源<code>Source/<code>，所佔面積為在<code>Sink/<code>樣本中各來源的比例。

圖2. SourceTracker對洗碗池樣本子集的比例估計。（a–c）使用SourceTracker估計的三個洗碗池樣本的源環境比例，

每個源環境中包括45個訓練樣本：吉布斯採樣中100次取樣的平均比例（a），相同樣本的數據，包括比例估計的標準變異S.D.（b）。100次吉布斯繪製的可視化；每個列顯示一次採樣的結果，列按保持相似的混合物進行排列在一起，使圖形看起來看美觀、更容易觀察和比較規律（c）。

Figure 2 | SourceTracker proportion estimates for a subset of sink samples. (a–c) Source environment proportions for three sink samples estimated using SourceTracker and 45 training samples from each source environment: mean proportions for 100 draws from Gibbs sampling (a), data for the same samples, including s.d. of the proportion estimates (b), and visualization of the 100 Gibbs draws; each column shows the mixture from one draw, with columns ordered to keep similar mixtures together (c).

圖3 常見汙染操作分類單元（OTU）的相對丰度。SourceTracker可以為水槽樣本中的每個OTU觀測序列分配不同的源環境。這十個OTU源對在水槽環境中具有最高的平均相對丰度，不包括未知源。圖例給出了OTU的屬級分類、OTU標識符和分配給這些觀測的源環境。值得注意的是，被歸類為

腸桿菌的OTU，一種常見於腸道的譜系，在皮膚訓練樣本中比在腸道訓練樣本中更為普遍。

Figure 3 | Relative abundance of common contaminating operational taxonomic units (OTUs). SourceTracker may assign a different source environment to each observation (sequence) of an OTU in the sink samples. These ten OTU-source pairs had the highest average relative abundance across sink environments, excluding the unknown source. The legend gives the genus-level taxonomic classification of the OTU, the OTU identifier and the source environment assigned to these observations of the OTU. Note that the OTU classified as Enterobacter, a lineage commonly seen in the gut, was more prevalent in the skin training samples than the gut training samples.

文章實戰解讀

Rob Knight-2016-Sciences[2]文章中圖(A) 動態貝葉斯推理網絡: 屍體分解過程微生物分類群神經信息流動網絡，土壤是其主要來源。小鼠屍體四種取樣位置分別為頭、軀幹、腹部和土壤。顏色為3種環境，分別為沙漠、草地和森林，且均與土壤來源微生物非常顯著相關。

16S+功能預測發Sciences：屍體降解過程

[3]產道菌群移植對剖腹產嬰兒缺失菌群的恢復：圖中展示了嬰兒三個部位皮膚Skin、口腔Oral、肛門Anal中，腸道菌群組成的來源，隨時間的推移1-30天而發生的改變。

剖腹產的嬰兒患免疫和代謝疾病的風險增高，被認為可能由於缺乏了與母親生殖道分泌物（包括微生物）的接觸；母親生殖道的分泌物會覆蓋順產嬰兒的全身，促進了嬰兒口腔、腸道、皮膚菌群的定殖，以及對嬰兒的保護作用；對剖腹產嬰兒塗抹分泌物，隨時間推移，其各部位菌群特徵逐漸趨向於順產嬰兒。該方法可以部分的恢復剖腹產嬰兒菌群，但對健康的長期影響有待觀察，以及樣本量也需要擴大。

軟件安裝

SourceTracker是一個R腳本，最新版本地址：https://github.com/danknights/sourcetracker ，版本1.0，2016年9月18日更新

<code># 下載腳本和測試數據
git clone git@github.com:danknights/sourcetracker.git
cd sourcetracker//<code>

example.r使用data目錄中的OTU表和mapping實現文章中的分析實例

<code># 運行測試數據
Rscript example.r/<code>

運行中顯示如下結果:

<code>Rarefying training data at 1000
Gut Oral Skin Soil Unknown
.......... 1 of 125, depth= 12: 0.12 (0.09) 0.07 (0.03) 0.11 (0.06) 0.02 (0.04) 0.68 (0.09)
.......... 2 of 125, depth= 20: 0.22 (0.03) 0.08 (0.03) 0.29 (0.05) 0.03 (0.04) 0.39 (0.05)
.......... 3 of 125, depth= 10: 0.60 (0.00) 0.00 (0.00) 0.25 (0.13) 0.00 (0.00) 0.15 (0.13)
.......... 4 of 125, depth= 10: 0.28 (0.06) 0.09 (0.03) 0.03 (0.05) 0.01 (0.03) 0.59 (0.07)
.......... 5 of 125, depth= 2: 0.00 (0.00) 0.00 (0.00) 0.45 (0.16) 0.00 (0.00) 0.55 (0.16)/<code>

先將訓練樣本抽平至1000條。再進行125次的重採樣，來源包括腸、口腔、皮膚、土壤和未知五大類來源。結果為以上講解的幾種圖形，建議在Rstudio中打開R腳本繪圖，交互探索結果。

輸入文件準備

按照<code>data/metadata.txt/<code>標準的實驗設計和<code>data/otus.txt/<code>標準的OTU表，在example.r中修改對應的分組，即可分析自己的數據了，非常easy。最好按照示例的文件填寫內容，減少代碼修改直接運行，以下有實驗設計信息說明：

實驗設計

示例文件：<code>data/metadata.txt/<code>

主要包括的列有：樣品名、描述、環境Env、來源SourceSink、研究、細節

你的實驗設計必須有前4列

<code>#SampleID/<code>列：樣品編號，以英文字母開頭，最好只包括字母和數字，必須與OTU錶行名一致
<code>Description/<code>列：樣品名，即樣本名稱，可以包括空格，非字符，展示為圖為標籤方便閱讀理解
<code>Env/<code>列：為樣品來源註釋列，包括上面輸出的Gut、Oral、Skin、Soil，主要為採樣來源
<code>SourceSink/<code>列，要計算的樣品標Sink，而來源數據標Source

<code>#SampleID Description Env SourceSink Study Details
Run20100430_H2O-1 PCR water 1 Lab 1 sink Lab 1 PCR_water_sample_1_2010_04_30_run
Run20100430_H2O-2 PCR water 2 Lab 1 sink Lab 1 PCR_water_sample_2_2010_04_30_run
Run20100430_H2O-3 PCR water 3 Lab 1 sink Lab 1 PCR_water_sample_3_2010_04_30_run
BB2 Spodosol 1 Soil source 88_Soils NA
IT2 Spodosol 2 Soil source 88_Soils NA
CL3 Ultisol Soil source 88_Soils NA/<code>

OTU表

示例文件：<code>data/otus.txt/<code>

QIIME導出biom後的經典表格格式：行為OTU編號，列為樣本名，矩陣對應為原始測序數量

<code># QIIME-formatted OTU table
#OTU ID Run20100430_ESC_C-1ss Run20100430_ESC_C-2ss Run20100430_ESC_C-3ss Run20100430 _ESC_C-4ss Run20100430_ESC_C-5ss
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0/<code>

代碼中文解讀

<code># This runs SourceTracker on the original "contamination" data set
# (data included in 'data' folder)

# load sample metadata
metadata 
# load OTU table
# This 'read.table' command is designed for a
# QIIME-formatted OTU table.
# namely, the first line begins with a '#' sign
# and actually _is_ a comment; the second line
# begins with a '#' sign but is actually the header
# 讀取行、列名，跳過一行，註釋為空即讀#號行
otus # 讀入數據框變矩陣，且轉置為樣本為行的舊格式
otus 
# extract only those samples in common between the two tables
common.sample.ids otus metadata # double-check that the mapping file and otu table
# had overlapping samples
# 判斷是否存在共有樣品，否則退出 

if(length(common.sample.ids) <= 3) {
message are %d sample ids in common ',length(common.sample.ids)),
'between the metadata file and data table')
stop(message)
}

# extract the source environments and source/sink indices
# 篩選哪些是來源或目標真假T/F，which轉化為位置編號
# 共篩選訓練集180個，測試集125個
train.ix test.ix # 測試集太多，只保留6個樣品做演示
test.ix = head(test.ix)
envs # 判斷是否存在Description列，存在列保存於desc
if(is.element('Description',colnames(metadata))) desc 
# load SourceTracker package
# 加載軟件包
source('src/SourceTracker.r')

# tune the alpha values using cross-validation (this is slow!)
# 使用交叉驗證調整alpha值，非常耗時
# tune.results # alpha1 # alpha2 # note: to skip tuning, run this instead:
# 跳過優化alpha值步驟，直接設置為0.001繼續計算
alpha1 
# train SourceTracker object on training data
# 基於訓練集和對應描述獲得預測模型
st 
# Estimate source proportions in test data
# 估計測試集中來源比例
results 
# Estimate leave-one-out source proportions in training data
# 在訓練集中留一法（一種交叉驗證方法）估計來源比例，計算次數等於訓練集樣本數量，極耗時 

# results.train 
# plot results
# 結果繪圖, 將環境和描述列合併作為標籤展示
labels # 繪製餅形圖比例
plot(results, labels[test.ix], type='pie')

# other plotting functions
plot(results, labels[test.ix], type='bar')
plot(results, labels[test.ix], type='dist')
# plot(results.train, labels[train.ix], type='pie')
# plot(results.train, labels[train.ix], type='bar')
# plot(results.train, labels[train.ix], type='dist')

# plot results with legend
# 添加圖例，並人工指定顏色
plot(results, labels[test.ix], type='pie', include.legend=TRUE, env.colors=c('#47697E','#5B7444','#CC6666','#79BEDB','#885588'))
plot(results, labels[test.ix], type='pie', include.legend=TRUE, env.colors=rainbow(5))/<code>

最後結果繪圖，可選餅形圖(pie)、柱狀圖(bar)和堆疊圖(dist)，如上面示例所示。

Knights D, Kuczynski J, Charlson ES, et al. Bayesian community-wide culture-independent microbial source tracking [J]. Nature Methods, 2011, 8(9): 761. DOI: 10.1038/nmeth.1650
Metcalf, J. L., et al. (2016). “Microbial community assembly and metabolic function during mammalian corpse decomposition.” Science 351(6269): 158-162.
Dominguez-Bello MG, De JKM, Nan S, et al. Partial restoration of the microbiota of cesarean-born infants via vaginal microbial transfer [J]. Nature Medicine, 2016, 22(3): 250-253.
https://www.nature.com
/articles/nmeth.1650
https://www.ncbi.nlm.nih.gov/pubmed/21765408
銳翌16S分析升級之③ SourceTracker — 尋找微生物的來源 https://mp.weixin.qq.com/s/eAD42C8ZZAcHBXWmr6HnUw