RNA-Seq

RNA-Seq指南

RNA-Seq参考

根据基因组进行量化

RNA-Seq根据基因组进行量化时,参考数据是基因组序列,基因组注释文件是基因组的注释信息,包括基因的位置、外显子的位置、基因的功能等信息。该方法将对齐文件与注释相交叉,以产生“计数”,然后过滤这些计数以保留统计上显著的结果。定量矩阵代表对齐与标记为转录本一部分的区间重叠(交集)的次数。

在通过参考基因组和注释信息对读段进行映射,然后再进行表达量估算时候,又有两种不同的基因表达量分析方法。

  1. 基因水平分析
  • 这种分析方法把一个基因的所有转录本(即由基因的不同区域(外显子)组成的不同版本)视为一个整体。
  • 在这种分析中,不区分各个不同的转录本,而是将它们合并为一个单一的序列,这个序列包括了该基因的所有外显子。
  • 这个合并的序列被认为是“代表性”的,但它可能并不对应于任何实际存在的单一转录本,它更像是一个虚拟的总和。
  • 对于某些研究目的而言,这种简化的方法足够用来获取有意义的数据,尤其是当研究的重点是基因整体活动水平时。
  1. 转录本水平分析
  • 与基因水平分析不同,转录本水平分析关注于每个单独的转录本。
  • 这种方法不会将不同的转录本合并,而是独立地测量每一个转录本的表达量。
  • 这允许研究者观察到基因内部异构体(即同一基因的不同转录本版本)的丰度变化,这些变化可能对细胞功能和调控至关重要。

根据转录组进行分类

RNA-Seq根据转录组进行量化时,参考数据将是每个转录本的 FASTA 文件。分类将把每个读取分配给一个转录本,从而解决过程中的歧义。每个转录本的量化矩阵将表示读取被归类为该特定转录本的次数。

目前主流的能够实现基于分类的RNA-Seq工作的软件有:

  • Salmon
  • Kallisto

我们也能从参考基因组及注释文件中使用gffread工具生成转录组的FASTA文件。

1
2
3
4
conda install gffread

# Extract the transcripts.
gffread -w refs/transcripts.fa -g refs/22.fa refs/22.gtf

参考基因组

通常情况下,当我们下载一个参考基因组后,我们需要知道一些基本信息,比如基因组的大小、基因组的组成、基因组的注释信息等。这些信息对于我们后续的分析非常重要。

参考基因组大小

1
seqkit stats *.fa

输出:

1
2
file        format  type  num_seqs        sum_len  min_len       avg_len      max_len
genomic.fa FASTA DNA 194 3,099,750,718 970 15,978,096.5 248,956,422

有194个序列(染色体+特定基因组序列标识符),总长度为3,099,750,718bp,最短序列为970bp,最长序列为248,956,422bp,平均序列长度为15,978,096.5bp。

有多少个染色体

1
2
3
4
cat *.fa | grep ">"

# Ensemble GRCh38w为例,统计染色体数。(结果为22+X+Y+MT=25)
cat *.fa | grep ">*chromosome*" | wc -l

每个染色体的长度

可通过samtools faidx命令获取每个染色体的长度。

1
2
samtools faidx *.fa
cat *.fa.fai

输出:
1
2
3
4
5
6
7
8
9
10
11
12
1       248956422       56      60      61
10 133797422 253105810 60 61
11 135086622 389133248 60 61
12 133275309 526471372 60 61
13 114364328 661967995 60 61
14 107043718 778238454 60 61
15 101991189 887066292 60 61
16 90338345 990757392 60 61
17 83257441 1082601434 60 61
18 80373285 1167246557 60 61
···
···

第一行:染色体名称。
第二行:染色体长度,表示这个染色体的总长度。
第三行:染色体的偏移量,这是染色体序列在FASTA文件中的起始字节位置。这意味着,如果你想定位到第10号染色体的序列开头,你需要从文件的开头跳过253,105,810个字节。
第四行:每行序列的长度,60,这表示在FASTA文件中,每行表示的序列的字符数是60。这不包括行末的换行符。
第五行:每行的长度,61,这个数字比序列的长度多1,说明每行的结束包括了一个换行符(通常是\n),所以每行的实际长度是61个字符。

参考基因组注释文件

参考基因组常见的注释文件格式有GTF(Gene Transfer Format)和 GFF(General Feature Format),两者都是文本格式的文件。这些文件包括了基因组上各种遗传标记的详细信息。
在RNA-Seq分析中,基因和转录本的注释通常是最核心的信息,因为它们直接关系到后续分析中的表达量计算和差异表达分析。基因级别的注释可以提供关于哪些基因在样本中被表达的信息,而转录本级别的注释则可以提供更细致的信息,比如不同的剪接变体。

然而,是否只需要基因和转录本的遗传标记信息取决于分析的具体目的:

  1. 基因表达定量: 如果只需要进行基因水平的表达量分析,那么基因的注释信息可能就足够了。

  2. 转录本表达定量和剪接分析: 如果要进行转录本水平的表达分析或者分析剪接变异,就必须要有转录本以及外显子的详细注释信息。

  3. 蛋白质编码分析: 如果分析涉及到蛋白质编码区域,例如寻找可能影响蛋白质结构的突变,那么编码序列(CDS)注释就很重要。

  4. UTR区域分析: 对于研究转录后调控,比如microRNA的靶向等,5’ UTR和3’ UTR的信息会很关键。

  5. 调控元件分析: 如果研究的重点是基因的调控机制,那么启动子、增强子等调控元件的注释就显得非常重要。

1
2
3
4
5
6
7
8
9
10
11
cat *.gtf | grep -v "#" | cut -f3 | sort | uniq -c

746300 CDS
120 Selenocysteine
1261907 exon
149893 five_prime_utr
58676 gene
86423 start_codon
78536 stop_codon
148456 three_prime_utr
206534 transcript
  1. CDS (Coding Sequence): 编码序列是指基因组中编码蛋白质的部分,这个数字(746300)告诉我们在注释文件中有这么多的编码序列条目。

  2. Selenocysteine: 一种被称为“第21个氨基酸”的罕见氨基酸,它直接参与蛋白质的合成。这个数字(120)表明在注释文件中有120个编码selenocysteine的位点。

  3. Exon: 外显子是成熟mRNA中保留下来的那部分DNA序列,可以转录并翻译成蛋白质。这个数字(1261907)表示注释文件中有这么多外显子。

  4. Five_prime_utr (5’ UTR): 5’非翻译区(Untranslated Region)是位于起始密码子之前的mRNA序列部分,它不编码蛋白质,但有助于调控翻译过程。这个数字(149893)表示文件中有这么多的5’ UTR条目。

  5. Gene: 基因是遗传的基本物理和功能单位,通常包含编码一个或多个蛋白质的序列。这个数字(58676)表示文件中有这么多基因。

  6. Start_codon: 起始密码子是一段特定的核苷酸序列,它在mRNA上标志着蛋白质合成的开始。这个数字(86423)显示了有这么多的起始密码子被注释。

  7. Stop_codon: 终止密码子是信号蛋白质合成结束的三个核苷酸序列之一。这个数字(78536)表示有这么多的终止密码子。

  8. Three_prime_utr (3’ UTR): 3’非翻译区(Untranslated Region)是位于终止密码子之后的mRNA序列部分,它同样不编码蛋白质,但含有调控mRNA稳定性和翻译效率的重要元素。这个数字(148456)表示文件中有这么多的3’ UTR条目。

  9. Transcript: 转录物是指DNA序列通过转录过程生成的RNA副本,在这里特指那些最终会被翻译成蛋白质或者具有其他功能的RNA分子。这个数字(206534)表明有这么多的转录物记录在注释文件中。

了解数据

通常情况下RNA-Seq数据是由fastq文件组成,fastq文件包含了测序仪输出的原始测序数据,每个fastq文件包含了大量的reads,每个read都是一个碱基序列,同时还包含了测序质量信息。我们可以通过查看统计fastq文件来获取数据的一些信息。

1
2
# 统计fastq文件
seqkit stats *.fq

输出:

1
2
3
4
5
6
7
8
file               format  type    num_seqs        sum_len  min_len  avg_len  max_len
chip_HA_1_1.fq FASTQ DNA 40,478,593 2,023,929,650 50 50 50
chip_HA_2_1.fq FASTQ DNA 37,248,768 1,862,438,400 50 50 50
chip_HA_3_1.fq FASTQ DNA 28,160,024 1,408,001,200 50 50 50
chip_input_1_1.fq FASTQ DNA 42,065,898 2,103,294,900 50 50 50
chip_plau_1_1.fq FASTQ DNA 31,915,339 1,595,766,950 50 50 50
chip_plau_2_1.fq FASTQ DNA 43,102,269 2,155,113,450 50 50 50
chip_plau_3_1.fq FASTQ DNA 38,000,000 1,900,000,000 50 50 50

比对

构建索引

当我们将数据和参考基因组进行比对时,首先需要准备参考基因组的索引文件,不同的参考基因组比对工具需要不同的索引文件,比如STAR需要的索引文件是STAR_index,而HISAT2需要的索引文件是HISAT2_index。
以Hisat2为例,我们可以通过以下命令构建索引文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# The reference genome.
REF=refs/*.fa
GTF=refs/*.gtf

# Build the genome index.
IDX=refs/genome

hisat2_extract_splice_sites.py ${GTF} > ${IDX}.ss
hisat2_extract_exons.py ${GTF} > ${IDX}.exon

hisat2-build -p 16 --exon ${IDX}.exon --ss ${IDX}.ss ${REF} ${REF}

# Index the reference genome with samtools.
samtools faidx ${REF}

比对

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Create the include sample list file.
ls reads | cut -d '_' -f1,2 | sort | uniq > ids.txt

# The index name.
REF=refs/*.fa

# Create the BAM folder.
mkdir -p bam

# Align the FASTQ files to the reference genome.
cat ids.txt | parallel "hisat2 -x ${REF} -1 reads/{}_R1.fq -2 reads/{}_R2.fq 2>> alignment_summary.txt | \
samtools sort > bam/{}.bam"

# Index each BAM file.
cat ids.txt | parallel "samtools index bam/{}.bam"

双端比对信息解读:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
10000 reads; of these:
10000 (100.00%) were paired; of these:
650 (6.50%) aligned concordantly 0 times
8823 (88.23%) aligned concordantly exactly 1 time
527 (5.27%) aligned concordantly >1 times
----
650 pairs aligned concordantly 0 times; of these:
34 (5.23%) aligned discordantly 1 time
----
616 pairs aligned 0 times concordantly or discordantly; of these:
1232 mates make up the pairs; of these:
660 (53.57%) aligned 0 times
571 (46.35%) aligned exactly 1 time
1 (0.08%) aligned >1 times

  • 10000 reads; of these: 10000 (100.00%) were paired: 总共有10000个读段,所有这些读段都是成对的,意味着每个读段都有相应的配对读段。

  • 650 (6.50%) aligned concordantly 0 times: 有650对读段(占6.50%)没有一起成功比对到参考序列上。”一致比对”(concordantly)意味着两个读段以预期的相对方向和距离成功地比对到参考序列的同一个区域。

  • 8823 (88.23%) aligned concordantly exactly 1 time: 有8823对读段(占88.23%)一致地比对到参考序列的唯一位置一次。

  • 527 (5.27%) aligned concordantly >1 times: 有527对读段(占5.27%)一致地比对到参考序列的多个位置。

接下来的部分提供了关于没有一致比对的读段的更多细节:

  • 650 pairs aligned concordantly 0 times; of these:
  • 34 (5.23%) aligned discordantly 1 time: 在没有一致比对的650对读段中,有34对(占5.23%)以不一致的方式比对了一次。”不一致比对”(discordantly)意味着这些读段没有按照预期的相对方向或距离比对到参考序列的同一个区域,但仍然在某处找到了比对位置。

  • 616 pairs aligned 0 times concordantly or discordantly; of these:

  • 1232 mates make up the pairs; of these:
  • 660 (53.57%) aligned 0 times: 在这616对既没有一致比对也没有不一致比对的读段中,总共有1232个单独的读段,其中660个(占53.57%)没有比对到参考序列。
  • 571 (46.35%) aligned exactly 1 time: 有571个读段(占46.35%)比对到了参考序列的唯一位置一次。
  • 1 (0.08%) aligned >1 times: 有1个读段(占0.08%)比对到了参考序列的多个位置。

最后一行提供了总体的比对率:

  • 96.70% overall alignment rate: 总体来说,96.70%的读段至少比对到了参考序列的一个位置。这个百分比是基于所有读段的,包括一致比对的、不一致比对的以及那些只有一个读段比对到参考序列的配对读段。

单端比对信息解读:

1
2
3
4
5
62364 reads; of these:
62364 (100.00%) were unpaired; of these:
19 (0.03%) aligned 0 times
62311 (99.92%) aligned exactly 1 time
34 (0.05%) aligned >1 times

  • 62364 reads: 表示总共有62,364个读段(reads)被处理。读段是指从测序实验得到的短序列片段。

  • of these: 62364 (100.00%) were unpaired: 所有的读段都是单端读段(unpaired),意味着这些读段不是成对的。在配对端测序中,你通常会得到两个读段(即一对),它们是从同一个DNA片段的两端测序得到的。单端读段意味着每个片段只测序了一端。

  • 19 (0.03%) aligned 0 times: 在这些读段中,有19个(占总数的0.03%)没有比对到参考序列上。

  • 62311 (99.92%) aligned exactly 1 time: 62,311个读段(占总数的99.92%)精确地比对到参考序列上一次。这意味着这些读段在参考序列中有一个唯一的匹配位置。

  • 34 (0.05%) aligned >1 times: 有34个读段(占总数的0.05%)比对到参考序列上多于一次。这表明这些读段可能来自基因组或转录组中重复区域,因此无法确定它们的精确位置。

bam文件转换

通常情况下,我们可以将bam文件直接在IGV等可视化软件中查看,但是有时候加载时间很长。当我们只是关注“覆盖率”时,我们可以将文件转换为`bigwig格式,这样加载速度会更快。在这个过程中我们需要将bam文件转换为bedgraph文件,然后再转换为bigwig文件。

1
2
3
4
5
6
7
8
# The index name.
REF = refs/genome.fa

# Turn each BAM file into bedGraph coverage. The files will have the .bg extension.
cat ids.txt | parallel "bedtools genomecov -ibam bam/{}.bam -split -bg > bam/{}.bg"

# Convert each bedGraph coverage into bigWig coverage. The files will have the .bw extension.
cat ids.txt | parallel "bedGraphToBigWig bam/{}.bg ${REF}.fai bam/{}.bw"

计数

得到reads比对文件后,我们需要对比对文件进行计数,得到每个基因的表达量。这里我们可以使用featureCounts工具,featureCounts是Subread软件包的一部分,它可以用于计算基因或转录本的表达量。

1
2
3
4
GTF = refs/*.gtf

cat ids.txt | parallel -j 1 echo "bam/{}.bam" | \
xargs featureCounts -p --countReadPairs -a ${GTF} -o counts.txt

此时,parallel,命令参数-j 1是必不可少的,我们要确保命令按照文件中列出的顺序执行。每当顺序至关重要时,您必须使用 -j 1,否则,顺序就无法保证。当是双端测序的时候,我们可以设置-p --countReadPairs参数,这样可以计算每对reads的计数(非必选)。

当我们使用featureCounts时,我们会做出许多决定。除了前面提到的命令外,我们还可以使用其他参数来调整计数的方式。下面是一些常用的参数:

1
2
3
4
5
6
7
8
9
10
11
-t <string>         Specify feature type(s) in a GTF annotation. If multiple
types are provided, they should be separated by ',' with
no space in between. 'exon' by default. Rows in the
annotation with a matched feature will be extracted and
used for read mapping.

-g <string> Specify attribute type in GTF annotation. 'gene_id' by
default. Meta-features used for read counting will be
extracted from annotation using the provided value.

-T <int> Number of the threads. 1 by default.

生成的counts.txt文件包含了每个基因的表达量信息。可使用自定义脚本format_featurecounts.r来转化为更简单直接的counts数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
#
# Transform feature counts output to simple counts.
# Adds gene name to a featurecounts output file.
#

# The design file must have a column named "sample"
COUNTS_FILE = "counts.txt"


# The output file name.
OUTPUT_FILE= "counts.csv"

# The name of the file that contains transcript to gene mapping(for example: human).
## Lists all datasets available in Ensembl
## mart = useMart("ensembl", dataset = species_dataset)
## listDatasets(mart)
## Lists all attributes available in datasets
## listAttributes(mart)

SPECIES_DATASET = "hsapiens_gene_ensembl"


# Load the required libraries.

if (!require("biomaRt", quietly = TRUE)) {
BiocManager::install("biomaRt")
}

library(biomaRt)

# Install the optparse package automatically.
if(!require(optparse, quietly=T)){
install.packages("optparse", repos="http://cran.us.r-project.org")
library(optparse)
}

# Command line options.
option_list = list(

make_option(c("-c", "--counts"), action="store", default=COUNTS_FILE, dest="counts_file", type='character',
help="input count file [%default]"),

make_option(c("-o", "--output_file"), action="store", default=OUTPUT_FILE, dest="output_file", type='character',
help="the design file for the samples [%default]"),

make_option(c("-s", "--species_dataset"), action="store", default=SPECIES_DATASET, dest="species_dataset", type='character',
help="the transcript to gene mapping [%default]")

)

# Parse the options.
opt = parse_args(OptionParser(option_list=option_list))

# The name of the file that contains the counts.
input_file = opt$counts_file

# The final result file.
output_file = opt$output_file

# The species dataset to gene mapping file.
species_dataset = opt$species_dataset

# Inform the user.
cat("# Reformating featurecounts.\n")
cat("# Input:", input_file, "\n")

# Read the input data
data = read.table(input_file, stringsAsFactors = FALSE, check.names = FALSE, header=T, sep="\t")

# Remove certain columns.
data = subset(data, select = -c(2, 3, 4, 5))

# Remove file related information from the column
headers = sub("\\.bam", "", colnames(data))
headers = sub("^(.*)/", "", headers)

# Attach the new names
colnames(data) <- headers

# Add columns with specific names
data$name = data$Geneid
data$gene = data$Geneid

targets = sub("\\..*", "", data$name)

if (SPECIES_DATASET != '') {
cat("# Species_dataset:", SPECIES_DATASET, "\n")

# Read the transcript to gene mapping file.
mart = useMart("ensembl", dataset = SPECIES_DATASET)
tx2gene = getBM(attributes = c("ensembl_gene_id", "external_gene_name"), mart = mart)

# Find the indices where the targets match to the mapping.
indices = match(targets, tx2gene$ensembl_gene_id)

# Add the gene name column.
data$gene = tx2gene$external_gene_name[indices]

data$gene[is.na(data$gene)] = targets[is.na(data$gene)]

data$gene[data$gene ==""] = targets[data$gene ==""]

} else {
data$gene = targets
}

# List the desired column order.
cols <- c("name", "gene", headers[3:length(headers)])

# Slice the data.
out = data[,cols]

# Save the resulting summarized counts.
write.csv(out, file=output_file, row.names = FALSE, quote = FALSE)

cat("# Output:", output_file, "\n")

差异表达分析

得到计数矩阵后我们就可以进行差异表达分析了,常用的差异表达分析工具有DESeq2、edgeR、limma等。

Limma基于线性模型,通过使用贝叶斯方法估计每个基因的差异方差。它使用经验贝叶斯方法来将信息从所有基因中借用,特别是在样本较少时提高估计的稳定性。
edgeR基于负二项分布模型。它使用贝叶斯方法通过适应组内变异估计提高估计的稳定性。edgeR考虑了基因的丰度和变异性,使其更适用于RNA-Seq数据。
DESeq2基于负二项分布的模型。它通过使用贝叶斯方法来考虑样本间的差异以及基因表达的离散性。

这里我们使用DESeq2进行差异表达分析,我们需要用到两个文件,一个是counts.csv,且是我们在上一步骤简化过的;另一个是design.csv,它包含了实验设计信息,一列是sample,一列是group,group是我们要比较的分组信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
#
# Differential expression analysis with the DESeq2 package.
#
# https://bioconductor.org/packages/release/bioc/html/DESeq2.html
#
# Command line install: mamba install bioconductor-deseq2
#

DESIGN_FILE = "design.csv"

# The name of the file that contains the counts.
COUNTS_FILE = "counts.csv"

# The output file.
OUTPUT_FILE = "deseq2.csv"

# Install the optparse package automatically.
if(!require(optparse, quietly=T)){
install.packages("optparse", repos="http://cran.us.r-project.org")
library(optparse)
}

# Command line options.
option_list = list(

make_option(c("-d", "--design_file"), action="store", default=DESIGN_FILE, dest="design_file", type='character',
help="the design file for the samples [%default]"),

make_option(c("-c", "--counts"), action="store", default=COUNTS_FILE, dest="counts_file", type='character',
help="input counts file [%default]"),

make_option(c("-o", "--out"), action="store", default=OUTPUT_FILE, dest="output_file", type='character',
help="the results file [%default]"),

make_option(c("-f", "--factor_name"), action="store", default="group", dest="factor_name", type='character',
help="the factor column name in the design file [%default]"),

make_option(c("-s", "--sample_name"), action="store", default="sample", dest="sample_name", type='character',
help="the sample column name in the design file [%default]")

)

# Parse the options.
opt = parse_args(OptionParser(option_list=option_list))

# The name of the file that contains the counts.
counts_file = opt$counts_file

# The sample file is in CSV format and must have the headers "sample" and "condition".
design_file = opt$design_file

# The output file.
output_file = opt$output_file

# The name of the factor column
factor_name = opt$factor_name

# The name of the sample column
sample_name = opt$sample_name

# Inform the user.
cat("# Running DESeq2", "\n")
cat("# Design:", design_file, "\n")
cat("# Counts:", counts_file, "\n")

# Read the design file.
inp <- read.csv(design_file, strip.white = T, stringsAsFactors = F)

# Check the sample column name.
cat("# Sample column:", sample_name, "\n")

if (!sample_name %in% colnames(inp)) {
stop("# Sample column: ", sample_name, " not found in design file (use -s).")
}

# Check the factor column name.
cat("# Factor column:", factor_name, "\n")
if (!factor_name %in% colnames(inp)) {
stop("# Factor column: ", factor_name, " not found in design file (use -f).")
}

# Create a new data frame to work with.
df = data.frame(
sample = inp[, sample_name],
groups = inp[, factor_name]
)

# The default groups come from the design
groups = unique(df$group)

# Create a new data frame to work with.
df = data.frame(
sample = inp[, sample_name],
groups = inp[, factor_name]
)

# The default groups come from the design
groups = unique(df$groups )

# Print information on the groups
for (grp in groups) {
n_grp <- sum(df$groups == grp)
cat("# Group", grp, "has", n_grp, "samples.\n")
}

# Turn conditions into factors.
df$group = factor(df$group)

# The first level should correspond to the first entry in the file
# Required later when building a model.
df$group = relevel(df$group, toString(df$group[1]))

# Isolate the sample names.
sample_names = df$sample

# Read the data from input file.
count_df = read.csv(counts_file, strip.white = T, stringsAsFactors = F)

# Created rounded integers for the count data
count_data = round(count_df[, sample_names, drop=F])

# Column names should be carried over. Skip sample names and FDR
other_cols = names(count_df)[!names(count_df) %in% c(sample_names, "FDR")]

# Carry over all additional columns not in sample names.
row_data = count_df[, other_cols, drop=F]

#
# Running DESeq2
#

# Bioconductor packages used in the script.
bio.p <- c("DESeq2", "tibble", "dplyr", "tools")

# Load Bioconductor packages
cat("# Initializing ", bio.p, "...")
status = suppressPackageStartupMessages(
lapply(bio.p, library, character.only = TRUE)
)
cat(" done\n")

# Create DESEq2 dataset.
dds = DESeqDataSetFromMatrix(
countData = count_data,
rowData = row_data,
colData = df,
design = ~group
)

# Number of rows before
n_start = nrow(dds)

# Remove rows with no counts
keep <- rowSums(counts(dds)) > 3

# At least 3 samples with a count of 10 or higher
# keep <- rowSums(counts(dds) >= 10) >= 3

# Apply the filtering
dds <- dds[keep, ]

# Run deseq
dds = DESeq(dds)

# Get the normalized counts.
normed = counts(dds, normalized=TRUE)

# Round normalized counts to a single digit.
normed = round(normed, 1)

# Format the results.
res = results(dds)

#
# The rest of the code is about formatting the output dataframe.
#

# The first columns that carry over from the input.
start = dplyr::as_tibble(rowData(dds))[, other_cols]

# The main tibble
data = bind_cols(start, as_tibble(res))

# Create the foldChange column.
data$foldChange = 2 ^ data$log2FoldChange

# Rename the columns.
data = dplyr::rename(data, PValue=pvalue, FDR=padj)

# Set all NA values to 1.
data$FDR[is.na(data$FDR)] <- 1

# Create a real adjusted pvalue
data$PAdj = p.adjust(data$PValue, method="hochberg")

# Create the additional columns that we wish to present.
data$baseMeanA = 1
data$baseMeanB = 1

# Combine the normed matrix with the results
total <- bind_cols(data, normed)

# Sort again by P-value.
total = arrange(total, FDR)

# Compute the false discovery counts on the sorted table.
total$falsePos = 1:nrow(total) * total$FDR

# Sample names for condition A and B
col_names_A <- sample_names [df$group == groups[[1]]]
col_names_B <- sample_names [df$group == groups[[2]]]

#col_names_A = unlist(df %>% filter(group==groups[1]) %>% select(sample))
#col_names_B = unlist(df %>% filter(group==groups[2]) %>% select(sample))

# Create the individual baseMean columns.
total$baseMeanA = rowMeans(total[, col_names_A])
total$baseMeanB = rowMeans(total[, col_names_B])

# Bringing some sanity to numbers. Round columns to fewer digits.
total$foldChange = round(total$foldChange, 3)
total$log2FoldChange = round(total$log2FoldChange, 1)
total$baseMean = round(total$baseMean, 1)
total$baseMeanA = round(total$baseMeanA, 1)
total$baseMeanB = round(total$baseMeanB, 1)
total$lfcSE = round(total$lfcSE, 2)
total$stat = round(total$stat, 2)
total$FDR = round(total$FDR, 4)
total$falsePos = round(total$falsePos, 0)

# Reorganize columns names to make more sense.
new_cols = c(other_cols,
"baseMean","baseMeanA","baseMeanB",
"foldChange", "log2FoldChange",
"lfcSE","stat","PValue","PAdj",
"FDR","falsePos", col_names_A, col_names_B)

# Slice the dataframe with new columns.
total = total[, new_cols]

# Count the number of significant results.
n_total = nrow(total)
fdr_count = sum(total$FDR < 0.05, na.rm=TRUE)
fdr_frac = round(fdr_count/n_total, 3) * 100

pval_count = sum(total$PValue < 0.05, na.rm=TRUE)
pval_frac = round(pval_count/n_total, 3) * 100

# Reformat these columns as string.
total$PAdj = formatC(total$PAdj, format = "e", digits = 1)
total$PValue = formatC(total$PValue, format = "e", digits = 1)

# Write the results to the standard output.
write.csv(total, file=output_file, row.names = F, quote = F)

# Inform the user.
cat("# Input:", n_start, "rows\n")
cat("# Removed:", n_start-n_total, "rows\n")
cat("# Fitted:", n_total, "rows\n")
cat("# Significant PVal:", pval_count, "(", pval_frac, "%)\n")
cat("# Significant FDRs:", fdr_count, "(", fdr_frac, "%)\n")
cat("# Results:", output_file, "\n")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
#!/usr/bin/env Rscript

#
# Differential expression analysis with the edgeR package.
#
# https://bioconductor.org/packages/release/bioc/html/edgeR.html
#
# Command line install: mamba install bioconductor-edger r-tibble r-dplyr r-optparse
#

# The input counts file.
COUNTS_FILE = "counts.csv"

# The input design file.
DESIGN_FILE="design.csv"

# The default output file.
OUTPUT_FILE = "edger.csv"

# Install the optparse package automatically.
if(!require(optparse, quietly=T)){
install.packages("optparse", repos="http://cran.us.r-project.org")
library(optparse)
}

# Command line options.
option_list = list(

make_option(c("-d", "--design_file"), action="store", default=DESIGN_FILE, dest="design_file", type='character',
help="the design file for the samples [%default]"),

make_option(c("-c", "--counts"), action="store", default=COUNTS_FILE, dest="counts_file", type='character',
help="input count file [%default]"),

make_option(c("-o", "--out"), action="store", default=OUTPUT_FILE, dest="output_file", type='character',
help="the output file [%default]"),

make_option(c("-m", "--method"), action="store", default="glm", dest="method", type='character',
help="the method glm/classic [%default]"),

make_option(c("-f", "--factor_name"), action="store", default="group", dest="factor_name", type='character',
help="the factor column name in the design file [%default]"),

make_option(c("-s", "--sample_name"), action="store", default="sample", dest="sample_name", type='character',
help="the sample column name in the design file [%default]")

)

# Parse the options.
opt = parse_args(OptionParser(option_list=option_list))

# The name of the file that contains the counts.
counts_file = opt$counts_file

# The sample file is in CSV format and must have the headers "sample" and "group".
design_file = opt$design_file

# The final result file.
output_file = opt$output_file

# The final result file.
method = opt$method

# The name of the factor column
factor_name = opt$factor_name

# The name of the sample column
sample_name = opt$sample_name

# Bioconductor packages used in the script.
pkgs <- c("edgeR", "tibble", "dplyr", "tools")

# Load required packages
cat("# Initializing", pkgs, "...")
status = suppressPackageStartupMessages(
lapply(pkgs, library, character.only = TRUE)
)
cat(" done\n")

# Inform the user.
cat("# Tool: edgeR", "\n")
cat("# Design:", design_file, "\n")
cat("# Counts:", counts_file, "\n")

# Read the sample file.
inp <- read.csv(design_file, strip.white = T, stringsAsFactors = F)

# Sample column name.
cat("# Sample column:", sample_name, "\n")

if (!sample_name %in% colnames(inp)) {
stop("# Sample column: ", sample_name, " not found in design file (use -s).")
}

# Factor column name.
cat("# Factor column:", factor_name, "\n")
if (!factor_name %in% colnames(inp)) {
stop("# Factor column: ", factor_name, " not found in design file (use -f).")
}

# Create a new data frame to work with.
df = data.frame(
sample = inp[, sample_name],
groups = inp[, factor_name]
)

# The default groups come from the design
groups = unique(df$groups )

# Print the factors
cat("# Factors:", groups, "\n")

# Check that the lenght of the groups is 2.
if (length(groups) != 2) {
stop("# The number of factors must be 2.")
}

# Print information on the groups
for (grp in groups) {
n_grp <- sum(df$groups == grp)
cat("# Group", grp, "has", n_grp, "samples.\n")
}

# Turn conditions into factors.
df$groups = factor(df$groups)

# The first level should correspond to the first entry in the file
# Required later when building a model.
df$groups = relevel(df$groups, toString(df$groups[1]))

# Isolate the sample names.
sample_names <- df$sample

# Sample names for the two groups.
col_names_A <- sample_names [df$group == groups[[1]]]
col_names_B <- sample_names [df$group == groups[[2]]]

# Read the data from the counts file.
counts_df = read.csv(counts_file, strip.white = T, stringsAsFactors = F)

# How many rows at the start.
n_start = nrow(counts_df)

# Created rounded integers for the count data
counts_mat = round(counts_df[, sample_names])

# Set the rownames on the matrix.
row.names(counts_mat) = counts_df[,1]

# Column names that should be carried over. Skip sample names and FDR (if exists)
other_cols = names(counts_df)[!names(counts_df) %in% c(sample_names, "FDR")]

# The starting columns of the dataframe
start_df = counts_df[, other_cols, drop=F]

# Set the rownames as well.
row.names(start_df) = start_df[,1]

# Group selector
group <- df$groups

#
# The steps below will perform the edgeR analysis.
#
# Select the GLM method
if (method=='glm') {

cat("# Method: glm", "\n")

# Following the recommended edger vignette here verbatim
dge <- DGEList(counts=counts_mat, group=group)

# Filter by expression.
keep <- filterByExpr(dge)

# Filter the expression
dge <- dge[keep, ,keep.lib.sizes=FALSE]

# The design matrix.
design <- model.matrix(~group)

# Calculate the normalization factors
dge <- calcNormFactors(dge)
dge <- estimateDisp(dge, design)
fit <- glmQLFit(dge, design)
qlf <- glmQLFTest(fit,coef=2)
etp = topTags(qlf, n=Inf)

} else {

#
# This is the classic mode of running EdgeR
#
cat("# Method: classic", "\n")
# Creates a DGEList object from a table of counts and group.
dge <- DGEList(counts=counts_mat, group=group)

# Filter by expression.
keep <- filterByExpr(dge)

# Filter the expression
dge <- dge[keep,,keep.lib.sizes=FALSE]

# Maximizes the negative binomial conditional common likelihood to estimate a common dispersion value across all genes.
dge <- estimateCommonDisp(dge)

# Estimates tagwise dispersion values by an empirical Bayes method based on weighted conditional maximum likelihood.
dge <- estimateTagwiseDisp(dge)

# Compute genewise exact tests for differences in the means between the groups.
etx <- exactTest(dge)

# Extracts the most differentially expressed genes.
etp <- topTags(etx, n=Inf)
}

#
# From here on out is about formatting the output.
#

# Get the scale of the data
scale = dge$samples$lib.size * dge$samples$norm.factors

# Get the normalized counts
normed = round(t(t(counts_mat)/scale) * mean(scale))

# Turn the results into a data frame.
data = merge(start_df, etp$table, by="row.names")

# Add the rownames
row.names(data) = data[,1]

# Get rid of extra column it gained
data = data[, 2:ncol(data)]

# Create column placeholders.
data$baseMean = 1
data$baseMeanA = 1
data$baseMeanB = 1
data$foldChange = 2 ^ data$logFC
data$log2FoldChange=data$logFC
data$falsePos = 1

# Compute the adjusted p-value
data$PAdj = p.adjust(data$PValue, method="hochberg")

# Create a merged output that contains the normalized counts.
total <- merge(data, normed, by='row.names')

# Get rid of extra column it gained
total = total[, 2:ncol(total)]

# Sort again by P-value.
total = arrange(total, PValue)

# Compute the false discovery counts on the sorted table.
total$falsePos = 1:nrow(total) * total$FDR

# Create the individual baseMean columns.
total$baseMeanA = rowMeans(total[, col_names_A])
total$baseMeanB = rowMeans(total[, col_names_B])
total$baseMean = total$baseMeanA + total$baseMeanB

# Round the numbers to make them look better
total$foldChange = round(total$foldChange, 3)
total$FDR = round(total$FDR, 4)
total$PAdj = round(total$PAdj, 4)
total$logCPM = round(total$logCPM, 1)
total$log2FoldChange = round(total$log2FoldChange, 1)
total$baseMean = round(total$baseMean, 1)
total$baseMeanA = round(total$baseMeanA, 1)
total$baseMeanB = round(total$baseMeanB, 1)
total$falsePos = round(total$falsePos, 0)

# The total numbers
n_total = nrow(total)

# Count the number of significant results.
fdr_count = sum(total$FDR < 0.05)
fdr_frac = round(fdr_count/n_total, 3) * 100

pval_count = sum(total$PValue < 0.05)
pval_frac = round(pval_count/n_total, 3) * 100

# Reorganize columns names to make more sense.
new_cols = c( other_cols,
"baseMean","baseMeanA","baseMeanB",
"foldChange", "log2FoldChange",
"PValue","PAdj",
"FDR","falsePos", col_names_A, col_names_B)

# Rearrange the data frame with new column order.
total = total[, new_cols]

# Reformat these columns as string.
total$PAdj = formatC(total$PAdj, format = "e", digits = 1)
total$PValue = formatC(total$PValue, format = "e", digits = 1)

# Write the results to the standard output.
write.csv(total, file=output_file, row.names = F, quote = F)

# Compute more stats.
n_removed = n_start - n_total

# Nicer formatting
pval_count = sprintf("%4d", pval_count)
pval_frac = sprintf("%.2f", pval_frac)
fdr_count = sprintf("%4d", fdr_count)
fdr_frac = sprintf("%.2f", fdr_frac)

# Inform the user.
cat("# Input:", n_start, "rows\n")
cat("# Removed:", n_removed, "rows\n")
cat("# Fitted:", n_total, "rows\n")
cat("# Significant PVal:", pval_count, "(", pval_frac, "%)\n")
cat("# Significant FDRs:", fdr_count, "(", fdr_frac, "%)\n")
cat("# Results:", output_file, "\n")

功能富集分析

RNA-Seq分析所遇问题

比对问题

编码序列的GC含量(指基因序列中鸟嘌呤(G)和胞嘧啶(C)的比例)通常高于整个基因组的平均GC含量。这是因为编码序列往往需要维持一定的稳定性,GC对之间的氢键比AT对更多,增加了DNA双螺旋的稳定性。在基因组测序和比对分析中,如果一个区域的GC含量远离平均水平,不管是偏高还是偏低,都可能导致测序和分析上的偏差。例如,高GC含量的区域可能因为PCR扩增偏好性或测序平台的偏好性而过度表示。此外,极端的GC含量可能影响测序读段的比对,因为这样的区域在基因组中可能不是唯一的,或者比对算法在处理这些区域时可能不够准确。

查看序列GC含量:

1
2
3
4
# 查看序列GC含量
cat *.fa | seqkit seq -s | geecee --filter
samtools faidx *.fa 1 |seqkit seq -s | geecee --filter
cat *.fq | seqkit seq -s | geecee --filter

特别的,当我们发现RNA-Seq数据比对出现长内含子区域时,可能就是由于参考序列中较长的仅有T组成造成的,在HISAT2中,我们可以使用--max-intronlen参数来调整最大内含子长度,以提高比对正确率。经典的内含子长度为500bp,我们可以将这个数值设置为2500.

1
2
3
4
5
6
7
8
# Make a directory for the bam file.
mkdir -p bam

# Run the hisat aligner for each sample.
cat ids | parallel "hisat2 --max-intronlen 2500 -x $IDX -U reads/{}.fq | samtools sort > bam/{}.bam"

# Create a bam index for each file.
cat ids | parallel samtools index bam/{}.bam

计数问题

在使用featureCount时,默认使用exon进行计数,通常有两种策略,一种是在基因水平分析,一种是在转录本水平分析。在基因水平分析时,外显子只会被计数一次,而在转录本水平时,外显子可能出现在多个转录本中,如果简单的每个外显子计数一次,可能会导致计数的重复,导致从贡献度被过度估计。为了解决这个问题,需要一个过程来确定每个外显子对每个转录本的贡献比例。这个过程称为“解卷积”(deconvolution)。解卷积的目的是分配每个外显子的表达量,确保即使一个外显子在多个转录本中出现,它的总贡献也只被计算一次。解卷积通常涉及到数学和统计方法,可以基于转录本的预期丰度或其他生物信息学数据来估计每个外显子的相对贡献。
一般情况下,建议使用featureCounts进行基因水平分析,使用基于分类/伪比对的方法如kallisto和salmon等工具进行转录水平分析。

链特异性问题

RNA-Seq 文库类型的选择决定了测序的读取方向和 cDNA 链的测序顺序,这意味着来自不同文库类型的 RNA-Seq 读取可能存在显著差异。有关文库类型的信息对于将读取组装成转录组或映射到参考组装非常有用。这是因为文库类型可以帮助通过读取的相对方向和从哪条链测序来帮助辨别转录组中较短的模糊读取属于哪里。不幸的是,有关所用文库类型的信息未包含在测序输出文件中,并且可能在数据组装之前丢失。即使使用来自公共存储库的 RNA-Seq 数据,也不能保证文库类型信息是正确的或根本不存在。
在成对端测序中,每对测序读取包括一个正向(forward)和一个反向(reverse)组分。这些读取对应于 DNA 片段的两端,通常用于提高测序的覆盖率和准确性。然而,当数据是成对的且有方向性时,理解每个读取是如何对应于参考基因组或转录本的变得更加复杂。在有方向性的成对端对齐中,通常第一对中的读取(first in pair)会与反义链(antisense strand)对齐,而第二对中的读取(second in pair)则会与正义链(sense strand)对齐。例如,如果 DNA 来自于反向链(reverse strand)上的一个转录本,那么第一对中的读取会对齐到正向链(forward strand)上,并且会在参考序列上向左移动(leftward shift);第二对中的读取会对齐到反向链,并且相对于片段会向右移动(rightward shift)。虽然这听起来“简单”,实际上却不是这样。在测序数据的方向性问题上存在很多混淆和误解,以至于有专门的软件工具,例如“Guess My Library Type”,来帮助研究人员弄清楚他们的数据是否有方向性,以及是哪种类型的方向性。

在一些软件中,有参数可以用来指定来链的方向

转录完整性

RNA-Seq分析流程

分析流程.png

基因表达量数据类型

Counts.png

RPM.png

RPKM.png

FPKM:FPKM-UQ.png

TPM.png

基因表达量数据转换

  • 原始Counts数据
  • 基因长度信息(外显子长度之和)
  • 如按照protein-coding基因计算mapped reads计算,还需提供基因的分类信息

基因表达量数据转换.png