We are not able to resolve this OAI Identifier to the repository landing page. If you are the repository manager for this record, please head to the Dashboard and adjust the settings.
Next-generation sequencing (NGS) technologies have revolutionised research into
nature and diversity of genomes and transcriptomes. Since the initial description of
these technology platforms over a decade ago, massively parallel RNA sequencing
(RNA-seq) has driven many advances in the characterization and quantification of
transcriptomes. RNA-seq is a powerful gene expression profiling technology
enabling transcript discovery and provides a far more precise measure of the levels of
transcripts and their isoforms than other methods e.g. microarray.
However, the analysis of RNA-seq data remains a significant challenge for many
biologists. The data generated is large and the tools for its assembly, analysis and
visualisation are still under development. Assemblies of reads can be inspected using
tools such as the Integrative Genomics Viewer (IGV) where visualisation of results
involves ‘stacking’ the reads onto a reference genome. Whilst sufficient for many
needs, when the underlying variance of the genome or transcript assemblies is
complex, this visualisation method can be limiting; errors in assembly can be
difficult to spot and visualisation of splicing events may be challenging.
Data visualisation is increasingly recognised as an essential component of genomic
and transcriptomic data analysis, enabling large and complex datasets to be better
understood. An approach that has been gaining traction in biological research is
based on the application of network visualisation and analysis methods. Networks
consist of nodes connected by edges (lines), where nodes usually represent an entity
and edge a relationship between them. These are now widely used for plotting
experimentally or computationally derived relationships between genes and proteins.
The overall aim of this PhD project was to explore the use of network-based
visualisation in the analysis and interpretation of RNA-seq data. In chapter 2, I
describe the development of a data pipeline that has been designed to go from ‘raw’
RNA-seq data to a file format which supports data visualisation as a ‘DNA assembly
graph’. In DNA assembly graphs, nodes represent sequence reads and edges denote a
homology between reads above a defined threshold. Following the mapping of reads
to a reference sequence and defining which reads a map to a given loci, pairwise
sequence alignments are performed between reads using MegaBLAST. This provides
a weighted similarity score that is used to define edges between reads. Visualisation
of the resulting networks is then carried out using BioLayout Express3D that can
render large networks in 3-D, thereby allowing a better appreciation of the often-complex
network structure. This pipeline has formed the basis for my subsequent
work on the exploring and analysing alternative splicing in human RNA-seq data. In
the second half of this chapter, I provide a series of tutorials aimed at different types
of users allowing them to perform such analyses. The first tutorial is aimed at
computational novices who might want to generate networks using a web-browser
and pre-prepared data. Other tutorials are designed for use by more advanced users
who can access the code for the pipeline through GitHub or via an Amazon Machine
Image (AMI).
In chapter 3, the utility of network-based visualisations of RNA-seq data is explored
using data processed through the pipeline described in Chapter 2. The aim of the
work described in this chapter was to better understand the basic principles and
challenges associated with network visualisation of RNA-seq data, in particular how
it could be used to visualise transcript structure and splice-variation. These analyses
were performed on data generated from four samples of human fibroblasts taken at
different time points during their entry into cell division. One of the first challenges
encountered was the fact that the existing network layout algorithm (Fruchterman-
Reingold) implemented within BioLayout Express3D did not result in an optimal
layout of the unusual graph structures produced by these analyses. Following the
implementation of the more advanced layout algorithm FMMM within the tool,
network structure could be far better appreciated. Using this layout method, the
majority of genes sequenced to an adequate depth assemble into networks with a
linear ‘corkscrew’ appearance and when representing single isoform transcripts add
little to existing views of these data. However, in a small number of cases (~5%), the
networks generated from transcripts expressed in human fibroblasts possess more
complex structures, with ‘loops’, ‘knots’ and multiple ends being observed. In a
majority of cases examined, these loops were associated with alternative splicing
events, a fact confirmed by RT-PCR analyses. Other DNA assembly networks
representing the mRNAs for genes such as MKI67 showed knot-like structures,
which was found to be due to the presence of repetitive sequence within an exon of
the gene. In another case, CENPO the unusual structure observed was due to reads
derived from an overlapping gene of ADCY3 gene present on the opposite strand
with reads being wrongly mapped to CENPO. Finally, I explored the use of a
network reduction strategy as an approach to visualising highly expressed genes such
as GAPDH and TUBA1C. Having successfully demonstrated the utility of networks
in analysing transcript isoforms in data derived from a single cell type I set out to
explore its utility in analysing transcript variation in tissue data where multiple
isoforms expressed by different cells within the tissue might be present in a given
sample.
In chapter 4, I explore the analysis of transcript variation in an RNA-seq dataset
derived from human tissue. The first half of this chapter describes the quality control
of these data again using a network-based approach but this time based the
correlation in expression between genes and samples. Of the 95 samples derived
from 27 human tissues, 77 passed the quality control. A network was constructed
using a correlation threshold of r ≥ 0.9, which comprised 6,109 nodes (genes) and
1,091,477 edges (correlations) and clustered. Subsequently, the profile and gene
content of each cluster was examined and enrichment of GO terms analysed. In the
second half of this chapter, the aim was to detect and analyse alternative splicing
events between different tissues using the rMATS tool. By using a false-discovery
rate (FDR) cut-off of < 0.01, I found that in comparisons of brain vs. heart, brain vs.
liver and heart vs. liver, the program reported 4,992, 4,804 and 3,990 splicing events,
respectively. Of these events, only 78 splicing events (52 genes) with more than 50%
of exon inclusion level and expression level more than FPKM 30. To further explore
the sometimes-complex structure of transcripts diversity derived from tissue, RNAseq
assembly networks for KLC1, SORBS2, GUK1, and TPM1 were explored. Each
of these networks showed different types of alternative splicing events and it was
sometimes difficult to determine the isoforms expressed between tissues using other
approaches. For instance, there is an issue in visualising the read assembly of long
genes such as KLC1 and SORBS2, using a Sashimi plots or even Vials, just because
of the number of exons and the size of their genomic loci. In another case of GUK1,
tissue-specific isoform expression was observed when a network of three tissues was
combined. Arguably the most complex analysis is the network of TPM1 where the
uniquification step was employed for this highly expressed gene.
In chapter 5, I perform a usability testing for NGS Graph Generator web application
and visualising RNA-seq assemblies as a network using BioLayout Express3D. This
test was important to ensure that the application is well received and utilised by the
user. Almost all participants of this usability test agree that this application would
encourage biologists to visualise and understand the alternative splicing together
with existing tools. The participants agreed that Sashimi plots rather difficult to view
and visualise and perhaps would lose something interesting features. However, there
were also reviews of this application that need improvements such as the capability
to analyse big network in a short time, side-by-side analysis of network with Sashimi
plot and Ensembl. Additional information of the network would be necessary to
improve the understanding of the alternative splicing.
In conclusion, this work demonstrates the utility of network visualisation of RNAseq
data, where the unusual structure of these networks can be used to identify issues
in assembly, repetitive sequences within transcripts and splice variation. As such,
this approach has the potential to significantly improve our understanding of
transcript complexity. Overall, this thesis demonstrates that network-based
visualisation provides a new and complementary approach to characterise alternative
splicing from RNA-seq data and has the potential to be useful for the analysis and
interpretation of other kinds of sequencing data
Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.