Data | the Non-Coding RNA Group

Ethnicity

Wed, 17 Jun 2020 00:00:00 +0000

A 2016 report on the ethnic composition of WES and WGS studies found that 84% of studies involved Europeans, only 14% & 3% of samples originated from Asia and Africa respectively. The bias is even more pronounced for studies of the non-coding genome, with only a handful of reports on population SNVs. Even though the situation is improving, the majority of studies remain biased towards Caucasian populations. This lack of variation among populations can impact awareness of how efficacious a drug may be or how likely it is to cause adverse events. However, the advent of large-scale WGS with broader populations means it is now possible to consider population specific variation within NC genomic regions.

In this work we are using publicly available data from the 1000 genomes project and the African Variome Project and combining it with data from collaborators in Africa and China to identify variations in specific populations that occur within miRNA associated regions. This work is funded by a Research Council of Norway FRIPRO grant.

Data Standards

Mon, 27 Apr 2020 00:00:00 +0000

The life sciences have been revolutionized by technical advances in experimental methodology. Nowadays, researchers not only generate huge amounts of data in a single experiment but the types of data they are collecting have also become highly divergent. Thus, biology is making the transition towards a data science and a ‘life cycle’ view of research data. Researchers now face the challenges associated with handling large amounts of heterogeneous data in a digital format. Some of these challenges include consolidating the data; translating it into a format that can be read by complex analysis pipelines; determining the most suitable analysis parameters; and making the data publicly available for reuse. There is growing evidence to suggest that many published results will not be reproducible over time. Thus, robust data management and stewardship plans are essential to ensure the long-term sustainability of digital data.

The Findable, Accessible, Interoperable, and Reusable FAIR data initiative was created in 2016 to address these issues by providing a framework for defining the minimum elements required for good data management. However, adopting FAIR principles is not straightforward as it requires knowledge of metadata, schemata, protocols, policies, and community agreements. Moreover, the lack of exactness in the original FAIR principles means that is a lack of clear guidelines regarding implementation of different elements. Even when robust solutions exist, data providers may have to choose among different and not necessarily compatible implementations.

The organ on chip research environment is one area where FAIR concepts are needed, but are yet to be incorporated. Organ on chip seeks to simulate the activities, mechanisms and physiological response of organs or organ systems. A major data challenge is that organ on chip research collects huge amounts of highly diverse types of data that need to be integrated to understand the mechanics of an organoid design. Currently, no standards exist in the field and, in addition to the challenges of integrating the data, there is also the problem of how to compare results among different research groups. For example, there are several Liver on Chip designs, but no way to compare performance.

Within the Hybrid Technology Hub Organ on Chip centre of excellence at the University of Oslo, we are developing the Global Accessible Distribution Data Sharing (GADDS) platform, an all-in-one cloud platform to facilitate data archiving and sharing with a level of FAIRness. GADDS uses decentralization technologies and a tamper proof blockchain algorithm as a metadata quality control. By providing a browser-based client interface, GADDs can simplify the implementation of FAIRness in the data collection and storage process. The platform is specifically developed for the Organ on Chip environment but has general application in any data collection and integration process requiring a level of data FAIRness. GADDS integrates version control, cloud storage and data as an all-in-one platform.

miRAW

Mon, 27 Apr 2020 00:00:00 +0000

miRAW is a miRNA target prediction based on Deep Learning (DL) which, rather than incorporating any knowledge (such as seed regions), investigates the entire miRNA and 3’UTR mRNA nucleotides to learn a uninhibited set of feature descriptors related to the targeting process.

The trained model is based on more than 150,000 experimentally validated homo sapiens miRNA:gene targets cross referenced with different CLIP-Seq, CLASH and iPAR-CLIP datasets to obtain ∼20,000 validated miRNA:gene exact target sites.

Using this data, we implemented and trained a deep neural network—composed of autoencoders and a feed-forward network—able to automatically learn features describing miRNA-mRNA interactions and assess functionality. Predictions were then refined using information such as site location or site accessibility energy. In a comparison using independent datasets, our DL approach consistently outperformed existing prediction methods, recognizing the seed region as a common feature in the targeting process, but also identifying the role of pairings outside this region. Thermodynamic analysis also suggests that site accessibility plays a role in targeting but that it cannot be used as a sole indicator for functionality.

miRBaseMiner

Mon, 27 Apr 2020 00:00:00 +0000

microRNAs are small non-coding RNA molecules playing a central role in gene regulation. miRBase is the standard reference source for analysis and interpretation of experimental studies. However, the richness and complexity of the annotation is often underappreciated by users. Moreover, even for experienced users, the size of the resource can make it difficult to explore annotation to determine features such as species coverage, the impact of specific characteristics and changes between successive releases. A further consideration is that each new miRBase release contains entries that have had limited review and which may subsequently be removed in a future release to ensure the quality of annotation. To aid the miRBase user, we developed a software tool, miRBaseMiner, for investigating miRBase annotation and generating custom annotation sets.

We apply the tool to characterize each release from v9.2 to v22 to examine how annotation has changed across releases and highlight some of the annotation features that users should keep in mind when using for miRBase for data analysis.

These include:

entries with identical or very similar sequences;
entries with multiple annotated genome locations;
hairpin precursor entries with extremely low-estimated minimum free energy;
entries possessing reverse complementary;
entries with 3ʹ poly(A) ends.

As each of these factors can impact the identification of dysregulated features and subsequent clinical or biological conclusions, miRBaseMiner is a valuable resource for any user using miRBase as a reference source.

Jasmine

Wed, 27 Apr 2016 00:00:00 +0000

The existence of complex subpopulations of miRNA isoforms, or isomiRs, is well established. While many tools exist for investigating isomiR populations, they differ in how they characterize an isomiR, making it difficult to compare results across different tools. Thus, there is a need for a more comprehensive and systematic standard for defining isomiRs. Such a standard allows investigation of isomiR population structure in progressively more refined sub-populations, permitting the identification of more subtle changes between conditions and leading to an improved understanding of the processes that generate these differences.

Jasmine is a software tool that incorporates a hierarchal framework for characterizing isomiR populations. Jasmine is a Java application that can process raw read data in fastq/fasta format, or mapped reads in SAM format to produce a detailed characterization of isomiR populations. Thus, Jasmine can reveal structure not apparent in a standard miRNA-Seq analysis pipeline.