TY - JOUR
T1 - Language-Informed Basecalling Architecture for Nanopore Direct RNA Sequencing
AU - Sneddon, Alexandra
AU - Mateos, Pablo Acera
AU - Shirokikh, Nikolay E.
AU - Eyras, Eduardo
N1 - Publisher Copyright:
© MLCB 2022.
PY - 2022
Y1 - 2022
N2 - Algorithms developed for basecalling Nanopore signals have primarily focused on DNA to date and utilise the raw signal as the only input. However, it is known that messenger RNA (mRNA), which dominates Nanopore direct RNA (dRNA) sequencing libraries, contains specific nucleotide patterns that are implicitly encoded in the Nanopore signals since RNA is always sequenced from the 3' to 5' direction. In this study we present an approach to exploit the sequence biases in mRNA as an additional input to dRNA basecalling. We developed a probabilistic model of mRNA language and propose a modified CTC beam search decoding algorithm to conditionally incorporate the language model during basecalling. Our findings demonstrate that inclusion of mRNA language is able to guide CTC beam search decoding towards the more probable nucleotide sequence. We also propose a time efficient approach to decoding variable length nanopore signals. This work provides the first demonstration of the potential for biological language to inform Nanopore basecalling. Code is available at: https://github.com/comprna/radian.
AB - Algorithms developed for basecalling Nanopore signals have primarily focused on DNA to date and utilise the raw signal as the only input. However, it is known that messenger RNA (mRNA), which dominates Nanopore direct RNA (dRNA) sequencing libraries, contains specific nucleotide patterns that are implicitly encoded in the Nanopore signals since RNA is always sequenced from the 3' to 5' direction. In this study we present an approach to exploit the sequence biases in mRNA as an additional input to dRNA basecalling. We developed a probabilistic model of mRNA language and propose a modified CTC beam search decoding algorithm to conditionally incorporate the language model during basecalling. Our findings demonstrate that inclusion of mRNA language is able to guide CTC beam search decoding towards the more probable nucleotide sequence. We also propose a time efficient approach to decoding variable length nanopore signals. This work provides the first demonstration of the potential for biological language to inform Nanopore basecalling. Code is available at: https://github.com/comprna/radian.
UR - http://www.scopus.com/inward/record.url?scp=85164540045&partnerID=8YFLogxK
UR - https://proceedings.mlr.press/v200/sneddon22a.html
U2 - 10.1101/2022.10.19.512968
DO - 10.1101/2022.10.19.512968
M3 - Conference article
AN - SCOPUS:85164540045
SN - 2640-3498
VL - 200
SP - 150
EP - 165
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
T2 - 17th Machine Learning in Computational Biology Meeting, MLCB 2022
Y2 - 21 November 2022 through 22 November 2022
ER -