e-learning

Fine tune large protein model (ProtTrans) using HuggingFace

Abstract

The advent of large language models has transformed the field of natural language processing, enabling machines to comprehend and generate human-like language with unprecedented accuracy. Pre-trained language models, such as BERT, RoBERTa, and their variants, have achieved state-of-the-art results on various tasks, from sentiment analysis and question answering to language translation and text classification. Moreover, the emergence of transformer-based models, such as Generative Pre-trained Transformer (GPT) and its variants, has enabled the creation of highly advanced language models to generate coherent and context-specific text. The latest iteration of these models, ChatGPT, has taken the concept of conversational AI to new heights, allowing users to engage in natural-sounding conversations with machines. However, despite their impressive capabilities, these models are imperfect, and their performance can be significantly improved through fine-tuning. Fine-tuning involves adapting the pre-trained model to a specific task or domain by adjusting its parameters to optimise its performance on a target dataset. This process allows the model to learn task-specific features and relationships that may not be captured by the pre-trained model alone, resulting in highly accurate and specialised language models that can be applied to a wide range of applications. In this tutorial, we will discuss and fine-tune large language model trained on protein sequences ProtT5, exploring the benefits and challenges of this approach, as well as the various techniques and strategies such as low ranking adaptations (LoRA) that can be employed to fit large language models with billions of parameters on regular GPUs. Protein large language models (LLMs) represent a significant advancement in Bioinformatics, leveraging the power of deep learning to understand and predict the behaviour of proteins at an unprecedented scale. These models, exemplified by the ProtTrans suite, are inspired by natural language processing (NLP) techniques, applying similar methodologies to biological sequences. ProtTrans models, including BERT and T5 adaptations, are trained on vast datasets of protein sequences from databases such as UniProt and BFD, storing millions of protein sequences and enabling them to capture the complex patterns and functions encoded within amino acid sequences. By interpreting these sequences much like languages, protein LLMs offer transformative potential in drug discovery, disease understanding, and synthetic biology, bridging the gap between computational predictions and experimental biology. In this tutorial, we will fine-tune the ProtT5 pre-trained model for dephosphorylation site prediction, a binary classification task.

About This Material

This is a Hands-on Tutorial from the GTN which is usable either for individual self-study, or as a teaching material in a classroom.

Questions this will address

  • How to load large protein AI models?
  • How to fine-tune such models on downstream tasks such as post-translational site prediction?

Learning Objectives

  • Learn to load and use large protein models from HuggingFace
  • Learn to fine-tune them on specific tasks such as predicting dephosphorylation sites

Licence: Creative Commons Attribution 4.0 International

Keywords: Statistics and machine learning, deep-learning, dephosphorylation-site-prediction, fine-tuning, interactive-tools, jupyter-lab, machine-learning

Target audience: Students

Resource type: e-learning

Version: 2

Status: Active

Prerequisites:

  • A Docker-based interactive Jupyterlab powered by GPU for artificial intelligence in Galaxy
  • Introduction to Galaxy Analyses
  • JupyterLab in Galaxy

Learning objectives:

  • Learn to load and use large protein models from HuggingFace
  • Learn to fine-tune them on specific tasks such as predicting dephosphorylation sites

Date modified: 2024-09-19

Date published: 2024-06-17

Authors: Anup Kumar

Contributors: Anup Kumar, Björn Grüning, Helena Rasche, Michelle Terese Savage, Saskia Hiltemann, Teresa Müller

Scientific topics: Statistics and probability


Activity log