Dealing with long domains

This module provides functionality for handling long domains within proteins in the context of protein fragmentation. It identifies domains exceeding a specified maximum length and generates appropriate fragments from these long domains. The module also identifies and outputs unfragmented regions of the protein around these long domains for further processing.

Functions:
  • handle_long_domains: Main function to handle long domains in proteins, creating fragments and identifying adjacent unfragmented regions which are output as a list of ProteinSubsection objects.

Dependencies:
  • .classes.ProteinSubsection: Used to represent sections of a protein, including domains and fragments.

  • .fragmentation_methods.validate_fragmentation_parameters: Used to validate the input parameters for protein fragmentation.

  • .fragmentation_methods.check_valid_cutpoint: Utilized to ensure proposed fragmentation points are valid based on domain boundaries and protein.

  • A.fragmentation_methods.merge_overlapping_domains: Used to merge overlapping domains within a protein.

alphafragment.long_domains.handle_long_domains(protein, length, overlap)

Identifies long domains within a protein and generates fragments that include these domains, along with the unfragmented regions of the protein surrounding these domains. It takes into consideration addition of overlap. This function is designed to work as part of a protein fragmentation workflow.

Parameters:
  • protein (Protein): The protein object containing domain and sequence information.

  • length (dict): Dictionary containing the ideal, minimum, and maximum length values, in the format: {‘min’: min_len, ‘ideal’: ideal_len, ‘max’: max_len} where min_len, ideal_len, and max_len are all integers, with min_len <= ideal_len <= max_len.

  • overlap (dict): Dictionary containing the ideal, minimum, and maximum overlap values, in the format: {‘min’: min_overlap, ‘ideal’: ideal_overlap, ‘max’: max_overlap} where min_overlap, ideal_overlap and max_overlap are all integers, with min_overlap <= ideal_overlap <= max_overlap.

Returns:
  • tuple: A tuple containing two lists: (1) unfragmented subsections of the protein surrounding the long domains, and (2) fragments that include the long domains with appropriate overlaps.

Function Logic:
  • Merge overlapping domains into a single domain to simplify processing, and add these to a new list.

  • Iterate over the list of domains in the protein to identify those exceeding the specified maximum length (length[‘max’]), categorizing them as long domains.

  • For each identified long domain:
    • Determine if it should be merged with sequence between this and adjacent long domains, or protein ends

      • If distance to these points is less than the minimum fragment length (length[‘min’]), merge to avoid leaving very short fragments. If a region between two long domains is < length[‘min’], merge this with the shorter of the two long domains.

    • For the region before the current long domain, create a subsection if it hasn’t been included in a previous fragment or subsection and if its length is viable.

  • Adjust the start and end points of each long domain fragment to include overlaps:

    • Attempt to add an overlap to both the start and end points within the bounds of ‘overlap[‘max’]’ and 0, ensuring the selected points are valid cutpoints.

    • If an overlap cannot be added, the original start and end points are used.

    • Ensure the adjusted start and end points do not extend beyond the protein’s boundaries.

  • Add the long domain, neighbouring merged regions, and overlap, as a fragment.

  • After processing all long domains, check for any remaining unfragmented protein sequence at the end and create a subsection if necessary.

  • Return the created subsections and fragments.

Notes:
  • If two long domains are adjacent and the distance between them is less than ‘overlap[‘min’]’, the domains will still be created as separate fragments with as much overlap as is allowed by the space between them.