Handling protein objects and domains in a csv file

This module provides functionality for initializing protein objects from CSV files, identifying domains in a CSV file associated with a particular protein object, and updating the CSV with protein fragments information.

Functions:
  • initialize_proteins_from_csv: Initializes Protein objects from data in a CSV file.

  • find_user_specified_domains: Identifies domains listed in a CSV file for a corresponding Protein object.

  • update_csv_with_fragments: Updates a CSV file with information about protein fragments.

  • reinitialize_proteins_from_csv: Reinitializes previoulsy processed Protein objects from a CSV file.

  • domains_from_manual_pae: Extracts domain positions from a manually specified PAE file in the DataFrame.

  • fragments_from_csv: Extracts fragment positions from a ‘fragments’ column in the DataFrame.

Dependencies:
  • os: For file path operations.

  • json: For reading JSON data from a file.

  • pandas: For reading and processing the CSV file.

  • ast.literal_eval: For safely evaluating string literals containing Python expressions.

  • collections.Counter: For checking column names are not duplicated.

  • .classes.Protein: The Protein class for representing protein data.

  • .classes.Domain: The Domain class for representing protein domains.

  • .uniprot_fetch.fetch_uniprot_info: For fetching protein data from UniProt.

  • .alphafold_db_domain_identification.find_domains_from_pae: For identifying domains from a user input pae file

alphafragment.process_proteins_csv.domains_from_manual_pae(df, protein)

Extracts domain positions from a manually specified PAE file in the DataFrame.

Parameters:
  • df: The DataFrame containing protein data, including a ‘pae_file’ column with file path to PAE file.

  • protein: The Protein object for which to find domains.

Returns:
  • A list of domain positions

alphafragment.process_proteins_csv.find_user_specified_domains(protein_name, df)

Finds domains listed in a dataframe for a particular protein.

Parameters:
  • protein_name (str): The name of the protein to find domains for.

  • df (dataframe): DataFrame with protein names in a column ‘name’ and user-defined domains in a column “domains”, as a list of (start, end) tuples or (key, (start, end)) tuples for each protein, using 1-based indexing.

Returns:
  • list of Domain objects: A list of Domain objects representing the manually defined domains for the protein, using 0-based indexing.

Raises:
  • TypeError: If the input DataFrame or domain_data has incorrect type.

  • ValueError: If the DataFrame is missing required columns or if the domain data cannot be parsed.

Notes:
  • Domains are expected to be provided using 1-based indexing but will be processed and output with 0-based indexing.

alphafragment.process_proteins_csv.fragments_from_csv(df, protein)

Extracts fragment positions from a ‘fragments’ column in the DataFrame. Converts from 1 -> 0 based indexing if fragments start at position 1

alphafragment.process_proteins_csv.initialize_proteins_from_csv(csv_path)

Reads a CSV file with columns for protein names and accession IDs, and initializes a list of Protein objects. Fetches protein sequences from UniProt - if fetching fails, can use manually provided sequence (eg if protein is known to not be in uniprot, provide a sequence in a column ‘sequence’ and no accession_id - note that this will mean no domains can be found from alphafold or uniprot and these must be manually specified if needed). Does not initialize a Protein object for if both fetching and manual provision fail.

Parameters:
  • csv_path (str): The path to the CSV file.

Returns:
  • list of Protein objects: A list of Protein objects initialized from the data in the CSV file.

  • dataframe: The DataFrame containing all data from the CSV file.

Errors and Exceptions:
  • Raises ValueError if required columns (‘name’, ‘accession_id’) are not found.

  • Reports when sequences cannot be fetched (e.g., invalid accession_id, network issues) and when a manually provided sequence is used instead.

  • Prints an error message if neither fetching nor manual sequence provision is successful and the protein cannot be initialised.

Notes:
  • Column names must be ‘name’ and ‘accession_id’ for protein name and accession ID respectively. If csv includes columns for manually specified protein sequences or domains, these should be named ‘sequence’ and ‘domains’ respectively. Capitalisation will be ignored. Other columns can also be included.

  • Sequences entered in a ‘sequence’ column will only be used if no sequence data is found in UniProt. If a sequence is found in UniProt, the ‘sequence’ column will be ignored (and overwritten if the update_csv_with_fragments function is used).

alphafragment.process_proteins_csv.reinitialize_proteins_from_csv(input_csv_path)

Reinitializes Protein objects saved to a CSV file.

Parameters:
  • input_csv_path: The path to the CSV file containing protein data.

Returns:
  • A list of Protein objects.

  • The DataFrame containing all data from the CSV file.

alphafragment.process_proteins_csv.update_csv_with_fragments(df, output_csv, proteins)

Reads a dataframe and outputs a csv file containing the original data, plus data for protein fragment indices and sequences, and domains identified in the protein.

Parameters:
  • df (dataframe): DataFrame containing initial information, likely from an input csv.

  • output_csv (file path ending .csv): Path to save the final csv file.

  • proteins (list of Protein objects): Protein objects containing information on Domain and fragment locations to be added to the output data.

Returns:
  • dataframe: A new DataFrame with updated data and reordered columns.

Notes:
  • Reorders columns for consistency

  • Domains in output are referenced using 1-based indexing.

  • Fragments in output are referenced using 1-based indexing and inclusive of the start and end residues.