Handling protein objects and domains in a csv file
This module provides functionality for initializing protein objects from CSV files, identifying domains in a CSV file associated with a particular protein object, and updating the CSV with protein fragments information.
- Functions:
initialize_proteins_from_csv: Initializes Protein objects from data in a CSV file.
find_user_specified_domains: Identifies domains listed in a CSV file for a corresponding Protein object.
update_csv_with_fragments: Updates a CSV file with information about protein fragments.
- Dependencies:
pandas: For reading and processing the CSV file.
ast.literal_eval: For safely evaluating string literals containing Python expressions.
collections.Counter: For checking column names are not duplicated.
.classes.Protein: The Protein class for representing protein data.
.classes.Domain: The Domain class for representing protein domains.
.uniprot_fetch.fetch_uniprot_info: For fetching protein data from UniProt.
- alphafragment.process_proteins_csv.find_user_specified_domains(protein_name, df)
Finds domains listed in a dataframe for a particular protein.
- Parameters:
protein_name (str): The name of the protein to find domains for.
df (dataframe): DataFrame with protein names in a column ‘name’ and user-defined domains in a column “domains”, as a list of (start, end) tuples for each protein, using 1-based indexing.
- Returns:
list of Domain objects: A list of Domain objects representing the manually defined domains for the protein, using 0-based indexing.
- Raises:
TypeError: If the input DataFrame or domain_data has incorrect type.
ValueError: If the DataFrame is missing required columns or if the domain data cannot be parsed.
- Notes:
Domains are expected to be provided using 1-based indexing but will be processed and output with 0-based indexing.
- alphafragment.process_proteins_csv.initialize_proteins_from_csv(csv_path)
Reads a CSV file with columns for protein names and accession IDs, and initializes a list of Protein objects. Fetches protein sequences from UniProt - if fetching fails, can use manually provided sequence (eg if protein is known to not be in uniprot, provide a sequence in a column ‘sequence’ and no accession_id - note that this will mean no domains can be found from alphafold or uniprot and these must be manually specified if needed). Does not initialize a Protein object for if both fetching and manual provision fail.
- Parameters:
csv_path (str): The path to the CSV file.
- Returns:
- tuple of (list of Protein, list of str):
The first element is a list of initialized Protein objects for which sequences were successfully fetched.
The second element is a list of protein names for which sequences could not be retrieved or other errors occurred during fetching.
- Errors and Exceptions:
Raises ValueError if required columns (‘name’, ‘accession_id’) are not found.
Reports when sequences cannot be fetched (e.g., invalid accession_id, network issues) and when a manually provided sequence is used instead.
Prints an error message if neither fetching nor manual sequence provision is successful and the protein cannot be initialised.
- Notes:
Column names must be ‘name’ and ‘accession_id’ for protein name and accession ID respectively. If csv includes columns for manually specified protein sequences or domains, these should be named ‘sequence’ and ‘domains’ respectively. Capitalisation will be ignored. Other columns can also be included.
Sequences entered in a ‘sequence’ column will only be used if no sequence data is found in UniProt. If a sequence is found in UniProt, the ‘sequence’ column will be ignored (and overwritten if the update_csv_with_fragments function is used).
- alphafragment.process_proteins_csv.update_csv_with_fragments(df, output_csv, proteins)
Reads a dataframe and outputs a csv file containing the original data, plus data for protein fragment indices and sequences, and domains identifed in the protein.
- Parameters:
df (dataframe): DataFrame containing initial information, likely from an input csv.
output_csv (file path ending .csv): Path to save the final csv file.
proteins (list of Protein objects): Protein objects containing information on Domain and fragment locations to be added to the output data.
- Returns:
dataframe: A new DataFrame with updated data and reordered columns.
- Notes:
Reorders columns for consistency
Domains in output are referenced using 1-based indexing.
Fragments in output are referenced using 1-based indexing and inclusive of the start and end residues.