aigct.etl.repo_loader

Module description here

Attributes

COLUMN_NAME_MAP

VEP_COLUMN_LIST

VEP_COLUMN_LIST_OLD

VARIANT_EFFECT_SOURCE_DATA

VARIANT_DATA_SOURCE_DATA

Classes

RepositoryLoader

Module Contents

aigct.etl.repo_loader.COLUMN_NAME_MAP[source]
aigct.etl.repo_loader.VEP_COLUMN_LIST[source]
aigct.etl.repo_loader.VEP_COLUMN_LIST_OLD[source]
aigct.etl.repo_loader.VARIANT_EFFECT_SOURCE_DATA = [['REVEL', 'REVEL', 'VEP', 'REVEL'], ['GVMP', 'gVMP', 'VEP', 'gVMP'], ['VAR_R', 'VARITY_R',...[source]
aigct.etl.repo_loader.VARIANT_DATA_SOURCE_DATA = [['GNOMGE', 'GNOMAD_GENOMES', 'GNOMAD GENOMES'], ['GNOMEX', 'GNOMAD_EXOMES', 'GNOMAD EXOMES']][source]
class aigct.etl.repo_loader.RepositoryLoader(config: aigct.util.Config, repo_context: aigct.repository.RepoSessionContext)[source]
_log_folder[source]
_repo_context[source]
_convert_dot_to_nan(val)[source]
_derive_variant_effect_source_columns(row)[source]
_task_full_path_name(task: str, file_name: str)[source]
init_variant_task()[source]
init_variant_effect_source()[source]
_build_excep_where_clause(column_list: list[str], suffixes: list[str])[source]

Builds a where clause to be used in a DataFrame.query method where it checks for inequality between any of the columns in the dataframe. For each column in column_list it constructs a comparison clause where suffixes[0] is appended to the column name on the left side of the comparison and suffixes[1] is appended to the column name on the right side of the comparison.

Parameters

column_listlist(str)

List of column names to compare.

suffixeslist(str)

A list of 2 suffixes with first suffix to be appended to each column for left side of comparison and second suffix to be appended to column name on right side

_excep_file_full_path_name(task: str, repo_file_name: str)[source]
_upsert_repository_file(new_data: pandas.DataFrame, task: str, columns: list[str], repo_file_name: str, pk_columns: list[str])[source]

General function for updating one of the repository data files with new data.

To update the files we call the _upsert_repository_file method. This method first checks if the row already exists in the file. If it doesn’t exist it adds the row. If it does exist it updates the existing row with the new values.

Parameters

new_datapd.DataFrame

DataFrame containing new data to be loaded.

task : str columns : list(str)

List of columns in new_data dataframe and in repository data file. The columns in the data file are inserted into or updated from the columns in the new_data dataframe.

repo_file_namestr

Name of repository data file to be inserted/updated.

pk_columns: list(str)

List of column names in both new_data and repo_file_name that uniquely identify a row. We determine if a row in new_data already exists in repo_file_name by using the values in this combination of columns to look up a row in repo file.

load_variant_file(genome_assembly: str, task: str, data_file: str, file_folder: str, data_source: str, binary_label: int, prior_genome_assembly: str, prior_prior_genome_assembly: str)[source]

Function for loading data from a data file containing data as it is downloaded from a source data site into our platform repository data files. The input data_file is assumed to contain one row per variant along with the label. There will be separate column in that row for each vep score. For each row in the input data_file we populate the following files:

  • variant.csv - We create one row.

  • variant_effect_label.csv - We create one row with the label and

    other informational columns.

  • variant_effect_score.csv - We create one row for each vep score

    column. So if we have 5 vep score columns we would create 5 rows in this file.

To update the files we call the _upsert_repository_file method. This method first checks if the row already exists in the file. If it doesn’t exist it adds the row. If it does exist it updates the existing row with the new values.

Parameters

genome_assemblystr

Genome assembly, typically hg38

taskstr

task code

data_filestr

File containing data to be loaded.

file_folderstr

Location of data_file

data_source: str

Source of the input data_file. i.e. HOTSPOT

binary_label: int

1 or 0. This is the binary label to be assigned to all the variants in the data_file. The assumption is that all of the variants in the file have the same label.

prior_genome_assemblystr

Genome assembly prior to genome_assembly that we have chromosome, position data for. typically hg19

prior_prior_genome_assemblystr

Genome assembly prior to prior_genome_assembly that we have, chromsome position data for. typically hg18