Skip to content

Latest commit

 

History

History
1916 lines (1734 loc) · 66.5 KB

README.md

File metadata and controls

1916 lines (1734 loc) · 66.5 KB

NetREm

Network regression embeddings reveal cell-type transcription factor coordination for gene regulation

By: Saniya Khullar, Xiang Huang, Raghu Ramesh, John Svaren, Daifeng Wang

Daifeng Wang Lab

Summary

NetREm is a software package that utilizes network-constrained regularization for biological applications and other network-based learning tasks. In biology, traditional regression methods can struggle with correlated predictors, particularly transcription factors (TFs) that regulate target genes (TGs) in gene regulatory networks (GRNs). NetREm incorporates information from prior biological networks to improve predictions and identify complex relationships among predictors (e.g. TF-TF coordination: direct/indirect interactions among TFs). This approach can highlight important nodes and edges in the network, reveal novel regularized embeddings for genes, provide insights into underlying biological processes, identify subnetworks of predictors that group together to influence the response variable, and improve model accuracy and biological/clinical significance of the models. NetREm can incorporate multiple types of network data, including Protein-Protein Interaction (PPI) networks, gene co-expression networks, and metabolic networks. In summary, network-constrained regularization may bolster the construction of more accurate and interpretable models that incorporate prior knowledge of the network structure among predictors.

Pipeline

Pipeline image of NetREm png

Hardware Requirements

The minimum requirement is a computer with 8 GB of RAM and 32 GB of storage. For large prior graph networks, 32 GB of RAM is recommended.

Software Requirements and Installation Guide

The software uses Python 3.10. After downloading the NetREm Github code, conda/Anaconda users can use the following steps to install:

  1. In the Anaconda navigator prompt, create a virtual environment of Python 3.10 by running:
    conda create -n NetREm python=3.10
  2. Activate the environment:
    conda activate NetREm
  3. Make sure to change the current directory to the NetREm folder.
  4. Install the packages and dependencies (math, matplotlib, networkx, numpy, typing, os, pandas, plotly.express, random, scipy, scikit-optimize, sklearn, sys, tqdm, warnings):
    pip install -r requirements.txt

Please note that if you encounter import errors from files or functions in the code folder (such as Netrem_model_builder.py), add an empty file named init.py to the code folder, and add the "code." prefix to all imports from the "code" folder. For example, import Netrem_model_builder as nm ➡️ import code.Netrem_model_builder as nm.


Usage of the NetREm main function netrem()

NetREm fits a Network-constrained Lasso regression machine learning model with user-provided weights for the prior network. Here, netrem is the main function with the following usage:

netrem(
        edge_list,
        beta_net = 1,
        alpha_lasso = 0.01,
        default_edge_weight = 0.01,
        edge_vals_for_d = True,
        w_transform_for_d = "none",
        degree_threshold = 0.5,
        gene_expression_nodes = [],
        overlapped_nodes_only = False,
        y_intercept = False,
        view_network = False,
        model_type = "Lasso",
        ...
)

Parameter Definition
edge_list list
A list of lists corresponding to a prior network involving predictors (nodes) and relationships among them (edges):
[[source1, target1, weight1], ..., [sourceZ, targetZ, weightZ]]. Here, weight1, ..., weightZ are optional. Nodes found in the edge_list are referred to as network nodes
beta_net float, default = 1
Regularization parameter for network penalization: $\beta_{net} \geq 0$.
alpha_lasso float, default = 0.01
A numerical regularization parameter for the lasso term ($\alpha_{lasso} \geq 0$) needed if model_type = LassoCV. Larger values typically reduce the number of final predictors in the model.
default_edge_weight float, default = 0.01
Default edge weight ($w$) assigned to any edge with missing weight
edge_vals_for_d boolean, default = True
If True, we focus on summing the edge weights to calculate the node degree $d$
w_transform_for_d string, default = "none"
Other options are "sqrt", "square", "avg". Here, "none" means we add the weights of the edges that are connected to the node to get the node degree. These other options represent transformations that can be done on this sum to yield various other node degree values $d$
degree_threshold float, default = 0.5
If edge_vals_for_d is False, then Edges with weight $w$ > degree_threshold are counted as 1 towards the node degree $d$
gene_expression_nodes list, default = []
A list of predictors (e.g. TFs) to use that typically is found as columns in the training gene expression data $X_{train}$.
Any gene_expression_nodes not found in the edge_list are added internally into the network prior edge_list using pairwise default_edge_weight. Specifying gene_expression_nodes is optional but may boost the speed of training and fitting NetREm models (by adjusting the network prior in the beginning). Thus, if the gene expression data ($X$) is available, it is recommended to input gene_expression_nodes. Otherwise, NetREm automatically determines gene_expression_nodes when fitting the model with $X_{train}$ gene expression data (when fit(X,y) is called), but needs time to recalibrate the network prior based on $X_{train}$ nodes and value set for overlapped_nodes_only.
overlapped_nodes_only boolean, default = False
This determines if NetREm should focus on common nodes found in network nodes (from edge_list) and gene expression data (based on gene_expression_nodes). Here, network nodes not found in the gene expression data will always be removed. The priority is given to gene_expression_nodes since those have gene expression values that are used by the regression.
• If overlapped_nodes_only = False, the predictors will come from gene_expression_nodes, even if those are not found in the network edge_list. Some predictors may lack relationships in the prior network.
• If overlapped_nodes_only = True, the predictors used will need to be a common node: network node also found in the gene_expression_nodes.
See overlapped_nodes_only.pdf for hands-on examples.
standardize_X boolean, default = True
This determines if NetREm should standardize $X$, for each predictor column: subtracting the mean of $X$ and dividing by the standard deviation of $X$ using the training data.
standardize_y boolean, default = True
This determines if NetREm should standardize $y$: subtracting the mean of $y$ and dividing by the standard deviation of $y$ using the training data.
center_y boolean, default = False
This determines if NetREm should center $y$: subtracting the mean of $y$ based on the training data
y_intercept boolean, default = 'False'
This is the fit_intercept parameter found in the Lasso and LassoCV classes in sklearn.
• If y_intercept = True, the model will be fit with a y-intercept term included.
• If y_intercept = False, the model will be fit with no y-intercept term.
view_network boolean, default = False
• If view_network = True, then NetREm outputs visualizations of the prior graph network. Recommended for small networks (instead of for dense hairballs)
If view_network = False, then NetREm saves time by not outputting visuals of the network.
model_type {'Lasso', 'LassoCV'}, default = 'Lasso'
• Lasso: user specifies value of $\alpha_{lasso}$
• LassoCV: NetREm performs cross-validation (CV) on training data to determine optimal $\alpha_{lasso}$
... (additional parameters) Read more in the User Guide: Additional Parameters for more parameters after model_type

Details:

We input an edge list of the prior graph network (constrains the model via network-based regularization) and a beta_net ($\beta_{net} \geq 0$, which scales the network-based regularization penalty). The user may specify the alpha_lasso ($\alpha_{lasso} \geq 0$) manually for the lasso regularization on the overall model (if model_type = Lasso) or NetREm may select an optimal $\alpha_{lasso}$ based on cross-validation (CV) on the training data (if model_type = LasssoCV). Then, netrem builds an estimator object from the class Netrem that can then take in input $X$ and $y$ data: transforms them to $\tilde{X}$ and $\tilde{y}$, respectively, and use them to fit a Lasso regression model with a regularization value of $\alpha_{lasso}$. Ultimately, the trained NetREm machine learning model is more reflective of an underlying network structure among predictors and may be more biologically meaningful and interpretable. Nonetheless, NetREm could be applied in various contexts where a network structure is present among the predictors. Input networks are typically weighted and undirected. We provide details, functions, and help with converting directed networks to undirected networks (of similarity values among nodes) here.

Output:

  • A NetREm network-regularized linear model estimator object from the NetREmModel class.

Usage of the NetREmModel object returned from netrem()

Methods:

       Method       Definition Returns
fit($X$, $y$) Building and training the NetREm model with $X$ and $y$ data. Fitted NetREm Object with several updated attributes. For instance, if model_type = LassoCV, will also return optimal value for $\alpha_{lasso}$.
predict($X$) Use our model to predict values for our response variable $y$. Numpy array of $\hat{y}$ predicted values for $y$ Numpy array of $\hat{y}$ predicted values for $y$
test_mse($X$, $y$) Evaluate our model performance capabilities on testing data using Mean Squared Error (MSE) as our metric. Numeric value corresponding to the Mean Square Error (MSE).
$$MSE = \frac{1}{m} \sum_{i=1}^m (y_i - \hat{y_i})^2$$

We assume that our $X$ and $y$ data correspond to $M$ samples and $N$ predictors. For biological applications, the $X$ data would typically be gene expression data (bulk or single-cell) for the $N$ candidate predictors (usually Transcription Factors (TFs)) for the $M$ samples. Then, the $y$ values would correspond to the gene expression values for the target gene (TG) $y$ for those same $M$ samples.

Parameter Definition
$X$ Pandas dataframe ($M$ rows by $N$ columns) where the rows are samples and columns are predictors.
$y$ Pandas dataframe ($M$ rows by 1 column) with 1 column that corresponds to values for the response variable for the same samples found in $X$.

We can retrieve our model coefficients and other attributes by calling these outputs:

Output Definition
model_coef_df Pandas dataframe of the Lasso model coefficients for all of the predictors and y-intercept (if y_intercept = True)
model_nonzero_coef_df Filtered Pandas dataframe of the Lasso model coefficients for only the predictors and y-intercept (if y_intercept = True) that have non-zero values.
optimal_alpha If model_type = LassoCV, returns the optimal $\alpha_{lasso}$ found by performing cross validation (CV) on training data
predY_train NetREm's predicted values for the training response $y$ data (used to fit the model).
mse_train Mean Square Error (MSE): predicted predY_train versus actual training values for the response variable Y.
sorted_coef_df Pandas dataframe that sorts the final model coefficients (including the y-intercept) based on their absolute values. The rank is provided from least (most important: highest absolute value coefficient) to highest (least important in model).
final_corr_vs_coef_df Pandas dataframe with 3 rows.
• NetREm regression coefficient for predictor
• correlation of each predictor with y based on the training data
• absolute value ranking of the coefficients for the predictors
combined_df Pandas dataframe with a row for each predictor and several columns detailing:
• general NetREm model information: y_intercept, train MSE, beta_net, alpha_lasso, original number of predictors in $X$, filtered number of predictors input to NetREm (based on pre-processing by user), final number of non-zero predictors selected
• predictor-specific results: NetREm coefficient for predictor, absolute value of NetREm coefficient, rank of the absolute value of the coefficient (low ranks imply higher
  • predict($X$)

We can use our model to predict values for our response variable $y$.

Parameter Definition
$X$ Pandas dataframe ($M$ rows by $N$ columns) where the rows are samples and columns are predictors.

Returns: Numpy array of $\hat{y}$ predicted values for $y$.

  • test_mse($X$, $y$)

We can evaluate our model performance capabilities on data like testing data using the Mean Squared Error (MSE) as our metric.

Parameter Definition
$X$ Pandas dataframe ($M$ rows by $N$ columns) where the rows are samples and columns are predictors.
$y$ Pandas dataframe ($M$ rows by 1 column) with 1 column that corresponds to values for the response variable for the same samples found in $X$.

Returns: Numeric value corresponding to the Mean Square Error (MSE).

$$ MSE = \frac{1}{m} \sum_{i=1}^m (y_i - \hat{y_i})^2 $$

=======

Attributes:

Attribute Definition
model_coef_df Pandas dataframe of the Lasso model coefficients for all of the predictors and y-intercept (if y_intercept = True)
model_nonzero_coef_df Filtered Pandas dataframe of the Lasso model coefficients for only the predictors and y-intercept (if y_intercept = True) that have non-zero values.
optimal_alpha If model_type = LassoCV, returns the optimal $\alpha_{lasso}$ found by performing cross validation (CV) on training data
predY_train NetREm's predicted values for the training response $y$ data (used to fit the model).
mse_train Mean Square Error (MSE): predicted predY_train versus actual training values for the response variable Y.
sorted_coef_df Pandas dataframe that sorts the final model coefficients (including the y-intercept) based on their absolute values. The rank is provided from least (most important: highest absolute value coefficient) to highest (least important in model).
final_corr_vs_coef_df Pandas dataframe with 3 rows.
• NetREm regression coefficient for predictor
• correlation of each predictor with y based on the training data
• absolute value ranking of the coefficients for the predictors
combined_df Pandas dataframe with a row for each predictor and several columns detailing:
• general NEtREm model information: y_intercept, train MSE, beta_net, alpha_lasso, original number of predictors in $X$, filtered number of predictors input to NetREm (based on pre-processing by user), final number of non-zero predictors selected
• predictor-specific results: NetREm coefficient for predictor, absolute value of NetREm coefficient, rank of the absolute value of the coefficient (low ranks imply higher absolute value for the NetREm coefficient)
B_interaction_df Pandas dataframe with a row for each pair of final predictors selected in the final model (with nonzero efficients) and several columns detailing the $y$-specific predictor-predictor interaction metric value:
• general NEtREm model information: model_type, beta_net
• predictor-predictor specific results: coord_score_cs (this is the key value we focus on as it is the coordination score), original_corr (that is the same as standardized_corr), input network weight, NetREm coefficient for each predictor, novel_link (if yes, it is artifical default edge added to the input network)

Demo (Toy Example) of NetREm:

The goal is to build a machine learning model to predict the gene expression levels of our target gene (TG) $y$ based on the gene expression levels of $N = 5$ Transcription Factors (TFs) [TF1, $TF_{2}$, $TF_{3}$, $TF_{4}$, $TF_{5}$] in a particular cell-type. Assume the gene expression values for each TF are [X1, $X_{2}$, $X_{3}$, $X_{4}$, $X_{5}$], respectively. We generate $10,000$ random cells (rows: samples) of data where the Pearson correlations ($r$) between the gene expression of each TF ($X$) with gene expression of TG $y$ as corrVals: [cor(TF1, $y$) = 0.9, cor(TF2, $y$) = 0.5, cor(TF3, $y$) = 0.4, cor(TF4, $y$) = -0.3, cor(TF5, $y$) = -0.8]. We note that the sparsity_factor_perc is 40%, so roughly 4,000 out of the 10,000 cells for each variable will be ≈0; we incorporate these dropoout, so that the underlying data mimics single-cell expression data. Thus, $M$ is $7,000$ cells since that is what is used for training the NetREm model.

The dimensions of $X$ are therefore 10,000 rows by 5 columns (predictors). More details about our generate_dummy_data function (and additional parameters we can adjust for) are in Dummy_Data_Demo_Example.pdf. Our NetREm estimator also incorporates a constraint of an undirected weighted prior graph network of biological relationships among only 5 TFs based on a weighted Protein-Protein Interaction (PPI) network ([TF1, $TF_{2}$, $TF_{3}$, $TF_{4}$, $TF_{5}$]), where higher edge weights $w$ indicate stronger biological direct and/or indirect pairwise interactions among TFs at the protein-level. These linked TFs may be involved in several shared functions at the molecular level, such as coordinating to co-regulate shared target genes (TGs).

Please note that the code for this demo example is demo_toy.py in the demo folder.

from DemoDataBuilderXandY import generate_dummy_data
from Netrem_model_builder import netrem
import PriorGraphNetwork as graph
import error_metrics as em 
import essential_functions as ef
import netrem_evaluation_functions as nm_eval

num_samples = 10000 # number of samples (e.g. single-cells)
corrs_list =  [0.9, 0.5, 0.4, -0.3, -0.8] # the individual correlations (r) of each X variable with y
train_p = 70 # the % of original data we use for M
dummy_data = generate_dummy_data(corrVals = corrs_list, # the # of elements in corrVals is the # of predictors (X)
                                 num_samples_M = num_samples, # the number of samples M is: (num_samples_M)*(train_data_percent)
                                 sparsity_factor_perc = 40, # approximately what % of data for each variable is sparse (0)
                                 train_data_percent = train_p) # the remainder out of 10,000 will be kept for testing. If 100, then ALL data is used for training and testing.

The Python console or Jupyter notebook will print out the following:

same_train_test_data = False
Please note that since we hold out 30.0% of our 100000 samples for testing, we have:
X_train = 7,000 rows (samples) and 5 columns (N = 5 predictors) for training.
X_test = 3,000 rows (samples) and 5 columns (N = 5 predictors) for testing.
y_train = 7,000 corresponding rows (samples) for training.
y_test = 3,000 corresponding rows (samples) for testing.

The $X$ data should be in the form of a Pandas dataframe as below:

X_df = dummy_data.X_df
X_df.head(10) # please view the first 10 rows
TF1 TF2 TF3 TF4 TF5
0 0.000000 1.431629 0.000000 0.000000 0.000000
1 0.000000 0.992102 0.774417 -1.464686 0.000000
2 -0.856293 0.000000 -1.682493 0.906734 -0.557047
3 -0.657480 1.008181 0.000000 1.426829 -0.699446
4 0.000000 0.000000 0.000000 0.000000 0.000000
5 0.000000 -1.001249 0.000000 0.000000 1.882605
6 1.258316 1.228853 1.789500 -1.373181 -3.245509
7 1.090277 0.000000 0.000000 1.250911 -0.683330
8 0.666747 0.000000 0.654979 0.573183 0.000000
9 -3.076390 -1.082092 -1.573686 0.742459 3.025879
y_df = dummy_data.y_df
y_df.head(10) # please view the first 10 rows
y
0 0.711674
1 0.000000
2 -1.001871
3 0.000000
4 0.000000
5 -1.141293
6 2.654407
7 1.440605
8 0.000000
9 -3.121532
# M = 7,000 samples (cells) for training data (used to train and fit NetREm model)
X_train = dummy_data.view_X_train_df()
y_train = dummy_data.view_y_train_df()

# 3,000 samples (cells) for testing data
X_test = dummy_data.view_X_test_df()
y_test = dummy_data.view_y_test_df()

Our generated single-cell gene expression data (X, y) looks like this:

X_train.corr() # pairwise correlations among the training samples
TF1 TF2 TF3 TF4 TF5
TF1 1.000000 0.438854 0.360903 -0.269164 -0.708270
TF2 0.438854 1.000000 0.176441 -0.139661 -0.391539
TF3 0.360903 0.176441 1.000000 -0.128266 -0.329087
TF4 -0.269164 -0.139661 -0.128266 1.000000 0.234778
TF5 -0.708270 -0.391539 -0.329087 0.234778 1.000000
X_test.corr() # pairwise correlations among the testing samples
TF1 TF2 TF3 TF4 TF5
TF1 1.000000 0.426107 0.336718 -0.266472 -0.759382
TF2 0.426107 1.000000 0.153369 -0.136285 -0.379380
TF3 0.336718 0.153369 1.000000 -0.088676 -0.294519
TF4 -0.266472 -0.136285 -0.088676 1.000000 0.243514
TF5 -0.759382 -0.379380 -0.294519 0.243514 1.000000
dummy_data.combined_correlations_df # breakdown of the correlations in training, testing, and overall data
i predictor expected_corr_with_Y actual_corr difference X_group num_samples sparsity_factor_perc
0 0 TF1 0.9 0.922158 0.022158 Overall unique 10000 70
1 1 TF2 0.5 0.462752 0.037248 Overall unique 10000 70
2 2 TF3 0.4 0.363821 0.036179 Overall unique 10000 70
3 3 TF4 -0.3 -0.272976 0.027024 Overall unique 10000 70
4 4 TF5 -0.8 -0.802524 0.002524 Overall unique 10000 70
0 0 TF1 0.9 0.920876 0.020876 Training unique 7000 70
1 1 TF2 0.5 0.459479 0.040521 Training unique 7000 70
2 2 TF3 0.4 0.365248 0.034752 Training unique 7000 70
3 3 TF4 -0.3 -0.267220 0.032780 Training unique 7000 70
4 4 TF5 -0.8 -0.795406 0.004594 Training unique 7000 70
0 0 TF1 0.9 0.925128 0.025128 Testing unique 3000 70
1 1 TF2 0.5 0.470230 0.029770 Testing unique 3000 70
2 2 TF3 0.4 0.360628 0.039372 Testing unique 3000 70
3 3 TF4 -0.3 -0.285976 0.014024 Testing unique 3000 70
4 4 TF5 -0.8 -0.818860 0.018860 Testing unique 3000 70

We note that NetREm will make this input TF-TF Protein-Protein Interaction (PPI) Network (PPIN) fully connected by adding a default edge weight ŋ = 0.01 for missing edges. This is for numerical stability and to also help propel the discovery of novel TF-TF coordination links.

# prior network edge_list (missing edges or edges with no edge weight will be added with the default_edge_list so the network is fully-connected):
edge_list = [["TF1", "TF2", 0.8], ["TF4", "TF5", 0.95], ["TF1", "TF3"], ["TF1", "TF4"], ["TF1", "TF5"], 
             ["TF2", "TF3"], ["TF2", "TF4"], ["TF2", "TF5"], ["TF3", "TF4"], ["TF3", "TF5"]]

beta_network_val = 1 
alpha_lasso_val = 0.1 # the default, so we do not need to specify it. 
# by default, model_type is Lasso, so alpha_lasso_val will be specified for the alpha_lasso parameter. 
# However, if we specify model_type = Lasso, so our alpha_lasso is determined by cross-validation on training data).

# Building the network regularized regression model: 
# By default, edges are constructed between all of the nodes; nodes with a missing edge are assigned the default_edge_weight. 
netrem_demo = netrem(edge_list = edge_list, 
                     beta_net = beta_network_val,
                     alpha_lasso = alpha_lasso_val, # the default is 0.1. 
                     model_type = "Lasso",
                     view_network = True)

# Fitting the NetREm model on training data: X_train and y_train:
netrem_demo.fit(X_train, y_train)

png

png

To view and extract the predicted model coefficients c for the predictors:

netrem_demo.model_coef_df
y_intercept TF1 TF2 TF3 TF4 TF5
0 None 0.376611 0.249198 0.034235 -0.091983 -0.188037

In the context of gene regulation (in biology), we predict that predictors with negative NetREm coefficients for target gene (TG) $y$ may be repressors (their activity focuses on reducing expression of $y$) and those with positive coefficients for $y$ may be activators.

To view the TG-specific TF-TF coordination (of direct and/or indirect interactions among TFs) that NetREm learned for this target gene $y$, in our given cell-type, we can view the B_interaction_df.

netrem_demo.B_interaction_df
TF1 TF2 TF3 TF4 TF5
TF1 0.000000 71.466718 41.657612 -38.149829 -97.923796
TF2 71.466718 0.000000 16.547463 -20.520985 -54.808259
TF3 41.657612 16.547463 0.000000 -24.335893 -51.672959
TF4 -38.149829 -20.520985 -24.335893 0.000000 100.000000
TF5 -97.923796 -54.808259 -51.672959 100.000000 0.000000

Based on the NetREm model, we predict:

  • TF predictors with positive $c^{*} > 0$ coefficient signs are activators of TG: linked to increased gene expression levels $y$ of the TG.
  • TF predictors with negative $c^{*} < 0$ coefficient signs are repressors of TG: linked to decreased gene expression levels $y$ of the TG.
  • TF predictors with negative $c^{*} = 0$ coefficient signs are not selected as final TFs for the TG based on the input data.

Our TG-specific TF-TF coordination network (i.e. coordination network among predictors for this $y$ variable) has coordination scores $B$ that are between -100 and 100. That is, -100 ≤ $B$ ≤ 100.

These tend to reflect the grouped variable selection property of network-regularized regression models that tend to select TF predictors that also group together in biological pathways in the original Protein-Protein Interaction Network (PPIN).

We predict positive scores 0 < $B$ ≤ 100 between predictors with the same coefficient sign (e.g. $TF_1$ and $TF_2$ have a score of 71.466718, which is higher than their links with $TF3$ since TF3 has weak links with them in the input TF-TF PPIN). This implies potential cooperativity (e.g. cobinding, pioneer/settler models) among them. They have a shared goal and $B &gt; 0$:

  • TF1, TF2, TF3 have a common goal of co-activating TG expression levels $y$ ➡️ increased $y$
  • TF4 and TF5 have a common goal of co-repressing TG expression levels $y$ ➡️ decreased $y$

We predict negative scores -100 ≤ $B$ < 0 between predictors with different coefficients sign (e.g. $TF_1$ and $TF_5$ have a score of -97.923796). This implies potential antagonistic activity (e.g. competition for binding sites, sequestration) among them: activator-repressor antagonism. For instance, these TFs may compete to regulate the $TG$ through biological mechanisms that may be investigated further through experiments. They have opposing goals that may partly cancel each other out so $B &lt; 0$:

  • TF1, TF2, TF3 each have antagonistic relations with TF4 and TF5.

We can test the performance of our data on testing data $X_{test}$ (3,000 samples), to understand better the generalizability of our NetREm model on new, unseen, data.

pred_y_test = netrem_demo.predict(X_test) # predicted values for y_test
mse_test = netrem_demo.test_mse(X_test, y_test)

print(f"The testing Mean Square Error (MSE) is {mse_test}")
The testing Mean Square Error (MSE) is 0.21834495195611514

We can analyze more metrics about our NetREm model results as below, where:

  • low NMSE values are great
  • high values for SNR and PSNR are ideal:
nmse_test  = em.nmse(y_test.values.flatten(), pred_y_test)
print(f"The testing Normalized Mean Square Error (NMSE) is {nmse_test}")
The testing Normalized Mean Square Error (NMSE) is 0.22013278683483584
snr_test  = em.snr(y_test.values.flatten(), pred_y_test)
print(f"The testing Signal-to-Noise Ratio (SNR) is {snr_test}")
The testing Signal-to-Noise Ratio (SNR) is 6.573152683008163
psnr_test  = em.psnr(y_test.values.flatten(), pred_y_test)
print(f"The testing Peak Signal-to-Noise Ratio (PSNR) is {psnr_test}")
The testing Peak Signal-to-Noise Ratio (PSNR) is 17.629710652383

These are additional metrics:

netrem_demo.final_corr_vs_coef_df
info input_data TF1 TF2 TF3 TF4 TF5
0 network regression coeff. with y: y X_train 0.376611 0.249198 0.034235 -0.091983 -0.188037
0 corr (r) with y: y X_train 0.893875 0.494397 0.401819 -0.295085 -0.792749
0 Absolute Value NetREm Coefficient Ranking X_train 1 2 5 4 3

In the context of gene regulation, our results may thereby be interpreted in the following way:

Nonetheless, NetREm can be applied to solve a suite of regression problems where there is an underlying connection among the predictors (that group together in meaningful subnetworks to influence the outcome $y$) and their correlation with one another may be utilized jointly for the predictive task rather than discarded.

We also provide a suite of evaluation functions and explanations of more advanced functionalities in our User Guide. For instance, we provide more details on our netrem and netremcv functions in terms of predicting gene regulation in myelinating Schwann cells here.

netrem_demo.combined_df
coef TF TG info train_mse train_nmse train_snr train_psnr beta_net alpha_lasso AbsoluteVal_coefficient Rank final_model_TFs TFs_input_to_model original_TFs_in_X standardized_X standardized_y centered_y
0 None y_intercept y netrem_no_intercept 0.219221 0.219221 6.591179 17.777226 1 0.1 NaN 6 5 5 5 True True False
1 0.376611 TF1 y netrem_no_intercept 0.219221 0.219221 6.591179 17.777226 1 0.1 0.376611 1 5 5 5 True True False
2 0.249198 TF2 y netrem_no_intercept 0.219221 0.219221 6.591179 17.777226 1 0.1 0.249198 2 5 5 5 True True False
3 0.034235 TF3 y netrem_no_intercept 0.219221 0.219221 6.591179 17.777226 1 0.1 0.034235 5 5 5 5 True True False
4 -0.091983 TF4 y netrem_no_intercept 0.219221 0.219221 6.591179 17.777226 1 0.1 0.091983 4 5 5 5 True True False
5 -0.188037 TF5 y netrem_no_intercept 0.219221 0.219221 6.591179 17.777226 1 0.1 0.188037 3 5 5 5 True True False
netrem_demo.TF_coord_scores_pairwise_df
23 TF4 TF5 100.000000 100.000000
19 TF5 TF4 100.000000 100.000000
20 TF1 TF5 -97.923796 97.923796
4 TF5 TF1 -97.923796 97.923796
1 TF2 TF1 71.466718 71.466718
5 TF1 TF2 71.466718 71.466718
9 TF5 TF2 -54.808259 54.808259
21 TF2 TF5 -54.808259 54.808259
14 TF5 TF3 -51.672959 51.672959
22 TF3 TF5 -51.672959 51.672959
10 TF1 TF3 41.657612 41.657612
2 TF3 TF1 41.657612 41.657612
15 TF1 TF4 -38.149829 38.149829
3 TF4 TF1 -38.149829 38.149829
17 TF3 TF4 -24.335893 24.335893
13 TF4 TF3 -24.335893 24.335893
8 TF4 TF2 -20.520985 20.520985
16 TF2 TF4 -20.520985 20.520985
7 TF3 TF2 16.547463 16.547463
11 TF2 TF3 16.547463 16.547463
organize_predictor_interaction_network(netrem_demo)
node_1 node_2 coord_score_cs sign potential_interaction absVal_coordScore model_type info candidate_TFs_N target_gene_y ... cs_magnitude_rank cs_magnitude_percentile TF_TF original_corr standardized_corr PPI_score novel_link absVal_diff_cs_and_originalCorr c_1 c_2
0 TF4 TF5 100.000000 :) :) cooperative (+) 100.000000 Lasso matrix of TF-TF interactions 5 y ... 1.0 95.0 TF4_TF5 0.234778 0.234778 0.95 no 99.765222 -0.091983 -0.188037
1 TF5 TF4 100.000000 :) :) cooperative (+) 100.000000 Lasso matrix of TF-TF interactions 5 y ... 1.0 95.0 TF5_TF4 0.234778 0.234778 0.95 no 99.765222 -0.188037 -0.091983
2 TF1 TF5 -97.923796 :( :( competitive (-) 97.923796 Lasso matrix of TF-TF interactions 5 y ... 3.0 85.0 TF1_TF5 -0.708270 -0.708270 0.01 yes 97.215526 0.376611 -0.188037
3 TF5 TF1 -97.923796 :( :( competitive (-) 97.923796 Lasso matrix of TF-TF interactions 5 y ... 3.0 85.0 TF5_TF1 -0.708270 -0.708270 0.01 yes 97.215526 -0.188037 0.376611
4 TF2 TF1 71.466718 :) :) cooperative (+) 71.466718 Lasso matrix of TF-TF interactions 5 y ... 5.0 75.0 TF2_TF1 0.438854 0.438854 0.80 no 71.027863 0.249198 0.376611
5 TF1 TF2 71.466718 :) :) cooperative (+) 71.466718 Lasso matrix of TF-TF interactions 5 y ... 5.0 75.0 TF1_TF2 0.438854 0.438854 0.80 no 71.027863 0.376611 0.249198
6 TF5 TF2 -54.808259 :( :( competitive (-) 54.808259 Lasso matrix of TF-TF interactions 5 y ... 7.0 65.0 TF5_TF2 -0.391539 -0.391539 0.01 yes 54.416720 -0.188037 0.249198
7 TF2 TF5 -54.808259 :( :( competitive (-) 54.808259 Lasso matrix of TF-TF interactions 5 y ... 7.0 65.0 TF2_TF5 -0.391539 -0.391539 0.01 yes 54.416720 0.249198 -0.188037
8 TF5 TF3 -51.672959 :( :( competitive (-) 51.672959 Lasso matrix of TF-TF interactions 5 y ... 9.0 55.0 TF5_TF3 -0.329087 -0.329087 0.01 yes 51.343872 -0.188037 0.034235
9 TF3 TF5 -51.672959 :( :( competitive (-) 51.672959 Lasso matrix of TF-TF interactions 5 y ... 9.0 55.0 TF3_TF5 -0.329087 -0.329087 0.01 yes 51.343872 0.034235 -0.188037
10 TF3 TF1 41.657612 :) :) cooperative (+) 41.657612 Lasso matrix of TF-TF interactions 5 y ... 11.0 45.0 TF3_TF1 0.360903 0.360903 0.01 yes 41.296709 0.034235 0.376611
11 TF1 TF3 41.657612 :) :) cooperative (+) 41.657612 Lasso matrix of TF-TF interactions 5 y ... 11.0 45.0 TF1_TF3 0.360903 0.360903 0.01 yes 41.296709 0.376611 0.034235
12 TF1 TF4 -38.149829 :( :( competitive (-) 38.149829 Lasso matrix of TF-TF interactions 5 y ... 13.0 35.0 TF1_TF4 -0.269164 -0.269164 0.01 yes 37.880665 0.376611 -0.091983
13 TF4 TF1 -38.149829 :( :( competitive (-) 38.149829 Lasso matrix of TF-TF interactions 5 y ... 13.0 35.0 TF4_TF1 -0.269164 -0.269164 0.01 yes 37.880665 -0.091983 0.376611
14 TF3 TF4 -24.335893 :( :( competitive (-) 24.335893 Lasso matrix of TF-TF interactions 5 y ... 15.0 25.0 TF3_TF4 -0.128266 -0.128266 0.01 yes 24.207627 0.034235 -0.091983
15 TF4 TF3 -24.335893 :( :( competitive (-) 24.335893 Lasso matrix of TF-TF interactions 5 y ... 15.0 25.0 TF4_TF3 -0.128266 -0.128266 0.01 yes 24.207627 -0.091983 0.034235
16 TF4 TF2 -20.520985 :( :( competitive (-) 20.520985 Lasso matrix of TF-TF interactions 5 y ... 17.0 15.0 TF4_TF2 -0.139661 -0.139661 0.01 yes 20.381324 -0.091983 0.249198
17 TF2 TF4 -20.520985 :( :( competitive (-) 20.520985 Lasso matrix of TF-TF interactions 5 y ... 17.0 15.0 TF2_TF4 -0.139661 -0.139661 0.01 yes 20.381324 0.249198 -0.091983
18 TF3 TF2 16.547463 :) :) cooperative (+) 16.547463 Lasso matrix of TF-TF interactions 5 y ... 19.0 5.0 TF3_TF2 0.176441 0.176441 0.01 yes 16.371022 0.034235 0.249198
19 TF2 TF3 16.547463 :) :) cooperative (+) 16.547463 Lasso matrix of TF-TF interactions 5 y ... 19.0 5.0 TF2_TF3 0.176441 0.176441 0.01 yes 16.371022 0.249198 0.034235

20 rows × 25 columns

We also note that since w_transform_for_d = "none", then to calculate the degree of each TF predictor node, we calculate the sum of all edges $w_{ij}$ for $j = 1$ to $N$ where $j \neq i$. Then, the degree $d$ of $TF_i$ is denoted as $d_i$, and it can be expressed as:

$$ d_i = \sum_{\substack{j=1 \ j \neq i}}^N w_{ij} $$

Instead of self-loops, we have:

$$ w_{ii} = \frac{d_i}{N-1} $$

We preprocess the input TF-TF PPI network to make it fully-connected for all pairwise TF-TF edges (default_edge_weight of 0.01) and diagonals reflecting averaged connectivity values.

Manuscript

Please note that the manuscript associated with NetREm is available as a pre-print on biorxiv: Link to Paper

Citation:

  • Saniya Khullar, Xiang Huang, Raghu Ramesh, John Svaren, Daifeng Wang, NetREm: Network Regression Embeddings reveal cell-type transcription factor coordination for gene regulation, bioRxiv 2023.10.25.563769; doi: https://doi.org/10.1101/2023.10.25.563769

References

[1]: Caiyan Li, Hongzhe Li, Network-constrained regularization and variable selection for analysis of genomic data, Bioinformatics, Volume 24, Issue 9, May 2008, Pages 1175–1182, https://doi.org/10.1093/bioinformatics/btn081

[2]: Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18

[3]: Zhang, Y., Akutsu, T., & Ching, W. K. (2016). Incorporating network topology and phylogenetic constraints for protein-protein interaction prediction. BMC bioinformatics, 17(1), 1-15. https://doi.org/10.1186/s12859-016-1310-4

[4]: Jia, C., Zhang, Y., Chen, K., & Zhang, S. (2018). A novel feature extraction method with improved consistency for identifying cell cycle regulated genes. Bioinformatics, 34(5), 896-903. https://doi.org/10.1093/bioinformatics/btx657

[5]: Lu, Y., Chen, X., & Hu, Z. (2017). Recognition of protein/gene names from text using an ensemble of classifiers and effective abbreviation resolution. BMC bioinformatics, 18(1), 1-11. https://doi.org/10.1186/s12859-017-1515-1

[6]: Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830.