diff --git a/README.md b/README.md index 7ba1bae..ef436ba 100644 --- a/README.md +++ b/README.md @@ -159,51 +159,6 @@ $$ --> | **... (additional parameters)** |Read more in the [User Guide: Additional Parameters](https://github.com/SaniyaKhullar/NetREm/blob/main/user_guide/Additional_NetREm_Parameters.md) for more parameters after **model_type** | - - - - - - - - - - - - - - - - - - ### Details: We input an edge list of the prior graph network (constrains the model via network-based regularization) and a beta_net ($\beta_{net} \geq 0$, which scales the network-based regularization penalty). The user may specify the alpha_lasso ($\alpha_{lasso} \geq 0$) manually for the lasso regularization on the overall model (if `model_type = Lasso`) or NetREm may select an optimal $\alpha_{lasso}$ based on cross-validation (CV) on the training data (if `model_type = LasssoCV`). Then, **netrem** builds an estimator object from the class Netrem that can then take in input $X$ and $y$ data: transforms them to $\tilde{X}$ and $\tilde{y}$, respectively, and use them to fit a Lasso regression model with a regularization value of $\alpha_{lasso}$. Ultimately, the trained NetREm machine learning model is more reflective of an underlying network structure among predictors and may be more biologically meaningful and interpretable. Nonetheless, NetREm could be applied in various contexts where a network structure is present among the predictors. Input networks are typically weighted and undirected. We provide details, functions, and help with converting directed networks to undirected networks (of similarity values among nodes) [here](https://github.com/SaniyaKhullar/NetREm/blob/main/user_guide/directed_to_undirected_network_example.ipynb). @@ -301,9 +256,9 @@ $$MSE = \frac{1}{m} \sum_{i=1}^m (y_i - \hat{y_i})^2$$
## Demo (Toy Example) of NetREm: -The goal is to build a machine learning model to predict the gene expression levels of our target gene (TG) $y$ based on the gene expression levels of $N = 6$ Transcription Factors (TFs) [TF1, $TF_{2}$, $TF_{3}$, $TF_{4}$, $TF_{5}$, $TF_{6}$] in a particular cell-type. Assume the gene expression values for each TF are [X1, $X_{2}$, $X_{3}$, $X_{4}$, $X_{5}$, $X_{6}$], respectively. We generate $M = 100$ random samples (rows) of data where the Pearson correlations ($r$) between gene expression of each TF ($X$) with gene expression of TG $y$ as *corrVals*: [cor(TF1, $y$) = 0.9, cor(TF2, $y$) = 0.5, cor(TF3, $y$) = 0.1, cor(TF4, $y$) = -0.2, cor(TF5, $y$) = -0.8, cor(TF6, $y$) = -0.3]. +The goal is to build a machine learning model to predict the gene expression levels of our target gene (TG) $y$ based on the gene expression levels of $N = 5$ Transcription Factors (TFs) [TF1, $TF_{2}$, $TF_{3}$, $TF_{4}$, $TF_{5}$] in a particular cell-type. Assume the gene expression values for each TF are [X1, $X_{2}$, $X_{3}$, $X_{4}$, $X_{5}$], respectively. We generate $M = 100,000$ random samples (rows) of data where the Pearson correlations ($r$) between gene expression of each TF ($X$) with gene expression of TG $y$ as *corrVals*: [cor(TF1, $y$) = 0.9, cor(TF2, $y$) = 0.5, cor(TF3, $y$) = 0.4, cor(TF4, $y$) = -0.3, cor(TF5, $y$) = -0.8]. -The dimensions of $X$ are therefore 100 rows by 6 columns (predictors). More details about our *generate_dummy_data* function (and additional parameters we can adjust for) are in [Dummy_Data_Demo_Example.ipynb](https://github.com/SaniyaKhullar/NetREm/blob/main/user_guide/Dummy_Data_Demo_Example.ipynb). Our NetREm estimator also incorporates a constraint of an **undirected weighted prior graph network** of biological relationships among only 5 TFs based on a weighted Protein-Protein Interaction (PPI) network ([TF1, $TF_{2}$, $TF_{3}$, $TF_{4}$, $TF_{5}$]), where higher edge weights $w$ indicate stronger biological interactions at the protein-level. +The dimensions of $X$ are therefore 100,000 rows by 5 columns (predictors). More details about our *generate_dummy_data* function (and additional parameters we can adjust for) are in [Dummy_Data_Demo_Example.pdf](https://github.com/SaniyaKhullar/NetREm/blob/main/user_guide/Dummy_Data_Demo_Example.pdf). Our NetREm estimator also incorporates a constraint of an **undirected weighted prior graph network** of biological relationships among only 5 TFs based on a weighted Protein-Protein Interaction (PPI) network ([TF1, $TF_{2}$, $TF_{3}$, $TF_{4}$, $TF_{5}$]), where higher edge weights $w$ indicate stronger biological interactions at the protein-level. The code for this demo example is [demo_toy.py](https://github.com/SaniyaKhullar/NetREm/blob/main/demo/demo_toy.py) in the *demo* folder. @@ -315,20 +270,18 @@ import error_metrics as em import essential_functions as ef import netrem_evaluation_functions as nm_eval -dummy_data = generate_dummy_data(corrVals = [0.9, 0.5, 0.1, -0.2, -0.8, -0.3], # the # of elements in corrVals is the # of predictors (X) - num_samples_M = 100, # the number of samples M - standardize_X = False, - center_y = False, +dummy_data = generate_dummy_data(corrVals = [0.9, 0.5, 0.4, -0.3, -0.8], # the # of elements in corrVals is the # of predictors (X) + num_samples_M = 100000, # the number of samples M train_data_percent = 70) # the remainder out of 100 will be kept for testing. If 100, all data is used for training and testing. ``` The Python console or Jupyter notebook will print out the following: same_train_test_data = False - We hold out 30.0% of our 100 samples for testing, so that: - X_train = 70 rows (samples) and 6 columns (N = 6 predictors) for training. - X_test = 30 rows (samples) and 6 columns (N = 6 predictors) for testing. - y_train = 70 corresponding rows (samples) for training. - y_test = 30 corresponding rows (samples) for testing. + Please note that since we hold out 30.0% of our 100000 samples for testing, we have: + X_train = 70000 rows (samples) and 5 columns (N = 5 predictors) for training. + X_test = 30000 rows (samples) and 5 columns (N = 5 predictors) for testing. + y_train = 70000 corresponding rows (samples) for training. + y_test = 30000 corresponding rows (samples) for testing. The $X$ data should be in the form of a Pandas dataframe as below: @@ -339,61 +292,55 @@ X_df.head()
- + - - - - - - - + + + + + - - - - - - + + + + + - - - - - - + + + + + - - - - - - + + + + + - - - - - - + + + + +
TF1 TF2 TF3 TF4 TF5TF6
02.0208400.594445-1.443012-0.6887770.900770-2.6436710.0675111.1681620.500052-1.622116-0.827644
13.224776-0.270632-0.557771-0.3055740.054708-1.0541971.754397-1.5314720.0676300.857830-1.440013
2-2.7467211.5022362.0438131.2529752.0821591.227615-1.519240-0.7648290.823048-0.2061060.820908
3-0.5581301.290771-1.230527-0.6784100.630084-1.5087580.0097352.0279542.0927690.8868840.054337
4-2.181462-0.657229-2.880186-1.6294700.2680421.207254-0.3774060.905750-1.1677451.350194-0.131234
@@ -444,11 +391,11 @@ y_df.head()
```python -# 70 samples for training data (used to train and fit GRegulNet model) +# 70,000 samples for training data (used to train and fit NetREm model) X_train = dummy_data.view_X_train_df() y_train = dummy_data.view_y_train_df() -# 30 samples for testing data +# 30,000 samples for testing data X_test = dummy_data.view_X_test_df() y_test = dummy_data.view_y_test_df() ``` diff --git a/code/DemoDataBuilderXandY.py b/code/DemoDataBuilderXandY.py index a1a9fe5..bb12db7 100644 --- a/code/DemoDataBuilderXandY.py +++ b/code/DemoDataBuilderXandY.py @@ -884,13 +884,12 @@ def view_train_vs_test_data_for_predictor(self, predictor_name): def generate_dummy_data(corrVals, - num_samples_M = 100, + num_samples_M = 10000, train_data_percent = 70, mu = 0, std_dev = 1, iters_to_generate_X = 100, orthogonal_X = False, - ortho_scalar = 10, view_input_corrs_plot = False, verbose = True, rand_seed_x = 123, rand_seed_y = 2023): diff --git a/code/Netrem_model_builder.py b/code/Netrem_model_builder.py index 761023f..1e13f6d 100644 --- a/code/Netrem_model_builder.py +++ b/code/Netrem_model_builder.py @@ -178,9 +178,14 @@ def updating_network_and_X_during_fitting(self, X, y): self.old_y = y y = self.center_y_data(y) - gene_expression_nodes = X_df.columns.tolist() # these are already sorted - #gene_expression_nodes = sorted(X_df.columns.tolist()) # these will be sorted - ppi_net_nodes = set(self.network_nodes_list) + #gene_expression_nodes = X_df.columns.tolist() # these are already sorted + tg_name = y.columns.tolist()[0] + if tg_name in X_df.columns.tolist(): + X_df = X_df.drop(columns = [tg_name]) + + #gene_expression_nodes = list(set(X_df.columns.tolist()) - tg_name) # these are already sorted + gene_expression_nodes = sorted(X_df.columns.tolist()) # these will be sorted + ppi_net_nodes = set(self.network_nodes_list) # set(self.network_nodes_list) - tg_name common_nodes = list(ppi_net_nodes.intersection(gene_expression_nodes)) if not common_nodes: # may be possible that the X dataframe needs to be transposed if provided incorrectly @@ -191,6 +196,7 @@ def updating_network_and_X_during_fitting(self, X, y): self.gene_expression_nodes = gene_expression_nodes self.common_nodes = sorted(common_nodes) + gene_expression_nodes = sorted(gene_expression_nodes) # 10/22 self.final_nodes = gene_expression_nodes if self.overlapped_nodes_only: self.final_nodes = common_nodes @@ -198,7 +204,7 @@ def updating_network_and_X_during_fitting(self, X, y): self.final_nodes = self.prior_network.final_nodes else: self.final_nodes = gene_expression_nodes - + self.final_nodes = sorted(self.final_nodes) # 10/22 final_nodes_set = set(self.final_nodes) ppi_nodes_to_remove = list(ppi_net_nodes - final_nodes_set) self.gexpr_nodes_added = list(set(gene_expression_nodes) - final_nodes_set) @@ -231,6 +237,7 @@ def updating_network_and_X_during_fitting(self, X, y): self.y_train = self.preprocess_y_df(y) return self + def organize_B_interaction_list(self): # TF-TF interactions to output :) self.B_train = self.compute_B_matrix(self.X_train) self.B_interaction_df = pd.DataFrame(self.B_train, index = self.final_nodes, columns = self.final_nodes) @@ -370,6 +377,7 @@ def view_W_network(self): labels = {e: G.edges[e]['weight'] for e in G.edges} return nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, ax=ax) + def compute_B_matrix_times_M(self, X): """ M is N_sample, because ||y - Xc||^2 need to be normalized by 1/n_sample, but not the 2 * beta_L2 * c'Ac term see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html @@ -412,6 +420,7 @@ def compute_X_tilde_y_tilde(self, B, X, y): y_tilde *= scale return X_tilde, y_tilde + def predict_y_from_y_tilda(self, X, X_tilda, pred_y_tilda): X = self.preprocess_X_df(X) @@ -550,57 +559,6 @@ def score(self, X, y, zero_coef_penalty=10): else: return -nmse_ - - -# def score(self, X, y, zero_coef_penalty=10): -# print("Debug: Start of score function") - -# if isinstance(X, pd.DataFrame): -# X = self.preprocess_X_df(X) -# print(f"Debug: preprocessed X, nulls: {X.isnull().sum().sum()}") - -# if isinstance(y, pd.DataFrame): -# y = self.preprocess_y_df(y) -# print(f"Debug: preprocessed y, nulls: {y.isnull().sum().sum()}") - -# y_pred = self.predict(X) -# print(f"Debug: y_pred, nulls: {np.isnan(y_pred).sum()}, infs: {np.isinf(y_pred).sum()}") - -# y_pred[y_pred == 0] = 1e-10 -# nmse_ = (y - y_pred)**2 - -# nmse_[y_pred == 1e-10] *= zero_coef_penalty -# denominator = (y**2).mean() - -# print(f"Debug: Denominator: {denominator}") - -# if denominator == 0: -# print("Debug: Denominator is zero.") -# return -1e10 # Some large negative value - -# nmse_ = nmse_.mean() / denominator - -# if nmse_ == 0: -# print("Debug: nmse_ is zero.") -# return -1e10 # Some large negative value - -# print(f"Debug: Returning score: {-nmse_}") -# return -nmse_ - - -# def score(self, X, y, zero_coef_penalty=10): -# if isinstance(X, pd.DataFrame): -# X = self.preprocess_X_df(X) # X_test -# if isinstance(y, pd.DataFrame): -# y = self.preprocess_y_df(y) -# # Make predictions using the predict method of your custom estimator -# y_pred = self.predict(X) -# # Calculate the normalized mean squared error between the true and predicted values -# nmse_ = (y - y_pred)**2 -# nmse_[y_pred==0] *= zero_coef_penalty -# nmse_ = nmse_.mean() / (y**2).mean() -# return -nmse_ # Return the negative normalized mean squared error - def updating_network_A_matrix_given_X(self) -> np.ndarray: """ When we call the fit method, this function is used to help us update the network information. @@ -692,7 +650,6 @@ def updating_network_A_matrix_given_X(self) -> np.ndarray: self.tf_names_list = self.nodes return self - def preprocess_X_df(self, X): if isinstance(X, pd.DataFrame): X_df = X @@ -708,19 +665,7 @@ def preprocess_X_df(self, X): X_df = X_df.reindex(columns=gene_names_list)# Reorder columns of dataframe to match order in `column_order` X = np.array(X_df.values.tolist()) return X -# def preprocess_X_df(self, X): -# if isinstance(X, pd.DataFrame): -# column_names_list = X.columns.tolist() -# overlap_num = len(set(column_names_list).intersection(self.final_nodes)) - -# if overlap_num == 0: -# print("Please note: we are flipping X dataframe around so that the rows are samples and the columns are gene/TF names :)") -# X = X.transpose() - -# gene_names_list = self.final_nodes # so that this matches the order of columns in A matrix as well -# X = X[gene_names_list] - -# return X.values + def preprocess_y_df(self, y): if isinstance(y, pd.DataFrame): @@ -835,18 +780,51 @@ def netrem(edge_list, beta_net = 1, alpha_lasso = 0.01, default_edge_weight = 0. return greggy - - -def generate_beta_networks(X_train, y_train, standardize_X, prior_network, overlapped_nodes_only = False, num = 10, max_beta = 200): - """ - Generate a grid of beta_network values to transform X_train. - - Parameters: - X_train (numpy array): training input data - - Returns: - numpy array: grid of beta_network values - """ +def netremCV(edge_list, X, y, + num_beta: int = 10, + extra_beta_list = [0.25, 0.5, 0.75, 1], # additional beta to try out + num_alpha: int = 10, + max_beta: float = 200, # max_beta used to help prevent explosion of beta_net values + reduced_cv_search: bool = False, # should we do a reduced search (Randomized Search) or a GridSearch? + default_edge_weight: float = 0.1, + degree_threshold: float = 0.5, + gene_expression_nodes = [], + overlapped_nodes_only: bool = False, + standardize_X: bool = True, + center_y: bool = True, + y_intercept: bool = False, + model_type = "Lasso", + lasso_selection = "cyclic", + all_pos_coefs: bool = False, + tolerance: float = 1e-4, + maxit: int = 10000, + num_jobs: int = -1, + num_cv_folds: int = 5, + lassocv_eps: float = 1e-3, + lassocv_n_alphas: int = 100, # default in sklearn + lassocv_alphas = None, # default in sklearn + verbose = False, + searchVerbosity: int = 2, + show_warnings: bool = False): + + X_train = X + y_train = y + if show_warnings == False: + warnings.filterwarnings('ignore') + prior_graph_dict = {"edge_list": edge_list, + "gene_expression_nodes":gene_expression_nodes, + "edge_values_for_degree": False, + "consider_self_loops":False, + "pseudocount_for_degree":1e-3, + "default_edge_weight": default_edge_weight, + "w_transform_for_d":"none", + "threshold_for_degree": degree_threshold, + "verbose":verbose, + "view_network":False} + + prior_network = graph.PriorGraphNetwork(**prior_graph_dict) + + # generate the beta grid: if isinstance(X_train, pd.DataFrame): X_df = X_train gene_names_list = list(X_df.columns) @@ -860,267 +838,172 @@ def generate_beta_networks(X_train, y_train, standardize_X, prior_network, overl X_df = X_df.reindex(columns=common_nodes) else: X_df = X_df.reindex(columns=gene_names_list) - + + X_train_np = X_df.copy() + y_train_np = y_train.copy() if standardize_X: - print("standardizing X :)") + if verbose: + print("standardizing X :)") scaler = preprocessing.StandardScaler().fit(X_df) - X_train = scaler.transform(X_df) + X_train_np = scaler.transform(X_df) else: - X_train = np.array(X_df.values.tolist()) + X_train_np = np.array(X_df.values.tolist()) if isinstance(y_train, pd.DataFrame): - y_train = y_train.values.flatten() - beta_max = 0.5 * np.max(np.abs(X_train.T.dot(y_train))) + y_train_np = y_train_np.values.flatten() + beta_max = 0.5 * np.max(np.abs(X_train_np.T.dot(y_train_np))) beta_min = 0.01 * beta_max - - var_X = np.var(X_train) - var_y = np.var(y_train) + + var_X = np.var(X_train_np) + var_y = np.var(y_train_np) if beta_max > max_beta: # max_beta used to prevent explosion of beta_net values - print(":) using variance to define beta_net values") + if verbose: + print(":) using variance to define beta_net values") beta_max = 0.5 * np.max(np.abs(var_X * var_y)) * 100 beta_min = 0.01 * beta_max - print(f"beta_min = {beta_min} and beta_max = {beta_max}") - - return np.logspace(np.log10(beta_max), np.log10(beta_min), num=num) - - -def generate_alpha_beta_pairs(X_train, - y_train, - prior_network, - overlapped_nodes_only: bool = False, - standardize_X: bool = True, - center_y: bool = True, - num_beta: int = 50, - num_alpha: int = 10, - max_beta: float = 200, - y_intercept: bool = False, - maxit: int = 10000, - all_pos_coefs: bool = False, - tolerance = 1e-4, - lasso_selection = "cyclic", - num_cv_folds = 5, - num_jobs = -1, - lassocv_eps = 1e-3, - lassocv_n_alphas = 100, - lassocv_alphas = None) -> dict: - """ - Generate a pairwise set of alpha_lasso and beta_network values. - - Parameters: - X_train (numpy array): training input data - y_train (numpy array): training output data - prior_network: The prior network to be used. - overlapped_nodes_only (bool): Whether to use only overlapped nodes. Default is False. - num (int): The number of beta_network values to generate. Default is 100. - - Returns: - dict: Dictionary containing grid of alpha_lasso values and beta_network values. - """ - beta_grid = generate_beta_networks(X_train, y_train, standardize_X, prior_network, overlapped_nodes_only, num=num_beta, max_beta = max_beta) + if verbose: + print(f"beta_min = {beta_min} and beta_max = {beta_max}") + beta_grid = np.logspace(np.log10(beta_max), np.log10(beta_min), num=num_beta) + if extra_beta_list != None: + if len(extra_beta_list) > 0: + for add_beta in extra_beta_list: # we add additional beta based on user-defined list + beta_grid = np.append(add_beta, beta_grid) + + beta_alpha_grid_dict = {"beta_network_vals": [], "alpha_lasso_vals": []} - + # generating the alpha-values that are corresponding try: with tqdm(beta_grid, desc=":) Generating beta_net and alpha_lasso pairs") as pbar: for beta in pbar: + if verbose: + print("beta_network:", beta) # please fix it so it reflects what we want more... like the proper defaults - netremCV_demo = nm.NetREmModel(beta_network=beta, - model_type="LassoCV", - network=prior_network, - standardize_X = standardize_X, - center_y = center_y, - overlapped_nodes_only=overlapped_nodes_only) -# netremCV_demo = nm.NetREmModel(beta_network=beta, -# model_type="LassoCV", -# network=prior_network, -# overlapped_nodes_only=overlapped_nodes_only, -# standardize_X = standardize_X, -# y_intercept = y_intercept, -# max_lasso_iterations = maxit, -# all_pos_coefs = all_pos_coefs, -# tolerance = tolerance, -# lasso_selection = lasso_selection, -# num_cv_folds = num_cv_folds, -# #num_jobs = num_jobs, -# lassocv_eps = lassocv_eps, -# lassocv_n_alphas = lassocv_n_alphas, -# lassocv_alphas = lassocv_alphas) - + netremCV_demo = NetREmModel(beta_net=beta, + model_type="LassoCV", + network=prior_network, + overlapped_nodes_only=overlapped_nodes_only, + standardize_X = standardize_X, + center_y = center_y, + y_intercept = y_intercept, + max_lasso_iterations = maxit, + all_pos_coefs = all_pos_coefs, + tolerance = tolerance, + lasso_selection = lasso_selection, + num_cv_folds = num_cv_folds, + #num_jobs = num_jobs, + lassocv_eps = lassocv_eps, + lassocv_n_alphas = lassocv_n_alphas, + lassocv_alphas = lassocv_alphas) + if lassocv_alphas != None: + netremCV_demo.lassocv_alphas = lassocv_alphas + # Fit the model and compute alpha_max and alpha_min netremCV_demo.fit(X_train, y_train) X_tilda_train = netremCV_demo.X_tilda_train y_tilda_train = netremCV_demo.y_tilda_train alpha_max = 0.5 * np.max(np.abs(X_tilda_train.T.dot(y_tilda_train))) alpha_min = 0.01 * alpha_max - + if verbose: + print(f"alpha_min = {alpha_min} and alpha_max = {alpha_max}") + # Generate alpha_grid based on alpha_max and alpha_min optimal_alpha = netremCV_demo.regr.alpha_ - alpha_grid = np.append(optimal_alpha, np.logspace(np.log10(alpha_min), np.log10(alpha_max), num=num_alpha)) - + # take the cross-validation alpha and apply as the best alpha as well for this beta_net + beta_alpha_grid_dict["beta_network_vals"].append(beta) + beta_alpha_grid_dict["alpha_lasso_vals"].append(optimal_alpha) + # we also utilize the other alphas we have constructed dynamically and will find the best alpha among those + alpha_grid = np.logspace(np.log10(alpha_min), np.log10(alpha_max), num=num_alpha) + # Find the best alpha using cross-validation best_alpha = None best_score = float('-inf') for alpha in alpha_grid: - #netremCV_demo.regr.set_params(alpha=alpha) -# netremCV_demo = nm.NetREmModel(beta_network=beta, -# alpha_lasso = alpha, -# model_type="Lasso", -# network=prior_network, -# standardize_X = standardize_X, -# overlapped_nodes_only=overlapped_nodes_only, -# y_intercept = y_intercept, -# max_lasso_iterations = maxit, -# all_pos_coefs = all_pos_coefs, -# tolerance = tolerance, -# lasso_selection = lasso_selection, -# num_cv_folds = num_cv_folds, -# #num_jobs = num_jobs, -# lassocv_eps = lassocv_eps, -# lassocv_n_alphas = lassocv_n_alphas, -# lassocv_alphas = lassocv_alphas) - netremCV_demo = nm.NetREmModel(beta_network=beta, - alpha_lasso = alpha, - standardize_X = standardize_X, - center_y = center_y, - model_type="Lasso", - network=prior_network, - overlapped_nodes_only=overlapped_nodes_only) - scores = cross_val_score(netremCV_demo, X_train, y_train, cv=5) # You can change cv to your specific cross-validation strategy + netremCV_demo = NetREmModel(beta_net=beta, + alpha_lasso = alpha, + model_type="Lasso", + network=prior_network, + standardize_X = standardize_X, + center_y = center_y, + overlapped_nodes_only=overlapped_nodes_only, + y_intercept = y_intercept, + max_lasso_iterations = maxit, + all_pos_coefs = all_pos_coefs, + tolerance = tolerance, + lasso_selection = lasso_selection) + scores = cross_val_score(netremCV_demo, X_train, y_train, cv=num_cv_folds, scoring = "neg_mean_squared_error") # You can change cv to your specific cross-validation strategy mean_score = np.mean(scores) if mean_score > best_score: best_score = mean_score best_alpha = alpha - + # Append the beta and best_alpha to the dictionary beta_alpha_grid_dict["beta_network_vals"].append(beta) beta_alpha_grid_dict["alpha_lasso_vals"].append(best_alpha) - + except Exception as e: print(f"An error occurred: {e}") - print("finished generate_alpha_beta_pairs") - print(beta_alpha_grid_dict) - return beta_alpha_grid_dict - - -# Custom scoring function -def custom_mse(y_true, y_pred): - mse = mean_squared_error(y_true, y_pred) - pbar.update(1) # Update the progress bar - return -mse # Negate because GridSearchCV tries to maximize the score - - -def netremCV(edge_list, X, y, - num_beta: int = 50, - num_alpha: int = 10, - max_beta: float = 200, # max_beta used to help prevent explosion of beta_net values - reduced_cv_search: bool = False, # should we do a reduced search (Randomized Search) or a GridSearch? - default_edge_weight: float = 0.1, - degree_threshold: float = 0.5, - gene_expression_nodes = [], - overlapped_nodes_only: bool = False, - standardize_X: bool = True, - center_y: bool = True, - y_intercept: bool = False, - model_type = "Lasso", - lasso_selection = "cyclic", - all_pos_coefs: bool = False, - tolerance: float = 1e-4, - maxit: int = 10000, - num_jobs: int = -1, - num_cv_folds: int = 5, - lassocv_eps: float = 1e-3, - lassocv_n_alphas: int = 100, # default in sklearn - lassocv_alphas = None, # default in sklearn - verbose = False, - searchVerbosity: int = 2): - - prior_graph_dict = {"edge_list": edge_list, - "gene_expression_nodes":gene_expression_nodes, - "edge_values_for_degree": False, - "consider_self_loops":False, - "pseudocount_for_degree":1e-3, - "default_edge_weight": default_edge_weight, - "w_transform_for_d":"none", - "threshold_for_degree": degree_threshold, - "verbose":verbose, - "view_network":False} - - network_to_use = graph.PriorGraphNetwork(**prior_graph_dict) - X_train = X - y_train = y - beta_alpha_grid_dict = generate_alpha_beta_pairs(X_train, - y_train, network_to_use, - overlapped_nodes_only, standardize_X, center_y, - num_beta, num_alpha, - y_intercept, - maxit, - all_pos_coefs, - tolerance, - lasso_selection, - num_cv_folds, - num_jobs, - lassocv_eps, - lassocv_n_alphas, - lassocv_alphas) + if verbose: + print("finished generate_alpha_beta_pairs") + print(beta_alpha_grid_dict) print(f"Length of beta_alpha_grid_dict: {len(beta_alpha_grid_dict['beta_network_vals'])}") - param_grid = [{"alpha_lasso": [alpha_las], "beta_net": [beta_net]} - for alpha_las, beta_net in zip(beta_alpha_grid_dict["alpha_lasso_vals"], - beta_alpha_grid_dict["beta_network_vals"])] - - - + param_grid = [{"alpha_lasso": [alpha_las], "beta_net": [beta_net]} + for alpha_las, beta_net in zip(beta_alpha_grid_dict["alpha_lasso_vals"], + beta_alpha_grid_dict["beta_network_vals"])] + if verbose: print(":) Performing NetREmCV with both beta_network and alpha_lasso as UNKNOWN.") - - initial_greg = nm.NetREmModel(network=network_to_use, - y_intercept = y_intercept, - standardize_X = standardize_X, - center_y = center_y, - max_lasso_iterations=maxit, - all_pos_coefs=all_pos_coefs, - lasso_selection = lasso_selection, - tolerance = tolerance, - view_network=False, - overlapped_nodes_only=overlapped_nodes_only) - - pbar = tqdm(total=len(param_grid)) # Assuming we're trying 9 combinations of parameters - - if reduced_cv_search: - # Run RandomizedSearchCV + initial_greg = NetREmModel(network=prior_network, + y_intercept = y_intercept, + standardize_X = standardize_X, + center_y = center_y, + max_lasso_iterations=maxit, + all_pos_coefs=all_pos_coefs, + lasso_selection = lasso_selection, + tolerance = tolerance, + view_network=False, + overlapped_nodes_only=overlapped_nodes_only) + pbar = tqdm(total=len(param_grid)) # Assuming we're trying 9 combinations of parameters + + if reduced_cv_search: + # Run RandomizedSearchCV + if verbose: print(f":) since reduced_cv_search = {reduced_cv_search}, we perform RandomizedSearchCV on a reduced search space") - grid_search= RandomizedSearchCV(initial_greg, - param_grid, - n_iter=num_alpha, - cv=num_cv_folds, - #scoring=make_scorer(custom_mse, greater_is_better=False), - verbose=searchVerbosity) - else: - # Run GridSearchCV - grid_search = GridSearchCV(initial_greg, param_grid=param_grid, cv=num_cv_folds, - #scoring=make_scorer(custom_mse, greater_is_better=False), - verbose = searchVerbosity) - grid_search.fit(X_train, y_train) - - # Extract and display the best hyperparameters - best_params = grid_search.best_params_ - optimal_alpha = best_params["alpha_lasso"] - optimal_beta = best_params["beta_net"] - print(f":) NetREmCV found that the optimal alpha_lasso = {optimal_alpha} and optimal beta_net = {optimal_beta}") - - newest_netrem = nm.NetREmModel(alpha_lasso = optimal_alpha, - beta_net = optimal_beta, - network = network_to_use, - y_intercept = y_intercept, - standardize_X = standardize_X, - center_y = center_y, - max_lasso_iterations=maxit, - all_pos_coefs=all_pos_coefs, - lasso_selection = lasso_selection, - tolerance = tolerance, - view_network=False, - overlapped_nodes_only=overlapped_nodes_only) - newest_netrem.fit(X_train, y_train) - train_mse = newest_netrem.test_mse(X_train, y_train) - print(f":) Please note that the training Mean Square Error (MSE) from this fitted NetREm model is {train_mse}") - return newest_netrem + grid_search= RandomizedSearchCV(initial_greg, + param_grid, + n_iter=num_alpha, + cv=num_cv_folds, + scoring = "neg_mean_squared_error", + #scoring=make_scorer(custom_mse, greater_is_better=False), + verbose=searchVerbosity) + else: + # Run GridSearchCV + grid_search = GridSearchCV(initial_greg, param_grid=param_grid, cv=num_cv_folds, + scoring = "neg_mean_squared_error", + #scoring=make_scorer(custom_mse, greater_is_better=False), + verbose = searchVerbosity) + grid_search.fit(X_train, y_train) + + # Extract and display the best hyperparameters + best_params = grid_search.best_params_ + optimal_alpha = best_params["alpha_lasso"] + optimal_beta = best_params["beta_net"] + print(f":) NetREmCV found that the optimal alpha_lasso = {optimal_alpha} and optimal beta_net = {optimal_beta}") + + newest_netrem = NetREmModel(alpha_lasso = optimal_alpha, + beta_net = optimal_beta, + network = prior_network, + y_intercept = y_intercept, + standardize_X = standardize_X, + center_y = center_y, + max_lasso_iterations=maxit, + all_pos_coefs=all_pos_coefs, + lasso_selection = lasso_selection, + tolerance = tolerance, + view_network=False, + overlapped_nodes_only=overlapped_nodes_only) + newest_netrem.fit(X_train, y_train) + train_mse = newest_netrem.test_mse(X_train, y_train) + print(f":) Please note that the training Mean Square Error (MSE) from this fitted NetREm model is {train_mse}") + return newest_netrem def organize_B_interaction_network(netrem_model): @@ -1143,7 +1026,7 @@ def organize_B_interaction_network(netrem_model): B_interaction_df["X_standardized"] = netrem_model.standardize_X B_interaction_df["gene_data"] = "training gene expression data" - # Step 1: Sort the DataFrame + # Step 1: Please Sort the DataFrame B_interaction_df = B_interaction_df.sort_values('absVal_B', ascending=False) # Step 2: Get the rank diff --git a/code/__pycache__/DemoDataBuilderXandY.cpython-310.pyc b/code/__pycache__/DemoDataBuilderXandY.cpython-310.pyc index 2488c24..bfd02c9 100644 Binary files a/code/__pycache__/DemoDataBuilderXandY.cpython-310.pyc and b/code/__pycache__/DemoDataBuilderXandY.cpython-310.pyc differ diff --git a/code/__pycache__/Netrem_model_builder.cpython-310.pyc b/code/__pycache__/Netrem_model_builder.cpython-310.pyc index 884519f..2df8416 100644 Binary files a/code/__pycache__/Netrem_model_builder.cpython-310.pyc and b/code/__pycache__/Netrem_model_builder.cpython-310.pyc differ diff --git a/code/old_code/refresh/DemoDataBuilderXandY.py b/code/old_code/refresh/DemoDataBuilderXandY.py new file mode 100644 index 0000000..a1a9fe5 --- /dev/null +++ b/code/old_code/refresh/DemoDataBuilderXandY.py @@ -0,0 +1,919 @@ +# DemoDataBuilder Class: :) +from packages_needed import * +import pandas as pd +import numpy as np +from tqdm.auto import tqdm +import numpy as np +from sklearn.model_selection import train_test_split +import plotly.express as px +class DemoDataBuilderXandY: + """:) Please note that this class focuses on building Y data based on a normal distribution (specified mean + and standard deviation). M is the # of samples we want to generate. Thus, Y is a vector with M elements. + Then, this class returns X for a set of N predictors (each with M # of samples) based on a list of N correlation + values. For instance, if N = 5 predictors (the Transcription Factors (TFs)), we have [X1, X2, X3, X4, X5], + and a respective list of correlation values: [cor(X1, Y), cor(X2, Y), cor(X3, Y), cor(X4, Y), cor(X5, Y)]. + Then, this class will generate X, a matrix of those 5 predictors (based on similar distribution as Y) + with these respective correlations.""" + + _parameter_constraints = { + "test_data_percent": (0, 100), + "mu": (0, None), + "std_dev": (0, None), + "num_iters_to_generate_X": (1, None), + "same_train_test_data": [False, True], + "rng_seed": (0, None), + "randSeed": (0, None), + "ortho_scalar": (1, None), + "orthogonal_X_bool": [True, False], + "view_input_correlations_plot": [False, True], + "num_samples_M": (1, None), + "corrVals": list + } + + def __init__(self, **kwargs): + + # define default values for constants + self.same_train_test_data = False + self.test_data_percent = 30 + self.mu = 0 + self.verbose = True + self.std_dev = 1 + self.num_iters_to_generate_X = 100 + self.rng_seed = 2023 # for Y + self.randSeed = 123 # for X + self.orthogonal_X_bool = True # False adjustment made on 9/20 + self.ortho_scalar = 10 + self.tol = 1e-2 + self.view_input_correlations_plot = False + # reading in user inputs + self.__dict__.update(kwargs) + ##################### other user parameters being loaded and checked + self.same_train_and_test_data_bool = self.same_train_test_data + # check that all required keys are present: + required_keys = ["corrVals", "num_samples_M"] + missing_keys = [key for key in required_keys if key not in self.__dict__] + if missing_keys: + raise ValueError(f":( Please note ye are missing information for these keys: {missing_keys}") + self.M = self.num_samples_M + self.N = self.get_N() + self.y = self.generate_Y() + self.X = self.generate_X() + self.same_train_and_test_data_bool = self.same_train_test_data + if self.same_train_and_test_data_bool: + self.testing_size = 1 + else: + self.testing_size = (self.test_data_percent/100.0) + self.data_sets = self.generate_training_and_testing_data() # [X_train, X_test, y_train, y_test] + self.X_train = self.data_sets[0] + self.X_test = self.data_sets[1] + self.y_train = self.data_sets[2] + self.y_test = self.data_sets[3] + + self.tf_names_list = self.get_tf_names_list() + self.corr_df = self.return_correlations_dataframe() + self.combined_correlations_df = self.get_combined_correlations_df() + if self.view_input_correlations_plot: + self.view_input_correlations = self.view_input_correlations() + self._apply_parameter_constraints() + self.X_train_df = self.view_X_train_df() + self.y_train_df = self.view_y_train_df() + self.X_test_df = self.view_X_test_df() + self.y_test_df = self.view_y_test_df() + self.X_df = self.view_original_X_df() + self.y_df = self.view_original_y_df() + self.combined_train_test_x_and_y_df = self.combine_X_and_y_train_and_test_data() + + def _apply_parameter_constraints(self): + constraints = {**DemoDataBuilderXandY._parameter_constraints} + for key, value in self.__dict__.items(): + if key in constraints: + if isinstance(constraints[key], tuple): + if isinstance(constraints[key][0], type) and not isinstance(value, constraints[key][0]): + setattr(self, key, constraints[key][0]) + elif constraints[key][1] is not None and isinstance(constraints[key][1], type) and not isinstance(value, constraints[key][1]): + setattr(self, key, constraints[key][1]) + elif key == "corrVals": # special case for corrVals + if not isinstance(value, list): + setattr(self, key, constraints[key]) + elif value not in constraints[key]: + setattr(self, key, constraints[key][0]) + return self + + def get_tf_names_list(self): + tf_names_list = [] + for i in range(0, self.N): + term = "TF" + str(i+1) + tf_names_list.append(term) + return tf_names_list + + # getter method + def get_N(self): + N = len(self.corrVals) + return N + + def get_X_train(self): + return self.data_sets[0] #X_train + + def get_y_train(self): + return self.data_sets[2] # y_train + + def get_X_test(self): + return self.data_sets[1] + + def get_y_test(self): + return self.data_sets[3] + + def view_original_X_df(self): + import pandas as pd + X_df = pd.DataFrame(self.X, columns = self.tf_names_list) + return X_df + + def view_original_y_df(self): + import pandas as pd + y_df = pd.DataFrame(self.y, columns = ["y"]) + return y_df + + def view_X_train_df(self): + import pandas as pd + X_train_df = pd.DataFrame(self.X_train, columns = self.tf_names_list) + return X_train_df + + def view_y_train_df(self): + import pandas as pd + y_train_df = pd.DataFrame(self.y_train, columns = ["y"]) + return y_train_df + + def view_X_test_df(self): + X_test_df = pd.DataFrame(self.X_test, columns = self.tf_names_list) + return X_test_df + + def view_y_test_df(self): + y_test_df = pd.DataFrame(self.y_test, columns = ["y"]) + return y_test_df + + def combine_X_and_y_train_and_test_data(self): + X_p1 = self.X_train_df + X_p1["info"] = "training" + X_p2 = self.X_test_df + X_p2["info"] = "testing" + X_combined = pd.concat([X_p1, X_p2]).drop_duplicates() + y_p1 = self.y_train_df + y_p1["info"] = "training" + y_p2 = self.y_test_df + y_p2["info"] = "testing" + y_combined = pd.concat([y_p1, y_p2]).drop_duplicates() + combining_df = X_combined + combining_df["y"] = y_combined["y"] + return combining_df + + def return_correlations_dataframe(self): + corr_info = ["expected_correlations"] * self.N + corr_df = pd.DataFrame(corr_info, columns = ["info"]) + corr_df["TF"] = self.tf_names_list + corr_df["value"] = self.corrVals + corr_df["data"] = "correlations" + return corr_df + + def generate_Y(self): + seed_val = self.rng_seed + rng = np.random.default_rng(seed=seed_val) + y = rng.normal(self.mu, self.std_dev, self.M) + return y + + # Check if Q is orthogonal using the is_orthogonal function + def is_orthogonal(matrix): + """ + Checks if a given matrix is orthogonal. + Parameters: + matrix (numpy.ndarray): The matrix to check + Returns: + bool: True if the matrix is orthogonal, False otherwise. + """ + # Compute the transpose of the matrix + matrix_T = matrix.T + + # Compute the product of the matrix and its transpose + matrix_matrix_T = np.dot(matrix, matrix_T) + + # Check if the product is equal to the identity matrix + return np.allclose(matrix_matrix_T, np.eye(matrix.shape[0])) + +# # Define the modified generate_X function +# def generate_X(self): +# """Generates a design matrix X with the given correlations while introducing noise and dependencies. +# Parameters: +# orthogonal (bool): Whether to generate an orthogonal matrix (default=False). + +# Returns: +# numpy.ndarray: The design matrix X. +# """ +# orthogonal = self.orthogonal_X_bool +# scalar = self.ortho_scalar +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N # len(corrVals) +# numIterations = self.num_iters_to_generate_X +# correlations = self.corrVals +# corrVals = [correlations[0]] + correlations + +# # Step 1: Generate Initial X +# e = np.random.normal(0, 1, (n, numTFs + 1)) +# X = np.copy(e) +# X[:, 0] = y * np.sqrt(1.0 - corrVals[0]**2) / np.sqrt(1.0 - np.corrcoef(y, X[:,0])[0,1]**2) +# for j in range(numIterations): +# for i in range(1, numTFs + 1): +# corr = np.corrcoef(y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * y + +# # Step 2: Add Noise +# noise_scale = 0.1 # You can adjust this value +# X += np.random.normal(0, noise_scale, X.shape) + +# # Step 3: Introduce Inter-dependencies +# # Make the second predictor a combination of the first and third predictors +# X[:, 1] += 0.3 * X[:, 0] + 0.7 * X[:, 2] + +# # Step 4: Adjust for Correlations +# for j in range(numIterations): +# for i in range(1, numTFs + 1): +# corr = np.corrcoef(y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * y + +# if orthogonal: +# # Compute the QR decomposition of X and take only the Q matrix +# Q = np.linalg.qr(X)[0] +# Q = scalar * Q +# return Q[:, 1:] +# else: +# # Return the X matrix without orthogonalization +# return X[:, 1:] + +# # # Display the modified function to ensure it looks okay +# # print(generate_X_modified) + +# def generate_X(self): +# """Generates a design matrix X with the given correlations and introduces an interaction term. + +# Returns: +# numpy.ndarray: The design matrix X. +# """ +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N # Number of predictors +# numIterations = self.num_iters_to_generate_X +# corrVals = self.corrVals + +# # Step 1: Generate Initial X based on the specified correlations with Y +# e = np.random.normal(0, 1, (n, numTFs)) +# X = np.copy(e) +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * y + +# # Step 2: Introduce Interaction Term into Y +# interaction_term = X[:, 3] * X[:, 4] +# self.y = y + 0.5 * interaction_term # Adjust the coefficient as needed + +# # Step 3: Re-adjust for specified correlations with Y +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(self.y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * self.y + +# return X + + + + # Define the modified generate_X function to highlight the benefits of network-regularized regression +# def generate_X(self): +# """Generates a design matrix X to highlight the benefits of network-regularized regression. + +# Returns: +# numpy.ndarray: The design matrix X. +# """ +# np.random.seed(self.randSeed) +# n = len(self.y) +# numTFs = self.N # Number of predictors +# numIterations = self.num_iters_to_generate_X +# corrVals = self.corrVals + +# # Step 1: Generate Initial X based on the specified correlations with Y +# e = np.random.normal(0, 1, (n, numTFs)) +# X = np.copy(e) +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(self.y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * self.y + +# # Step 2: Weaken X2 and X4 as predictors by introducing interactions in Y +# interaction_term = 0.3 * (X[:, 0] * X[:, 1]) + 0.3 * (X[:, 3] * X[:, 4]) # Interaction terms +# self.y = self.y + interaction_term # Update Y + +# # Step 3: Strengthen network edges by making X1 and X2, and X4 and X5 highly correlated +# X[:, 1] = 0.7 * X[:, 0] + 0.3 * X[:, 1] # X1 and X2 +# X[:, 3] = 0.7 * X[:, 4] + 0.3 * X[:, 3] # X4 and X5 + +# # Step 4: Re-adjust for specified correlations with Y +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(self.y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * self.y + +# return X +# def generate_X(self): +# """Generates a design matrix X with the given correlations and introduces specified network edges and interactions. + +# Returns: +# numpy.ndarray: The design matrix X. +# """ +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N # Number of predictors +# numIterations = self.num_iters_to_generate_X +# corrVals = self.corrVals + +# # Step 1: Generate Initial X based on the specified correlations with Y +# e = np.random.normal(0, 1, (n, numTFs)) +# X = np.copy(e) +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * y + +# # Step 2: Weaken X2 and X4 as predictors by introducing interactions in Y +# self.y = y + 0.3 * (X[:, 1] * X[:, 0]) + 0.3 * (X[:, 3] * X[:, 4]) # Adjust the coefficients as needed + +# # Step 3: Strengthen network edges by making X1 and X2, and X4 and X5 highly correlated +# X[:, 1] = 0.7 * X[:, 0] + 0.3 * X[:, 1] # X1 and X2 +# X[:, 3] = 0.7 * X[:, 4] + 0.3 * X[:, 3] # X4 and X5 + +# # Step 4: Re-adjust for specified correlations with Y +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(self.y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * self.y + +# return X +# def generate_X(self): +# """Generates a design matrix X with given correlations and introduces inter-predictor correlations. + +# Returns: +# numpy.ndarray: The design matrix X. +# """ +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N # Number of predictors +# numIterations = self.num_iters_to_generate_X +# corrVals = self.corrVals + +# # Step 1: Generate Initial X based on the specified correlations with Y +# e = np.random.normal(0, 1, (n, numTFs)) +# X = np.copy(e) +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * y + +# # Step 2: Introduce Inter-predictor Correlations +# # Make X1 and X2 highly correlated +# X[:, 0] = 0.5 * X[:, 0] + 0.5 * X[:, 1] +# # Make X4 and X5 highly correlated +# X[:, 3] = 0.525 * X[:, 3] + 0.475 * X[:, 4] + +# # Step 3: Re-adjust for specified correlations with Y +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * y + +# return X + +# def generate_X(self, tol=1e-4): +# orthogonal = self.orthogonal_X_bool +# scalar = self.ortho_scalar +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N + +# # Initialize X with standard normal distribution +# X = np.random.normal(0, 1, (n, numTFs)) + +# for i in range(numTFs): +# desired_corr = self.corrVals[i] + +# while True: +# # Create a new predictor as a linear combination of original predictor and y +# X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] + +# # Standardize the predictor to have mean 0 and variance 1 +# X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + +# # Calculate the actual correlation +# actual_corr = np.corrcoef(y, X[:, i])[0, 1] + +# # Calculate the difference between the actual and desired correlations +# diff = abs(actual_corr - desired_corr) + +# if diff < tol: +# break + +# # Orthogonalize the predictors to make them independent of each other +# Q, _ = np.linalg.qr(X) + +# if orthogonal: +# # Scale the orthogonalized predictors +# Q = scalar * Q +# return Q +# else: +# # Return the orthogonalized predictors without scaling +# return Q + +# def generate_X(self): +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N +# tol = self.tol + +# # Initialize X with standard normal distribution (vectorized) +# X = np.random.normal(0, 1, (n, numTFs)) + +# # Standardize y for correlation calculation +# y_std = (y - np.mean(y)) / np.std(y) + +# for i in tqdm(range(numTFs), desc="Generating predictors"): +# desired_corr = self.corrVals[i] + +# while True: +# # Orthogonalize Xi against all previous predictors +# for j in range(i): +# coef = np.dot(X[:, i], X[:, j]) / np.dot(X[:, j], X[:, j]) +# X[:, i] -= coef * X[:, j] + +# # Create and standardize new predictor (vectorized) +# X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] +# X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + +# # Calculate actual correlation (vectorized) +# actual_corr = np.dot(y_std, X[:, i]) / n + +# # Check if actual correlation is close enough to desired correlation +# if abs(actual_corr - desired_corr) < tol: +# break + +# # Orthogonalize X to reduce inter-predictor correlation (if required) +# if self.orthogonal_X_bool: +# X, _ = np.linalg.qr(X) + +# return X + + def generate_X(self): + np.random.seed(self.randSeed) + y = self.y + n = len(y) + numTFs = self.N + tol = self.tol + + # Initialize X with standard normal distribution (vectorized) + X = np.random.normal(0, 1, (n, numTFs)) + + # Standardize y for correlation calculation + y_std = (y - np.mean(y)) / np.std(y) + + for i in tqdm(range(numTFs), desc="Generating predictors"): + desired_corr = self.corrVals[i] + + while True: + # Create and standardize new predictor (vectorized) + X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] + X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + + # Calculate actual correlation (vectorized) + actual_corr = np.dot(y_std, X[:, i]) / n + + # Check if actual correlation is close enough to desired correlation + if abs(actual_corr - desired_corr) < tol: + break + + # Orthogonalize X to reduce inter-predictor correlation (if required) + if self.orthogonal_X_bool: + X, _ = np.linalg.qr(X) + + return X + def generate_X7(self): + orthogonal = self.orthogonal_X_bool + scalar = self.ortho_scalar + np.random.seed(self.randSeed) + y = self.y + n = len(y) + numTFs = self.N + tol = self.tol + + # Initialize X with standard normal distribution + X = np.random.normal(0, 1, (n, numTFs)) + + desc_name = "Generating data for " + str(numTFs) + " Predictors with tolerance of " + str(tol) + " :) " + for i in tqdm(range(numTFs), desc=desc_name): + desired_corr = self.corrVals[i] + + while True: + # Create a new predictor as a linear combination of original predictor and y + X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] + + # Standardize the predictor to have mean 0 and variance 1 + X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + + # Calculate the actual correlation + actual_corr = np.corrcoef(y, X[:, i])[0, 1] + + # Calculate the difference between the actual and desired correlations + diff = abs(actual_corr - desired_corr) + + if diff < tol: + break + + # Step 2: Orthogonalize the predictors to remove inter-predictor correlation + X_ortho, _ = np.linalg.qr(X) + + # Step 3: Scale each orthogonalized predictor to match the desired correlation with y + for i in tqdm(range(numTFs), desc="Rescaling orthogonalized predictors"): + desired_corr = self.corrVals[i] + + while True: + # Scale the orthogonalized predictor + X_ortho[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X_ortho[:, i] + + # Standardize the predictor + X_ortho[:, i] = (X_ortho[:, i] - np.mean(X_ortho[:, i])) / np.std(X_ortho[:, i]) + + # Calculate the actual correlation + actual_corr = np.corrcoef(y, X_ortho[:, i])[0, 1] + + # Calculate the difference between the actual and desired correlations + diff = abs(actual_corr - desired_corr) + + if diff < tol: + break + + if orthogonal: + # Compute the QR decomposition of X and take only the Q matrix + Q = np.linalg.qr(X_ortho)[0] + Q = scalar * Q + return Q + else: + # Return the X matrix without orthogonalization + return X_ortho + + + def generate_X5(self): + orthogonal = self.orthogonal_X_bool + scalar = self.ortho_scalar + np.random.seed(self.randSeed) + y = self.y + n = len(y) + numTFs = self.N + tol = self.tol + jitter = 0.05 # Noise level to reduce correlation between predictors + + # Initialize X with standard normal distribution + X = np.random.normal(0, 1, (n, numTFs)) + + desc_name = "Generating data for " + str(numTFs) + " Predictors with tolerance of " + str(tol) + " :) " + for i in tqdm(range(numTFs), desc=desc_name): + desired_corr = self.corrVals[i] + + while True: + # Create a new predictor as a linear combination of original predictor and y + X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] + + # Add a small amount of noise to reduce correlation with other predictors + X[:, i] += jitter * np.random.normal(0, 1, n) + + # Standardize the predictor to have mean 0 and variance 1 + X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + + # Calculate the actual correlation + actual_corr = np.corrcoef(y, X[:, i])[0, 1] + + # Calculate the difference between the actual and desired correlations + diff = abs(actual_corr - desired_corr) + + if diff < tol: + break + + if orthogonal: + # Compute the QR decomposition of X and take only the Q matrix + Q = np.linalg.qr(X)[0] + Q = scalar * Q + return Q + else: + # Return the X matrix without orthogonalization + return X + + def generate_X3(self): + orthogonal = self.orthogonal_X_bool + scalar = self.ortho_scalar + np.random.seed(self.randSeed) + y = self.y + n = len(y) + numTFs = self.N + tol = self.tol + # Initialize X with standard normal distribution + X = np.random.normal(0, 1, (n, numTFs)) + desc_name = "Generating data for " + str(numTFs) + " Predictors with tolerance of " + str(tol) + " :) " + for i in tqdm(range(numTFs), desc=desc_name): + desired_corr = self.corrVals[i] + + while True: + # Create a new predictor as a linear combination of original predictor and y + X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] + + # Standardize the predictor to have mean 0 and variance 1 + X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + + # Calculate the actual correlation + actual_corr = np.corrcoef(y, X[:, i])[0, 1] + + # Calculate the difference between the actual and desired correlations + diff = abs(actual_corr - desired_corr) + + if diff < tol: + break + + if orthogonal: + # Compute the QR decomposition of X and take only the Q matrix + Q = np.linalg.qr(X)[0] + Q = scalar * Q + return Q + else: + # Return the X matrix without orthogonalization + return X + + # Define the function for generating synthetic data with specific correlations and standard normal predictors + def generate_X1(self): + orthogonal = self.orthogonal_X_bool + scalar = self.ortho_scalar + np.random.seed(self.randSeed) + y = self.y + n = len(y) + numTFs = self.N + + # Initialize X with standard normal distribution + X = np.random.normal(0, 1, (n, numTFs)) + + # Adjust X to achieve the desired correlations with y + for i in range(numTFs): + corr = self.corrVals[i] + # Create a new predictor as a linear combination of original predictor and y + X[:, i] = corr * y + np.sqrt(1 - corr ** 2) * X[:, i] + + # Standardize the predictor to have mean 0 and variance 1 + X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + + if orthogonal: + # Compute the QR decomposition of X and take only the Q matrix + Q = np.linalg.qr(X)[0] + Q = scalar * Q + return Q + else: + # Return the X matrix without orthogonalization + return X +# def generate_X(self): +# orthogonal = self.orthogonal_X_bool +# scalar = self.ortho_scalar +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N +# numIterations = self.num_iters_to_generate_X +# correlations = self.corrVals +# corrVals = [correlations[0]] + correlations + +# # Initialize X with standard normal distribution +# X = np.random.normal(0, 1, (n, numTFs)) + +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * y +# # Standardize the predictor +# X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + +# if orthogonal: +# # Compute the QR decomposition of X and take only the Q matrix +# Q = np.linalg.qr(X)[0] +# Q = scalar * Q +# return Q +# else: +# # Return the X matrix without orthogonalization +# return X + + +# def generate_X(self): +# orthogonal = self.orthogonal_X_bool +# scalar = self.ortho_scalar +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N +# tol=self.tol +# # Initialize X with standard normal distribution +# X = np.random.normal(0, 1, (n, numTFs)) +# numIterations = self.num_iters_to_generate_X +# for iter_count in range(numIterations): +# max_diff = 0 # Initialize maximum difference between actual and desired correlations for this iteration +# for i in range(numTFs): +# desired_corr = self.corrVals[i] + +# # Create a new predictor as a linear combination of original predictor and y +# X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] + +# # Standardize the predictor to have mean 0 and variance 1 +# X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + +# # Calculate the actual correlation +# actual_corr = np.corrcoef(y, X[:, i])[0, 1] + +# # Calculate the difference between the actual and desired correlations +# diff = abs(actual_corr - desired_corr) +# max_diff = max(max_diff, diff) + +# # If the maximum difference between actual and desired correlations is below the tolerance, break the loop +# if max_diff < tol: +# break + +# if orthogonal: +# # Compute the QR decomposition of X and take only the Q matrix +# Q = np.linalg.qr(X)[0] +# Q = scalar * Q +# return Q +# else: +# # Return the X matrix without orthogonalization +# return X + + def generate_X_old(self): + """Generates a design matrix X with the given correlations. + Parameters: + orthogonal (bool): Whether to generate an orthogonal matrix (default=False). + + Returns: + numpy.ndarray: The design matrix X. + """ + orthogonal = self.orthogonal_X_bool + scalar = self.ortho_scalar + np.random.seed(self.randSeed) + y = self.y + n = len(y) + numTFs = self.N # len(corrVals) + numIterations = self.num_iters_to_generate_X + correlations = self.corrVals + corrVals = [correlations[0]] + correlations + e = np.random.normal(0, 1, (n, numTFs + 1)) + X = np.copy(e) + X[:, 0] = y * np.sqrt(1.0 - corrVals[0]**2) / np.sqrt(1.0 - np.corrcoef(y, X[:,0])[0,1]**2) + for j in range(numIterations): + for i in range(1, numTFs + 1): + corr = np.corrcoef(y, X[:, i])[0, 1] + X[:, i] = X[:, i] + (corrVals[i] - corr) * y + + if orthogonal: + # Compute the QR decomposition of X and take only the Q matrix + Q = np.linalg.qr(X)[0] + Q = scalar * Q + return Q[:, 1:] + else: + # Return the X matrix without orthogonalization + return X[:, 1:] + + + def generate_training_and_testing_data(self): + same_train_and_test_data_bool = self.same_train_and_test_data_bool + X = self.X + y = self.y + if same_train_and_test_data_bool == False: # different training and testing datasets + X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = self.testing_size) + if self.verbose: + print(f"Please note that since we hold out {self.testing_size * 100.0}% of our {self.M} samples for testing, we have:") + print(f"X_train = {X_train.shape[0]} rows (samples) and {X_train.shape[1]} columns (N = {self.N} predictors) for training.") + print(f"X_test = {X_test.shape[0]} rows (samples) and {X_test.shape[1]} columns (N = {self.N} predictors) for testing.") + print(f"y_train = {y_train.shape[0]} corresponding rows (samples) for training.") + print(f"y_test = {y_test.shape[0]} corresponding rows (samples) for testing.") + else: # training and testing datasets are the same :) + X_train, X_test, y_train, y_test = X, X, y, y + y_train = y + y_test = y_train + X_test = X_train + if self.verbose: + print(f"Please note that since we use the same data for training and for testing :) of our {self.M} samples. Thus, we have:") + print(f"X_train = X_test = {X_train.shape[0]} rows (samples) and {X_train.shape[1]} columns (N = {self.N} predictors) for training and for testing") + print(f"y_train = y_test = {y_train.shape[0]} corresponding rows (samples) for training and for testing.") + return [X_train, X_test, y_train, y_test] + + + def get_combined_correlations_df(self): + combined_correlations_df = self.actual_vs_expected_corrs_DefensiveProgramming_all_groups(self.X, self.y, + self.X_train, + self.y_train, + self.X_test, + self.y_test, + self.corrVals, + self.tf_names_list, + self.same_train_and_test_data_bool) + return combined_correlations_df + + def actual_vs_expected_corrs_DefensiveProgramming_all_groups(self, X, y, X_train, y_train, X_test, y_test, + corrVals, tf_names_list, + same_train_and_test_data_bool): + overall_corrs_df = self.compare_actual_and_expected_correlations_DefensiveProgramming_one_data_group(X, y, corrVals, + tf_names_list, same_train_and_test_data_bool, "Overall") + training_corrs_df = self.compare_actual_and_expected_correlations_DefensiveProgramming_one_data_group(X_train, y_train, corrVals, + tf_names_list, same_train_and_test_data_bool, "Training") + testing_corrs_df = self.compare_actual_and_expected_correlations_DefensiveProgramming_one_data_group(X_test, y_test, corrVals, + tf_names_list, same_train_and_test_data_bool, "Testing") + combined_correlations_df = pd.concat([overall_corrs_df, training_corrs_df, testing_corrs_df]).drop_duplicates() + return combined_correlations_df + + def compare_actual_and_expected_correlations_DefensiveProgramming_one_data_group(self, X_matrix, y, corrVals, + predictor_names_list, + same_train_and_test_data_boolean, + data_type): + # please note that this function by Saniya ensures that the actual and expected correlations are close + # so that the simulation has the x-y correlations we were hoping for in corrVals + updatedDF = pd.DataFrame(X_matrix)#.shape + actualCorrsList = [] + for i in tqdm(range(0, len(corrVals))): + expectedCor = corrVals[i] + actualCor = np.corrcoef(updatedDF[i], y)[0][1] + difference = abs(expectedCor - actualCor) + predictor_name = predictor_names_list[i] + actualCorrsList.append([i, predictor_name, expectedCor, actualCor, difference]) + comparisonDF = pd.DataFrame(actualCorrsList, columns = ["i", "predictor", "expected_corr_with_Y", "actual_corr", "difference"]) + comparisonDF["X_group"] = data_type + num_samples = X_matrix.shape[0] + if same_train_and_test_data_boolean: + comparisonDF["num_samples"] = "same " + str(num_samples) + else: + comparisonDF["num_samples"] = "unique " + str(num_samples) + return comparisonDF + + # Visualizing Functions :) + def view_input_correlations(self): + corr_val_df = pd.DataFrame(self.corrVals, columns = ["correlation"])#.transpose() + corr_val_df.index = self.tf_names_list + corr_val_df["TF"] = self.tf_names_list + fig = px.bar(corr_val_df, x='TF', y='correlation', title = "Input Correlations for Dummy Example", barmode='group') + fig.show() + return fig + + + def view_train_vs_test_data_for_predictor(self, predictor_name): + combined_train_test_x_and_y_df = self.combined_train_test_x_and_y_df + combined_correlations_df = self.combined_correlations_df + print(combined_correlations_df[combined_correlations_df["predictor"] == predictor_name][["predictor", "actual_corr", "X_group", "num_samples"]]) + title_name = title = "Training Versus Testing Data Points for Predictor: " + predictor_name + fig = px.scatter(combined_train_test_x_and_y_df, x=predictor_name, y="y", color = "info", + title = title_name) + #fig.show() + return fig + + +def generate_dummy_data(corrVals, + num_samples_M = 100, + train_data_percent = 70, + mu = 0, + std_dev = 1, + iters_to_generate_X = 100, + orthogonal_X = False, + + ortho_scalar = 10, + view_input_corrs_plot = False, + verbose = True, rand_seed_x = 123, rand_seed_y = 2023): + + # the defaults + same_train_test_data = False + test_data_percent = 100 - train_data_percent + if train_data_percent == 100: # since all of the data is used for training, + # then the training and testing data will be the same :) + same_train_test_data = True + test_data_percent = 100 + print(f":) same_train_test_data = {same_train_test_data}") + demo_dict = { + "test_data_percent": 100 - train_data_percent, + "mu": mu, "std_dev": std_dev, + "num_iters_to_generate_X": iters_to_generate_X, + "same_train_test_data": same_train_test_data, + "rng_seed": rand_seed_y, #2023, # for Y + "randSeed": rand_seed_x, #123, # for X + "ortho_scalar": ortho_scalar, + "orthogonal_X_bool": orthogonal_X, + "view_input_correlations_plot": view_input_corrs_plot, + "num_samples_M": num_samples_M, + "corrVals": corrVals, "verbose":verbose} + dummy_data = DemoDataBuilderXandY(**demo_dict) # + return dummy_data diff --git a/user_guide/Dummy_Data_Demo_Example.ipynb b/code/old_code/refresh/Dummy_Data_Demo_Example.ipynb similarity index 100% rename from user_guide/Dummy_Data_Demo_Example.ipynb rename to code/old_code/refresh/Dummy_Data_Demo_Example.ipynb diff --git a/code/old_code/refresh/NetREm Myelinating Schwann Cells Comprehensive Example.ipynb b/code/old_code/refresh/NetREm Myelinating Schwann Cells Comprehensive Example.ipynb new file mode 100644 index 0000000..df06d8c --- /dev/null +++ b/code/old_code/refresh/NetREm Myelinating Schwann Cells Comprehensive Example.ipynb @@ -0,0 +1,11785 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "bfbc1d90", + "metadata": {}, + "source": [ + "Please note that this example focuses on the raw data utilized for human Myelinating Schwann Cells (mSCs) in the Dorsal Root Ganglion. NetREm is applied for predicting Transcription Factor (TF) to Target Gene (TG) regulatory links (given by coefficient **c*** as well) as for potential TF-TF interactions (given by *B* matrix values).\n", + "😊🤓\n", + "\n", + "This is a Bioinformatics Application🧑‍🔬👩‍🔬👨‍🔬👩🏼‍🔬👨🏼‍🔬🧑🏼‍🔬🧑🏻‍🔬👨🏻‍🔬👩🏻‍🔬🧑🏽‍🔬👨🏽‍🔬👩🏽‍🔬🧑🏾‍🔬👨🏾‍🔬👩🏾‍🔬🧑🏿‍🔬👨🏿‍🔬🧬🧫🔬🧑🏿‍💻👨🏿‍💻👩🏿‍💻👩🏾‍💻👨🏾‍💻🧑🏾‍💻👩🏽‍💻👨🏽‍💻🧑🏽‍💻👩🏼‍💻👨🏼‍💻🧑🏼‍💻👩🏻‍💻👨🏻‍💻🧑🏻‍💻👩‍💻👨‍💻🧑‍💻\n", + "\n", + "#### By: Saniya Khullar, Xiang Huang, Raghu Ramesh, John Svaren, Daifeng Wang\n", + "##### University of Wisconsin - Madison" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "a22d2244", + "metadata": {}, + "outputs": [], + "source": [ + "printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs))\n", + "rng_seed = 2023 # random seed for reproducibility\n", + "randSeed = 123\n", + "from packages_needed import *\n", + "import error_metrics as em \n", + "from packages_needed import *\n", + "import Netrem_model_builder as nm\n", + "import DemoDataBuilderXandY as demo\n", + "import PriorGraphNetwork as graph\n", + "import netrem_evaluation_functions as nm_eval\n", + "import essential_functions as ef\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "1c001000", + "metadata": {}, + "source": [ + "![netrem_info.png](../user_guide/pics/netrem_info.png)" + ] + }, + { + "cell_type": "markdown", + "id": "cd4d3d50", + "metadata": {}, + "source": [ + "## Input Datasets for NetREm\n", + "To load in *parquet* files (more effiicnet than csv files) and write them out, please ensure ye have installed `pyarrow` by running `pip install pyarrow` in the *terminal* 🧑‍💻👩‍💻👨‍💻. " + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "b72f1d89", + "metadata": {}, + "outputs": [], + "source": [ + "# file names (FNs) of input data: 🥸\n", + "# Please note that Saniya deposited these data files here: \n", + "# https://github.com/SaniyaKhullar/NetREm/tree/main/data/myelin_Schwann_Cells \n", + "\n", + "tfs_for_tgs_FN = \"myelin_candidate_TFs_for_TGs.parquet\"\n", + "train_data_FN = \"myelin_training_gene_expression_data.parquet\"\n", + "test_data_FN = \"myelin_testing_gene_expression_data.parquet\"\n", + "ppi_FN = \"ppi_dataframe.parquet\"" + ] + }, + { + "cell_type": "markdown", + "id": "ab0bd6bc", + "metadata": {}, + "source": [ + "These raw data files for this tutorial are available here: https://github.com/SaniyaKhullar/NetREm/tree/main/data/myelin_Schwann_Cells" + ] + }, + { + "cell_type": "markdown", + "id": "df73cba1", + "metadata": {}, + "source": [ + "A list of potential candidate TFs for each given target gene (TG). Please note that we constructed this input list of candidate TFs for the respective TGs using various data sources such as motif binding analysis, colocalization of TFs, molecular function, etc." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "2e455973", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TGTF
0A1BGCREB3L2
1A1BGCTCF
2A1BGELF2
3A1BGGTF3C2
4A1BGIRF3
.........
635390ZZZ3BACH1
635391ZZZ3TCF3
635392ZZZ3ERF
635393ZZZ3ZNF281
635394ZZZ3SMAD4
\n", + "

1784242 rows × 2 columns

\n", + "
" + ], + "text/plain": [ + " TG TF\n", + "0 A1BG CREB3L2\n", + "1 A1BG CTCF\n", + "2 A1BG ELF2\n", + "3 A1BG GTF3C2\n", + "4 A1BG IRF3\n", + "... ... ...\n", + "635390 ZZZ3 BACH1\n", + "635391 ZZZ3 TCF3\n", + "635392 ZZZ3 ERF\n", + "635393 ZZZ3 ZNF281\n", + "635394 ZZZ3 SMAD4\n", + "\n", + "[1784242 rows x 2 columns]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# list of potential candidate TFs for each given target gene (TG)\n", + "tfs_for_tgs_final_df = pd.read_parquet(tfs_for_tgs_FN)\n", + "tfs_for_tgs_final_df" + ] + }, + { + "cell_type": "markdown", + "id": "dc117b82", + "metadata": {}, + "source": [ + "Single-cell gene expression data that is cell samples by genes:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "dc406d0c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Unnamed: 0SAMD11NOC2LKLHL17PLEKHN1PERM1HES4ISG15AGRNC1orf159...STK26RTL8BRTL8CRTL8ASMIM10L2BSMIM10L2AINTS6LADGRG4PNMA6ACCNQ
0ATCGCCTAGTAGATCA-1_10.00.00.0001.2325990.00.00.795283...000.0000000000.0000.0
1ATGAGGGCATGGGAAC-1_40.00.00.0000.0000000.00.00.000000...000.0000000000.0000.0
2GGGACCTCAGACAAAT-1_30.00.00.0000.0000000.00.00.000000...000.0000000000.0000.0
3CGATGGCCAGATCCTA-1_50.00.00.0000.0000000.00.00.000000...002.5178990000.0000.0
4AATCGACGTGGCACTC-1_50.00.00.0000.0000000.00.00.000000...000.0000000000.0000.0
..................................................................
218GGGAAGTAGCTTAAGA-1_20.00.00.0000.0000000.00.02.360130...000.0000000000.0000.0
219GGAACCCGTCACTTCC-1_40.00.00.0000.0000000.00.00.000000...000.0000000000.0000.0
220CGGGTCACAAACGGCA-1_20.00.00.0000.0000000.00.00.000000...000.0000000000.0000.0
221CACCAAACATAGAATG-1_10.00.00.0001.2676930.00.00.822562...000.0000000000.0000.0
222TGCAGGCCACAGTGAG-1_20.00.00.0000.0000000.00.00.000000...000.0000000000.0000.0
\n", + "

223 rows × 17049 columns

\n", + "
" + ], + "text/plain": [ + " Unnamed: 0 SAMD11 NOC2L KLHL17 PLEKHN1 PERM1 HES4 \\\n", + "0 ATCGCCTAGTAGATCA-1_1 0.0 0.0 0.0 0 0 1.232599 \n", + "1 ATGAGGGCATGGGAAC-1_4 0.0 0.0 0.0 0 0 0.000000 \n", + "2 GGGACCTCAGACAAAT-1_3 0.0 0.0 0.0 0 0 0.000000 \n", + "3 CGATGGCCAGATCCTA-1_5 0.0 0.0 0.0 0 0 0.000000 \n", + "4 AATCGACGTGGCACTC-1_5 0.0 0.0 0.0 0 0 0.000000 \n", + ".. ... ... ... ... ... ... ... \n", + "218 GGGAAGTAGCTTAAGA-1_2 0.0 0.0 0.0 0 0 0.000000 \n", + "219 GGAACCCGTCACTTCC-1_4 0.0 0.0 0.0 0 0 0.000000 \n", + "220 CGGGTCACAAACGGCA-1_2 0.0 0.0 0.0 0 0 0.000000 \n", + "221 CACCAAACATAGAATG-1_1 0.0 0.0 0.0 0 0 1.267693 \n", + "222 TGCAGGCCACAGTGAG-1_2 0.0 0.0 0.0 0 0 0.000000 \n", + "\n", + " ISG15 AGRN C1orf159 ... STK26 RTL8B RTL8C RTL8A SMIM10L2B \\\n", + "0 0.0 0.0 0.795283 ... 0 0 0.000000 0 0 \n", + "1 0.0 0.0 0.000000 ... 0 0 0.000000 0 0 \n", + "2 0.0 0.0 0.000000 ... 0 0 0.000000 0 0 \n", + "3 0.0 0.0 0.000000 ... 0 0 2.517899 0 0 \n", + "4 0.0 0.0 0.000000 ... 0 0 0.000000 0 0 \n", + ".. ... ... ... ... ... ... ... ... ... \n", + "218 0.0 0.0 2.360130 ... 0 0 0.000000 0 0 \n", + "219 0.0 0.0 0.000000 ... 0 0 0.000000 0 0 \n", + "220 0.0 0.0 0.000000 ... 0 0 0.000000 0 0 \n", + "221 0.0 0.0 0.822562 ... 0 0 0.000000 0 0 \n", + "222 0.0 0.0 0.000000 ... 0 0 0.000000 0 0 \n", + "\n", + " SMIM10L2A INTS6L ADGRG4 PNMA6A CCNQ \n", + "0 0 0.0 0 0 0.0 \n", + "1 0 0.0 0 0 0.0 \n", + "2 0 0.0 0 0 0.0 \n", + "3 0 0.0 0 0 0.0 \n", + "4 0 0.0 0 0 0.0 \n", + ".. ... ... ... ... ... \n", + "218 0 0.0 0 0 0.0 \n", + "219 0 0.0 0 0 0.0 \n", + "220 0 0.0 0 0 0.0 \n", + "221 0 0.0 0 0 0.0 \n", + "222 0 0.0 0 0 0.0 \n", + "\n", + "[223 rows x 17049 columns]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_gexpr = pd.read_parquet(train_data_FN) # 70% of the original gene expression data (random split)\n", + "train_gexpr" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "4a7e81bc", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Unnamed: 0SAMD11NOC2LKLHL17PLEKHN1PERM1HES4ISG15AGRNC1orf159...STK26RTL8BRTL8CRTL8ASMIM10L2BSMIM10L2AINTS6LADGRG4PNMA6ACCNQ
0CCCAACTGTCGAATTC-1_50.0000000.00.0000.0000000.0000000.00.000000...000.00000.0000.0
1GTTCGCTGTACAGTTC-1_10.0000000.00.0000.0000000.0000000.00.000000...000.00000.0000.0
2AGGAGGTCATTGACTG-1_20.0000000.00.0000.8409490.8409490.00.000000...000.00000.0000.0
3TAAGTCGTCTTCGTGC-1_30.0000000.00.0000.0000000.0000000.00.000000...000.00000.0000.0
4GCCAGCATCAGAGCAG-1_20.0000000.00.0000.0000000.0000000.00.000000...000.00000.0000.0
..................................................................
91AAGGAATCACGGGCTT-1_30.0000000.00.0000.0000000.0000000.00.000000...000.00000.0000.0
92ATAGAGATCAAAGGTA-1_20.0000000.00.0000.0000000.0000000.01.048932...000.00000.0000.0
93TTTGTTGTCTACGCGG-1_20.0000000.00.0001.0014980.0000000.00.000000...000.00000.0000.0
94GCCATGGGTGGAACAC-1_10.0000000.00.0001.9737620.0000000.00.000000...000.00000.0000.0
95TTTGGAGCAGTAGATA-1_21.5480590.00.0000.0000000.0000000.01.548059...000.00000.0000.0
\n", + "

96 rows × 17049 columns

\n", + "
" + ], + "text/plain": [ + " Unnamed: 0 SAMD11 NOC2L KLHL17 PLEKHN1 PERM1 HES4 \\\n", + "0 CCCAACTGTCGAATTC-1_5 0.000000 0.0 0.0 0 0 0.000000 \n", + "1 GTTCGCTGTACAGTTC-1_1 0.000000 0.0 0.0 0 0 0.000000 \n", + "2 AGGAGGTCATTGACTG-1_2 0.000000 0.0 0.0 0 0 0.840949 \n", + "3 TAAGTCGTCTTCGTGC-1_3 0.000000 0.0 0.0 0 0 0.000000 \n", + "4 GCCAGCATCAGAGCAG-1_2 0.000000 0.0 0.0 0 0 0.000000 \n", + ".. ... ... ... ... ... ... ... \n", + "91 AAGGAATCACGGGCTT-1_3 0.000000 0.0 0.0 0 0 0.000000 \n", + "92 ATAGAGATCAAAGGTA-1_2 0.000000 0.0 0.0 0 0 0.000000 \n", + "93 TTTGTTGTCTACGCGG-1_2 0.000000 0.0 0.0 0 0 1.001498 \n", + "94 GCCATGGGTGGAACAC-1_1 0.000000 0.0 0.0 0 0 1.973762 \n", + "95 TTTGGAGCAGTAGATA-1_2 1.548059 0.0 0.0 0 0 0.000000 \n", + "\n", + " ISG15 AGRN C1orf159 ... STK26 RTL8B RTL8C RTL8A SMIM10L2B \\\n", + "0 0.000000 0.0 0.000000 ... 0 0 0.0 0 0 \n", + "1 0.000000 0.0 0.000000 ... 0 0 0.0 0 0 \n", + "2 0.840949 0.0 0.000000 ... 0 0 0.0 0 0 \n", + "3 0.000000 0.0 0.000000 ... 0 0 0.0 0 0 \n", + "4 0.000000 0.0 0.000000 ... 0 0 0.0 0 0 \n", + ".. ... ... ... ... ... ... ... ... ... \n", + "91 0.000000 0.0 0.000000 ... 0 0 0.0 0 0 \n", + "92 0.000000 0.0 1.048932 ... 0 0 0.0 0 0 \n", + "93 0.000000 0.0 0.000000 ... 0 0 0.0 0 0 \n", + "94 0.000000 0.0 0.000000 ... 0 0 0.0 0 0 \n", + "95 0.000000 0.0 1.548059 ... 0 0 0.0 0 0 \n", + "\n", + " SMIM10L2A INTS6L ADGRG4 PNMA6A CCNQ \n", + "0 0 0.0 0 0 0.0 \n", + "1 0 0.0 0 0 0.0 \n", + "2 0 0.0 0 0 0.0 \n", + "3 0 0.0 0 0 0.0 \n", + "4 0 0.0 0 0 0.0 \n", + ".. ... ... ... ... ... \n", + "91 0 0.0 0 0 0.0 \n", + "92 0 0.0 0 0 0.0 \n", + "93 0 0.0 0 0 0.0 \n", + "94 0 0.0 0 0 0.0 \n", + "95 0 0.0 0 0 0.0 \n", + "\n", + "[96 rows x 17049 columns]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_gexpr = pd.read_parquet(test_data_FN) # remaining 30% of the original gene expression data\n", + "test_gexpr" + ] + }, + { + "cell_type": "markdown", + "id": "0fbd7f77", + "metadata": {}, + "source": [ + "Input Protein-Protein Interaction (PPI) Network:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "4cd8d66d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TF1TF2score
323NR2C2NR2C11.000000
432NR2C1NR2C21.000000
545ATF7ATF21.000000
572CUX1ATF21.000000
573JUNDATF21.000000
............
23168702TEAD1TEAD40.922923
24614146NFIANFIB0.812813
28106670MEF2CMEF2B0.912209
28172345NFIBNFIA0.983943
28484180TEAD4TEAD10.998360
\n", + "

20926 rows × 3 columns

\n", + "
" + ], + "text/plain": [ + " TF1 TF2 score\n", + "323 NR2C2 NR2C1 1.000000\n", + "432 NR2C1 NR2C2 1.000000\n", + "545 ATF7 ATF2 1.000000\n", + "572 CUX1 ATF2 1.000000\n", + "573 JUND ATF2 1.000000\n", + "... ... ... ...\n", + "23168702 TEAD1 TEAD4 0.922923\n", + "24614146 NFIA NFIB 0.812813\n", + "28106670 MEF2C MEF2B 0.912209\n", + "28172345 NFIB NFIA 0.983943\n", + "28484180 TEAD4 TEAD1 0.998360\n", + "\n", + "[20926 rows x 3 columns]" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ppi_df = pd.read_parquet(ppi_FN)\n", + "ppi_df" + ] + }, + { + "cell_type": "markdown", + "id": "2081d440", + "metadata": {}, + "source": [ + "Next, please note that we will use NetREm (Network Regression Embeddings) to identify the optimal Transcription Factors (TFs) out of the N candidate TFs, which may regulate this TG.\n", + "NetREm is run 1 TG at a time, to eventually build out networks for the cell-type :)" + ] + }, + { + "cell_type": "markdown", + "id": "3d5f4a32", + "metadata": {}, + "source": [ + "## Integration of multimodal data and networks:" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "e581a231", + "metadata": {}, + "source": [ + "![netrem_step1.png](../user_guide/pics/netrem_step1.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "456225d8", + "metadata": {}, + "outputs": [], + "source": [ + "tg = \"ZZZ3\" # target gene of interest" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "66fb2bf8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ZZZ3
00.0
10.0
20.0
30.0
40.0
......
2180.0
2190.0
2200.0
2210.0
2220.0
\n", + "

223 rows × 1 columns

\n", + "
" + ], + "text/plain": [ + " ZZZ3\n", + "0 0.0\n", + "1 0.0\n", + "2 0.0\n", + "3 0.0\n", + "4 0.0\n", + ".. ...\n", + "218 0.0\n", + "219 0.0\n", + "220 0.0\n", + "221 0.0\n", + "222 0.0\n", + "\n", + "[223 rows x 1 columns]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# training gene expression data for target gene (TG) y\n", + "y_train = train_gexpr[[tg]]\n", + "y_train" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "07c5d661", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ZZZ3
00.000000
11.406272
20.000000
30.000000
40.000000
......
910.000000
920.000000
930.000000
940.000000
951.548059
\n", + "

96 rows × 1 columns

\n", + "
" + ], + "text/plain": [ + " ZZZ3\n", + "0 0.000000\n", + "1 1.406272\n", + "2 0.000000\n", + "3 0.000000\n", + "4 0.000000\n", + ".. ...\n", + "91 0.000000\n", + "92 0.000000\n", + "93 0.000000\n", + "94 0.000000\n", + "95 1.548059\n", + "\n", + "[96 rows x 1 columns]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# testing gene expression data for target gene (TG) y\n", + "y_test = test_gexpr[[tg]]\n", + "y_test" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "bc43ad2e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) Please note that we have N = 77 candidate TFs for our TG ZZZ3\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TGTF
635318ZZZ3CTCF
635319ZZZ3E2F3
635320ZZZ3EBF1
635321ZZZ3FOXP1
635322ZZZ3GTF3C2
.........
635390ZZZ3BACH1
635391ZZZ3TCF3
635392ZZZ3ERF
635393ZZZ3ZNF281
635394ZZZ3SMAD4
\n", + "

77 rows × 2 columns

\n", + "
" + ], + "text/plain": [ + " TG TF\n", + "635318 ZZZ3 CTCF\n", + "635319 ZZZ3 E2F3\n", + "635320 ZZZ3 EBF1\n", + "635321 ZZZ3 FOXP1\n", + "635322 ZZZ3 GTF3C2\n", + "... ... ...\n", + "635390 ZZZ3 BACH1\n", + "635391 ZZZ3 TCF3\n", + "635392 ZZZ3 ERF\n", + "635393 ZZZ3 ZNF281\n", + "635394 ZZZ3 SMAD4\n", + "\n", + "[77 rows x 2 columns]" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "candidate_TFs_for_TG_df = tfs_for_tgs_final_df[tfs_for_tgs_final_df[\"TG\"] == tg]\n", + "num_candidate_TFs_for_TG = candidate_TFs_for_TG_df.shape[0]\n", + "print(f\":) Please note that we have N = {num_candidate_TFs_for_TG} candidate TFs for our TG {tg}\")\n", + "candidate_TFs_for_TG_df" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "a70f24c4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Some of the first few candidate TFs for the TG ZZZ3: ['BACH1', 'BCL6', 'CCNT2', 'CTCF', 'E2F3', 'E4F1']\n" + ] + } + ], + "source": [ + "candidate_TFs_for_TG = list(candidate_TFs_for_TG_df[\"TF\"]) \n", + "candidate_TFs_for_TG.sort() # Saniya sorts alphabetically for convenience :)\n", + "print(f\"Some of the first few candidate TFs for the TG {tg}: {candidate_TFs_for_TG[0:6]}\")" + ] + }, + { + "cell_type": "markdown", + "id": "c69b3eb8", + "metadata": {}, + "source": [ + "Please note that we will utilize this given data to fit our network regularized regression problem for our target gene (TG):\n", + "\n", + "**NetREm model training:**\n", + "* filtered_ppi_for_TG\n", + "* X_train\n", + "* y_train\n", + "\n", + "**NetREm model testing:**\n", + "* X_test\n", + "* y_test" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "00f41fe4", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TF1TF2score
32314HCFC1SP11.000000
32369SMC3SP11.000000
32387HDAC2SP11.000000
32457GTF3C2SP11.000000
32619MGASP11.000000
............
14290770ZFP82TP530.219219
14292496ZNF136SMC30.157157
14293375ZNF274HDAC20.245245
14293598ZNF281SMAD40.183183
18525244MYEF2CTCF0.013724
\n", + "

2860 rows × 3 columns

\n", + "
" + ], + "text/plain": [ + " TF1 TF2 score\n", + "32314 HCFC1 SP1 1.000000\n", + "32369 SMC3 SP1 1.000000\n", + "32387 HDAC2 SP1 1.000000\n", + "32457 GTF3C2 SP1 1.000000\n", + "32619 MGA SP1 1.000000\n", + "... ... ... ...\n", + "14290770 ZFP82 TP53 0.219219\n", + "14292496 ZNF136 SMC3 0.157157\n", + "14293375 ZNF274 HDAC2 0.245245\n", + "14293598 ZNF281 SMAD4 0.183183\n", + "18525244 MYEF2 CTCF 0.013724\n", + "\n", + "[2860 rows x 3 columns]" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# we filter the PPI to only include the candidate TFs for our TG\n", + "filtered_ppi_for_TG = ppi_df[ppi_df[\"TF1\"].isin(candidate_TFs_for_TG)]\n", + "filtered_ppi_for_TG = filtered_ppi_for_TG[filtered_ppi_for_TG[\"TF2\"].isin(candidate_TFs_for_TG)].drop_duplicates()\n", + "filtered_ppi_for_TG" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "1f5c6dd4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2860\n" + ] + }, + { + "data": { + "text/plain": [ + "[['HCFC1', 'SP1', 1.0],\n", + " ['SMC3', 'SP1', 1.0],\n", + " ['HDAC2', 'SP1', 1.0],\n", + " ['GTF3C2', 'SP1', 1.0],\n", + " ['MGA', 'SP1', 1.0]]" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# # Then, we need to do this conversion for NetREm:\n", + "filtered_ppi_for_TG = filtered_ppi_for_TG.values.tolist()\n", + "print(len(filtered_ppi_for_TG))\n", + "filtered_ppi_for_TG[0:5] # first 5 entries" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "21bfd35c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
BACH1BCL6CCNT2CTCFE2F3E4F1EBF1EGR1ELF1ERF...YY1ZBTB7AZFP28ZFP82ZNF136ZNF140ZNF274ZNF281ZNF682ZNF76
00.7952830.0000000.7952830.0000001.9566150.00.00.00.7952830.0...0.7952830.0000000.0000000.00.0000000.0000000.7952830.00.00.0
10.0000002.1672180.0000000.0000000.0000000.00.00.00.0000000.0...0.0000000.0000000.0000000.00.0000000.0000000.0000000.00.00.0
20.0000000.0000000.0000000.0000000.0000000.00.00.00.0000000.0...0.0000000.0000000.0000000.00.0000000.0000000.0000000.00.00.0
30.0000000.0000000.0000000.0000000.0000000.00.00.00.0000000.0...0.0000000.0000002.5178990.00.0000000.0000000.0000000.00.00.0
40.0000000.0000000.0000001.2979271.2979270.00.00.02.1955670.0...0.0000000.0000000.0000000.00.0000000.0000000.0000000.00.00.0
..................................................................
2180.0000000.0000000.0000000.0000000.0000000.00.00.01.7571960.0...0.0000000.0000000.0000000.00.0000000.0000000.0000000.00.00.0
2190.0000000.0000000.0000000.0000000.0000000.00.00.00.0000000.0...0.0000002.4789690.0000000.00.0000000.0000000.0000000.00.00.0
2200.0000000.0000000.0000000.0000000.0000000.00.00.00.0000000.0...0.0000000.0000000.0000000.00.0000001.7345460.0000000.00.00.0
2210.0000002.1584770.8225620.0000000.0000000.00.00.00.8225620.0...0.8225620.0000000.0000000.00.8225620.0000000.8225620.00.00.0
2221.7659821.7659820.0000000.0000000.0000000.00.00.00.0000000.0...0.0000000.0000000.0000000.00.0000000.0000000.0000000.00.00.0
\n", + "

223 rows × 77 columns

\n", + "
" + ], + "text/plain": [ + " BACH1 BCL6 CCNT2 CTCF E2F3 E4F1 EBF1 EGR1 \\\n", + "0 0.795283 0.000000 0.795283 0.000000 1.956615 0.0 0.0 0.0 \n", + "1 0.000000 2.167218 0.000000 0.000000 0.000000 0.0 0.0 0.0 \n", + "2 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 \n", + "3 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 \n", + "4 0.000000 0.000000 0.000000 1.297927 1.297927 0.0 0.0 0.0 \n", + ".. ... ... ... ... ... ... ... ... \n", + "218 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 \n", + "219 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 \n", + "220 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 \n", + "221 0.000000 2.158477 0.822562 0.000000 0.000000 0.0 0.0 0.0 \n", + "222 1.765982 1.765982 0.000000 0.000000 0.000000 0.0 0.0 0.0 \n", + "\n", + " ELF1 ERF ... YY1 ZBTB7A ZFP28 ZFP82 ZNF136 \\\n", + "0 0.795283 0.0 ... 0.795283 0.000000 0.000000 0.0 0.000000 \n", + "1 0.000000 0.0 ... 0.000000 0.000000 0.000000 0.0 0.000000 \n", + "2 0.000000 0.0 ... 0.000000 0.000000 0.000000 0.0 0.000000 \n", + "3 0.000000 0.0 ... 0.000000 0.000000 2.517899 0.0 0.000000 \n", + "4 2.195567 0.0 ... 0.000000 0.000000 0.000000 0.0 0.000000 \n", + ".. ... ... ... ... ... ... ... ... \n", + "218 1.757196 0.0 ... 0.000000 0.000000 0.000000 0.0 0.000000 \n", + "219 0.000000 0.0 ... 0.000000 2.478969 0.000000 0.0 0.000000 \n", + "220 0.000000 0.0 ... 0.000000 0.000000 0.000000 0.0 0.000000 \n", + "221 0.822562 0.0 ... 0.822562 0.000000 0.000000 0.0 0.822562 \n", + "222 0.000000 0.0 ... 0.000000 0.000000 0.000000 0.0 0.000000 \n", + "\n", + " ZNF140 ZNF274 ZNF281 ZNF682 ZNF76 \n", + "0 0.000000 0.795283 0.0 0.0 0.0 \n", + "1 0.000000 0.000000 0.0 0.0 0.0 \n", + "2 0.000000 0.000000 0.0 0.0 0.0 \n", + "3 0.000000 0.000000 0.0 0.0 0.0 \n", + "4 0.000000 0.000000 0.0 0.0 0.0 \n", + ".. ... ... ... ... ... \n", + "218 0.000000 0.000000 0.0 0.0 0.0 \n", + "219 0.000000 0.000000 0.0 0.0 0.0 \n", + "220 1.734546 0.000000 0.0 0.0 0.0 \n", + "221 0.000000 0.822562 0.0 0.0 0.0 \n", + "222 0.000000 0.000000 0.0 0.0 0.0 \n", + "\n", + "[223 rows x 77 columns]" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "X_train = train_gexpr[candidate_TFs_for_TG] \n", + "X_train" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "628beab0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
BACH1BCL6CCNT2CTCFE2F3E4F1EBF1EGR1ELF1ERF...YY1ZBTB7AZFP28ZFP82ZNF136ZNF140ZNF274ZNF281ZNF682ZNF76
00.0000002.5369810.0000000.0000000.0000000.0000000.00.00.0000000.0...0.0000000.0000000.00.0000000.0000000.00.0000000.0000000.000000.000000
11.4062722.3265110.0000000.0000000.0000000.0000000.00.00.0000000.0...1.4062720.0000000.00.0000000.0000000.00.0000000.0000001.968710.000000
20.8409490.8409490.0000000.8409490.0000000.0000000.00.00.8409490.0...0.8409490.0000000.00.0000000.8409490.00.0000000.0000000.000000.000000
30.0000000.0000002.3260940.0000000.0000001.7261430.00.00.0000000.0...0.0000000.0000000.01.7261430.0000000.00.0000000.0000000.000000.000000
41.8931360.0000000.0000000.0000000.0000000.0000000.00.01.8931360.0...1.3402710.0000000.01.3402710.0000000.00.0000000.0000000.000000.000000
..................................................................
910.0000000.0000002.8891450.0000000.0000000.0000000.00.00.0000000.0...0.0000003.5540860.00.0000000.0000000.00.0000000.0000000.000000.000000
920.0000001.0489320.0000000.0000001.0489320.0000000.00.01.0489320.0...0.0000000.0000000.00.0000000.0000000.00.0000000.0000000.000000.000000
931.0014980.0000000.0000000.0000000.0000000.0000000.00.00.0000000.0...0.0000000.0000000.00.0000000.0000000.01.0014980.0000000.000001.001498
940.0000000.0000000.0000000.0000000.0000000.0000000.00.00.0000000.0...1.4107070.0000000.00.0000000.0000000.00.0000001.4107070.000000.000000
950.0000000.0000000.0000000.0000000.0000000.0000000.00.00.0000000.0...0.0000001.5480590.00.0000001.5480590.00.0000000.0000000.000000.000000
\n", + "

96 rows × 77 columns

\n", + "
" + ], + "text/plain": [ + " BACH1 BCL6 CCNT2 CTCF E2F3 E4F1 EBF1 EGR1 \\\n", + "0 0.000000 2.536981 0.000000 0.000000 0.000000 0.000000 0.0 0.0 \n", + "1 1.406272 2.326511 0.000000 0.000000 0.000000 0.000000 0.0 0.0 \n", + "2 0.840949 0.840949 0.000000 0.840949 0.000000 0.000000 0.0 0.0 \n", + "3 0.000000 0.000000 2.326094 0.000000 0.000000 1.726143 0.0 0.0 \n", + "4 1.893136 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 \n", + ".. ... ... ... ... ... ... ... ... \n", + "91 0.000000 0.000000 2.889145 0.000000 0.000000 0.000000 0.0 0.0 \n", + "92 0.000000 1.048932 0.000000 0.000000 1.048932 0.000000 0.0 0.0 \n", + "93 1.001498 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 \n", + "94 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 \n", + "95 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 \n", + "\n", + " ELF1 ERF ... YY1 ZBTB7A ZFP28 ZFP82 ZNF136 ZNF140 \\\n", + "0 0.000000 0.0 ... 0.000000 0.000000 0.0 0.000000 0.000000 0.0 \n", + "1 0.000000 0.0 ... 1.406272 0.000000 0.0 0.000000 0.000000 0.0 \n", + "2 0.840949 0.0 ... 0.840949 0.000000 0.0 0.000000 0.840949 0.0 \n", + "3 0.000000 0.0 ... 0.000000 0.000000 0.0 1.726143 0.000000 0.0 \n", + "4 1.893136 0.0 ... 1.340271 0.000000 0.0 1.340271 0.000000 0.0 \n", + ".. ... ... ... ... ... ... ... ... ... \n", + "91 0.000000 0.0 ... 0.000000 3.554086 0.0 0.000000 0.000000 0.0 \n", + "92 1.048932 0.0 ... 0.000000 0.000000 0.0 0.000000 0.000000 0.0 \n", + "93 0.000000 0.0 ... 0.000000 0.000000 0.0 0.000000 0.000000 0.0 \n", + "94 0.000000 0.0 ... 1.410707 0.000000 0.0 0.000000 0.000000 0.0 \n", + "95 0.000000 0.0 ... 0.000000 1.548059 0.0 0.000000 1.548059 0.0 \n", + "\n", + " ZNF274 ZNF281 ZNF682 ZNF76 \n", + "0 0.000000 0.000000 0.00000 0.000000 \n", + "1 0.000000 0.000000 1.96871 0.000000 \n", + "2 0.000000 0.000000 0.00000 0.000000 \n", + "3 0.000000 0.000000 0.00000 0.000000 \n", + "4 0.000000 0.000000 0.00000 0.000000 \n", + ".. ... ... ... ... \n", + "91 0.000000 0.000000 0.00000 0.000000 \n", + "92 0.000000 0.000000 0.00000 0.000000 \n", + "93 1.001498 0.000000 0.00000 1.001498 \n", + "94 0.000000 1.410707 0.00000 0.000000 \n", + "95 0.000000 0.000000 0.00000 0.000000 \n", + "\n", + "[96 rows x 77 columns]" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "X_test = test_gexpr[candidate_TFs_for_TG]\n", + "X_test" + ] + }, + { + "cell_type": "markdown", + "id": "90a9d41a", + "metadata": {}, + "source": [ + "Below, Saniya first shows the performance of 4 Baseline models (all fit using Cross Validation (CV) except Linear Regression) and then presents many examples of how NetREm may be applied, via **netrem**, **netremCV**, and/or Bayesian hyperparamater optimization. We recommend trying out these examples for your specific needs. 😁\n", + "\n", + "Please note that video tutorials on NetREm will soon be available on [Saniya's YouTube channel](https://www.youtube.com/c/SaniyaKhullar)📽️👩‍🏫." + ] + }, + { + "cell_type": "markdown", + "id": "888e3c9d", + "metadata": {}, + "source": [ + "**Example 1: Mainly using defaults and/or Cross-Validation to determine best values (minimal input from user 😴)** \n", + "* 1a: *netrem*: defaults for beta and $\\alpha_{lasso}$ \n", + "* 1b: *netrem*: default beta_net and LassoCV to find $\\alpha_{lasso}$ \n", + "* 1c: *netremCV*: find the optimal beta and optimal $\\alpha_{lasso}$ via Cross Validation (CV) (3 examples)\n", + "* 1d: *netrem and netrem-based function*: bayesian optimization to determine the optimal $\\alpha_{lasso}$ and $\\beta_{net}$ in fixed default ranges\n", + "\n", + "**Example 2: User provides inputs for NetREm (more input needed from user🤓🤔)** \n", + "* 2a: *netrem*: using user-defined values for $\\beta_{net}$ and $\\alpha_{lasso}$ \n", + "* 2b. *netrem*: using user-defined value for $\\beta_{net}$ and using LassoCV to find optimal $\\alpha_{lasso}$ \n", + "* 2c: *netrem*: using GridSearchCV for comprehensive hyperparameter optimization\n", + "* 2d: *netrem*: using RandomizedSearchCV for comprehensive hyperparameter optimization\n", + "* 2e: *netrem and netrem-based function*: bayesian optimization to determine the optimal $\\alpha_{lasso}$ and $\\beta_{net}$ for ranges of values defined by the user. \n", + "\n", + "**Example 3: User provides more inputs for more comprehensive hyperparameter optimization (building on #2)**\n", + "* 3a: *netrem*: using GridSearchCV for comprehensive hyperparameter optimization" + ] + }, + { + "cell_type": "markdown", + "id": "e9007e89", + "metadata": {}, + "source": [ + "### Baseline Examples: We fit models using Scikit-Learn packages for LinearRegression, LassoCV, RidgeCV, and ElasticNetCV on the data.\n", + "\n", + "#### Baseline Example 1: Fitting model with y-intercept term" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "6c63ed0d", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "C:\\Users\\saniy\\anaconda3\\lib\\site-packages\\sklearn\\linear_model\\_coordinate_descent.py:1568: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", + " y = column_or_1d(y, warn=True)\n", + "C:\\Users\\saniy\\anaconda3\\lib\\site-packages\\sklearn\\linear_model\\_coordinate_descent.py:1568: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", + " y = column_or_1d(y, warn=True)\n", + "C:\\Users\\saniy\\anaconda3\\lib\\site-packages\\sklearn\\linear_model\\_coordinate_descent.py:1568: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", + " y = column_or_1d(y, warn=True)\n", + "C:\\Users\\saniy\\anaconda3\\lib\\site-packages\\sklearn\\linear_model\\_coordinate_descent.py:1568: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", + " y = column_or_1d(y, warn=True)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AbsoluteVal_coefficientRankTFInfoy_interceptfinal_model_TFsTFs_input_to_modeloriginal_TFs_in_Xtrain_msetest_msetrain_nmsetest_nmsetrain_snrtest_snrtrain_psnrtest_psnrTG
00.0975861SETDB1ElasticNetCVTrue :)1477770.5321990.6443590.6138350.6658922.1194871.76596012.51716310.656326ZZZ3
10.0890082ELF1ElasticNetCVTrue :)1477770.5321990.6443590.6138350.6658922.1194871.76596012.51716310.656326ZZZ3
20.0765893BACH1ElasticNetCVTrue :)1477770.5321990.6443590.6138350.6658922.1194871.76596012.51716310.656326ZZZ3
30.0749914NFKB1ElasticNetCVTrue :)1477770.5321990.6443590.6138350.6658922.1194871.76596012.51716310.656326ZZZ3
40.0598675NFIBElasticNetCVTrue :)1477770.5321990.6443590.6138350.6658922.1194871.76596012.51716310.656326ZZZ3
......................................................
720.00844473NR1D2LinearRegressionFalse :(7777770.8756471.0942981.0099651.130868-0.043063-0.53411710.3546138.356249ZZZ3
730.00770774NR6A1LinearRegressionFalse :(7777770.8756471.0942981.0099651.130868-0.043063-0.53411710.3546138.356249ZZZ3
740.00253575RXRALinearRegressionFalse :(7777770.8756471.0942981.0099651.130868-0.043063-0.53411710.3546138.356249ZZZ3
750.00169576THRBLinearRegressionFalse :(7777770.8756471.0942981.0099651.130868-0.043063-0.53411710.3546138.356249ZZZ3
760.00005777NR2F1LinearRegressionFalse :(7777770.8756471.0942981.0099651.130868-0.043063-0.53411710.3546138.356249ZZZ3
\n", + "

364 rows × 17 columns

\n", + "
" + ], + "text/plain": [ + " AbsoluteVal_coefficient Rank TF Info y_intercept \\\n", + "0 0.097586 1 SETDB1 ElasticNetCV True :) \n", + "1 0.089008 2 ELF1 ElasticNetCV True :) \n", + "2 0.076589 3 BACH1 ElasticNetCV True :) \n", + "3 0.074991 4 NFKB1 ElasticNetCV True :) \n", + "4 0.059867 5 NFIB ElasticNetCV True :) \n", + ".. ... ... ... ... ... \n", + "72 0.008444 73 NR1D2 LinearRegression False :( \n", + "73 0.007707 74 NR6A1 LinearRegression False :( \n", + "74 0.002535 75 RXRA LinearRegression False :( \n", + "75 0.001695 76 THRB LinearRegression False :( \n", + "76 0.000057 77 NR2F1 LinearRegression False :( \n", + "\n", + " final_model_TFs TFs_input_to_model original_TFs_in_X train_mse \\\n", + "0 14 77 77 0.532199 \n", + "1 14 77 77 0.532199 \n", + "2 14 77 77 0.532199 \n", + "3 14 77 77 0.532199 \n", + "4 14 77 77 0.532199 \n", + ".. ... ... ... ... \n", + "72 77 77 77 0.875647 \n", + "73 77 77 77 0.875647 \n", + "74 77 77 77 0.875647 \n", + "75 77 77 77 0.875647 \n", + "76 77 77 77 0.875647 \n", + "\n", + " test_mse train_nmse test_nmse train_snr test_snr train_psnr \\\n", + "0 0.644359 0.613835 0.665892 2.119487 1.765960 12.517163 \n", + "1 0.644359 0.613835 0.665892 2.119487 1.765960 12.517163 \n", + "2 0.644359 0.613835 0.665892 2.119487 1.765960 12.517163 \n", + "3 0.644359 0.613835 0.665892 2.119487 1.765960 12.517163 \n", + "4 0.644359 0.613835 0.665892 2.119487 1.765960 12.517163 \n", + ".. ... ... ... ... ... ... \n", + "72 1.094298 1.009965 1.130868 -0.043063 -0.534117 10.354613 \n", + "73 1.094298 1.009965 1.130868 -0.043063 -0.534117 10.354613 \n", + "74 1.094298 1.009965 1.130868 -0.043063 -0.534117 10.354613 \n", + "75 1.094298 1.009965 1.130868 -0.043063 -0.534117 10.354613 \n", + "76 1.094298 1.009965 1.130868 -0.043063 -0.534117 10.354613 \n", + "\n", + " test_psnr TG \n", + "0 10.656326 ZZZ3 \n", + "1 10.656326 ZZZ3 \n", + "2 10.656326 ZZZ3 \n", + "3 10.656326 ZZZ3 \n", + "4 10.656326 ZZZ3 \n", + ".. ... ... \n", + "72 8.356249 ZZZ3 \n", + "73 8.356249 ZZZ3 \n", + "74 8.356249 ZZZ3 \n", + "75 8.356249 ZZZ3 \n", + "76 8.356249 ZZZ3 \n", + "\n", + "[364 rows x 17 columns]" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Examples with and without the y-intercept term included:\n", + "baseline_model_names = [\"ElasticNetCV\", \"RidgeCV\", \"LassoCV\", \"LinearRegression\"]\n", + "baseline_df = pd.DataFrame()\n", + "for model in baseline_model_names:\n", + " df_to_add1 = nm_eval.baseline_metrics_function(X_train = X_train, y_train = y_train, \n", + " X_test = X_test, y_test = y_test, \n", + " tg = tg, model_name = model, y_intercept = True)\n", + " df_to_add2 = nm_eval.baseline_metrics_function(X_train = X_train, y_train = y_train, \n", + " X_test = X_test, y_test = y_test, \n", + " tg = tg, model_name = model, y_intercept = False)\n", + " baseline_df = pd.concat([baseline_df, df_to_add1, df_to_add2])\n", + "baseline_df" + ] + }, + { + "cell_type": "markdown", + "id": "3d8379ea", + "metadata": {}, + "source": [ + "😊 We can view some of the baseline metrics at-a-glance:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "0d1f0053", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TGInfoy_interceptfinal_model_TFstrain_msetest_msetrain_nmsetest_nmsetrain_snrtest_snrtrain_psnrtest_psnr
0ZZZ3ElasticNetCVTrue :)140.5321990.6443590.6138350.6658922.1194871.76596012.51716310.656326
0ZZZ3ElasticNetCVFalse :(160.5176260.6588630.5970260.6808812.2400651.66928612.63774110.559653
0ZZZ3RidgeCVTrue :)770.8053430.8975810.9288760.9275760.3204220.32650410.7180989.216870
0ZZZ3RidgeCVFalse :(770.8055210.8978830.9290820.9278880.3194590.32504310.7171359.215409
0ZZZ3LassoCVTrue :)130.5354770.6464870.6176160.6680922.0928181.75164012.49049410.642006
0ZZZ3LassoCVFalse :(130.5181130.6608720.5975870.6829572.2359861.65606512.63366210.546431
0ZZZ3LinearRegressionTrue :)770.8781131.0996531.0128091.136401-0.055274-0.55531710.3424028.335050
0ZZZ3LinearRegressionFalse :(770.8756471.0942981.0099651.130868-0.043063-0.53411710.3546138.356249
\n", + "
" + ], + "text/plain": [ + " TG Info y_intercept final_model_TFs train_mse test_mse \\\n", + "0 ZZZ3 ElasticNetCV True :) 14 0.532199 0.644359 \n", + "0 ZZZ3 ElasticNetCV False :( 16 0.517626 0.658863 \n", + "0 ZZZ3 RidgeCV True :) 77 0.805343 0.897581 \n", + "0 ZZZ3 RidgeCV False :( 77 0.805521 0.897883 \n", + "0 ZZZ3 LassoCV True :) 13 0.535477 0.646487 \n", + "0 ZZZ3 LassoCV False :( 13 0.518113 0.660872 \n", + "0 ZZZ3 LinearRegression True :) 77 0.878113 1.099653 \n", + "0 ZZZ3 LinearRegression False :( 77 0.875647 1.094298 \n", + "\n", + " train_nmse test_nmse train_snr test_snr train_psnr test_psnr \n", + "0 0.613835 0.665892 2.119487 1.765960 12.517163 10.656326 \n", + "0 0.597026 0.680881 2.240065 1.669286 12.637741 10.559653 \n", + "0 0.928876 0.927576 0.320422 0.326504 10.718098 9.216870 \n", + "0 0.929082 0.927888 0.319459 0.325043 10.717135 9.215409 \n", + "0 0.617616 0.668092 2.092818 1.751640 12.490494 10.642006 \n", + "0 0.597587 0.682957 2.235986 1.656065 12.633662 10.546431 \n", + "0 1.012809 1.136401 -0.055274 -0.555317 10.342402 8.335050 \n", + "0 1.009965 1.130868 -0.043063 -0.534117 10.354613 8.356249 " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "comparison_df = baseline_df[[\"TG\", \"Info\", \"y_intercept\", \n", + " \"final_model_TFs\", \"train_mse\", \"test_mse\", \"train_nmse\", \n", + " \"test_nmse\", \"train_snr\", \"test_snr\", \"train_psnr\", \"test_psnr\"]].drop_duplicates()\n", + "comparison_df" + ] + }, + { + "cell_type": "markdown", + "id": "69b2f2a2", + "metadata": {}, + "source": [ + "Please note that:\n", + "* MSE is Mean Squared Error (smaller values are better 😀)\n", + "* NMSE is Normalized MSE (smaller values are better 😀)\n", + "* SNR is Signal to Noise Ratio (larger values are better 😀)\n", + "* PSNR is Peak SNR (larger values are better 😀)" + ] + }, + { + "cell_type": "markdown", + "id": "033fd384", + "metadata": {}, + "source": [ + "### Below, Saniya will show examples utilizing NetREm (Network Regression Embeddings) " + ] + }, + { + "cell_type": "markdown", + "id": "006d75b3", + "metadata": {}, + "source": [ + "## Example 1️⃣:\n", + "### using defaults when possible :) 😴" + ] + }, + { + "cell_type": "markdown", + "id": "ca867780", + "metadata": {}, + "source": [ + "### Example 1a: \n", + "#### Using the defaults for *beta_net* $\\beta_{net}$ and *alpha_lasso* $\\alpha_{lasso}$ ." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "14b59753", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2860\n" + ] + }, + { + "data": { + "text/plain": [ + "[['HCFC1', 'SP1', 1.0],\n", + " ['SMC3', 'SP1', 1.0],\n", + " ['HDAC2', 'SP1', 1.0],\n", + " ['GTF3C2', 'SP1', 1.0],\n", + " ['MGA', 'SP1', 1.0]]" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print(len(filtered_ppi_for_TG))\n", + "filtered_ppi_for_TG[0:5] # Saniya views the first few entries of the input edge list\n", + "# [[node1, node2, weight (if known)],...]" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "dfe27f59", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "using beta_net default of 1\n", + "using alpha_lasso default of 0.01\n", + "# of TFs with non-zero coefficients: 55\n", + "Training MSE: 0.4028411980913493\n", + "Testing MSE: 0.7056823978412833\n", + "CPU times: total: 0 ns\n", + "Wall time: 79.3 ms\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
y_interceptBACH1BCL6CCNT2CTCFE2F3E4F1EBF1ELF1ERF...STAT5BTBX2TCF3TP53USF1YY1ZBTB7AZNF140ZNF682ZNF76
0None0.0916370.0005140.049703-0.05840.181601-0.206912-0.0342050.122034-0.089625...0.0784890.007029-0.127275-0.1031150.0413950.106009-0.035760.0929990.0026580.139763
\n", + "

1 rows × 56 columns

\n", + "
" + ], + "text/plain": [ + " y_intercept BACH1 BCL6 CCNT2 CTCF E2F3 E4F1 \\\n", + "0 None 0.091637 0.000514 0.049703 -0.0584 0.181601 -0.206912 \n", + "\n", + " EBF1 ELF1 ERF ... STAT5B TBX2 TCF3 TP53 \\\n", + "0 -0.034205 0.122034 -0.089625 ... 0.078489 0.007029 -0.127275 -0.103115 \n", + "\n", + " USF1 YY1 ZBTB7A ZNF140 ZNF682 ZNF76 \n", + "0 0.041395 0.106009 -0.03576 0.092999 0.002658 0.139763 \n", + "\n", + "[1 rows x 56 columns]" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time \n", + "# added %%time above time the amount of time to run the code in the cell block\n", + "\n", + "# Using defaults for beta and alpha:\n", + "netrem_1a = nm.netrem(edge_list = filtered_ppi_for_TG)\n", + "\n", + "# Fitting the gregulnet model on training data: X_train and y_train:\n", + "netrem_1a.fit(X_train, y_train)\n", + "\n", + "# Analyzing the NetREm Function\n", + "final_model_1a = netrem_1a.model_nonzero_coef_df\n", + "print(f\"# of TFs with non-zero coefficients: {netrem_1a.num_final_predictors}\")\n", + "mse_train = netrem_1a.test_mse(X_train, y_train)\n", + "mse_test = netrem_1a.test_mse(X_test, y_test)\n", + "print(f\"Training MSE: {mse_train}\")\n", + "print(f\"Testing MSE: {mse_test}\")\n", + "final_model_1a" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "2ef578e0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
NetREmModel(verbose=False, overlapped_nodes_only=False, all_pos_coefs=False, model_type=Lasso, use_network=True, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C467AB8E0>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "NetREmModel(verbose=False, overlapped_nodes_only=False, all_pos_coefs=False, model_type=Lasso, use_network=True, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, network=)" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netrem_1a" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "27687c20", + "metadata": {}, + "source": [ + "![netrem_1a.png](../user_guide/pics/netrem_1a.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "8c2c20ff", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "To view the TF-TG regulatory links for the optimal TFs, please note that we can access this:\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
coefTFTGinfotrain_msebeta_netalpha_lassoAbsoluteVal_coefficientRankfinal_model_TFsTFs_input_to_modeloriginal_TFs_in_X
0Noney_interceptZZZ3netrem_no_intercept0.40284110.01NaN56557777
10.091637BACH1ZZZ3netrem_no_intercept0.40284110.010.09163717557777
20.000514BCL6ZZZ3netrem_no_intercept0.40284110.010.00051455557777
30.049703CCNT2ZZZ3netrem_no_intercept0.40284110.010.04970329557777
4-0.0584CTCFZZZ3netrem_no_intercept0.40284110.010.05840025557777
50.181601E2F3ZZZ3netrem_no_intercept0.40284110.010.1816013557777
6-0.206912E4F1ZZZ3netrem_no_intercept0.40284110.010.2069121557777
7-0.034205EBF1ZZZ3netrem_no_intercept0.40284110.010.03420536557777
80.122034ELF1ZZZ3netrem_no_intercept0.40284110.010.1220347557777
9-0.089625ERFZZZ3netrem_no_intercept0.40284110.010.08962519557777
10-0.043602ESR2ZZZ3netrem_no_intercept0.40284110.010.04360231557777
11-0.011249FOXO1ZZZ3netrem_no_intercept0.40284110.010.01124947557777
120.120234HCFC1ZZZ3netrem_no_intercept0.40284110.010.1202348557777
130.054667HDAC2ZZZ3netrem_no_intercept0.40284110.010.05466727557777
140.114946IRF3ZZZ3netrem_no_intercept0.40284110.010.1149469557777
150.006109IRF7ZZZ3netrem_no_intercept0.40284110.010.00610951557777
16-0.074701KLF12ZZZ3netrem_no_intercept0.40284110.010.07470122557777
170.012354KLF15ZZZ3netrem_no_intercept0.40284110.010.01235446557777
180.021167MAFZZZ3netrem_no_intercept0.40284110.010.02116743557777
190.005506MAXZZZ3netrem_no_intercept0.40284110.010.00550653557777
20-0.109458MXI1ZZZ3netrem_no_intercept0.40284110.010.10945811557777
210.027785MYEF2ZZZ3netrem_no_intercept0.40284110.010.02778540557777
220.099391NFIBZZZ3netrem_no_intercept0.40284110.010.09939115557777
230.033488NFICZZZ3netrem_no_intercept0.40284110.010.03348837557777
240.170866NFKB1ZZZ3netrem_no_intercept0.40284110.010.1708664557777
250.104805NR1H2ZZZ3netrem_no_intercept0.40284110.010.10480513557777
26-0.021396NR2F1ZZZ3netrem_no_intercept0.40284110.010.02139642557777
270.016633NR3C1ZZZ3netrem_no_intercept0.40284110.010.01663344557777
280.008364NR6A1ZZZ3netrem_no_intercept0.40284110.010.00836449557777
290.040354PLAG1ZZZ3netrem_no_intercept0.40284110.010.04035434557777
30-0.040854PMLZZZ3netrem_no_intercept0.40284110.010.04085433557777
31-0.005926POU2F1ZZZ3netrem_no_intercept0.40284110.010.00592652557777
320.009088PPARAZZZ3netrem_no_intercept0.40284110.010.00908848557777
330.114686RARBZZZ3netrem_no_intercept0.40284110.010.11468610557777
34-0.058719RARGZZZ3netrem_no_intercept0.40284110.010.05871924557777
350.050561RFX3ZZZ3netrem_no_intercept0.40284110.010.05056128557777
36-0.091143RORAZZZ3netrem_no_intercept0.40284110.010.09114318557777
370.054681RREB1ZZZ3netrem_no_intercept0.40284110.010.05468126557777
380.045125RUNX2ZZZ3netrem_no_intercept0.40284110.010.04512530557777
390.196592SETDB1ZZZ3netrem_no_intercept0.40284110.010.1965922557777
400.070867SIN3AZZZ3netrem_no_intercept0.40284110.010.07086723557777
41-0.030716SMAD4ZZZ3netrem_no_intercept0.40284110.010.03071639557777
420.07586SMC3ZZZ3netrem_no_intercept0.40284110.010.07586021557777
430.021915SP1ZZZ3netrem_no_intercept0.40284110.010.02191541557777
44-0.016183SREBF2ZZZ3netrem_no_intercept0.40284110.010.01618345557777
450.031441STAT1ZZZ3netrem_no_intercept0.40284110.010.03144138557777
460.078489STAT5BZZZ3netrem_no_intercept0.40284110.010.07848920557777
470.007029TBX2ZZZ3netrem_no_intercept0.40284110.010.00702950557777
48-0.127275TCF3ZZZ3netrem_no_intercept0.40284110.010.1272756557777
49-0.103115TP53ZZZ3netrem_no_intercept0.40284110.010.10311514557777
500.041395USF1ZZZ3netrem_no_intercept0.40284110.010.04139532557777
510.106009YY1ZZZ3netrem_no_intercept0.40284110.010.10600912557777
52-0.03576ZBTB7AZZZ3netrem_no_intercept0.40284110.010.03576035557777
530.092999ZNF140ZZZ3netrem_no_intercept0.40284110.010.09299916557777
540.002658ZNF682ZZZ3netrem_no_intercept0.40284110.010.00265854557777
550.139763ZNF76ZZZ3netrem_no_intercept0.40284110.010.1397635557777
\n", + "
" + ], + "text/plain": [ + " coef TF TG info train_mse beta_net \\\n", + "0 None y_intercept ZZZ3 netrem_no_intercept 0.402841 1 \n", + "1 0.091637 BACH1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "2 0.000514 BCL6 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "3 0.049703 CCNT2 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "4 -0.0584 CTCF ZZZ3 netrem_no_intercept 0.402841 1 \n", + "5 0.181601 E2F3 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "6 -0.206912 E4F1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "7 -0.034205 EBF1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "8 0.122034 ELF1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "9 -0.089625 ERF ZZZ3 netrem_no_intercept 0.402841 1 \n", + "10 -0.043602 ESR2 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "11 -0.011249 FOXO1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "12 0.120234 HCFC1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "13 0.054667 HDAC2 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "14 0.114946 IRF3 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "15 0.006109 IRF7 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "16 -0.074701 KLF12 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "17 0.012354 KLF15 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "18 0.021167 MAF ZZZ3 netrem_no_intercept 0.402841 1 \n", + "19 0.005506 MAX ZZZ3 netrem_no_intercept 0.402841 1 \n", + "20 -0.109458 MXI1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "21 0.027785 MYEF2 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "22 0.099391 NFIB ZZZ3 netrem_no_intercept 0.402841 1 \n", + "23 0.033488 NFIC ZZZ3 netrem_no_intercept 0.402841 1 \n", + "24 0.170866 NFKB1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "25 0.104805 NR1H2 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "26 -0.021396 NR2F1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "27 0.016633 NR3C1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "28 0.008364 NR6A1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "29 0.040354 PLAG1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "30 -0.040854 PML ZZZ3 netrem_no_intercept 0.402841 1 \n", + "31 -0.005926 POU2F1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "32 0.009088 PPARA ZZZ3 netrem_no_intercept 0.402841 1 \n", + "33 0.114686 RARB ZZZ3 netrem_no_intercept 0.402841 1 \n", + "34 -0.058719 RARG ZZZ3 netrem_no_intercept 0.402841 1 \n", + "35 0.050561 RFX3 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "36 -0.091143 RORA ZZZ3 netrem_no_intercept 0.402841 1 \n", + "37 0.054681 RREB1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "38 0.045125 RUNX2 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "39 0.196592 SETDB1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "40 0.070867 SIN3A ZZZ3 netrem_no_intercept 0.402841 1 \n", + "41 -0.030716 SMAD4 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "42 0.07586 SMC3 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "43 0.021915 SP1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "44 -0.016183 SREBF2 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "45 0.031441 STAT1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "46 0.078489 STAT5B ZZZ3 netrem_no_intercept 0.402841 1 \n", + "47 0.007029 TBX2 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "48 -0.127275 TCF3 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "49 -0.103115 TP53 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "50 0.041395 USF1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "51 0.106009 YY1 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "52 -0.03576 ZBTB7A ZZZ3 netrem_no_intercept 0.402841 1 \n", + "53 0.092999 ZNF140 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "54 0.002658 ZNF682 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "55 0.139763 ZNF76 ZZZ3 netrem_no_intercept 0.402841 1 \n", + "\n", + " alpha_lasso AbsoluteVal_coefficient Rank final_model_TFs \\\n", + "0 0.01 NaN 56 55 \n", + "1 0.01 0.091637 17 55 \n", + "2 0.01 0.000514 55 55 \n", + "3 0.01 0.049703 29 55 \n", + "4 0.01 0.058400 25 55 \n", + "5 0.01 0.181601 3 55 \n", + "6 0.01 0.206912 1 55 \n", + "7 0.01 0.034205 36 55 \n", + "8 0.01 0.122034 7 55 \n", + "9 0.01 0.089625 19 55 \n", + "10 0.01 0.043602 31 55 \n", + "11 0.01 0.011249 47 55 \n", + "12 0.01 0.120234 8 55 \n", + "13 0.01 0.054667 27 55 \n", + "14 0.01 0.114946 9 55 \n", + "15 0.01 0.006109 51 55 \n", + "16 0.01 0.074701 22 55 \n", + "17 0.01 0.012354 46 55 \n", + "18 0.01 0.021167 43 55 \n", + "19 0.01 0.005506 53 55 \n", + "20 0.01 0.109458 11 55 \n", + "21 0.01 0.027785 40 55 \n", + "22 0.01 0.099391 15 55 \n", + "23 0.01 0.033488 37 55 \n", + "24 0.01 0.170866 4 55 \n", + "25 0.01 0.104805 13 55 \n", + "26 0.01 0.021396 42 55 \n", + "27 0.01 0.016633 44 55 \n", + "28 0.01 0.008364 49 55 \n", + "29 0.01 0.040354 34 55 \n", + "30 0.01 0.040854 33 55 \n", + "31 0.01 0.005926 52 55 \n", + "32 0.01 0.009088 48 55 \n", + "33 0.01 0.114686 10 55 \n", + "34 0.01 0.058719 24 55 \n", + "35 0.01 0.050561 28 55 \n", + "36 0.01 0.091143 18 55 \n", + "37 0.01 0.054681 26 55 \n", + "38 0.01 0.045125 30 55 \n", + "39 0.01 0.196592 2 55 \n", + "40 0.01 0.070867 23 55 \n", + "41 0.01 0.030716 39 55 \n", + "42 0.01 0.075860 21 55 \n", + "43 0.01 0.021915 41 55 \n", + "44 0.01 0.016183 45 55 \n", + "45 0.01 0.031441 38 55 \n", + "46 0.01 0.078489 20 55 \n", + "47 0.01 0.007029 50 55 \n", + "48 0.01 0.127275 6 55 \n", + "49 0.01 0.103115 14 55 \n", + "50 0.01 0.041395 32 55 \n", + "51 0.01 0.106009 12 55 \n", + "52 0.01 0.035760 35 55 \n", + "53 0.01 0.092999 16 55 \n", + "54 0.01 0.002658 54 55 \n", + "55 0.01 0.139763 5 55 \n", + "\n", + " TFs_input_to_model original_TFs_in_X \n", + "0 77 77 \n", + "1 77 77 \n", + "2 77 77 \n", + "3 77 77 \n", + "4 77 77 \n", + "5 77 77 \n", + "6 77 77 \n", + "7 77 77 \n", + "8 77 77 \n", + "9 77 77 \n", + "10 77 77 \n", + "11 77 77 \n", + "12 77 77 \n", + "13 77 77 \n", + "14 77 77 \n", + "15 77 77 \n", + "16 77 77 \n", + "17 77 77 \n", + "18 77 77 \n", + "19 77 77 \n", + "20 77 77 \n", + "21 77 77 \n", + "22 77 77 \n", + "23 77 77 \n", + "24 77 77 \n", + "25 77 77 \n", + "26 77 77 \n", + "27 77 77 \n", + "28 77 77 \n", + "29 77 77 \n", + "30 77 77 \n", + "31 77 77 \n", + "32 77 77 \n", + "33 77 77 \n", + "34 77 77 \n", + "35 77 77 \n", + "36 77 77 \n", + "37 77 77 \n", + "38 77 77 \n", + "39 77 77 \n", + "40 77 77 \n", + "41 77 77 \n", + "42 77 77 \n", + "43 77 77 \n", + "44 77 77 \n", + "45 77 77 \n", + "46 77 77 \n", + "47 77 77 \n", + "48 77 77 \n", + "49 77 77 \n", + "50 77 77 \n", + "51 77 77 \n", + "52 77 77 \n", + "53 77 77 \n", + "54 77 77 \n", + "55 77 77 " + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print(\"To view the TF-TG regulatory links for the optimal TFs, please note that we can access this:\")\n", + "netrem_1a.combined_df" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "2c4f3853", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'info': 'NetREm Model',\n", + " 'alpha_lasso': 0.01,\n", + " 'beta_net': 1,\n", + " 'y_intercept': False,\n", + " 'model_type': 'Lasso',\n", + " 'max_lasso_iterations': 10000,\n", + " 'network': ,\n", + " 'verbose': False,\n", + " 'all_pos_coefs': False,\n", + " 'model_info': 'fitted_model :)',\n", + " 'target_gene_y': 'ZZZ3',\n", + " 'tolerance': 0.0001,\n", + " 'lasso_selection': 'cyclic'}" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netrem_1a.get_params()" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "1146999b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "dict_keys(['info', 'verbose', 'overlapped_nodes_only', 'num_cv_folds', 'num_jobs', 'all_pos_coefs', 'model_type', 'use_network', 'y_intercept', 'max_lasso_iterations', 'view_network', 'model_info', 'target_gene_y', 'tolerance', 'lasso_selection', 'lassocv_eps', 'lassocv_n_alphas', 'lassocv_alphas', 'beta_net', 'network', 'alpha_lasso', 'optimal_alpha', 'prior_network', 'preprocessed_network', 'network_params', 'network_nodes_list', 'kwargs', 'X_df', 'gene_expression_nodes', 'common_nodes', 'final_nodes', 'gexpr_nodes_added', 'gexpr_nodes_to_add_for_net', 'filter_network_bool', 'A_df', 'A', 'nodes', 'network_info', 'M', 'N', 'X_train', 'y_train', 'B_train', 'B_interaction_df', 'B_train_times_M', 'X_tilda_train', 'y_tilda_train', 'X_training_to_use', 'y_training_to_use', 'regr', 'final_alpha', 'coef', 'predY_tilda_train', 'mse_tilda_train', 'predY_train', 'mse_train', 'model_coef_df', 'model_nonzero_coef_df', 'sorted_coef_df', 'corr_vs_coef_df', 'final_corr_vs_coef_df', 'combined_df', 'num_final_predictors'])" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vars(netrem_1a).keys() # to view all of the keys we may call" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "bfdba727", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) We can view the sorted B-matrix of TF-TF interaction for N = 77 TFs for TG ZZZ3 where beta_net = 1.\n" + ] + } + ], + "source": [ + "print(f\":) We can view the sorted B-matrix of TF-TF interaction for N = {num_candidate_TFs_for_TG} TFs \", end=\"\")\n", + "print(f\"for TG {tg} where beta_net = {netrem_1a.beta_net}.\")" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "180439d9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TF1TF2B_train_weightsignpotential_interactionabsVal_Binfocandidate_TFs_Ntarget_gene_ynum_final_predictorsmodel_typebeta_netgene_datarankpercentile
2092FOXO1NFIB1.119274e+00:):(1.119274e+00B matrix of TF-TF interactions77ZZZ355Lasso1training gene expression data1.099.982912
1028NFIBFOXO11.119274e+00:):(1.119274e+00B matrix of TF-TF interactions77ZZZ355Lasso1training gene expression data1.099.982912
4416NFIBSREBF29.570243e-01:):(9.570243e-01B matrix of TF-TF interactions77ZZZ355Lasso1training gene expression data3.099.948735
2136SREBF2NFIB9.570243e-01:):(9.570243e-01B matrix of TF-TF interactions77ZZZ355Lasso1training gene expression data3.099.948735
1046RORAFOXO19.287842e-01:):(9.287842e-01B matrix of TF-TF interactions77ZZZ355Lasso1training gene expression data5.099.914559
................................................
3834TCF3RXRB-1.163842e-06:(:( competitive (-)1.163842e-06B matrix of TF-TF interactions77ZZZ355Lasso1training gene expression data5847.00.085441
960PMLESRRA-9.827955e-07:(:( competitive (-)9.827955e-07B matrix of TF-TF interactions77ZZZ355Lasso1training gene expression data5849.00.051265
2784ESRRAPML-9.827955e-07:(:( competitive (-)9.827955e-07B matrix of TF-TF interactions77ZZZ355Lasso1training gene expression data5850.00.034176
4708ESR2TCF3-9.284305e-07:(:( competitive (-)9.284305e-07B matrix of TF-TF interactions77ZZZ355Lasso1training gene expression data5851.00.017088
908TCF3ESR2-9.284305e-07:(:( competitive (-)9.284305e-07B matrix of TF-TF interactions77ZZZ355Lasso1training gene expression data5851.00.017088
\n", + "

5852 rows × 15 columns

\n", + "
" + ], + "text/plain": [ + " TF1 TF2 B_train_weight sign potential_interaction absVal_B \\\n", + "2092 FOXO1 NFIB 1.119274e+00 :) :( 1.119274e+00 \n", + "1028 NFIB FOXO1 1.119274e+00 :) :( 1.119274e+00 \n", + "4416 NFIB SREBF2 9.570243e-01 :) :( 9.570243e-01 \n", + "2136 SREBF2 NFIB 9.570243e-01 :) :( 9.570243e-01 \n", + "1046 RORA FOXO1 9.287842e-01 :) :( 9.287842e-01 \n", + "... ... ... ... ... ... ... \n", + "3834 TCF3 RXRB -1.163842e-06 :( :( competitive (-) 1.163842e-06 \n", + "960 PML ESRRA -9.827955e-07 :( :( competitive (-) 9.827955e-07 \n", + "2784 ESRRA PML -9.827955e-07 :( :( competitive (-) 9.827955e-07 \n", + "4708 ESR2 TCF3 -9.284305e-07 :( :( competitive (-) 9.284305e-07 \n", + "908 TCF3 ESR2 -9.284305e-07 :( :( competitive (-) 9.284305e-07 \n", + "\n", + " info candidate_TFs_N target_gene_y \\\n", + "2092 B matrix of TF-TF interactions 77 ZZZ3 \n", + "1028 B matrix of TF-TF interactions 77 ZZZ3 \n", + "4416 B matrix of TF-TF interactions 77 ZZZ3 \n", + "2136 B matrix of TF-TF interactions 77 ZZZ3 \n", + "1046 B matrix of TF-TF interactions 77 ZZZ3 \n", + "... ... ... ... \n", + "3834 B matrix of TF-TF interactions 77 ZZZ3 \n", + "960 B matrix of TF-TF interactions 77 ZZZ3 \n", + "2784 B matrix of TF-TF interactions 77 ZZZ3 \n", + "4708 B matrix of TF-TF interactions 77 ZZZ3 \n", + "908 B matrix of TF-TF interactions 77 ZZZ3 \n", + "\n", + " num_final_predictors model_type beta_net \\\n", + "2092 55 Lasso 1 \n", + "1028 55 Lasso 1 \n", + "4416 55 Lasso 1 \n", + "2136 55 Lasso 1 \n", + "1046 55 Lasso 1 \n", + "... ... ... ... \n", + "3834 55 Lasso 1 \n", + "960 55 Lasso 1 \n", + "2784 55 Lasso 1 \n", + "4708 55 Lasso 1 \n", + "908 55 Lasso 1 \n", + "\n", + " gene_data rank percentile \n", + "2092 training gene expression data 1.0 99.982912 \n", + "1028 training gene expression data 1.0 99.982912 \n", + "4416 training gene expression data 3.0 99.948735 \n", + "2136 training gene expression data 3.0 99.948735 \n", + "1046 training gene expression data 5.0 99.914559 \n", + "... ... ... ... \n", + "3834 training gene expression data 5847.0 0.085441 \n", + "960 training gene expression data 5849.0 0.051265 \n", + "2784 training gene expression data 5850.0 0.034176 \n", + "4708 training gene expression data 5851.0 0.017088 \n", + "908 training gene expression data 5851.0 0.017088 \n", + "\n", + "[5852 rows x 15 columns]" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "b_matrix_1a = nm.organize_B_interaction_network(netrem_1a)\n", + "b_matrix_1a" + ] + }, + { + "cell_type": "markdown", + "id": "c28a9769", + "metadata": {}, + "source": [ + "Please note that the original B matrix provided by NetREm for the above dataframe can be accessed as:" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "09ec182e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
BACH1BCL6CCNT2CTCFE2F3E4F1EBF1EGR1ELF1ERF...YY1ZBTB7AZFP28ZFP82ZNF136ZNF140ZNF274ZNF281ZNF682ZNF76
BACH11.0188400.2792520.1873540.1880820.1857290.0660600.100796-0.0000030.3823660.012201...0.2400040.1854090.0506250.0899650.0337970.0606430.1069280.0216570.1313550.167068
BCL60.2792520.9173200.1359900.2014100.1318810.0577130.1245610.0630450.3022830.014507...0.1517850.2559130.0078600.0466150.0680430.0815460.1089800.0335850.0537070.077397
CCNT20.1873540.1359900.4775740.0761860.0948890.0151510.1447700.0097090.236614-0.000004...0.1310200.0667540.0125590.0693820.0099760.0344340.0549540.0238420.0422020.053667
CTCF0.1880820.2014100.0761860.5121250.0683050.0142190.0724450.0097090.2277570.031582...0.1384760.1748570.0225170.0167930.0378150.0419240.0270670.0232470.0355240.041152
E2F30.1857290.1318810.0948890.0683050.3572570.0277380.034148-0.0000050.1644840.014507...0.1205950.112284-0.0000070.0440220.0145260.0311740.0497030.0022960.0552610.068752
..................................................................
ZNF1400.0606430.0815460.0344340.0419240.0311740.0156080.0284690.0097090.102075-0.000008...0.0674580.0461840.0023670.009461-0.0000870.1668080.0196450.0058630.0206850.021863
ZNF2740.1069280.1089800.0549540.0270670.0497030.0075710.050144-0.0000040.1184890.014658...0.0422320.1029100.0023720.0090220.0254510.0196450.3171430.0022990.0407010.043983
ZNF2810.0216570.0335850.0238420.0232470.002296-0.0000030.049474-0.0000020.044545-0.000004...0.0215280.031265-0.0000660.008675-0.0000040.0058630.0022990.0583550.0196060.015731
ZNF6820.1313550.0537070.0422020.0355240.0552610.0103700.026068-0.0002050.135911-0.000435...0.0916610.074324-0.0007540.0205400.0275980.0206850.0407010.0196062.8440380.035217
ZNF760.1670680.0773970.0536670.0411520.0687520.0193010.046529-0.0000060.060648-0.000014...0.0703760.0968650.0078270.0078790.0060810.0218630.0439830.0157310.0352170.227348
\n", + "

77 rows × 77 columns

\n", + "
" + ], + "text/plain": [ + " BACH1 BCL6 CCNT2 CTCF E2F3 E4F1 EBF1 \\\n", + "BACH1 1.018840 0.279252 0.187354 0.188082 0.185729 0.066060 0.100796 \n", + "BCL6 0.279252 0.917320 0.135990 0.201410 0.131881 0.057713 0.124561 \n", + "CCNT2 0.187354 0.135990 0.477574 0.076186 0.094889 0.015151 0.144770 \n", + "CTCF 0.188082 0.201410 0.076186 0.512125 0.068305 0.014219 0.072445 \n", + "E2F3 0.185729 0.131881 0.094889 0.068305 0.357257 0.027738 0.034148 \n", + "... ... ... ... ... ... ... ... \n", + "ZNF140 0.060643 0.081546 0.034434 0.041924 0.031174 0.015608 0.028469 \n", + "ZNF274 0.106928 0.108980 0.054954 0.027067 0.049703 0.007571 0.050144 \n", + "ZNF281 0.021657 0.033585 0.023842 0.023247 0.002296 -0.000003 0.049474 \n", + "ZNF682 0.131355 0.053707 0.042202 0.035524 0.055261 0.010370 0.026068 \n", + "ZNF76 0.167068 0.077397 0.053667 0.041152 0.068752 0.019301 0.046529 \n", + "\n", + " EGR1 ELF1 ERF ... YY1 ZBTB7A ZFP28 \\\n", + "BACH1 -0.000003 0.382366 0.012201 ... 0.240004 0.185409 0.050625 \n", + "BCL6 0.063045 0.302283 0.014507 ... 0.151785 0.255913 0.007860 \n", + "CCNT2 0.009709 0.236614 -0.000004 ... 0.131020 0.066754 0.012559 \n", + "CTCF 0.009709 0.227757 0.031582 ... 0.138476 0.174857 0.022517 \n", + "E2F3 -0.000005 0.164484 0.014507 ... 0.120595 0.112284 -0.000007 \n", + "... ... ... ... ... ... ... ... \n", + "ZNF140 0.009709 0.102075 -0.000008 ... 0.067458 0.046184 0.002367 \n", + "ZNF274 -0.000004 0.118489 0.014658 ... 0.042232 0.102910 0.002372 \n", + "ZNF281 -0.000002 0.044545 -0.000004 ... 0.021528 0.031265 -0.000066 \n", + "ZNF682 -0.000205 0.135911 -0.000435 ... 0.091661 0.074324 -0.000754 \n", + "ZNF76 -0.000006 0.060648 -0.000014 ... 0.070376 0.096865 0.007827 \n", + "\n", + " ZFP82 ZNF136 ZNF140 ZNF274 ZNF281 ZNF682 ZNF76 \n", + "BACH1 0.089965 0.033797 0.060643 0.106928 0.021657 0.131355 0.167068 \n", + "BCL6 0.046615 0.068043 0.081546 0.108980 0.033585 0.053707 0.077397 \n", + "CCNT2 0.069382 0.009976 0.034434 0.054954 0.023842 0.042202 0.053667 \n", + "CTCF 0.016793 0.037815 0.041924 0.027067 0.023247 0.035524 0.041152 \n", + "E2F3 0.044022 0.014526 0.031174 0.049703 0.002296 0.055261 0.068752 \n", + "... ... ... ... ... ... ... ... \n", + "ZNF140 0.009461 -0.000087 0.166808 0.019645 0.005863 0.020685 0.021863 \n", + "ZNF274 0.009022 0.025451 0.019645 0.317143 0.002299 0.040701 0.043983 \n", + "ZNF281 0.008675 -0.000004 0.005863 0.002299 0.058355 0.019606 0.015731 \n", + "ZNF682 0.020540 0.027598 0.020685 0.040701 0.019606 2.844038 0.035217 \n", + "ZNF76 0.007879 0.006081 0.021863 0.043983 0.015731 0.035217 0.227348 \n", + "\n", + "[77 rows x 77 columns]" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netrem_1a.B_interaction_df" + ] + }, + { + "cell_type": "markdown", + "id": "b586e2e4", + "metadata": {}, + "source": [ + "### Example 1b: \n", + "#### Use the default *beta_net* $\\beta_{net}$ but use LassoCV to help find the optimal *alpha_lasso* $\\alpha_{lasso}$ via Cross-Validation." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "717c7c66", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "using beta_net default of 1\n", + "# of TFs with non-zero coefficients: 56\n", + "Training MSE: 0.4005184771488405\n", + "Testing MSE: 0.7104270905888238\n", + "CPU times: total: 15.6 ms\n", + "Wall time: 169 ms\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
y_interceptBACH1CCNT2CTCFE2F3E4F1EBF1ELF1ERFESR2...TBX2TCF3TP53USF1USF2YY1ZBTB7AZNF140ZNF682ZNF76
0None0.0914530.050524-0.0593310.183201-0.21609-0.0355850.121124-0.09812-0.053123...0.009692-0.129406-0.1067560.0492-0.0017340.108942-0.0364160.0952090.0026290.143346
\n", + "

1 rows × 57 columns

\n", + "
" + ], + "text/plain": [ + " y_intercept BACH1 CCNT2 CTCF E2F3 E4F1 EBF1 \\\n", + "0 None 0.091453 0.050524 -0.059331 0.183201 -0.21609 -0.035585 \n", + "\n", + " ELF1 ERF ESR2 ... TBX2 TCF3 TP53 USF1 \\\n", + "0 0.121124 -0.09812 -0.053123 ... 0.009692 -0.129406 -0.106756 0.0492 \n", + "\n", + " USF2 YY1 ZBTB7A ZNF140 ZNF682 ZNF76 \n", + "0 -0.001734 0.108942 -0.036416 0.095209 0.002629 0.143346 \n", + "\n", + "[1 rows x 57 columns]" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time \n", + "# Using defaults for beta and alpha:\n", + "netrem_1b = nm.netrem(edge_list = filtered_ppi_for_TG,\n", + " model_type = \"LassoCV\")\n", + "\n", + "# Fitting the gregulnet model on training data: X_train and y_train:\n", + "netrem_1b.fit(X_train, y_train)\n", + "\n", + "# Analyzing the NetREm Function\n", + "final_model_1b = netrem_1b.model_nonzero_coef_df\n", + "print(f\"# of TFs with non-zero coefficients: {netrem_1b.num_final_predictors}\")\n", + "mse_train = netrem_1b.test_mse(X_train, y_train)\n", + "mse_test = netrem_1b.test_mse(X_test, y_test)\n", + "print(f\"Training MSE: {mse_train}\")\n", + "print(f\"Testing MSE: {mse_test}\")\n", + "final_model_1b\n" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "cb233f0c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'info': 'NetREm Model',\n", + " 'alpha_lasso': 'LassoCV finds optimal alpha',\n", + " 'beta_net': 1,\n", + " 'y_intercept': False,\n", + " 'model_type': 'LassoCV',\n", + " 'max_lasso_iterations': 10000,\n", + " 'network': ,\n", + " 'verbose': False,\n", + " 'all_pos_coefs': False,\n", + " 'model_info': 'fitted_model :)',\n", + " 'target_gene_y': 'ZZZ3',\n", + " 'num_cv_folds': 5,\n", + " 'num_jobs': -1,\n", + " 'lassocv_eps': 0.001,\n", + " 'lassocv_n_alphas': 100,\n", + " 'lassocv_alphas': None,\n", + " 'optimal_alpha': 'Cross-Validation optimal alpha lasso: 0.009564951400513843',\n", + " 'tolerance': 0.0001,\n", + " 'lasso_selection': 'cyclic'}" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netrem_1b.get_params()" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "b5d9634e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
NetREmModel(verbose=False, overlapped_nodes_only=False, num_cv_folds=5, num_jobs=-1, all_pos_coefs=False, model_type=LassoCV, use_network=True, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, lassocv_eps=0.001, lassocv_n_alphas=100, lassocv_alphas=None, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C46C5CF40>, alpha_lasso=LassoCV finds optimal alpha)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "NetREmModel(verbose=False, overlapped_nodes_only=False, num_cv_folds=5, num_jobs=-1, all_pos_coefs=False, model_type=LassoCV, use_network=True, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, lassocv_eps=0.001, lassocv_n_alphas=100, lassocv_alphas=None, network=, alpha_lasso=LassoCV finds optimal alpha)" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netrem_1b" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "29f467d5", + "metadata": {}, + "source": [ + "![netrem_1b.png](../user_guide/pics/netrem_1b.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "0e675c2c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) We can view the sorted B-matrix of TF-TF interaction for N = 77 TFs for TG ZZZ3 where beta_net = 1.\n" + ] + } + ], + "source": [ + "print(f\":) We can view the sorted B-matrix of TF-TF interaction for N = {num_candidate_TFs_for_TG} TFs \", end=\"\")\n", + "print(f\"for TG {tg} where beta_net = {netrem_1a.beta_net}.\")" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "74108fae", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TF1TF2B_train_weightsignpotential_interactionabsVal_Binfocandidate_TFs_Ntarget_gene_ynum_final_predictorsmodel_typebeta_netgene_datarankpercentile
2092FOXO1NFIB1.119274e+00:):(1.119274e+00B matrix of TF-TF interactions77ZZZ356LassoCV1training gene expression data1.099.982912
1028NFIBFOXO11.119274e+00:):(1.119274e+00B matrix of TF-TF interactions77ZZZ356LassoCV1training gene expression data1.099.982912
4416NFIBSREBF29.570243e-01:):(9.570243e-01B matrix of TF-TF interactions77ZZZ356LassoCV1training gene expression data3.099.948735
2136SREBF2NFIB9.570243e-01:):(9.570243e-01B matrix of TF-TF interactions77ZZZ356LassoCV1training gene expression data3.099.948735
1046RORAFOXO19.287842e-01:):(9.287842e-01B matrix of TF-TF interactions77ZZZ356LassoCV1training gene expression data5.099.914559
................................................
3834TCF3RXRB-1.163842e-06:(:( competitive (-)1.163842e-06B matrix of TF-TF interactions77ZZZ356LassoCV1training gene expression data5847.00.085441
960PMLESRRA-9.827955e-07:(:( competitive (-)9.827955e-07B matrix of TF-TF interactions77ZZZ356LassoCV1training gene expression data5849.00.051265
2784ESRRAPML-9.827955e-07:(:( competitive (-)9.827955e-07B matrix of TF-TF interactions77ZZZ356LassoCV1training gene expression data5850.00.034176
4708ESR2TCF3-9.284305e-07:(:( competitive (-)9.284305e-07B matrix of TF-TF interactions77ZZZ356LassoCV1training gene expression data5851.00.017088
908TCF3ESR2-9.284305e-07:(:( competitive (-)9.284305e-07B matrix of TF-TF interactions77ZZZ356LassoCV1training gene expression data5851.00.017088
\n", + "

5852 rows × 15 columns

\n", + "
" + ], + "text/plain": [ + " TF1 TF2 B_train_weight sign potential_interaction absVal_B \\\n", + "2092 FOXO1 NFIB 1.119274e+00 :) :( 1.119274e+00 \n", + "1028 NFIB FOXO1 1.119274e+00 :) :( 1.119274e+00 \n", + "4416 NFIB SREBF2 9.570243e-01 :) :( 9.570243e-01 \n", + "2136 SREBF2 NFIB 9.570243e-01 :) :( 9.570243e-01 \n", + "1046 RORA FOXO1 9.287842e-01 :) :( 9.287842e-01 \n", + "... ... ... ... ... ... ... \n", + "3834 TCF3 RXRB -1.163842e-06 :( :( competitive (-) 1.163842e-06 \n", + "960 PML ESRRA -9.827955e-07 :( :( competitive (-) 9.827955e-07 \n", + "2784 ESRRA PML -9.827955e-07 :( :( competitive (-) 9.827955e-07 \n", + "4708 ESR2 TCF3 -9.284305e-07 :( :( competitive (-) 9.284305e-07 \n", + "908 TCF3 ESR2 -9.284305e-07 :( :( competitive (-) 9.284305e-07 \n", + "\n", + " info candidate_TFs_N target_gene_y \\\n", + "2092 B matrix of TF-TF interactions 77 ZZZ3 \n", + "1028 B matrix of TF-TF interactions 77 ZZZ3 \n", + "4416 B matrix of TF-TF interactions 77 ZZZ3 \n", + "2136 B matrix of TF-TF interactions 77 ZZZ3 \n", + "1046 B matrix of TF-TF interactions 77 ZZZ3 \n", + "... ... ... ... \n", + "3834 B matrix of TF-TF interactions 77 ZZZ3 \n", + "960 B matrix of TF-TF interactions 77 ZZZ3 \n", + "2784 B matrix of TF-TF interactions 77 ZZZ3 \n", + "4708 B matrix of TF-TF interactions 77 ZZZ3 \n", + "908 B matrix of TF-TF interactions 77 ZZZ3 \n", + "\n", + " num_final_predictors model_type beta_net \\\n", + "2092 56 LassoCV 1 \n", + "1028 56 LassoCV 1 \n", + "4416 56 LassoCV 1 \n", + "2136 56 LassoCV 1 \n", + "1046 56 LassoCV 1 \n", + "... ... ... ... \n", + "3834 56 LassoCV 1 \n", + "960 56 LassoCV 1 \n", + "2784 56 LassoCV 1 \n", + "4708 56 LassoCV 1 \n", + "908 56 LassoCV 1 \n", + "\n", + " gene_data rank percentile \n", + "2092 training gene expression data 1.0 99.982912 \n", + "1028 training gene expression data 1.0 99.982912 \n", + "4416 training gene expression data 3.0 99.948735 \n", + "2136 training gene expression data 3.0 99.948735 \n", + "1046 training gene expression data 5.0 99.914559 \n", + "... ... ... ... \n", + "3834 training gene expression data 5847.0 0.085441 \n", + "960 training gene expression data 5849.0 0.051265 \n", + "2784 training gene expression data 5850.0 0.034176 \n", + "4708 training gene expression data 5851.0 0.017088 \n", + "908 training gene expression data 5851.0 0.017088 \n", + "\n", + "[5852 rows x 15 columns]" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "b_matrix_1b = nm.organize_B_interaction_network(netrem_1b)\n", + "b_matrix_1b" + ] + }, + { + "cell_type": "markdown", + "id": "0bd40de1", + "metadata": {}, + "source": [ + "### Example 1c\n", + "#### Using *netremCV* to determine the optimal alpha and beta values via cross-validation approaches :). This function may be computationally and time-intensive:" + ] + }, + { + "cell_type": "markdown", + "id": "4b901ca1", + "metadata": {}, + "source": [ + "#### Option 1\n", + "Option 1 is RandomizedSearchCV, which is more efficient given time constraints" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "7bc702c4", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) using variance to define beta_net values\n", + "beta_min = 0.14322421928177426 and beta_max = 14.322421928177425\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "5d8107af6c6a42f6b5130718c2d95e98", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + ":) Generating beta_net and alpha_lasso pairs: 0%| | 0/50 [00:00#sk-container-id-3 {color: black;background-color: white;}#sk-container-id-3 pre{padding: 0;}#sk-container-id-3 div.sk-toggleable {background-color: white;}#sk-container-id-3 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-3 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-3 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-3 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-3 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-3 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-3 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-3 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-3 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-3 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-3 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-3 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-3 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-3 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-3 div.sk-item {position: relative;z-index: 1;}#sk-container-id-3 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-3 div.sk-item::before, #sk-container-id-3 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-3 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-3 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-3 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-3 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-3 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-3 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-3 div.sk-label-container {text-align: center;}#sk-container-id-3 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-3 div.sk-text-repr-fallback {display: none;}
NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=11.868171420559674, alpha_lasso=0.009564951400513843, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C46B2AEF0>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=11.868171420559674, alpha_lasso=0.009564951400513843, network=)" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netCV_ex1" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "0e42a122", + "metadata": {}, + "source": [ + "![netCV_ex1.png](../user_guide/pics/netCV_ex1.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "48738d8c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'info': 'NetREm Model',\n", + " 'alpha_lasso': 0.009564951400513843,\n", + " 'beta_net': 11.868171420559674,\n", + " 'y_intercept': False,\n", + " 'model_type': 'Lasso',\n", + " 'max_lasso_iterations': 10000,\n", + " 'network': ,\n", + " 'verbose': False,\n", + " 'all_pos_coefs': False,\n", + " 'model_info': 'fitted_model :)',\n", + " 'target_gene_y': 'ZZZ3',\n", + " 'tolerance': 0.0001,\n", + " 'lasso_selection': 'cyclic'}" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netCV_ex1.get_params()" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "967f1b96", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Training MSE: 0.40382941437204656\n", + "Testing MSE: 0.7021367157493694\n" + ] + } + ], + "source": [ + "mse_train = netCV_ex1.test_mse(X_train, y_train)\n", + "mse_test = netCV_ex1.test_mse(X_test, y_test)\n", + "print(f\"Training MSE: {mse_train}\")\n", + "print(f\"Testing MSE: {mse_test}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "0d32d652", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
infoinput_dataBACH1BCL6CCNT2CTCFE2F3E4F1EBF1ELF1...TBX2TCF3TP53USF1USF2YY1ZBTB7AZNF140ZNF682ZNF76
0network regression coeff. with y: ZZZ3X_train0.092490.0007280.050554-0.0594160.180268-0.197291-0.0332580.119244...0.006187-0.128818-0.1031560.04475-0.0001380.105222-0.0355820.0889970.0004260.119169
0corr (r) with y: ZZZ3X_train0.1910810.0283150.134249-0.0534870.160718-0.046865-0.047160.207914...0.026626-0.149349-0.0651730.085353-0.0254910.119843-0.1245850.1325370.1297380.124608
0Absolute Value NetREm Coefficient RankingX_train1654292531377...525143257113518568
\n", + "

3 rows × 59 columns

\n", + "
" + ], + "text/plain": [ + " info input_data BACH1 BCL6 \\\n", + "0 network regression coeff. with y: ZZZ3 X_train 0.09249 0.000728 \n", + "0 corr (r) with y: ZZZ3 X_train 0.191081 0.028315 \n", + "0 Absolute Value NetREm Coefficient Ranking X_train 16 54 \n", + "\n", + " CCNT2 CTCF E2F3 E4F1 EBF1 ELF1 ... TBX2 \\\n", + "0 0.050554 -0.059416 0.180268 -0.197291 -0.033258 0.119244 ... 0.006187 \n", + "0 0.134249 -0.053487 0.160718 -0.046865 -0.04716 0.207914 ... 0.026626 \n", + "0 29 25 3 1 37 7 ... 52 \n", + "\n", + " TCF3 TP53 USF1 USF2 YY1 ZBTB7A ZNF140 \\\n", + "0 -0.128818 -0.103156 0.04475 -0.000138 0.105222 -0.035582 0.088997 \n", + "0 -0.149349 -0.065173 0.085353 -0.025491 0.119843 -0.124585 0.132537 \n", + "0 5 14 32 57 11 35 18 \n", + "\n", + " ZNF682 ZNF76 \n", + "0 0.000426 0.119169 \n", + "0 0.129738 0.124608 \n", + "0 56 8 \n", + "\n", + "[3 rows x 59 columns]" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netCV_ex1.final_corr_vs_coef_df" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "647fee07", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TF1TF2B_train_weightsignpotential_interactionabsVal_Binfocandidate_TFs_Ntarget_gene_ynum_final_predictorsmodel_typebeta_netgene_datarankpercentile
2092FOXO1NFIB1.119190:):(1.119190B matrix of TF-TF interactions77ZZZ357Lasso11.868171training gene expression data1.099.982912
1028NFIBFOXO11.119190:):(1.119190B matrix of TF-TF interactions77ZZZ357Lasso11.868171training gene expression data1.099.982912
4416NFIBSREBF20.956982:):(0.956982B matrix of TF-TF interactions77ZZZ357Lasso11.868171training gene expression data3.099.948735
2136SREBF2NFIB0.956982:):(0.956982B matrix of TF-TF interactions77ZZZ357Lasso11.868171training gene expression data3.099.948735
1046RORAFOXO10.928712:):(0.928712B matrix of TF-TF interactions77ZZZ357Lasso11.868171training gene expression data5.099.914559
................................................
3834TCF3RXRB-0.000014:(:( competitive (-)0.000014B matrix of TF-TF interactions77ZZZ357Lasso11.868171training gene expression data5848.00.068353
960PMLESRRA-0.000012:(:( competitive (-)0.000012B matrix of TF-TF interactions77ZZZ357Lasso11.868171training gene expression data5849.00.051265
2784ESRRAPML-0.000012:(:( competitive (-)0.000012B matrix of TF-TF interactions77ZZZ357Lasso11.868171training gene expression data5850.00.034176
908TCF3ESR2-0.000011:(:( competitive (-)0.000011B matrix of TF-TF interactions77ZZZ357Lasso11.868171training gene expression data5851.00.017088
4708ESR2TCF3-0.000011:(:( competitive (-)0.000011B matrix of TF-TF interactions77ZZZ357Lasso11.868171training gene expression data5851.00.017088
\n", + "

5852 rows × 15 columns

\n", + "
" + ], + "text/plain": [ + " TF1 TF2 B_train_weight sign potential_interaction absVal_B \\\n", + "2092 FOXO1 NFIB 1.119190 :) :( 1.119190 \n", + "1028 NFIB FOXO1 1.119190 :) :( 1.119190 \n", + "4416 NFIB SREBF2 0.956982 :) :( 0.956982 \n", + "2136 SREBF2 NFIB 0.956982 :) :( 0.956982 \n", + "1046 RORA FOXO1 0.928712 :) :( 0.928712 \n", + "... ... ... ... ... ... ... \n", + "3834 TCF3 RXRB -0.000014 :( :( competitive (-) 0.000014 \n", + "960 PML ESRRA -0.000012 :( :( competitive (-) 0.000012 \n", + "2784 ESRRA PML -0.000012 :( :( competitive (-) 0.000012 \n", + "908 TCF3 ESR2 -0.000011 :( :( competitive (-) 0.000011 \n", + "4708 ESR2 TCF3 -0.000011 :( :( competitive (-) 0.000011 \n", + "\n", + " info candidate_TFs_N target_gene_y \\\n", + "2092 B matrix of TF-TF interactions 77 ZZZ3 \n", + "1028 B matrix of TF-TF interactions 77 ZZZ3 \n", + "4416 B matrix of TF-TF interactions 77 ZZZ3 \n", + "2136 B matrix of TF-TF interactions 77 ZZZ3 \n", + "1046 B matrix of TF-TF interactions 77 ZZZ3 \n", + "... ... ... ... \n", + "3834 B matrix of TF-TF interactions 77 ZZZ3 \n", + "960 B matrix of TF-TF interactions 77 ZZZ3 \n", + "2784 B matrix of TF-TF interactions 77 ZZZ3 \n", + "908 B matrix of TF-TF interactions 77 ZZZ3 \n", + "4708 B matrix of TF-TF interactions 77 ZZZ3 \n", + "\n", + " num_final_predictors model_type beta_net \\\n", + "2092 57 Lasso 11.868171 \n", + "1028 57 Lasso 11.868171 \n", + "4416 57 Lasso 11.868171 \n", + "2136 57 Lasso 11.868171 \n", + "1046 57 Lasso 11.868171 \n", + "... ... ... ... \n", + "3834 57 Lasso 11.868171 \n", + "960 57 Lasso 11.868171 \n", + "2784 57 Lasso 11.868171 \n", + "908 57 Lasso 11.868171 \n", + "4708 57 Lasso 11.868171 \n", + "\n", + " gene_data rank percentile \n", + "2092 training gene expression data 1.0 99.982912 \n", + "1028 training gene expression data 1.0 99.982912 \n", + "4416 training gene expression data 3.0 99.948735 \n", + "2136 training gene expression data 3.0 99.948735 \n", + "1046 training gene expression data 5.0 99.914559 \n", + "... ... ... ... \n", + "3834 training gene expression data 5848.0 0.068353 \n", + "960 training gene expression data 5849.0 0.051265 \n", + "2784 training gene expression data 5850.0 0.034176 \n", + "908 training gene expression data 5851.0 0.017088 \n", + "4708 training gene expression data 5851.0 0.017088 \n", + "\n", + "[5852 rows x 15 columns]" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "b_matrix_netremcv = nm.organize_B_interaction_network(netCV_ex1)\n", + "b_matrix_netremcv" + ] + }, + { + "cell_type": "markdown", + "id": "3def87b3", + "metadata": {}, + "source": [ + "#### Option 2:\n", + "\n", + "Option 2 is GridSearchCV, which is less efficient but more comprehensive:" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "6e915b26", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) using variance to define beta_net values\n", + "beta_min = 0.14322421928177426 and beta_max = 14.322421928177425\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "7cdc161663654a1b838aeb5afc5c0fc5", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + ":) Generating beta_net and alpha_lasso pairs: 0%| | 0/50 [00:00#sk-container-id-4 {color: black;background-color: white;}#sk-container-id-4 pre{padding: 0;}#sk-container-id-4 div.sk-toggleable {background-color: white;}#sk-container-id-4 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-4 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-4 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-4 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-4 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-4 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-4 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-4 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-4 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-4 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-4 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-4 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-4 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-4 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-4 div.sk-item {position: relative;z-index: 1;}#sk-container-id-4 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-4 div.sk-item::before, #sk-container-id-4 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-4 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-4 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-4 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-4 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-4 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-4 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-4 div.sk-label-container {text-align: center;}#sk-container-id-4 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-4 div.sk-text-repr-fallback {display: none;}
NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=14.322421928177425, alpha_lasso=0.009564951400513843, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C46E874C0>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=14.322421928177425, alpha_lasso=0.009564951400513843, network=)" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time \n", + "\n", + "netCV_ex2 = nm.netremCV(edge_list = filtered_ppi_for_TG, X = X_train, y = y_train, searchVerbosity = 1)\n", + "netCV_ex2" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "958fba86", + "metadata": {}, + "source": [ + "![netCV_ex2.png](../user_guide/pics/netCV_ex2.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "c66c5f9a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) # of final TFs in the model for TG ZZZ3: 56\n", + "Training MSE: 0.40452577061665235\n", + "Testing MSE: 0.7004107017093232\n" + ] + } + ], + "source": [ + "print(f\":) # of final TFs in the model for TG {tg}: {netCV_ex2.num_final_predictors}\")\n", + "mse_train = netCV_ex2.test_mse(X_train, y_train)\n", + "mse_test = netCV_ex2.test_mse(X_test, y_test)\n", + "print(f\"Training MSE: {mse_train}\")\n", + "print(f\"Testing MSE: {mse_test}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "367d8aa0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
coefTFTGinfotrain_msebeta_netalpha_lassoAbsoluteVal_coefficientRankfinal_model_TFsTFs_input_to_modeloriginal_TFs_in_X
0Noney_interceptZZZ3netrem_no_intercept0.40452614.3224220.009565NaN57567777
10.092587BACH1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.09258716567777
20.00092BCL6ZZZ3netrem_no_intercept0.40452614.3224220.0095650.00092054567777
30.050598CCNT2ZZZ3netrem_no_intercept0.40452614.3224220.0095650.05059829567777
4-0.059361CTCFZZZ3netrem_no_intercept0.40452614.3224220.0095650.05936125567777
50.179477E2F3ZZZ3netrem_no_intercept0.40452614.3224220.0095650.1794773567777
6-0.193458E4F1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.1934582567777
7-0.032772EBF1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.03277237567777
80.118892ELF1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.1188926567777
9-0.079737ERFZZZ3netrem_no_intercept0.40452614.3224220.0095650.07973719567777
10-0.043963ESR2ZZZ3netrem_no_intercept0.40452614.3224220.0095650.04396332567777
11-0.009189FOXO1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.00918950567777
120.118487HCFC1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.1184877567777
130.057439HDAC2ZZZ3netrem_no_intercept0.40452614.3224220.0095650.05743926567777
140.109527IRF3ZZZ3netrem_no_intercept0.40452614.3224220.0095650.10952710567777
150.014876IRF7ZZZ3netrem_no_intercept0.40452614.3224220.0095650.01487646567777
16-0.068512KLF12ZZZ3netrem_no_intercept0.40452614.3224220.0095650.06851223567777
170.012554KLF15ZZZ3netrem_no_intercept0.40452614.3224220.0095650.01255447567777
180.021563MAFZZZ3netrem_no_intercept0.40452614.3224220.0095650.02156342567777
190.004856MAXZZZ3netrem_no_intercept0.40452614.3224220.0095650.00485653567777
20-0.103236MXI1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.10323612567777
210.028747MYEF2ZZZ3netrem_no_intercept0.40452614.3224220.0095650.02874740567777
220.099127NFIBZZZ3netrem_no_intercept0.40452614.3224220.0095650.09912715567777
230.033372NFICZZZ3netrem_no_intercept0.40452614.3224220.0095650.03337236567777
240.169403NFKB1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.1694034567777
250.10255NR1H2ZZZ3netrem_no_intercept0.40452614.3224220.0095650.10255013567777
26-0.021933NR2F1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.02193341567777
270.01755NR3C1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.01755044567777
280.009942NR6A1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.00994248567777
290.04056PLAG1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.04056034567777
30-0.040835PMLZZZ3netrem_no_intercept0.40452614.3224220.0095650.04083533567777
31-0.007102POU2F1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.00710251567777
320.009594PPARAZZZ3netrem_no_intercept0.40452614.3224220.0095650.00959449567777
330.000353PPARDZZZ3netrem_no_intercept0.40452614.3224220.0095650.00035356567777
340.115948RARBZZZ3netrem_no_intercept0.40452614.3224220.0095650.1159488567777
35-0.060035RARGZZZ3netrem_no_intercept0.40452614.3224220.0095650.06003524567777
360.051793RFX3ZZZ3netrem_no_intercept0.40452614.3224220.0095650.05179328567777
37-0.089946RORAZZZ3netrem_no_intercept0.40452614.3224220.0095650.08994617567777
380.053255RREB1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.05325527567777
390.044984RUNX2ZZZ3netrem_no_intercept0.40452614.3224220.0095650.04498430567777
400.194475SETDB1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.1944751567777
410.069117SIN3AZZZ3netrem_no_intercept0.40452614.3224220.0095650.06911722567777
42-0.031496SMAD4ZZZ3netrem_no_intercept0.40452614.3224220.0095650.03149639567777
430.074904SMC3ZZZ3netrem_no_intercept0.40452614.3224220.0095650.07490421567777
440.02126SP1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.02126043567777
45-0.01584SREBF2ZZZ3netrem_no_intercept0.40452614.3224220.0095650.01584045567777
460.03191STAT1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.03191038567777
470.076585STAT5BZZZ3netrem_no_intercept0.40452614.3224220.0095650.07658520567777
480.005611TBX2ZZZ3netrem_no_intercept0.40452614.3224220.0095650.00561152567777
49-0.128551TCF3ZZZ3netrem_no_intercept0.40452614.3224220.0095650.1285515567777
50-0.102327TP53ZZZ3netrem_no_intercept0.40452614.3224220.0095650.10232714567777
510.044001USF1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.04400131567777
520.104471YY1ZZZ3netrem_no_intercept0.40452614.3224220.0095650.10447111567777
53-0.035432ZBTB7AZZZ3netrem_no_intercept0.40452614.3224220.0095650.03543235567777
540.08773ZNF140ZZZ3netrem_no_intercept0.40452614.3224220.0095650.08773018567777
550.000388ZNF682ZZZ3netrem_no_intercept0.40452614.3224220.0095650.00038855567777
560.114872ZNF76ZZZ3netrem_no_intercept0.40452614.3224220.0095650.1148729567777
\n", + "
" + ], + "text/plain": [ + " coef TF TG info train_mse beta_net \\\n", + "0 None y_intercept ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "1 0.092587 BACH1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "2 0.00092 BCL6 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "3 0.050598 CCNT2 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "4 -0.059361 CTCF ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "5 0.179477 E2F3 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "6 -0.193458 E4F1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "7 -0.032772 EBF1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "8 0.118892 ELF1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "9 -0.079737 ERF ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "10 -0.043963 ESR2 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "11 -0.009189 FOXO1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "12 0.118487 HCFC1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "13 0.057439 HDAC2 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "14 0.109527 IRF3 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "15 0.014876 IRF7 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "16 -0.068512 KLF12 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "17 0.012554 KLF15 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "18 0.021563 MAF ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "19 0.004856 MAX ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "20 -0.103236 MXI1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "21 0.028747 MYEF2 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "22 0.099127 NFIB ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "23 0.033372 NFIC ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "24 0.169403 NFKB1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "25 0.10255 NR1H2 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "26 -0.021933 NR2F1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "27 0.01755 NR3C1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "28 0.009942 NR6A1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "29 0.04056 PLAG1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "30 -0.040835 PML ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "31 -0.007102 POU2F1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "32 0.009594 PPARA ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "33 0.000353 PPARD ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "34 0.115948 RARB ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "35 -0.060035 RARG ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "36 0.051793 RFX3 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "37 -0.089946 RORA ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "38 0.053255 RREB1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "39 0.044984 RUNX2 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "40 0.194475 SETDB1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "41 0.069117 SIN3A ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "42 -0.031496 SMAD4 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "43 0.074904 SMC3 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "44 0.02126 SP1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "45 -0.01584 SREBF2 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "46 0.03191 STAT1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "47 0.076585 STAT5B ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "48 0.005611 TBX2 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "49 -0.128551 TCF3 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "50 -0.102327 TP53 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "51 0.044001 USF1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "52 0.104471 YY1 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "53 -0.035432 ZBTB7A ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "54 0.08773 ZNF140 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "55 0.000388 ZNF682 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "56 0.114872 ZNF76 ZZZ3 netrem_no_intercept 0.404526 14.322422 \n", + "\n", + " alpha_lasso AbsoluteVal_coefficient Rank final_model_TFs \\\n", + "0 0.009565 NaN 57 56 \n", + "1 0.009565 0.092587 16 56 \n", + "2 0.009565 0.000920 54 56 \n", + "3 0.009565 0.050598 29 56 \n", + "4 0.009565 0.059361 25 56 \n", + "5 0.009565 0.179477 3 56 \n", + "6 0.009565 0.193458 2 56 \n", + "7 0.009565 0.032772 37 56 \n", + "8 0.009565 0.118892 6 56 \n", + "9 0.009565 0.079737 19 56 \n", + "10 0.009565 0.043963 32 56 \n", + "11 0.009565 0.009189 50 56 \n", + "12 0.009565 0.118487 7 56 \n", + "13 0.009565 0.057439 26 56 \n", + "14 0.009565 0.109527 10 56 \n", + "15 0.009565 0.014876 46 56 \n", + "16 0.009565 0.068512 23 56 \n", + "17 0.009565 0.012554 47 56 \n", + "18 0.009565 0.021563 42 56 \n", + "19 0.009565 0.004856 53 56 \n", + "20 0.009565 0.103236 12 56 \n", + "21 0.009565 0.028747 40 56 \n", + "22 0.009565 0.099127 15 56 \n", + "23 0.009565 0.033372 36 56 \n", + "24 0.009565 0.169403 4 56 \n", + "25 0.009565 0.102550 13 56 \n", + "26 0.009565 0.021933 41 56 \n", + "27 0.009565 0.017550 44 56 \n", + "28 0.009565 0.009942 48 56 \n", + "29 0.009565 0.040560 34 56 \n", + "30 0.009565 0.040835 33 56 \n", + "31 0.009565 0.007102 51 56 \n", + "32 0.009565 0.009594 49 56 \n", + "33 0.009565 0.000353 56 56 \n", + "34 0.009565 0.115948 8 56 \n", + "35 0.009565 0.060035 24 56 \n", + "36 0.009565 0.051793 28 56 \n", + "37 0.009565 0.089946 17 56 \n", + "38 0.009565 0.053255 27 56 \n", + "39 0.009565 0.044984 30 56 \n", + "40 0.009565 0.194475 1 56 \n", + "41 0.009565 0.069117 22 56 \n", + "42 0.009565 0.031496 39 56 \n", + "43 0.009565 0.074904 21 56 \n", + "44 0.009565 0.021260 43 56 \n", + "45 0.009565 0.015840 45 56 \n", + "46 0.009565 0.031910 38 56 \n", + "47 0.009565 0.076585 20 56 \n", + "48 0.009565 0.005611 52 56 \n", + "49 0.009565 0.128551 5 56 \n", + "50 0.009565 0.102327 14 56 \n", + "51 0.009565 0.044001 31 56 \n", + "52 0.009565 0.104471 11 56 \n", + "53 0.009565 0.035432 35 56 \n", + "54 0.009565 0.087730 18 56 \n", + "55 0.009565 0.000388 55 56 \n", + "56 0.009565 0.114872 9 56 \n", + "\n", + " TFs_input_to_model original_TFs_in_X \n", + "0 77 77 \n", + "1 77 77 \n", + "2 77 77 \n", + "3 77 77 \n", + "4 77 77 \n", + "5 77 77 \n", + "6 77 77 \n", + "7 77 77 \n", + "8 77 77 \n", + "9 77 77 \n", + "10 77 77 \n", + "11 77 77 \n", + "12 77 77 \n", + "13 77 77 \n", + "14 77 77 \n", + "15 77 77 \n", + "16 77 77 \n", + "17 77 77 \n", + "18 77 77 \n", + "19 77 77 \n", + "20 77 77 \n", + "21 77 77 \n", + "22 77 77 \n", + "23 77 77 \n", + "24 77 77 \n", + "25 77 77 \n", + "26 77 77 \n", + "27 77 77 \n", + "28 77 77 \n", + "29 77 77 \n", + "30 77 77 \n", + "31 77 77 \n", + "32 77 77 \n", + "33 77 77 \n", + "34 77 77 \n", + "35 77 77 \n", + "36 77 77 \n", + "37 77 77 \n", + "38 77 77 \n", + "39 77 77 \n", + "40 77 77 \n", + "41 77 77 \n", + "42 77 77 \n", + "43 77 77 \n", + "44 77 77 \n", + "45 77 77 \n", + "46 77 77 \n", + "47 77 77 \n", + "48 77 77 \n", + "49 77 77 \n", + "50 77 77 \n", + "51 77 77 \n", + "52 77 77 \n", + "53 77 77 \n", + "54 77 77 \n", + "55 77 77 \n", + "56 77 77 " + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netCV_ex2.combined_df" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "b99f60fa", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We input N = 77 candidate TFs as our predictors, X, for this prediction, of which 56 TFs were selected as optimal for our target gene (TG): ZZZ3.\n" + ] + } + ], + "source": [ + "print(f\"We input N = {num_candidate_TFs_for_TG} candidate TFs as our predictors, X, for this prediction\", end = \"\")\n", + "print(f\", of which {netCV_ex2.num_final_predictors} TFs were selected as optimal for our target gene (TG): {tg}.\")" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "fd3deadd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
y_interceptBACH1BCL6CCNT2CTCFE2F3E4F1EBF1EGR1ELF1...YY1ZBTB7AZFP28ZFP82ZNF136ZNF140ZNF274ZNF281ZNF682ZNF76
0None0.0925870.000920.050598-0.0593610.179477-0.193458-0.0327720.00.118892...0.104471-0.0354320.00.00.00.087730.00.00.0003880.114872
\n", + "

1 rows × 78 columns

\n", + "
" + ], + "text/plain": [ + " y_intercept BACH1 BCL6 CCNT2 CTCF E2F3 E4F1 \\\n", + "0 None 0.092587 0.00092 0.050598 -0.059361 0.179477 -0.193458 \n", + "\n", + " EBF1 EGR1 ELF1 ... YY1 ZBTB7A ZFP28 ZFP82 ZNF136 \\\n", + "0 -0.032772 0.0 0.118892 ... 0.104471 -0.035432 0.0 0.0 0.0 \n", + "\n", + " ZNF140 ZNF274 ZNF281 ZNF682 ZNF76 \n", + "0 0.08773 0.0 0.0 0.000388 0.114872 \n", + "\n", + "[1 rows x 78 columns]" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netCV_ex2.model_coef_df # all of the N predictors. 1 column is added due to y_intercept term. " + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "4457b6fb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
y_interceptBACH1BCL6CCNT2CTCFE2F3E4F1EBF1ELF1ERF...STAT5BTBX2TCF3TP53USF1YY1ZBTB7AZNF140ZNF682ZNF76
0None0.0925870.000920.050598-0.0593610.179477-0.193458-0.0327720.118892-0.079737...0.0765850.005611-0.128551-0.1023270.0440010.104471-0.0354320.087730.0003880.114872
\n", + "

1 rows × 57 columns

\n", + "
" + ], + "text/plain": [ + " y_intercept BACH1 BCL6 CCNT2 CTCF E2F3 E4F1 \\\n", + "0 None 0.092587 0.00092 0.050598 -0.059361 0.179477 -0.193458 \n", + "\n", + " EBF1 ELF1 ERF ... STAT5B TBX2 TCF3 TP53 \\\n", + "0 -0.032772 0.118892 -0.079737 ... 0.076585 0.005611 -0.128551 -0.102327 \n", + "\n", + " USF1 YY1 ZBTB7A ZNF140 ZNF682 ZNF76 \n", + "0 0.044001 0.104471 -0.035432 0.08773 0.000388 0.114872 \n", + "\n", + "[1 rows x 57 columns]" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netCV_ex2.model_nonzero_coef_df # the final # of non-zero predictors. 1 column is added due to y_intercept term. " + ] + }, + { + "cell_type": "markdown", + "id": "2daa8e81", + "metadata": {}, + "source": [ + "#### Option 3:\n", + "Option 3 shows how the user may add additional paramters to netremCV to consider as it determines the optimal *beta_net* $\\beta_{net}$ and *alpha_lasso* $\\alpha_{lasso}$ values via cross-validation. We can set searchVerbosity = 1, if we want to avoid printing out the outputs of the searchCV. " + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "a191c8f6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) using variance to define beta_net values\n", + "beta_min = 0.14322421928177426 and beta_max = 14.322421928177425\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "32a26343987e497ab96d6a97a55e6529", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + ":) Generating beta_net and alpha_lasso pairs: 0%| | 0/50 [00:00#sk-container-id-5 {color: black;background-color: white;}#sk-container-id-5 pre{padding: 0;}#sk-container-id-5 div.sk-toggleable {background-color: white;}#sk-container-id-5 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-5 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-5 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-5 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-5 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-5 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-5 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-5 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-5 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-5 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-5 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-5 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-5 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-5 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-5 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-5 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-5 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-5 div.sk-item {position: relative;z-index: 1;}#sk-container-id-5 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-5 div.sk-item::before, #sk-container-id-5 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-5 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-5 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-5 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-5 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-5 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-5 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-5 div.sk-label-container {text-align: center;}#sk-container-id-5 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-5 div.sk-text-repr-fallback {display: none;}
NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, y_intercept=True, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=0.9383027608540463, alpha_lasso=0.009564951400513843, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C46E87400>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, y_intercept=True, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=0.9383027608540463, alpha_lasso=0.009564951400513843, network=)" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time\n", + "\n", + "netCV_ex3 = nm.netremCV(edge_list = filtered_ppi_for_TG, X = X_train, y = y_train, reduced_cv_search = True,\n", + " y_intercept = True, searchVerbosity = 1)\n", + "netCV_ex3" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "15372a52", + "metadata": {}, + "source": [ + "![netCV_ex3.png](../user_guide/pics/netCV_ex3.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "187edd9d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) # of final TFs in the model for TG ZZZ3: 53\n", + "Training MSE: 0.44467536404146185\n", + "Testing MSE: 0.7517263006549427\n" + ] + } + ], + "source": [ + "print(f\":) # of final TFs in the model for TG {tg}: {netCV_ex3.num_final_predictors}\")\n", + "mse_train = netCV_ex3.test_mse(X_train, y_train)\n", + "mse_test = netCV_ex3.test_mse(X_test, y_test)\n", + "print(f\"Training MSE: {mse_train}\")\n", + "print(f\"Testing MSE: {mse_test}\")" + ] + }, + { + "cell_type": "markdown", + "id": "41a83fee", + "metadata": {}, + "source": [ + "### Example 1d: \n", + "#### Using Bayesian Optimization and Gaussian Processes to determine the optimal *beta_net* $\\beta_{net}$ and *alpha_lasso* $\\alpha_{lasso}$. Here, we can use the default range of values checked for $\\beta_{net}$ and $\\alpha_{lasso}$. \n", + "\n", + "In *Example 2c*, we will show how these values can be adjusted by the user based on intuition to improve performance. :)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "8159e12d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "using beta_net default of 1\n", + "using alpha_lasso default of 0.01\n", + ":) Please note that we are running: optimal_netrem_model_via_bayesian_param_tuner\n", + "alpha_lasso = 0.05632616797779373 ; beta_network = 1000.0\n", + "{'info': 'NetREm Model', 'alpha_lasso': 0.05632616797779373, 'beta_net': 1, 'y_intercept': True, 'model_type': 'Lasso', 'max_lasso_iterations': 10000, 'network': , 'verbose': False, 'all_pos_coefs': False, 'model_info': 'fitted_model :)', 'target_gene_y': 'ZZZ3', 'tolerance': 0.0001, 'lasso_selection': 'cyclic'}\n", + "CPU times: total: 18.8 s\n", + "Wall time: 50.1 s\n" + ] + }, + { + "data": { + "text/html": [ + "
NetREmModel(info=NetREm Model, verbose=False, all_pos_coefs=False, model_type=Lasso, y_intercept=True, max_lasso_iterations=10000, model_info=fitted_model :), target_gene_y=ZZZ3, tolerance=0.0001, lasso_selection=cyclic, beta_net=1, alpha_lasso=0.05632616797779373, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C48522830>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "NetREmModel(info=NetREm Model, verbose=False, all_pos_coefs=False, model_type=Lasso, y_intercept=True, max_lasso_iterations=10000, model_info=fitted_model :), target_gene_y=ZZZ3, tolerance=0.0001, lasso_selection=cyclic, beta_net=1, alpha_lasso=0.05632616797779373, network=)" + ] + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time\n", + "\n", + "netrem_bayes_demo1 = nm.netrem(edge_list = filtered_ppi_for_TG, \n", + " y_intercept = True)\n", + "\n", + "bayesian_netty = nm_eval.optimal_netrem_model_via_bayesian_param_tuner(netrem_bayes_demo1, X_train, y_train)\n", + "bayesian_net_model = bayesian_netty[\"optimal_model\"]\n", + "bayesian_net_model" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "8efc388a", + "metadata": {}, + "source": [ + "![bayesian_net_model.png](../user_guide/pics/bayesian_net_model.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "68ea9665", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) # of final TFs in the model for TG ZZZ3: 17\n", + "Training MSE: 0.5442961386563945\n", + "Testing MSE: 0.6919347461250936\n" + ] + } + ], + "source": [ + "print(f\":) # of final TFs in the model for TG {tg}: {bayesian_net_model.num_final_predictors}\")\n", + "mse_train = bayesian_net_model.test_mse(X_train, y_train)\n", + "mse_test = bayesian_net_model.test_mse(X_test, y_test)\n", + "print(f\"Training MSE: {mse_train}\")\n", + "print(f\"Testing MSE: {mse_test}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "92b733b9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
y_interceptBACH1E2F3ELF1FOXO1FOXP1MAFMAXNFIBNFKB1NR3C1NR6A1RFX3RREB1SETDB1STAT5BYY1ZNF682
0-0.1074520.1109690.0119360.1084070.0051260.025120.0533240.0205180.108520.0806910.0317760.0000450.0376260.0192530.1107980.0305870.014130.005165
\n", + "
" + ], + "text/plain": [ + " y_intercept BACH1 E2F3 ELF1 FOXO1 FOXP1 MAF \\\n", + "0 -0.107452 0.110969 0.011936 0.108407 0.005126 0.02512 0.053324 \n", + "\n", + " MAX NFIB NFKB1 NR3C1 NR6A1 RFX3 RREB1 \\\n", + "0 0.020518 0.10852 0.080691 0.031776 0.000045 0.037626 0.019253 \n", + "\n", + " SETDB1 STAT5B YY1 ZNF682 \n", + "0 0.110798 0.030587 0.01413 0.005165 " + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bayesian_net_model.model_nonzero_coef_df # the final # of non-zero predictors. 1 column is added due to y_intercept term. " + ] + }, + { + "cell_type": "markdown", + "id": "ad672132", + "metadata": {}, + "source": [ + "## Example 2️⃣\n", + "### involves more input from the user🥈🥼🧪\n" + ] + }, + { + "cell_type": "markdown", + "id": "62315325", + "metadata": {}, + "source": [ + "### Example 2a: \n", + "#### using user-defined values for *beta_net* $\\beta_{net}$ and *alpha_lasso* $\\alpha_{lasso}$ ." + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "5ba43fe5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) # of final TFs in the model for TG ZZZ3: 75\n", + "Training MSE: 0.36146345048829703\n", + "Testing MSE: 0.9005579183195876\n" + ] + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "fixed_netrem_2a = nm.netrem(edge_list = filtered_ppi_for_TG, \n", + " beta_net = 3,\n", + " alpha_lasso = 0.001,\n", + " view_network = True) # recommended that view_network = False since this is sadly a hairball! ☹️\n", + "fixed_netrem_2a.fit(X_train, y_train)\n", + "mse_train = fixed_netrem_2a.test_mse(X_train, y_train)\n", + "mse_test = fixed_netrem_2a.test_mse(X_test, y_test)\n", + "print(f\":) # of final TFs in the model for TG {tg}: {fixed_netrem_2a.num_final_predictors}\")\n", + "print(f\"Training MSE: {mse_train}\")\n", + "print(f\"Testing MSE: {mse_test}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "c70725de", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
NetREmModel(verbose=False, overlapped_nodes_only=False, all_pos_coefs=False, model_type=Lasso, use_network=True, y_intercept=False, max_lasso_iterations=10000, view_network=True, tolerance=0.0001, lasso_selection=cyclic, beta_net=3, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C46E946A0>, alpha_lasso=0.001)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "NetREmModel(verbose=False, overlapped_nodes_only=False, all_pos_coefs=False, model_type=Lasso, use_network=True, y_intercept=False, max_lasso_iterations=10000, view_network=True, tolerance=0.0001, lasso_selection=cyclic, beta_net=3, network=, alpha_lasso=0.001)" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fixed_netrem_2a" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "0c3c702a", + "metadata": {}, + "source": [ + "![fixed_netrem_2a.png](../user_guide/pics/fixed_netrem_2a.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "00e7a6ec", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
y_interceptBACH1BCL6CCNT2CTCFE2F3E4F1EBF1EGR1ELF1...USF2YY1ZBTB7AZFP28ZNF136ZNF140ZNF274ZNF281ZNF682ZNF76
0None0.096596-0.0092370.085396-0.0832710.225887-0.42595-0.07387-0.0768010.112941...-0.1198740.165508-0.054092-0.1270780.1177170.14481-0.0759690.0534240.0000310.220575
\n", + "

1 rows × 76 columns

\n", + "
" + ], + "text/plain": [ + " y_intercept BACH1 BCL6 CCNT2 CTCF E2F3 E4F1 \\\n", + "0 None 0.096596 -0.009237 0.085396 -0.083271 0.225887 -0.42595 \n", + "\n", + " EBF1 EGR1 ELF1 ... USF2 YY1 ZBTB7A ZFP28 \\\n", + "0 -0.07387 -0.076801 0.112941 ... -0.119874 0.165508 -0.054092 -0.127078 \n", + "\n", + " ZNF136 ZNF140 ZNF274 ZNF281 ZNF682 ZNF76 \n", + "0 0.117717 0.14481 -0.075969 0.053424 0.000031 0.220575 \n", + "\n", + "[1 rows x 76 columns]" + ] + }, + "execution_count": 51, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fixed_netrem_2a.model_nonzero_coef_df # the final # of non-zero predictors. 1 column is added due to y_intercept term. " + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "93f93b49", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
coefTFTGinfotrain_msebeta_netalpha_lassoAbsoluteVal_coefficientRankfinal_model_TFsTFs_input_to_modeloriginal_TFs_in_X
0Noney_interceptZZZ3netrem_no_intercept0.36146330.001NaN76757777
10.096596BACH1ZZZ3netrem_no_intercept0.36146330.0010.09659634757777
2-0.009237BCL6ZZZ3netrem_no_intercept0.36146330.0010.00923768757777
30.085396CCNT2ZZZ3netrem_no_intercept0.36146330.0010.08539638757777
4-0.083271CTCFZZZ3netrem_no_intercept0.36146330.0010.08327140757777
.......................................
710.14481ZNF140ZZZ3netrem_no_intercept0.36146330.0010.14481019757777
72-0.075969ZNF274ZZZ3netrem_no_intercept0.36146330.0010.07596943757777
730.053424ZNF281ZZZ3netrem_no_intercept0.36146330.0010.05342456757777
740.000031ZNF682ZZZ3netrem_no_intercept0.36146330.0010.00003175757777
750.220575ZNF76ZZZ3netrem_no_intercept0.36146330.0010.2205756757777
\n", + "

76 rows × 12 columns

\n", + "
" + ], + "text/plain": [ + " coef TF TG info train_mse beta_net \\\n", + "0 None y_intercept ZZZ3 netrem_no_intercept 0.361463 3 \n", + "1 0.096596 BACH1 ZZZ3 netrem_no_intercept 0.361463 3 \n", + "2 -0.009237 BCL6 ZZZ3 netrem_no_intercept 0.361463 3 \n", + "3 0.085396 CCNT2 ZZZ3 netrem_no_intercept 0.361463 3 \n", + "4 -0.083271 CTCF ZZZ3 netrem_no_intercept 0.361463 3 \n", + ".. ... ... ... ... ... ... \n", + "71 0.14481 ZNF140 ZZZ3 netrem_no_intercept 0.361463 3 \n", + "72 -0.075969 ZNF274 ZZZ3 netrem_no_intercept 0.361463 3 \n", + "73 0.053424 ZNF281 ZZZ3 netrem_no_intercept 0.361463 3 \n", + "74 0.000031 ZNF682 ZZZ3 netrem_no_intercept 0.361463 3 \n", + "75 0.220575 ZNF76 ZZZ3 netrem_no_intercept 0.361463 3 \n", + "\n", + " alpha_lasso AbsoluteVal_coefficient Rank final_model_TFs \\\n", + "0 0.001 NaN 76 75 \n", + "1 0.001 0.096596 34 75 \n", + "2 0.001 0.009237 68 75 \n", + "3 0.001 0.085396 38 75 \n", + "4 0.001 0.083271 40 75 \n", + ".. ... ... ... ... \n", + "71 0.001 0.144810 19 75 \n", + "72 0.001 0.075969 43 75 \n", + "73 0.001 0.053424 56 75 \n", + "74 0.001 0.000031 75 75 \n", + "75 0.001 0.220575 6 75 \n", + "\n", + " TFs_input_to_model original_TFs_in_X \n", + "0 77 77 \n", + "1 77 77 \n", + "2 77 77 \n", + "3 77 77 \n", + "4 77 77 \n", + ".. ... ... \n", + "71 77 77 \n", + "72 77 77 \n", + "73 77 77 \n", + "74 77 77 \n", + "75 77 77 \n", + "\n", + "[76 rows x 12 columns]" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fixed_netrem_2a.combined_df" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "110ae80f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
NetREmModel(verbose=False, overlapped_nodes_only=False, all_pos_coefs=False, model_type=Lasso, use_network=True, y_intercept=False, max_lasso_iterations=10000, view_network=True, tolerance=0.0001, lasso_selection=cyclic, beta_net=3, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C46E946A0>, alpha_lasso=0.001)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "NetREmModel(verbose=False, overlapped_nodes_only=False, all_pos_coefs=False, model_type=Lasso, use_network=True, y_intercept=False, max_lasso_iterations=10000, view_network=True, tolerance=0.0001, lasso_selection=cyclic, beta_net=3, network=, alpha_lasso=0.001)" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fixed_netrem_2a" + ] + }, + { + "cell_type": "markdown", + "id": "4d6bb46d", + "metadata": {}, + "source": [ + "### Example 2b:\n", + "#### using user-defined values for *beta_net* $\\beta_{net}$ and LassoCV to find the optimal *alpha_lasso* .\n", + "This is the approach that Saniya utilized for NetREm for the paper :)📜👩‍🏫" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "02d053b4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) # of final TFs in the model for TG ZZZ3: 67\n", + "Training MSE: 0.40284765315234555\n", + "Testing MSE: 0.6833977664046426\n" + ] + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "fixed_netrem_2b = nm.netrem(edge_list = filtered_ppi_for_TG, \n", + " beta_net = 100,\n", + " y_intercept = True,\n", + " default_edge_weight = 0.01,\n", + " degree_threshold = 0.5,\n", + " model_type = \"LassoCV\",\n", + " view_network = True)\n", + "fixed_netrem_2b.fit(X_train, y_train)\n", + "mse_train = fixed_netrem_2b.test_mse(X_train, y_train)\n", + "mse_test = fixed_netrem_2b.test_mse(X_test, y_test)\n", + "print(f\":) # of final TFs in the model for TG {tg}: {fixed_netrem_2b.num_final_predictors}\")\n", + "\n", + "print(f\"Training MSE: {mse_train}\")\n", + "print(f\"Testing MSE: {mse_test}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "ac2520c4", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
y_interceptBACH1CCNT2CTCFE2F3E4F1EBF1ELF1ERFESR1...TP53USF1USF2YY1ZBTB7AZNF136ZNF140ZNF274ZNF682ZNF76
00.068770.0731280.060351-0.086670.174759-0.170537-0.0456440.130413-0.0824570.007668...-0.0898060.069805-0.0151880.113233-0.0225050.0560880.076585-0.0019090.001840.109638
\n", + "

1 rows × 68 columns

\n", + "
" + ], + "text/plain": [ + " y_intercept BACH1 CCNT2 CTCF E2F3 E4F1 EBF1 \\\n", + "0 0.06877 0.073128 0.060351 -0.08667 0.174759 -0.170537 -0.045644 \n", + "\n", + " ELF1 ERF ESR1 ... TP53 USF1 USF2 YY1 \\\n", + "0 0.130413 -0.082457 0.007668 ... -0.089806 0.069805 -0.015188 0.113233 \n", + "\n", + " ZBTB7A ZNF136 ZNF140 ZNF274 ZNF682 ZNF76 \n", + "0 -0.022505 0.056088 0.076585 -0.001909 0.00184 0.109638 \n", + "\n", + "[1 rows x 68 columns]" + ] + }, + "execution_count": 55, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fixed_netrem_2b.model_nonzero_coef_df # the final # of non-zero predictors. 1 column is added due to y_intercept term. " + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "id": "00c53c75", + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
coefTFTGinfotrain_msebeta_netalpha_lassoCVAbsoluteVal_coefficientRankfinal_model_TFsTFs_input_to_modeloriginal_TFs_in_X
00.068770y_interceptZZZ3netrem_with_intercept0.402848100Cross-Validation optimal alpha lasso: 0.004425...0.06877029677777
10.073128BACH1ZZZ3netrem_with_intercept0.402848100Cross-Validation optimal alpha lasso: 0.004425...0.07312827677777
20.060351CCNT2ZZZ3netrem_with_intercept0.402848100Cross-Validation optimal alpha lasso: 0.004425...0.06035134677777
3-0.086670CTCFZZZ3netrem_with_intercept0.402848100Cross-Validation optimal alpha lasso: 0.004425...0.08667019677777
40.174759E2F3ZZZ3netrem_with_intercept0.402848100Cross-Validation optimal alpha lasso: 0.004425...0.1747591677777
.......................................
630.056088ZNF136ZZZ3netrem_with_intercept0.402848100Cross-Validation optimal alpha lasso: 0.004425...0.05608835677777
640.076585ZNF140ZZZ3netrem_with_intercept0.402848100Cross-Validation optimal alpha lasso: 0.004425...0.07658525677777
65-0.001909ZNF274ZZZ3netrem_with_intercept0.402848100Cross-Validation optimal alpha lasso: 0.004425...0.00190966677777
660.001840ZNF682ZZZ3netrem_with_intercept0.402848100Cross-Validation optimal alpha lasso: 0.004425...0.00184067677777
670.109638ZNF76ZZZ3netrem_with_intercept0.402848100Cross-Validation optimal alpha lasso: 0.004425...0.10963810677777
\n", + "

68 rows × 12 columns

\n", + "
" + ], + "text/plain": [ + " coef TF TG info train_mse beta_net \\\n", + "0 0.068770 y_intercept ZZZ3 netrem_with_intercept 0.402848 100 \n", + "1 0.073128 BACH1 ZZZ3 netrem_with_intercept 0.402848 100 \n", + "2 0.060351 CCNT2 ZZZ3 netrem_with_intercept 0.402848 100 \n", + "3 -0.086670 CTCF ZZZ3 netrem_with_intercept 0.402848 100 \n", + "4 0.174759 E2F3 ZZZ3 netrem_with_intercept 0.402848 100 \n", + ".. ... ... ... ... ... ... \n", + "63 0.056088 ZNF136 ZZZ3 netrem_with_intercept 0.402848 100 \n", + "64 0.076585 ZNF140 ZZZ3 netrem_with_intercept 0.402848 100 \n", + "65 -0.001909 ZNF274 ZZZ3 netrem_with_intercept 0.402848 100 \n", + "66 0.001840 ZNF682 ZZZ3 netrem_with_intercept 0.402848 100 \n", + "67 0.109638 ZNF76 ZZZ3 netrem_with_intercept 0.402848 100 \n", + "\n", + " alpha_lassoCV \\\n", + "0 Cross-Validation optimal alpha lasso: 0.004425... \n", + "1 Cross-Validation optimal alpha lasso: 0.004425... \n", + "2 Cross-Validation optimal alpha lasso: 0.004425... \n", + "3 Cross-Validation optimal alpha lasso: 0.004425... \n", + "4 Cross-Validation optimal alpha lasso: 0.004425... \n", + ".. ... \n", + "63 Cross-Validation optimal alpha lasso: 0.004425... \n", + "64 Cross-Validation optimal alpha lasso: 0.004425... \n", + "65 Cross-Validation optimal alpha lasso: 0.004425... \n", + "66 Cross-Validation optimal alpha lasso: 0.004425... \n", + "67 Cross-Validation optimal alpha lasso: 0.004425... \n", + "\n", + " AbsoluteVal_coefficient Rank final_model_TFs TFs_input_to_model \\\n", + "0 0.068770 29 67 77 \n", + "1 0.073128 27 67 77 \n", + "2 0.060351 34 67 77 \n", + "3 0.086670 19 67 77 \n", + "4 0.174759 1 67 77 \n", + ".. ... ... ... ... \n", + "63 0.056088 35 67 77 \n", + "64 0.076585 25 67 77 \n", + "65 0.001909 66 67 77 \n", + "66 0.001840 67 67 77 \n", + "67 0.109638 10 67 77 \n", + "\n", + " original_TFs_in_X \n", + "0 77 \n", + "1 77 \n", + "2 77 \n", + "3 77 \n", + "4 77 \n", + ".. ... \n", + "63 77 \n", + "64 77 \n", + "65 77 \n", + "66 77 \n", + "67 77 \n", + "\n", + "[68 rows x 12 columns]" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fixed_netrem_2b.combined_df" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "84ae10b3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
NetREmModel(verbose=False, overlapped_nodes_only=False, num_cv_folds=5, num_jobs=-1, all_pos_coefs=False, model_type=LassoCV, use_network=True, y_intercept=True, max_lasso_iterations=10000, view_network=True, tolerance=0.0001, lasso_selection=cyclic, lassocv_eps=0.001, lassocv_n_alphas=100, lassocv_alphas=None, beta_net=100, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C6A39A6B0>, alpha_lasso=LassoCV finds optimal alpha)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "NetREmModel(verbose=False, overlapped_nodes_only=False, num_cv_folds=5, num_jobs=-1, all_pos_coefs=False, model_type=LassoCV, use_network=True, y_intercept=True, max_lasso_iterations=10000, view_network=True, tolerance=0.0001, lasso_selection=cyclic, lassocv_eps=0.001, lassocv_n_alphas=100, lassocv_alphas=None, beta_net=100, network=, alpha_lasso=LassoCV finds optimal alpha)" + ] + }, + "execution_count": 57, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fixed_netrem_2b" + ] + }, + { + "attachments": { + "netrem_1b.png": { + "image/png": "" + } + }, + "cell_type": "markdown", + "id": "ea2faab5", + "metadata": {}, + "source": [ + "![netrem_1b.png](attachment:netrem_1b.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "id": "dc687e29", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "dict_keys(['info', 'verbose', 'overlapped_nodes_only', 'num_cv_folds', 'num_jobs', 'all_pos_coefs', 'model_type', 'use_network', 'y_intercept', 'max_lasso_iterations', 'view_network', 'model_info', 'target_gene_y', 'tolerance', 'lasso_selection', 'lassocv_eps', 'lassocv_n_alphas', 'lassocv_alphas', 'beta_net', 'network', 'alpha_lasso', 'optimal_alpha', 'prior_network', 'preprocessed_network', 'network_params', 'network_nodes_list', 'kwargs', 'X_df', 'gene_expression_nodes', 'common_nodes', 'final_nodes', 'gexpr_nodes_added', 'gexpr_nodes_to_add_for_net', 'filter_network_bool', 'A_df', 'A', 'nodes', 'network_info', 'M', 'N', 'X_train', 'y_train', 'B_train', 'B_interaction_df', 'B_train_times_M', 'X_tilda_train', 'y_tilda_train', 'X_training_to_use', 'y_training_to_use', 'regr', 'final_alpha', 'coef', 'intercept', 'predY_tilda_train', 'mse_tilda_train', 'predY_train', 'mse_train', 'model_coef_df', 'model_nonzero_coef_df', 'sorted_coef_df', 'corr_vs_coef_df', 'final_corr_vs_coef_df', 'combined_df', 'num_final_predictors'])" + ] + }, + "execution_count": 58, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vars(fixed_netrem_2b).keys()" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "id": "6d1f1c81", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'info': 'NetREm Model',\n", + " 'alpha_lasso': 'LassoCV finds optimal alpha',\n", + " 'beta_net': 100,\n", + " 'y_intercept': True,\n", + " 'model_type': 'LassoCV',\n", + " 'max_lasso_iterations': 10000,\n", + " 'network': ,\n", + " 'verbose': False,\n", + " 'all_pos_coefs': False,\n", + " 'model_info': 'fitted_model :)',\n", + " 'target_gene_y': 'ZZZ3',\n", + " 'num_cv_folds': 5,\n", + " 'num_jobs': -1,\n", + " 'lassocv_eps': 0.001,\n", + " 'lassocv_n_alphas': 100,\n", + " 'lassocv_alphas': None,\n", + " 'optimal_alpha': 'Cross-Validation optimal alpha lasso: 0.004425278133386773',\n", + " 'tolerance': 0.0001,\n", + " 'lasso_selection': 'cyclic'}" + ] + }, + "execution_count": 59, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fixed_netrem_2b.get_params()" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "668bfe5a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
infoinput_dataBACH1CCNT2CTCFE2F3E4F1EBF1ELF1ERF...TP53USF1USF2YY1ZBTB7AZNF136ZNF140ZNF274ZNF682ZNF76
0network regression coeff. with y: ZZZ3X_train0.0731280.060351-0.0866700.174759-0.170537-0.0456440.130413-0.082457...-0.0898060.069805-0.0151880.113233-0.0225050.0560880.076585-0.0019090.0018400.109638
0corr (r) with y: ZZZ3X_train0.1910810.134249-0.0534870.160718-0.046865-0.0471600.207914-0.091996...-0.0651730.085353-0.0254910.119843-0.124585-0.0124450.132537-0.0000890.1297380.124608
0Absolute Value NetREm Coefficient RankingX_train27.00000034.00000019.0000001.0000003.00000041.0000007.00000021.000000...18.00000028.00000058.0000009.00000050.00000035.00000025.00000066.00000067.00000010.000000
\n", + "

3 rows × 69 columns

\n", + "
" + ], + "text/plain": [ + " info input_data BACH1 CCNT2 \\\n", + "0 network regression coeff. with y: ZZZ3 X_train 0.073128 0.060351 \n", + "0 corr (r) with y: ZZZ3 X_train 0.191081 0.134249 \n", + "0 Absolute Value NetREm Coefficient Ranking X_train 27.000000 34.000000 \n", + "\n", + " CTCF E2F3 E4F1 EBF1 ELF1 ERF ... \\\n", + "0 -0.086670 0.174759 -0.170537 -0.045644 0.130413 -0.082457 ... \n", + "0 -0.053487 0.160718 -0.046865 -0.047160 0.207914 -0.091996 ... \n", + "0 19.000000 1.000000 3.000000 41.000000 7.000000 21.000000 ... \n", + "\n", + " TP53 USF1 USF2 YY1 ZBTB7A ZNF136 ZNF140 \\\n", + "0 -0.089806 0.069805 -0.015188 0.113233 -0.022505 0.056088 0.076585 \n", + "0 -0.065173 0.085353 -0.025491 0.119843 -0.124585 -0.012445 0.132537 \n", + "0 18.000000 28.000000 58.000000 9.000000 50.000000 35.000000 25.000000 \n", + "\n", + " ZNF274 ZNF682 ZNF76 \n", + "0 -0.001909 0.001840 0.109638 \n", + "0 -0.000089 0.129738 0.124608 \n", + "0 66.000000 67.000000 10.000000 \n", + "\n", + "[3 rows x 69 columns]" + ] + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fixed_netrem_2b.final_corr_vs_coef_df" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "8cdcad59", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TF1TF2B_train_weightsignpotential_interactionabsVal_Binfocandidate_TFs_Ntarget_gene_ynum_final_predictorsmodel_typebeta_netgene_datarankpercentile
1028NFIBFOXO11.118503:):(1.118503B matrix of TF-TF interactions77ZZZ367LassoCV100training gene expression data1.099.982912
2092FOXO1NFIB1.118503:):(1.118503B matrix of TF-TF interactions77ZZZ367LassoCV100training gene expression data1.099.982912
4416NFIBSREBF20.956989:):(0.956989B matrix of TF-TF interactions77ZZZ367LassoCV100training gene expression data3.099.948735
2136SREBF2NFIB0.956989:):(0.956989B matrix of TF-TF interactions77ZZZ367LassoCV100training gene expression data3.099.948735
1046RORAFOXO10.928125:):(0.928125B matrix of TF-TF interactions77ZZZ367LassoCV100training gene expression data5.099.914559
................................................
3834TCF3RXRB-0.000012:(:( competitive (-)0.000012B matrix of TF-TF interactions77ZZZ367LassoCV100training gene expression data5848.00.068353
2784ESRRAPML-0.000010:(:( competitive (-)0.000010B matrix of TF-TF interactions77ZZZ367LassoCV100training gene expression data5849.00.051265
960PMLESRRA-0.000010:(:( competitive (-)0.000010B matrix of TF-TF interactions77ZZZ367LassoCV100training gene expression data5849.00.051265
4708ESR2TCF3-0.000009:(:( competitive (-)0.000009B matrix of TF-TF interactions77ZZZ367LassoCV100training gene expression data5851.00.017088
908TCF3ESR2-0.000009:(:( competitive (-)0.000009B matrix of TF-TF interactions77ZZZ367LassoCV100training gene expression data5852.00.000000
\n", + "

5852 rows × 15 columns

\n", + "
" + ], + "text/plain": [ + " TF1 TF2 B_train_weight sign potential_interaction absVal_B \\\n", + "1028 NFIB FOXO1 1.118503 :) :( 1.118503 \n", + "2092 FOXO1 NFIB 1.118503 :) :( 1.118503 \n", + "4416 NFIB SREBF2 0.956989 :) :( 0.956989 \n", + "2136 SREBF2 NFIB 0.956989 :) :( 0.956989 \n", + "1046 RORA FOXO1 0.928125 :) :( 0.928125 \n", + "... ... ... ... ... ... ... \n", + "3834 TCF3 RXRB -0.000012 :( :( competitive (-) 0.000012 \n", + "2784 ESRRA PML -0.000010 :( :( competitive (-) 0.000010 \n", + "960 PML ESRRA -0.000010 :( :( competitive (-) 0.000010 \n", + "4708 ESR2 TCF3 -0.000009 :( :( competitive (-) 0.000009 \n", + "908 TCF3 ESR2 -0.000009 :( :( competitive (-) 0.000009 \n", + "\n", + " info candidate_TFs_N target_gene_y \\\n", + "1028 B matrix of TF-TF interactions 77 ZZZ3 \n", + "2092 B matrix of TF-TF interactions 77 ZZZ3 \n", + "4416 B matrix of TF-TF interactions 77 ZZZ3 \n", + "2136 B matrix of TF-TF interactions 77 ZZZ3 \n", + "1046 B matrix of TF-TF interactions 77 ZZZ3 \n", + "... ... ... ... \n", + "3834 B matrix of TF-TF interactions 77 ZZZ3 \n", + "2784 B matrix of TF-TF interactions 77 ZZZ3 \n", + "960 B matrix of TF-TF interactions 77 ZZZ3 \n", + "4708 B matrix of TF-TF interactions 77 ZZZ3 \n", + "908 B matrix of TF-TF interactions 77 ZZZ3 \n", + "\n", + " num_final_predictors model_type beta_net \\\n", + "1028 67 LassoCV 100 \n", + "2092 67 LassoCV 100 \n", + "4416 67 LassoCV 100 \n", + "2136 67 LassoCV 100 \n", + "1046 67 LassoCV 100 \n", + "... ... ... ... \n", + "3834 67 LassoCV 100 \n", + "2784 67 LassoCV 100 \n", + "960 67 LassoCV 100 \n", + "4708 67 LassoCV 100 \n", + "908 67 LassoCV 100 \n", + "\n", + " gene_data rank percentile \n", + "1028 training gene expression data 1.0 99.982912 \n", + "2092 training gene expression data 1.0 99.982912 \n", + "4416 training gene expression data 3.0 99.948735 \n", + "2136 training gene expression data 3.0 99.948735 \n", + "1046 training gene expression data 5.0 99.914559 \n", + "... ... ... ... \n", + "3834 training gene expression data 5848.0 0.068353 \n", + "2784 training gene expression data 5849.0 0.051265 \n", + "960 training gene expression data 5849.0 0.051265 \n", + "4708 training gene expression data 5851.0 0.017088 \n", + "908 training gene expression data 5852.0 0.000000 \n", + "\n", + "[5852 rows x 15 columns]" + ] + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "b_matrix_2b = nm.organize_B_interaction_network(fixed_netrem_2b)\n", + "b_matrix_2b" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "6e31ad77", + "metadata": {}, + "source": [ + "### Example 2c: \n", + "#### using GridSearchCV for comprehensive hyperparameter optimization🧮 for *beta_net* $\\beta_{net}$ and *alpha_lasso* $\\alpha_{lasso}$ ." + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "1c6d089c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "using beta_net default of 1\n", + "using alpha_lasso default of 0.01\n", + "Fitting 5 folds for each of 208 candidates, totalling 1040 fits\n", + "CPU times: total: 10.6 s\n", + "Wall time: 26.6 s\n" + ] + }, + { + "data": { + "text/html": [ + "
GridSearchCV(cv=5,\n",
+       "             estimator=NetREmModel(all_pos_coefs=False, alpha_lasso=0.01, beta_net=1, info='NetREm Model', lasso_selection='cyclic', max_lasso_iterations=10000, model_info='unfitted_model :(', model_type='Lasso', network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C71577C10>, target_gene_y='Unknown :(', tolerance=0.0001, verbose=False, y_intercept=False),\n",
+       "             n_jobs=-1,\n",
+       "             param_grid={'alpha_lasso': [1e-05, 1e-05, 1e-08, 0.0001, 0.005,\n",
+       "                                         0.001, 1e-09, 0.002, 0.1, 0.0023559,\n",
+       "                                         0.003, 0.005, 0.01],\n",
+       "                         'beta_net': [1e-06, 3, 0.1, 0.05, 1e-07, 5e-06, 0.01,\n",
+       "                                      0.001, 0.2, 0.4, 0.5, 0.6, 0.8, 5, 1,\n",
+       "                                      2]},\n",
+       "             verbose=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "GridSearchCV(cv=5,\n", + " estimator=NetREmModel(all_pos_coefs=False, alpha_lasso=0.01, beta_net=1, info='NetREm Model', lasso_selection='cyclic', max_lasso_iterations=10000, model_info='unfitted_model :(', model_type='Lasso', network=, target_gene_y='Unknown :(', tolerance=0.0001, verbose=False, y_intercept=False),\n", + " n_jobs=-1,\n", + " param_grid={'alpha_lasso': [1e-05, 1e-05, 1e-08, 0.0001, 0.005,\n", + " 0.001, 1e-09, 0.002, 0.1, 0.0023559,\n", + " 0.003, 0.005, 0.01],\n", + " 'beta_net': [1e-06, 3, 0.1, 0.05, 1e-07, 5e-06, 0.01,\n", + " 0.001, 0.2, 0.4, 0.5, 0.6, 0.8, 5, 1,\n", + " 2]},\n", + " verbose=10)" + ] + }, + "execution_count": 62, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time\n", + "\n", + "demo1 = nm.netrem(edge_list = filtered_ppi_for_TG, \n", + " view_network = False)\n", + "\n", + "param_grid = {\n", + " 'beta_net': [1e-6,3, 0.1, 0.05, 1e-7, 5e-6, 0.01,1e-3, 0.2, 0.4, 0.5, 0.6, 0.8, 5, 1, 2],\n", + " 'alpha_lasso': [1e-5, 0.00001, 1e-8, 0.0001, 0.005, 0.001, 1e-9, 0.002, 0.1, 0.0023559, 0.003, 0.005, 0.01]}\n", + "\n", + "griddy_demo1 = GridSearchCV(demo1, param_grid=param_grid, cv=5, n_jobs = -1, verbose = 10)\n", + "griddy_demo1.fit(X_train, y_train)" + ] + }, + { + "cell_type": "markdown", + "id": "aff7d3cb", + "metadata": {}, + "source": [ + "![gridster_0.png](../user_guide/pics/gridster_0.png)\n", + "![gridster_1.png](../user_guide/pics/gridster_1.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "0ebaf4e7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'alpha_lasso': 0.1, 'beta_net': 5}" + ] + }, + "execution_count": 63, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "griddy_demo1.best_params_ # these are the parameters selected by the gridSearchCV" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "eccef2a8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
NetREmModel(info=NetREm Model, verbose=False, all_pos_coefs=False, model_type=Lasso, y_intercept=False, max_lasso_iterations=10000, model_info=fitted_model :), target_gene_y=ZZZ3, tolerance=0.0001, lasso_selection=cyclic, beta_net=5, alpha_lasso=0.1, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C6E41B430>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "NetREmModel(info=NetREm Model, verbose=False, all_pos_coefs=False, model_type=Lasso, y_intercept=False, max_lasso_iterations=10000, model_info=fitted_model :), target_gene_y=ZZZ3, tolerance=0.0001, lasso_selection=cyclic, beta_net=5, alpha_lasso=0.1, network=)" + ] + }, + "execution_count": 64, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "gridsearch_netrem_model = griddy_demo1.best_estimator_\n", + "gridsearch_netrem_model" + ] + }, + { + "cell_type": "markdown", + "id": "8a2b1046", + "metadata": {}, + "source": [ + "![griddy_1.png](../user_guide/pics/griddy_1.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "41c9af6e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'info': 'NetREm Model',\n", + " 'alpha_lasso': 0.1,\n", + " 'beta_net': 5,\n", + " 'y_intercept': False,\n", + " 'model_type': 'Lasso',\n", + " 'max_lasso_iterations': 10000,\n", + " 'network': ,\n", + " 'verbose': False,\n", + " 'all_pos_coefs': False,\n", + " 'model_info': 'fitted_model :)',\n", + " 'target_gene_y': 'ZZZ3',\n", + " 'tolerance': 0.0001,\n", + " 'lasso_selection': 'cyclic'}" + ] + }, + "execution_count": 65, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "gridsearch_netrem_model.get_params()" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "id": "0c4e7aca", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) # of final TFs in the model for TG ZZZ3: 12\n", + "Training MSE: 0.5480934127420958\n", + "Testing MSE: 0.6610817578852232\n" + ] + } + ], + "source": [ + "print(f\":) # of final TFs in the model for TG {tg}: {gridsearch_netrem_model.num_final_predictors}\")\n", + "mse_train = gridsearch_netrem_model.test_mse(X_train, y_train)\n", + "mse_test = gridsearch_netrem_model.test_mse(X_test, y_test)\n", + "print(f\"Training MSE: {mse_train}\")\n", + "print(f\"Testing MSE: {mse_test}\")" + ] + }, + { + "cell_type": "markdown", + "id": "24427c81", + "metadata": {}, + "source": [ + "### Example 2d: \n", + "#### using RandomizedSearchCV for more efficient (but less comprehensive) optimization for *beta_net* $\\beta_{net}$ and *alpha_lasso* $\\alpha_{lasso}$ ." + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "20fcc68d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "using beta_net default of 1\n", + "using alpha_lasso default of 0.01\n", + "Fitting 5 folds for each of 10 candidates, totalling 50 fits\n" + ] + }, + { + "data": { + "text/html": [ + "
RandomizedSearchCV(cv=5,\n",
+       "                   estimator=NetREmModel(all_pos_coefs=False, alpha_lasso=0.01, beta_net=1, info='NetREm Model', lasso_selection='cyclic', max_lasso_iterations=10000, model_info='unfitted_model :(', model_type='Lasso', network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C71576AA0>, target_gene_y='Unknown :(', tolerance=0.0001, verbose=False, y_intercept=False),\n",
+       "                   n_jobs=-1,\n",
+       "                   param_distributions={'alpha_lasso': [1e-05, 1e-05, 1e-08,\n",
+       "                                                        0.0001, 0.005, 0.001,\n",
+       "                                                        1e-09, 0.002, 0.1,\n",
+       "                                                        0.0023559, 0.003, 0.005,\n",
+       "                                                        0.01],\n",
+       "                                        'beta_net': [1e-06, 3, 0.1, 0.05, 1e-07,\n",
+       "                                                     5e-06, 0.01, 0.001, 0.2,\n",
+       "                                                     0.4, 0.5, 0.6, 0.8, 5, 1,\n",
+       "                                                     2]},\n",
+       "                   verbose=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "RandomizedSearchCV(cv=5,\n", + " estimator=NetREmModel(all_pos_coefs=False, alpha_lasso=0.01, beta_net=1, info='NetREm Model', lasso_selection='cyclic', max_lasso_iterations=10000, model_info='unfitted_model :(', model_type='Lasso', network=, target_gene_y='Unknown :(', tolerance=0.0001, verbose=False, y_intercept=False),\n", + " n_jobs=-1,\n", + " param_distributions={'alpha_lasso': [1e-05, 1e-05, 1e-08,\n", + " 0.0001, 0.005, 0.001,\n", + " 1e-09, 0.002, 0.1,\n", + " 0.0023559, 0.003, 0.005,\n", + " 0.01],\n", + " 'beta_net': [1e-06, 3, 0.1, 0.05, 1e-07,\n", + " 5e-06, 0.01, 0.001, 0.2,\n", + " 0.4, 0.5, 0.6, 0.8, 5, 1,\n", + " 2]},\n", + " verbose=10)" + ] + }, + "execution_count": 67, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.model_selection import RandomizedSearchCV\n", + "demo2 = nm.netrem(edge_list = filtered_ppi_for_TG, \n", + " view_network = False)\n", + "\n", + "griddy_demo2 = RandomizedSearchCV(demo2, param_distributions=param_grid, cv=5, n_jobs = -1, verbose = 10)\n", + "griddy_demo2.fit(X_train, y_train)" + ] + }, + { + "cell_type": "markdown", + "id": "6c703cb5", + "metadata": {}, + "source": [ + "![rand_Search1.png](../user_guide/pics/rand_SearchA.png)\n", + "![rand_Search2.png](../user_guide/pics/rand_SearchB.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "b34567b1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'beta_net': 5e-06, 'alpha_lasso': 0.1}" + ] + }, + "execution_count": 68, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "griddy_demo2.best_params_" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "b48fa832", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
NetREmModel(info=NetREm Model, verbose=False, all_pos_coefs=False, model_type=Lasso, y_intercept=False, max_lasso_iterations=10000, model_info=fitted_model :), target_gene_y=ZZZ3, tolerance=0.0001, lasso_selection=cyclic, beta_net=5e-06, alpha_lasso=0.1, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C6E77F610>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "NetREmModel(info=NetREm Model, verbose=False, all_pos_coefs=False, model_type=Lasso, y_intercept=False, max_lasso_iterations=10000, model_info=fitted_model :), target_gene_y=ZZZ3, tolerance=0.0001, lasso_selection=cyclic, beta_net=5e-06, alpha_lasso=0.1, network=)" + ] + }, + "execution_count": 69, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "randsearch_netrem_model = griddy_demo2.best_estimator_\n", + "randsearch_netrem_model" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "ab8f577e", + "metadata": {}, + "source": [ + "![rand_Search1.png](../user_guide/pics/rand_Search1.png)\n", + "![rand_Search2.png](../user_guide/pics/rand_Search2.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "3dc7233b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) # of final TFs in the model for TG ZZZ3: 12\n", + "Training MSE: 0.5480195513484795\n", + "Testing MSE: 0.661036024697951\n" + ] + } + ], + "source": [ + "print(f\":) # of final TFs in the model for TG {tg}: {randsearch_netrem_model.num_final_predictors}\")\n", + "mse_train = randsearch_netrem_model.test_mse(X_train, y_train)\n", + "mse_test = randsearch_netrem_model.test_mse(X_test, y_test)\n", + "print(f\"Training MSE: {mse_train}\")\n", + "print(f\"Testing MSE: {mse_test}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "39723bbe", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
y_interceptBACH1ELF1FOXO1FOXP1MAFMAXNFIBNFKB1NR3C1RREB1SETDB1YY1
0None0.0854960.1189280.0127870.0103830.0264450.0247010.1171320.0520720.0512380.0063380.0830560.005714
\n", + "
" + ], + "text/plain": [ + " y_intercept BACH1 ELF1 FOXO1 FOXP1 MAF MAX \\\n", + "0 None 0.085496 0.118928 0.012787 0.010383 0.026445 0.024701 \n", + "\n", + " NFIB NFKB1 NR3C1 RREB1 SETDB1 YY1 \n", + "0 0.117132 0.052072 0.051238 0.006338 0.083056 0.005714 " + ] + }, + "execution_count": 71, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "randsearch_netrem_model.model_nonzero_coef_df" + ] + }, + { + "cell_type": "markdown", + "id": "41ee9e46", + "metadata": {}, + "source": [ + "### Example 2e: \n", + "#### using Bayesian Optimization and Gaussian Processes to determine the optimal *beta_net* $\\beta_{net}$ and *alpha_lasso* $\\alpha_{lasso}$ given a potential range of values.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "id": "f5d9e64a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "using beta_net default of 1\n", + "using alpha_lasso default of 0.01\n", + ":) Please note that we are running: optimal_netrem_model_via_bayesian_param_tuner\n", + "alpha_lasso = 0.01 ; beta_network = 100.0\n", + "{'info': 'NetREm Model', 'alpha_lasso': 0.01, 'beta_net': 1, 'y_intercept': True, 'model_type': 'Lasso', 'max_lasso_iterations': 10000, 'network': , 'verbose': False, 'all_pos_coefs': False, 'model_info': 'fitted_model :)', 'target_gene_y': 'ZZZ3', 'tolerance': 0.0001, 'lasso_selection': 'cyclic'}\n", + "CPU times: total: 7.47 s\n", + "Wall time: 19.5 s\n" + ] + }, + { + "data": { + "text/html": [ + "
NetREmModel(info=NetREm Model, verbose=False, all_pos_coefs=False, model_type=Lasso, y_intercept=True, max_lasso_iterations=10000, model_info=fitted_model :), target_gene_y=ZZZ3, tolerance=0.0001, lasso_selection=cyclic, beta_net=1, alpha_lasso=0.01, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C6C6F8130>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "NetREmModel(info=NetREm Model, verbose=False, all_pos_coefs=False, model_type=Lasso, y_intercept=True, max_lasso_iterations=10000, model_info=fitted_model :), target_gene_y=ZZZ3, tolerance=0.0001, lasso_selection=cyclic, beta_net=1, alpha_lasso=0.01, network=)" + ] + }, + "execution_count": 72, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time\n", + "netrem_bayes_demo2 = nm.netrem(edge_list = filtered_ppi_for_TG, \n", + " y_intercept = True)\n", + "\n", + "bayesian_netty2 = nm_eval.optimal_netrem_model_via_bayesian_param_tuner(netrem_bayes_demo2,\n", + " X_train, y_train,\n", + " beta_net_min = 1, \n", + " beta_net_max = 100, \n", + " alpha_lasso_min = 0.0001,\n", + " alpha_lasso_max = 0.01, \n", + " num_grid_values = 50)\n", + "bayesian_net_model2 = bayesian_netty2[\"optimal_model\"]\n", + "bayesian_net_model2" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "f0a94e1f", + "metadata": {}, + "source": [ + "![bayesian_net_model2.png](../user_guide/pics/bayesian_net_model2.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "id": "f85feba4", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'info': 'NetREm Model',\n", + " 'alpha_lasso': 0.01,\n", + " 'beta_net': 1,\n", + " 'y_intercept': True,\n", + " 'model_type': 'Lasso',\n", + " 'max_lasso_iterations': 10000,\n", + " 'network': ,\n", + " 'verbose': False,\n", + " 'all_pos_coefs': False,\n", + " 'model_info': 'fitted_model :)',\n", + " 'target_gene_y': 'ZZZ3',\n", + " 'tolerance': 0.0001,\n", + " 'lasso_selection': 'cyclic'}" + ] + }, + "execution_count": 73, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bayesian_net_model2.get_params()" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "id": "f40dfe15", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) # of final TFs in the model for TG ZZZ3: 53\n", + "Training MSE: 0.44677196830719673\n", + "Testing MSE: 0.7485799905085805\n" + ] + } + ], + "source": [ + "print(f\":) # of final TFs in the model for TG {tg}: {bayesian_net_model2.num_final_predictors}\")\n", + "mse_train = bayesian_net_model2.test_mse(X_train, y_train)\n", + "mse_test = bayesian_net_model2.test_mse(X_test, y_test)\n", + "print(f\"Training MSE: {mse_train}\")\n", + "print(f\"Testing MSE: {mse_test}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "d36243be", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
y_interceptBACH1BCL6CCNT2CTCFE2F3E4F1EBF1ELF1ERF...TCF3TP53USF2YY1ZBTB7AZFP28ZNF140ZNF274ZNF682ZNF76
0-0.156670.1118070.0023050.02496-0.0314190.135078-0.187243-0.0239860.091143-0.013846...-0.137879-0.154464-0.0391460.056958-0.064111-0.0837640.118554-0.0198790.0151350.151874
\n", + "

1 rows × 54 columns

\n", + "
" + ], + "text/plain": [ + " y_intercept BACH1 BCL6 CCNT2 CTCF E2F3 E4F1 \\\n", + "0 -0.15667 0.111807 0.002305 0.02496 -0.031419 0.135078 -0.187243 \n", + "\n", + " EBF1 ELF1 ERF ... TCF3 TP53 USF2 YY1 \\\n", + "0 -0.023986 0.091143 -0.013846 ... -0.137879 -0.154464 -0.039146 0.056958 \n", + "\n", + " ZBTB7A ZFP28 ZNF140 ZNF274 ZNF682 ZNF76 \n", + "0 -0.064111 -0.083764 0.118554 -0.019879 0.015135 0.151874 \n", + "\n", + "[1 rows x 54 columns]" + ] + }, + "execution_count": 75, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bayesian_net_model2.model_nonzero_coef_df" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "f8222c0e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
coefTFTGinfotrain_msebeta_netalpha_lassoAbsoluteVal_coefficientRankfinal_model_TFsTFs_input_to_modeloriginal_TFs_in_X
0-0.156670y_interceptZZZ3netrem_with_intercept0.44677210.010.1566703537777
10.111807BACH1ZZZ3netrem_with_intercept0.44677210.010.11180712537777
20.002305BCL6ZZZ3netrem_with_intercept0.44677210.010.00230553537777
30.024960CCNT2ZZZ3netrem_with_intercept0.44677210.010.02496041537777
4-0.031419CTCFZZZ3netrem_with_intercept0.44677210.010.03141937537777
50.135078E2F3ZZZ3netrem_with_intercept0.44677210.010.1350787537777
6-0.187243E4F1ZZZ3netrem_with_intercept0.44677210.010.1872431537777
7-0.023986EBF1ZZZ3netrem_with_intercept0.44677210.010.02398642537777
80.091143ELF1ZZZ3netrem_with_intercept0.44677210.010.09114317537777
9-0.013846ERFZZZ3netrem_with_intercept0.44677210.010.01384647537777
10-0.112157ESR2ZZZ3netrem_with_intercept0.44677210.010.11215711537777
11-0.010902FOXO1ZZZ3netrem_with_intercept0.44677210.010.01090248537777
120.030213FOXP1ZZZ3netrem_with_intercept0.44677210.010.03021339537777
130.045014HCFC1ZZZ3netrem_with_intercept0.44677210.010.04501432537777
140.057533HDAC2ZZZ3netrem_with_intercept0.44677210.010.05753325537777
150.121722IRF3ZZZ3netrem_with_intercept0.44677210.010.1217229537777
16-0.051893KLF12ZZZ3netrem_with_intercept0.44677210.010.05189330537777
170.019619KLF15ZZZ3netrem_with_intercept0.44677210.010.01961944537777
180.051233MAFZZZ3netrem_with_intercept0.44677210.010.05123331537777
190.007074MAXZZZ3netrem_with_intercept0.44677210.010.00707450537777
20-0.060859MXI1ZZZ3netrem_with_intercept0.44677210.010.06085924537777
210.055780MYEF2ZZZ3netrem_with_intercept0.44677210.010.05578028537777
220.109876NFIBZZZ3netrem_with_intercept0.44677210.010.10987613537777
230.052258NFICZZZ3netrem_with_intercept0.44677210.010.05225829537777
240.133596NFKB1ZZZ3netrem_with_intercept0.44677210.010.1335968537777
250.039664NR1D2ZZZ3netrem_with_intercept0.44677210.010.03966435537777
260.056687NR1H2ZZZ3netrem_with_intercept0.44677210.010.05668727537777
27-0.030452NR2F1ZZZ3netrem_with_intercept0.44677210.010.03045238537777
280.005572NR3C1ZZZ3netrem_with_intercept0.44677210.010.00557251537777
290.000801NR6A1ZZZ3netrem_with_intercept0.44677210.010.00080154537777
300.043955PPARAZZZ3netrem_with_intercept0.44677210.010.04395533537777
310.100120RARBZZZ3netrem_with_intercept0.44677210.010.10012016537777
32-0.027042RARGZZZ3netrem_with_intercept0.44677210.010.02704240537777
330.004707RESTZZZ3netrem_with_intercept0.44677210.010.00470752537777
340.061880RFX3ZZZ3netrem_with_intercept0.44677210.010.06188023537777
35-0.086117RORAZZZ3netrem_with_intercept0.44677210.010.08611718537777
360.042622RREB1ZZZ3netrem_with_intercept0.44677210.010.04262234537777
370.076264RUNX2ZZZ3netrem_with_intercept0.44677210.010.07626421537777
380.170890SETDB1ZZZ3netrem_with_intercept0.44677210.010.1708902537777
39-0.100338SMAD4ZZZ3netrem_with_intercept0.44677210.010.10033815537777
400.077066SMC3ZZZ3netrem_with_intercept0.44677210.010.07706620537777
410.015308STAT1ZZZ3netrem_with_intercept0.44677210.010.01530845537777
420.100428STAT5BZZZ3netrem_with_intercept0.44677210.010.10042814537777
430.007686TBX2ZZZ3netrem_with_intercept0.44677210.010.00768649537777
44-0.137879TCF3ZZZ3netrem_with_intercept0.44677210.010.1378796537777
45-0.154464TP53ZZZ3netrem_with_intercept0.44677210.010.1544644537777
46-0.039146USF2ZZZ3netrem_with_intercept0.44677210.010.03914636537777
470.056958YY1ZZZ3netrem_with_intercept0.44677210.010.05695826537777
48-0.064111ZBTB7AZZZ3netrem_with_intercept0.44677210.010.06411122537777
49-0.083764ZFP28ZZZ3netrem_with_intercept0.44677210.010.08376419537777
500.118554ZNF140ZZZ3netrem_with_intercept0.44677210.010.11855410537777
51-0.019879ZNF274ZZZ3netrem_with_intercept0.44677210.010.01987943537777
520.015135ZNF682ZZZ3netrem_with_intercept0.44677210.010.01513546537777
530.151874ZNF76ZZZ3netrem_with_intercept0.44677210.010.1518745537777
\n", + "
" + ], + "text/plain": [ + " coef TF TG info train_mse beta_net \\\n", + "0 -0.156670 y_intercept ZZZ3 netrem_with_intercept 0.446772 1 \n", + "1 0.111807 BACH1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "2 0.002305 BCL6 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "3 0.024960 CCNT2 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "4 -0.031419 CTCF ZZZ3 netrem_with_intercept 0.446772 1 \n", + "5 0.135078 E2F3 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "6 -0.187243 E4F1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "7 -0.023986 EBF1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "8 0.091143 ELF1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "9 -0.013846 ERF ZZZ3 netrem_with_intercept 0.446772 1 \n", + "10 -0.112157 ESR2 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "11 -0.010902 FOXO1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "12 0.030213 FOXP1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "13 0.045014 HCFC1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "14 0.057533 HDAC2 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "15 0.121722 IRF3 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "16 -0.051893 KLF12 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "17 0.019619 KLF15 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "18 0.051233 MAF ZZZ3 netrem_with_intercept 0.446772 1 \n", + "19 0.007074 MAX ZZZ3 netrem_with_intercept 0.446772 1 \n", + "20 -0.060859 MXI1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "21 0.055780 MYEF2 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "22 0.109876 NFIB ZZZ3 netrem_with_intercept 0.446772 1 \n", + "23 0.052258 NFIC ZZZ3 netrem_with_intercept 0.446772 1 \n", + "24 0.133596 NFKB1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "25 0.039664 NR1D2 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "26 0.056687 NR1H2 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "27 -0.030452 NR2F1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "28 0.005572 NR3C1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "29 0.000801 NR6A1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "30 0.043955 PPARA ZZZ3 netrem_with_intercept 0.446772 1 \n", + "31 0.100120 RARB ZZZ3 netrem_with_intercept 0.446772 1 \n", + "32 -0.027042 RARG ZZZ3 netrem_with_intercept 0.446772 1 \n", + "33 0.004707 REST ZZZ3 netrem_with_intercept 0.446772 1 \n", + "34 0.061880 RFX3 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "35 -0.086117 RORA ZZZ3 netrem_with_intercept 0.446772 1 \n", + "36 0.042622 RREB1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "37 0.076264 RUNX2 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "38 0.170890 SETDB1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "39 -0.100338 SMAD4 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "40 0.077066 SMC3 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "41 0.015308 STAT1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "42 0.100428 STAT5B ZZZ3 netrem_with_intercept 0.446772 1 \n", + "43 0.007686 TBX2 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "44 -0.137879 TCF3 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "45 -0.154464 TP53 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "46 -0.039146 USF2 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "47 0.056958 YY1 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "48 -0.064111 ZBTB7A ZZZ3 netrem_with_intercept 0.446772 1 \n", + "49 -0.083764 ZFP28 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "50 0.118554 ZNF140 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "51 -0.019879 ZNF274 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "52 0.015135 ZNF682 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "53 0.151874 ZNF76 ZZZ3 netrem_with_intercept 0.446772 1 \n", + "\n", + " alpha_lasso AbsoluteVal_coefficient Rank final_model_TFs \\\n", + "0 0.01 0.156670 3 53 \n", + "1 0.01 0.111807 12 53 \n", + "2 0.01 0.002305 53 53 \n", + "3 0.01 0.024960 41 53 \n", + "4 0.01 0.031419 37 53 \n", + "5 0.01 0.135078 7 53 \n", + "6 0.01 0.187243 1 53 \n", + "7 0.01 0.023986 42 53 \n", + "8 0.01 0.091143 17 53 \n", + "9 0.01 0.013846 47 53 \n", + "10 0.01 0.112157 11 53 \n", + "11 0.01 0.010902 48 53 \n", + "12 0.01 0.030213 39 53 \n", + "13 0.01 0.045014 32 53 \n", + "14 0.01 0.057533 25 53 \n", + "15 0.01 0.121722 9 53 \n", + "16 0.01 0.051893 30 53 \n", + "17 0.01 0.019619 44 53 \n", + "18 0.01 0.051233 31 53 \n", + "19 0.01 0.007074 50 53 \n", + "20 0.01 0.060859 24 53 \n", + "21 0.01 0.055780 28 53 \n", + "22 0.01 0.109876 13 53 \n", + "23 0.01 0.052258 29 53 \n", + "24 0.01 0.133596 8 53 \n", + "25 0.01 0.039664 35 53 \n", + "26 0.01 0.056687 27 53 \n", + "27 0.01 0.030452 38 53 \n", + "28 0.01 0.005572 51 53 \n", + "29 0.01 0.000801 54 53 \n", + "30 0.01 0.043955 33 53 \n", + "31 0.01 0.100120 16 53 \n", + "32 0.01 0.027042 40 53 \n", + "33 0.01 0.004707 52 53 \n", + "34 0.01 0.061880 23 53 \n", + "35 0.01 0.086117 18 53 \n", + "36 0.01 0.042622 34 53 \n", + "37 0.01 0.076264 21 53 \n", + "38 0.01 0.170890 2 53 \n", + "39 0.01 0.100338 15 53 \n", + "40 0.01 0.077066 20 53 \n", + "41 0.01 0.015308 45 53 \n", + "42 0.01 0.100428 14 53 \n", + "43 0.01 0.007686 49 53 \n", + "44 0.01 0.137879 6 53 \n", + "45 0.01 0.154464 4 53 \n", + "46 0.01 0.039146 36 53 \n", + "47 0.01 0.056958 26 53 \n", + "48 0.01 0.064111 22 53 \n", + "49 0.01 0.083764 19 53 \n", + "50 0.01 0.118554 10 53 \n", + "51 0.01 0.019879 43 53 \n", + "52 0.01 0.015135 46 53 \n", + "53 0.01 0.151874 5 53 \n", + "\n", + " TFs_input_to_model original_TFs_in_X \n", + "0 77 77 \n", + "1 77 77 \n", + "2 77 77 \n", + "3 77 77 \n", + "4 77 77 \n", + "5 77 77 \n", + "6 77 77 \n", + "7 77 77 \n", + "8 77 77 \n", + "9 77 77 \n", + "10 77 77 \n", + "11 77 77 \n", + "12 77 77 \n", + "13 77 77 \n", + "14 77 77 \n", + "15 77 77 \n", + "16 77 77 \n", + "17 77 77 \n", + "18 77 77 \n", + "19 77 77 \n", + "20 77 77 \n", + "21 77 77 \n", + "22 77 77 \n", + "23 77 77 \n", + "24 77 77 \n", + "25 77 77 \n", + "26 77 77 \n", + "27 77 77 \n", + "28 77 77 \n", + "29 77 77 \n", + "30 77 77 \n", + "31 77 77 \n", + "32 77 77 \n", + "33 77 77 \n", + "34 77 77 \n", + "35 77 77 \n", + "36 77 77 \n", + "37 77 77 \n", + "38 77 77 \n", + "39 77 77 \n", + "40 77 77 \n", + "41 77 77 \n", + "42 77 77 \n", + "43 77 77 \n", + "44 77 77 \n", + "45 77 77 \n", + "46 77 77 \n", + "47 77 77 \n", + "48 77 77 \n", + "49 77 77 \n", + "50 77 77 \n", + "51 77 77 \n", + "52 77 77 \n", + "53 77 77 " + ] + }, + "execution_count": 76, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bayesian_net_model2.combined_df" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "92f8557b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "dict_keys(['info', 'verbose', 'overlapped_nodes_only', 'num_cv_folds', 'num_jobs', 'all_pos_coefs', 'model_type', 'use_network', 'y_intercept', 'max_lasso_iterations', 'view_network', 'model_info', 'target_gene_y', 'tolerance', 'lasso_selection', 'lassocv_eps', 'lassocv_n_alphas', 'lassocv_alphas', 'beta_net', 'alpha_lasso', 'network', 'optimal_alpha', 'prior_network', 'preprocessed_network', 'network_params', 'network_nodes_list', 'kwargs', 'X_df', 'gene_expression_nodes', 'common_nodes', 'final_nodes', 'gexpr_nodes_added', 'gexpr_nodes_to_add_for_net', 'filter_network_bool', 'A_df', 'A', 'nodes', 'network_info', 'M', 'N', 'X_train', 'y_train', 'B_train', 'B_interaction_df', 'B_train_times_M', 'X_tilda_train', 'y_tilda_train', 'X_training_to_use', 'y_training_to_use', 'regr', 'final_alpha', 'coef', 'intercept', 'predY_tilda_train', 'mse_tilda_train', 'predY_train', 'mse_train', 'model_coef_df', 'model_nonzero_coef_df', 'sorted_coef_df', 'corr_vs_coef_df', 'final_corr_vs_coef_df', 'combined_df', 'num_final_predictors'])" + ] + }, + "execution_count": 77, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vars(bayesian_net_model2).keys()" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "id": "d537c8cc", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TF1TF2B_train_weightsignpotential_interactionabsVal_Binfocandidate_TFs_Ntarget_gene_ynum_final_predictorsmodel_typebeta_netgene_datarankpercentile
2092FOXO1NFIB1.119274e+00:):(1.119274e+00B matrix of TF-TF interactions77ZZZ353Lasso1training gene expression data1.099.982912
1028NFIBFOXO11.119274e+00:):(1.119274e+00B matrix of TF-TF interactions77ZZZ353Lasso1training gene expression data1.099.982912
4416NFIBSREBF29.570243e-01:):(9.570243e-01B matrix of TF-TF interactions77ZZZ353Lasso1training gene expression data3.099.948735
2136SREBF2NFIB9.570243e-01:):(9.570243e-01B matrix of TF-TF interactions77ZZZ353Lasso1training gene expression data3.099.948735
1046RORAFOXO19.287842e-01:):(9.287842e-01B matrix of TF-TF interactions77ZZZ353Lasso1training gene expression data5.099.914559
................................................
3834TCF3RXRB-1.163842e-06:(:( competitive (-)1.163842e-06B matrix of TF-TF interactions77ZZZ353Lasso1training gene expression data5847.00.085441
960PMLESRRA-9.827955e-07:(:( competitive (-)9.827955e-07B matrix of TF-TF interactions77ZZZ353Lasso1training gene expression data5849.00.051265
2784ESRRAPML-9.827955e-07:(:( competitive (-)9.827955e-07B matrix of TF-TF interactions77ZZZ353Lasso1training gene expression data5850.00.034176
4708ESR2TCF3-9.284305e-07:(:( competitive (-)9.284305e-07B matrix of TF-TF interactions77ZZZ353Lasso1training gene expression data5851.00.017088
908TCF3ESR2-9.284305e-07:(:( competitive (-)9.284305e-07B matrix of TF-TF interactions77ZZZ353Lasso1training gene expression data5851.00.017088
\n", + "

5852 rows × 15 columns

\n", + "
" + ], + "text/plain": [ + " TF1 TF2 B_train_weight sign potential_interaction absVal_B \\\n", + "2092 FOXO1 NFIB 1.119274e+00 :) :( 1.119274e+00 \n", + "1028 NFIB FOXO1 1.119274e+00 :) :( 1.119274e+00 \n", + "4416 NFIB SREBF2 9.570243e-01 :) :( 9.570243e-01 \n", + "2136 SREBF2 NFIB 9.570243e-01 :) :( 9.570243e-01 \n", + "1046 RORA FOXO1 9.287842e-01 :) :( 9.287842e-01 \n", + "... ... ... ... ... ... ... \n", + "3834 TCF3 RXRB -1.163842e-06 :( :( competitive (-) 1.163842e-06 \n", + "960 PML ESRRA -9.827955e-07 :( :( competitive (-) 9.827955e-07 \n", + "2784 ESRRA PML -9.827955e-07 :( :( competitive (-) 9.827955e-07 \n", + "4708 ESR2 TCF3 -9.284305e-07 :( :( competitive (-) 9.284305e-07 \n", + "908 TCF3 ESR2 -9.284305e-07 :( :( competitive (-) 9.284305e-07 \n", + "\n", + " info candidate_TFs_N target_gene_y \\\n", + "2092 B matrix of TF-TF interactions 77 ZZZ3 \n", + "1028 B matrix of TF-TF interactions 77 ZZZ3 \n", + "4416 B matrix of TF-TF interactions 77 ZZZ3 \n", + "2136 B matrix of TF-TF interactions 77 ZZZ3 \n", + "1046 B matrix of TF-TF interactions 77 ZZZ3 \n", + "... ... ... ... \n", + "3834 B matrix of TF-TF interactions 77 ZZZ3 \n", + "960 B matrix of TF-TF interactions 77 ZZZ3 \n", + "2784 B matrix of TF-TF interactions 77 ZZZ3 \n", + "4708 B matrix of TF-TF interactions 77 ZZZ3 \n", + "908 B matrix of TF-TF interactions 77 ZZZ3 \n", + "\n", + " num_final_predictors model_type beta_net \\\n", + "2092 53 Lasso 1 \n", + "1028 53 Lasso 1 \n", + "4416 53 Lasso 1 \n", + "2136 53 Lasso 1 \n", + "1046 53 Lasso 1 \n", + "... ... ... ... \n", + "3834 53 Lasso 1 \n", + "960 53 Lasso 1 \n", + "2784 53 Lasso 1 \n", + "4708 53 Lasso 1 \n", + "908 53 Lasso 1 \n", + "\n", + " gene_data rank percentile \n", + "2092 training gene expression data 1.0 99.982912 \n", + "1028 training gene expression data 1.0 99.982912 \n", + "4416 training gene expression data 3.0 99.948735 \n", + "2136 training gene expression data 3.0 99.948735 \n", + "1046 training gene expression data 5.0 99.914559 \n", + "... ... ... ... \n", + "3834 training gene expression data 5847.0 0.085441 \n", + "960 training gene expression data 5849.0 0.051265 \n", + "2784 training gene expression data 5850.0 0.034176 \n", + "4708 training gene expression data 5851.0 0.017088 \n", + "908 training gene expression data 5851.0 0.017088 \n", + "\n", + "[5852 rows x 15 columns]" + ] + }, + "execution_count": 78, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "b_matrix_2e = nm.organize_B_interaction_network(bayesian_net_model2)\n", + "b_matrix_2e" + ] + }, + { + "cell_type": "markdown", + "id": "62b29bc5", + "metadata": {}, + "source": [ + "## Example 3️⃣: \n", + "### More intensive hyperparameter tuning\n", + "Here, please note that we focus on more hyperparameters that we may tune on, such as including the y-intercept term (or not?), or using Lasso versus LassoCV (and when LassoCV is being tested, we ignore the input alpha_lasso in the param_grid values). 😊\n", + "\n", + "### Example 3a:\n", + "#### User optimizes over several hyperparameters using GridSearchCV (comprehensive):\n" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "id": "e37b79f9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "using beta_net default of 1\n", + "using alpha_lasso default of 0.01\n", + "Fitting 5 folds for each of 480 candidates, totalling 2400 fits\n", + "CPU times: total: 34 s\n", + "Wall time: 1min 9s\n" + ] + }, + { + "data": { + "text/html": [ + "
GridSearchCV(cv=5,\n",
+       "             estimator=NetREmModel(all_pos_coefs=False, alpha_lasso=0.01, beta_net=1, info='NetREm Model', lasso_selection='cyclic', max_lasso_iterations=10000, model_info='unfitted_model :(', model_type='Lasso', network=<PriorGraphNetwork.PriorGraphNetwork object at 0x0000016C6F1F74C0>, target_gene_y='Unknown :(', tolerance=0.0001, verbose=False, y_intercept=False),\n",
+       "             n_jobs=-1,\n",
+       "             param_grid={'alpha_lasso': [1e-05, 1e-05, 0.0001, 0.005, 0.001,\n",
+       "                                         0.002, 0.1, 0.003, 0.005, 0.01],\n",
+       "                         'beta_net': [0.1, 0.05, 0.01, 0.2, 0.4, 0.5, 0.6, 0.8,\n",
+       "                                      5, 1, 2, 10],\n",
+       "                         'model_type': ['Lasso', 'LassoCV'],\n",
+       "                         'y_intercept': [True, False]},\n",
+       "             verbose=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "GridSearchCV(cv=5,\n", + " estimator=NetREmModel(all_pos_coefs=False, alpha_lasso=0.01, beta_net=1, info='NetREm Model', lasso_selection='cyclic', max_lasso_iterations=10000, model_info='unfitted_model :(', model_type='Lasso', network=, target_gene_y='Unknown :(', tolerance=0.0001, verbose=False, y_intercept=False),\n", + " n_jobs=-1,\n", + " param_grid={'alpha_lasso': [1e-05, 1e-05, 0.0001, 0.005, 0.001,\n", + " 0.002, 0.1, 0.003, 0.005, 0.01],\n", + " 'beta_net': [0.1, 0.05, 0.01, 0.2, 0.4, 0.5, 0.6, 0.8,\n", + " 5, 1, 2, 10],\n", + " 'model_type': ['Lasso', 'LassoCV'],\n", + " 'y_intercept': [True, False]},\n", + " verbose=10)" + ] + }, + "execution_count": 79, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time\n", + "\n", + "demo3 = nm.netrem(edge_list = filtered_ppi_for_TG, \n", + " view_network = False)\n", + "\n", + "larger_param_grid = {\n", + " 'beta_net': [0.1, 0.05,0.01, 0.2, 0.4, 0.5, 0.6, 0.8, 5, 1, 2, 10],\n", + " 'alpha_lasso': [1e-5, 0.00001, 0.0001, 0.005, 0.001, 0.002, 0.1, 0.003, 0.005, 0.01],\n", + " 'y_intercept': [True, False],\n", + " 'model_type': [\"Lasso\", \"LassoCV\"]}\n", + "\n", + "griddy_demo3 = GridSearchCV(demo3, param_grid=larger_param_grid, cv=5, n_jobs = -1, verbose = 10)\n", + "griddy_demo3.fit(X_train, y_train)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "b73d131d", + "metadata": {}, + "source": [ + "![gridster_0.png](../user_guide/pics/gridster_0.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "cc34a859", + "metadata": {}, + "outputs": [], + "source": [ + "# griddy_demo3.cv_results_ # to view the results from GridSearchCV\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0351384", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/code/old_code/refresh/Netrem_model_builder.py b/code/old_code/refresh/Netrem_model_builder.py new file mode 100644 index 0000000..761023f --- /dev/null +++ b/code/old_code/refresh/Netrem_model_builder.py @@ -0,0 +1,1154 @@ +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import random +import copy +from tqdm import tqdm +import os +import sys # https://www.dev2qa.com/how-to-run-python-script-py-file-in-jupyter-notebook-ipynb-file-and-ipython/#:~:text=How%20To%20Run%20Python%20Script%20.py%20File%20In,2.%20Invoke%20Python%20Script%20File%20From%20Ipython%20Command-Line. +import networkx as nx +import scipy +from scipy.linalg import svd as robust_svd +from sklearn.model_selection import KFold, train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score +from sklearn.decomposition import TruncatedSVD +from sklearn import linear_model, preprocessing # 9/19 +from sklearn.linear_model import Lasso, LassoCV, LinearRegression, ElasticNetCV, Ridge +from numpy.typing import ArrayLike +from typing import Optional, List, Tuple +from sklearn.metrics import make_scorer +import plotly.express as px +from sklearn.base import RegressorMixin, ClassifierMixin, BaseEstimator +from numpy.typing import ArrayLike +from skopt import gp_minimize, space +from scipy.sparse.linalg.interface import LinearOperator +import warnings +from sklearn.exceptions import ConvergenceWarning +printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) +rng_seed = 2023 # random seed for reproducibility +randSeed = 123 +# from packages_needed import * +import essential_functions as ef +import error_metrics as em # why to do import +#import Netrem_model_builder as nm +import DemoDataBuilderXandY as demo +import PriorGraphNetwork as graph +import netrem_evaluation_functions as nm_eval +import matplotlib.pyplot as plt +import pandas as pd +import numpy as np +import networkx as nx +from sklearn.linear_model import LinearRegression, Lasso, LassoCV +from tqdm.auto import tqdm +import copy +""" +Optimization for +(1 / (2 * M)) * ||y - Xc||^2_2 + (beta / (2 * N^2)) * c'Ac + alpha * ||c||_1 +Which is converted to lasso +(1 / (2 * M)) * ||y_tilde - X_tilde @ c||^2_2 + alpha * ||c||_1 +where M = n_samples and N is the dimension of c. +Check compute_X_tilde_y_tilde() to see how we make sure above normalization is applied using Lasso of sklearn +""" + +class NetREmModel(BaseEstimator, RegressorMixin): + """ :) Please note that this class focuses on building a Gene Regulatory Network (GRN) from gene expression data for Transcription Factors (TFs), gene expression data for the target gene (TG), and a prior biological network (W). This class performs Network-penalized regression :) """ + _parameter_constraints = { + "alpha_lasso": (0, None), + "beta_net": (0, None), + "num_cv_folds": (0, None), + "y_intercept": [False, True], + "use_network": [True, False], + "max_lasso_iterations": (1, None), + "model_type": ["Lasso", "LassoCV", "Linear"], + "tolerance": (0, None), + "num_jobs": (1, 1e10), + "lasso_selection": ["cyclic", "random"], + "lassocv_eps": (0, None), + "lassocv_n_alphas": (1, None), + "standardize_X": [True, False], + "center_y": [True, False] + } + + def __init__(self, **kwargs): + self.info = "NetREm Model" + self.verbose = False + self.overlapped_nodes_only = False # restrict the nodes to only being those found in the network? overlapped_nodes_only + self.num_cv_folds = 5 # for cross-validation models + self.num_jobs = -1 # for LassoCV or LinearRegression (here, -1 is the max possible for CPU) + self.all_pos_coefs = False # for coefficients + self.model_type = "Lasso" + self.standardize_X = True + self.center_y = True + self.use_network = True + self.y_intercept = False + self.max_lasso_iterations = 10000 + self.view_network = False + self.model_info = "unfitted_model :(" + self.target_gene_y = "Unknown :(" + self.tolerance = 1e-4 + self.lasso_selection = "cyclic" # default in sklearn + self.lassocv_eps = 1e-3 # default in sklearn + self.lassocv_n_alphas = 100 # default in sklearn + self.lassocv_alphas = None # default in sklearn + self.beta_net = kwargs.get('beta_net', 1) + self.__dict__.update(kwargs) + required_keys = ["network", "beta_net"] + if self.model_type == "Lasso": + self.alpha_lasso = kwargs.get('alpha_lasso', 0.01) + self.optimal_alpha = "User-specified optimal alpha lasso: " + str(self.alpha_lasso) + required_keys += ["alpha_lasso"] + elif self.model_type == "LassoCV": + self.alpha_lasso = "LassoCV finds optimal alpha" + self.optimal_alpha = "Since LassoCV is model_type, please fit model using X and y data to find optimal_alpha." + else: # model_type == "Linear": + self.alpha_lasso = "No alpha needed" + self.optimal_alpha = "No alpha needed" # + missing_keys = [key for key in required_keys if key not in self.__dict__] # check that all required keys are present: + if missing_keys: + raise ValueError(f":( Please note ye are missing information for these keys: {missing_keys}") + if self.use_network: + prior_network = self.network + self.prior_network = prior_network + self.preprocessed_network = prior_network.preprocessed_network + self.network_params = prior_network.param_lists + self.network_nodes_list = prior_network.final_nodes # tf_names_list + self.kwargs = kwargs + self._apply_parameter_constraints() # ensuring that the parameter constraints are met + + + def __repr__(self): + args = [f"{k}={v}" for k, v in self.__dict__.items() if k != 'param_grid' and k in self.kwargs] + return f"{self.__class__.__name__}({', '.join(args)})" + + + def check_overlaps_work(self): + final_set = set(self.final_nodes) + network_set = set(self.network_nodes_list) + return network_set != final_set + + + def standardize_X_data(self, X_df): # if the user opts to + """ :) If the user opts to standardize the X data (so that predictors have a mean of 0 + and a standard deviation of 1), then this method will be run, which uses the preprocessing + package StandardScalar() functionality. """ + if self.standardize_X: + # Transform both the training and test data + X_scaled = self.scaler.transform(X_df) + X_scaled_df = pd.DataFrame(X_scaled, columns=X_df.columns) + return X_scaled_df + else: + return X_df + + def center_y_data(self, y_df): # if the user opts to + """ :) If the user opts to center the response y data: + subtracting its mean from each observation.""" + if self.center_y: + # Center the response + y_train_centered = y_df - self.mean_y_train + return y_train_centered + else: + return y_df + + def updating_network_and_X_during_fitting(self, X, y): + # updated one :) + """ Update the prior network information and the + X input data (training) during the fitting of the model. It determines if the common predictors + should be used (based on if overlapped_nodes_only is True) or if all of the X input data should be used. """ + X_df = X.sort_index(axis=1) # sorting the X dataframe by columns. (rows are samples) + + #X_df = X.sort_index(axis=0).sort_index(axis=1) # sorting the X dataframe by rows and columns. + #self.X_df = X_df + self.target_gene_y = y.columns[0] + + if self.standardize_X: # we will standardize X then + if self.verbose: + print(":) Standardizing the X data") + self.old_X_df = X_df + self.scaler = preprocessing.StandardScaler().fit(X_df) # Fit the scaler to the training data only + # this self.scalar will be utilized for the testing data to prevent data leakage and to ensure generalization :) + self.X_df = self.standardize_X_data(X_df) + X = self.X_df # overwriting and updating the X df + else: + self.X_df = X_df + + self.mean_y_train = np.mean(y) # the average y value + if self.center_y: # we will center y then + if self.verbose: + print(":) centering the y data") + # Assuming y_train and y_test are your training and test labels + self.old_y = y + y = self.center_y_data(y) + + gene_expression_nodes = X_df.columns.tolist() # these are already sorted + #gene_expression_nodes = sorted(X_df.columns.tolist()) # these will be sorted + ppi_net_nodes = set(self.network_nodes_list) + common_nodes = list(ppi_net_nodes.intersection(gene_expression_nodes)) + + if not common_nodes: # may be possible that the X dataframe needs to be transposed if provided incorrectly + print("Please note: we are flipping X dataframe around so that the rows are samples and the columns are gene/TF names :)") + X_df = X_df.transpose() + gene_expression_nodes = sorted(X_df.columns.tolist()) + common_nodes = list(ppi_net_nodes.intersection(gene_expression_nodes)) + + self.gene_expression_nodes = gene_expression_nodes + self.common_nodes = sorted(common_nodes) + self.final_nodes = gene_expression_nodes + if self.overlapped_nodes_only: + self.final_nodes = common_nodes + elif self.preprocessed_network: + self.final_nodes = self.prior_network.final_nodes + else: + self.final_nodes = gene_expression_nodes + + final_nodes_set = set(self.final_nodes) + ppi_nodes_to_remove = list(ppi_net_nodes - final_nodes_set) + self.gexpr_nodes_added = list(set(gene_expression_nodes) - final_nodes_set) + self.gexpr_nodes_to_add_for_net = list(set(gene_expression_nodes) - set(common_nodes)) + + if self.verbose: + if ppi_nodes_to_remove: + print(f"Please note that we remove {len(ppi_nodes_to_remove)} nodes found in the input network that are not found in the input gene expression data (X) :)") + print(ppi_nodes_to_remove) + else: + print(f":) Please note that all {len(common_nodes)} nodes found in the network are also found in the input gene expression data (X) :)") + + filter_network_bool = self.filter_network_bool = self.check_overlaps_work() #self.check_overlaps_work(X_df) + if filter_network_bool: + print("Please note that we need to update the network information") + self.updating_network_A_matrix_given_X() # updating the A matrix given the gene expression data X + if self.view_network: + ef.draw_arrow() + self.view_W_network = self.view_W_network() + else: + self.A_df = self.network.A_df + self.A = self.network.A + self.nodes = self.A_df.columns.tolist() + + self.network_params = self.prior_network.param_lists + self.network_info = "fitted_network" + self.M = y.shape[0] + self.N = len(self.final_nodes) # pre-processing: + self.X_train = self.preprocess_X_df(X) + self.y_train = self.preprocess_y_df(y) + return self + + def organize_B_interaction_list(self): # TF-TF interactions to output :) + self.B_train = self.compute_B_matrix(self.X_train) + self.B_interaction_df = pd.DataFrame(self.B_train, index = self.final_nodes, columns = self.final_nodes) + return self + + + def fit(self, X, y): # fits a model Function used for model training + self.updating_network_and_X_during_fitting(X, y) + self.organize_B_interaction_list() + self.B_train_times_M = self.compute_B_matrix_times_M(self.X_train) + self.X_tilda_train, self.y_tilda_train = self.compute_X_tilde_y_tilde(self.B_train_times_M, self.X_train, + self.y_train) + self.X_training_to_use, self.y_training_to_use = self.X_tilda_train, self.y_tilda_train + self.regr = self.return_fit_ml_model(self.X_training_to_use, self.y_training_to_use) + ml_model = self.regr + self.final_alpha = self.alpha_lasso + if self.model_type == "LassoCV": + self.final_alpha = ml_model.alpha_ + self.optimal_alpha = "Cross-Validation optimal alpha lasso: " + str(self.final_alpha) + if self.verbose: + print(self.optimal_alpha) + self.coef = ml_model.coef_ # Please Get the coefficients + self.coef[self.coef == -0.0] = 0 + if self.y_intercept: + self.intercept = ml_model.intercept_ + self.predY_tilda_train = ml_model.predict(self.X_training_to_use) # training data + self.mse_tilda_train = self.calculate_mean_square_error(self.y_training_to_use, self.predY_tilda_train) # Calculate MSE + self.predY_train = ml_model.predict(self.X_train) # training data + self.mse_train = self.calculate_mean_square_error(self.y_train, self.predY_train) # Calculate MSE + if self.y_intercept: + coeff_terms = [self.intercept] + list(self.coef) + index_names = ["y_intercept"] + self.nodes + self.model_coef_df = pd.DataFrame(coeff_terms, index = index_names).transpose() + else: + coeff_terms = ["None"] + list(self.coef) + index_names = ["y_intercept"] + self.nodes + self.model_coef_df = pd.DataFrame(coeff_terms, index = index_names).transpose() + self.model_info = "fitted_model :)" + selected_row = self.model_coef_df.iloc[0] + selected_cols = selected_row[selected_row != 0].index # Filter out the columns with value 0 + if len(selected_cols) == 0: + self.model_nonzero_coef_df = None + self.num_final_predictors = 0 + else: + self.model_nonzero_coef_df = self.model_coef_df[selected_cols] + if len(selected_cols) > 1: # and self.model_type != "Linear": + self.netrem_model_predictor_results(y) + self.num_final_predictors = len(selected_cols) + if "y_intercept" in selected_cols: + self.num_final_predictors = self.num_final_predictors - 1 + return self + + + def netrem_model_predictor_results(self, y): # olders + """ :) Please note that this function by Saniya works on a netrem model and returns information about the predictors + such as their Pearson correlations with y, their rankings as well. + It returns: sorted_df, final_corr_vs_coef_df, combined_df """ + abs_df = self.model_nonzero_coef_df.replace("None", np.nan).apply(pd.to_numeric, errors='coerce').abs() + if abs_df.shape[0] == 1: + abs_df = pd.DataFrame([abs_df.squeeze()]) + sorted_series = abs_df.squeeze().sort_values(ascending=False) + sorted_df = pd.DataFrame(sorted_series) # convert the sorted series back to a DataFrame + sorted_df['Rank'] = range(1, len(sorted_df) + 1) # add a column for the rank + sorted_df['TF'] = sorted_df.index + sorted_df = sorted_df.rename(columns = {0:"AbsoluteVal_coefficient"}) + self.sorted_coef_df = sorted_df # print the sorted DataFrame + tg = y.columns.tolist()[0] + corr = pd.DataFrame(self.X_df.corrwith(y[tg])).transpose() + corr["info"] = "corr (r) with y: " + tg + all_df = self.model_coef_df + all_df = all_df.iloc[:, 1:] + all_df["info"] = "network regression coeff. with y: " + tg + all_df = pd.concat([all_df, corr]) + all_df["input_data"] = "X_train" + sorting = self.sorted_coef_df[["Rank"]].transpose().drop(columns = ["y_intercept"]) + sorting = sorting.reset_index().drop(columns = ["index"]) + sorting["info"] = "Absolute Value NetREm Coefficient Ranking" + sorting["input_data"] = "X_train" + all_df = pd.concat([all_df, sorting]) + self.corr_vs_coef_df = all_df + self.final_corr_vs_coef_df = self.corr_vs_coef_df[["info", "input_data"] + self.model_nonzero_coef_df.columns.tolist()[1:]] + + netrem_model_df = self.model_nonzero_coef_df.transpose() + netrem_model_df.columns = ["coef"] + netrem_model_df["TF"] = netrem_model_df.index.tolist() + netrem_model_df["TG"] = tg + if self.y_intercept: + netrem_model_df["info"] = "netrem_with_intercept" + else: + netrem_model_df["info"] = "netrem_no_intercept" + netrem_model_df["train_mse"] = self.mse_train + if self.model_type != "Linear": + netrem_model_df["beta_net"] = self.beta_net + if self.model_type == "LassoCV": + netrem_model_df["alpha_lassoCV"] = self.optimal_alpha + else: + netrem_model_df["alpha_lasso"] = self.alpha_lasso + if netrem_model_df.shape[0] > 1: + self.combined_df = pd.merge(netrem_model_df, self.sorted_coef_df) + self.combined_df["final_model_TFs"] = max(self.sorted_coef_df["Rank"]) - 1 + else: + self.combined_df = netrem_model_df + self.combined_df["TFs_input_to_model"] = len(self.final_nodes) + self.combined_df["original_TFs_in_X"] = len(self.gene_expression_nodes) + self.combined_df["standardized_X"] = self.standardize_X + self.combined_df["centered_y"] = self.center_y + return self + + def view_W_network(self): + roundedW = np.round(self.W, decimals=4) + wMat = ef.view_matrix_as_dataframe(roundedW, column_names_list=self.final_nodes, row_names_list=self.final_nodes) + w_edgeList = wMat.stack().reset_index() + w_edgeList = w_edgeList[w_edgeList["level_0"] != w_edgeList["level_1"]] + w_edgeList = w_edgeList.rename(columns={"level_0": "source", "level_1": "target", 0: "weight"}) + w_edgeList = w_edgeList[w_edgeList["weight"] != 0] + + G = nx.from_pandas_edgelist(w_edgeList, source="source", target="target", edge_attr="weight") + pos = nx.spring_layout(G) + weights_list = [G.edges[e]['weight'] * self.prior_network.edge_weight_scaling for e in G.edges] + + fig, ax = plt.subplots() + + if not self.overlapped_nodes_only: + nodes_to_add = list(set(self.gene_expression_nodes) - set(self.common_nodes)) + if nodes_to_add: + print(f":) {len(nodes_to_add)} new nodes added to network based on gene expression data {nodes_to_add}") + node_color_map = { + node: self.prior_network.added_node_color_name if node in nodes_to_add else self.prior_network.node_color_name + for node in G.nodes + } + nx.draw(G, pos, node_color=node_color_map.values(), edge_color=self.prior_network.edge_color_name, with_labels=True, width=weights_list, ax=ax) + else: + nx.draw(G, pos, node_color=self.prior_network.node_color_name, edge_color=self.prior_network.edge_color_name, with_labels=True, width=weights_list, ax=ax) + else: + nx.draw(G, pos, node_color=self.prior_network.node_color_name, edge_color=self.prior_network.edge_color_name, with_labels=True, width=weights_list, ax=ax) + + labels = {e: G.edges[e]['weight'] for e in G.edges} + return nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, ax=ax) + + def compute_B_matrix_times_M(self, X): + """ M is N_sample, because ||y - Xc||^2 need to be normalized by 1/n_sample, but not the 2 * beta_L2 * c'Ac term + see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html + The optimization objective for Lasso is: + (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1 where M = n_sample + Calculations""" + XtX = X.T @ X + beta_L2 = self.beta_net + N_squared = self.N * self.N + part_2 = 2.0 * float(beta_L2) * self.M / (N_squared) * self.A + B = XtX + part_2 + return B + + + def compute_B_matrix(self, X): + """ M is N_sample, because ||y - Xc||^2 need to be normalized by 1/n_sample, but not the 2 * beta_L2 * c'Ac term + see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html + The optimization objective for Lasso is: + (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1 + where M = n_sample + Outputting for user """ + return self.compute_B_matrix_times_M(X) / self.M + + + def compute_X_tilde_y_tilde(self, B, X, y): + """Compute X_tilde, y_tilde such that X_tilde.T @ X_tilde = B, y_tilde.T @ X_tilde = y.T @ X """ + U, s, _Vh = np.linalg.svd(B, hermitian=True) # B = U @ np.diag(s) @ _Vh + if (cond := s[0] / s[-1]) > 1e10: + print(f'Large conditional number of B matrix: {cond: .2f}') + S_sqrt = ef.DiagonalLinearOperator(np.sqrt(s)) + S_inv_sqrt = ef.DiagonalLinearOperator(1 / np.sqrt(s)) + X_tilde = S_sqrt @ U.T + y_tilde = (y @ X @ U @ S_inv_sqrt).T + # assert(np.allclose(y.T @ X, y_tilde.T @ X_tilde)) + # assert(np.allclose(B, X_tilde.T @ X_tilde)) + # scale: we normalize by 1/M, but sklearn.linear_model.Lasso normalize by 1/N because X_tilde is N*N matrix, + # so Lasso thinks the number of sample is N instead of M, to use lasso solve our desired problem, correct the scale + scale = np.sqrt(self.N / self.M) + X_tilde *= scale + y_tilde *= scale + return X_tilde, y_tilde + + def predict_y_from_y_tilda(self, X, X_tilda, pred_y_tilda): + + X = self.preprocess_X_df(X) + # Transposing the matrix before inverting + X_transpose_inv = np.linalg.inv(X.T) + + # Efficiently compute pred_y by considering the dimensions of matrices + pred_y = np.dot(np.dot(X_transpose_inv, X_tilda.T), pred_y_tilda) + + return pred_y + + + def _apply_parameter_constraints(self): + constraints = {**NetREmModel._parameter_constraints} + for key, value in self.__dict__.items(): + if key in constraints: + if isinstance(constraints[key], tuple): + if isinstance(constraints[key][0], type) and not isinstance(value, constraints[key][0]): + setattr(self, key, constraints[key][0]) + elif constraints[key][1] is not None and isinstance(constraints[key][1], type) and not isinstance(value, constraints[key][1]): + setattr(self, key, constraints[key][1]) + elif value not in constraints[key]: + setattr(self, key, constraints[key][0]) + return self + + + def calculate_mean_square_error(self, actual_values, predicted_values): + difference = (actual_values - predicted_values)# Please note that this function by Saniya calculates the Mean Square Error (MSE) + squared_diff = difference ** 2 # square of the difference + mean_squared_diff = np.mean(squared_diff) + return mean_squared_diff + + + def predict(self, X_test): + if self.standardize_X: + self.X_test_standardized = self.standardize_X_data(X_test) + X_test = self.preprocess_X_df(self.X_test_standardized) + else: + X_test = self.preprocess_X_df(X_test) # X_test + return self.regr.predict(X_test) + + + def test_mse(self, X_test, y_test): + X_test = X_test.sort_index(axis=1) # 9/20 + if self.standardize_X: + self.X_test_standardized = self.standardize_X_data(X_test) + X_test = self.preprocess_X_df(self.X_test_standardized) + else: + X_test = self.preprocess_X_df(X_test) # X_test + if self.center_y: + y_test = self.center_y_data(y_test) + #X_test = self.preprocess_X_df(X_test) # X_test + y_test = self.preprocess_y_df(y_test) + predY_test = self.regr.predict(X_test) # training data + mse_test = self.calculate_mean_square_error(y_test, predY_test) # Calculate MSE + return mse_test #mse_test + + + def get_params(self, deep=True): + params_dict = {"info":self.info, "alpha_lasso": self.alpha_lasso, "beta_net": self.beta_net, + "y_intercept": self.y_intercept, "model_type":self.model_type, + "standardize_X":self.standardize_X, + "center_y":self.center_y, + "max_lasso_iterations":self.max_lasso_iterations, + "network":self.network, "verbose":self.verbose, + "all_pos_coefs":self.all_pos_coefs, "model_info":self.model_info, + "target_gene_y":self.target_gene_y} + if self.model_type == "LassoCV": + params_dict["num_cv_folds"] = self.num_cv_folds + params_dict["num_jobs"] = self.num_jobs + params_dict["alpha_lasso"] = "LassoCV finds optimal alpha" + params_dict["lassocv_eps"] = self.lassocv_eps + params_dict["lassocv_n_alphas"] = self.lassocv_n_alphas + params_dict["lassocv_alphas"] = self.lassocv_alphas + params_dict["optimal_alpha"] = self.optimal_alpha + elif self.model_type == "Linear": + params_dict["alpha_lasso"] = "No alpha needed" + params_dict["num_jobs"] = self.num_jobs + if self.model_type != "Linear": + params_dict["tolerance"] = self.tolerance + params_dict["lasso_selection"] = self.lasso_selection + if not deep: + return params_dict + else: + return copy.deepcopy(params_dict) + + + def set_params(self, **params): + """ Sets the value of any parameters in this estimator + Parameters: **params: Dictionary of parameter names mapped to their values + Returns: self: Returns an instance of self """ + if not params: + return self + for key, value in params.items(): + if key not in self.get_params(): + raise ValueError(f'Invalid parameter {key} for estimator {self.__class__.__name__}') + setattr(self, key, value) + return self + + + def __deepcopy__(self, memo): + cls = self.__class__ + result = cls.__new__(cls) + memo[id(self)] = result + for k, v in self.__dict__.items(): + setattr(result, k, deepcopy(v, memo)) + result.optimal_alpha = self.optimal_alpha + return result + + + def clone(self): + return deepcopy(self) + + + def score(self, X, y, zero_coef_penalty=10): + if isinstance(X, pd.DataFrame): + X = self.preprocess_X_df(X) # X_test + if isinstance(y, pd.DataFrame): + y = self.preprocess_y_df(y) + + # Make predictions using the predict method of your custom estimator + y_pred = self.predict(X) + + # Handle cases where predictions are exactly zero + y_pred[y_pred == 0] = 1e-10 + + # Calculate the normalized mean squared error between the true and predicted values + nmse_ = (y - y_pred)**2 + nmse_[y_pred == 1e-10] *= zero_coef_penalty + nmse_ = nmse_.mean() / (y**2).mean() + + if nmse_ == 0: + #return float(1e1000) # Return positive infinity if nmse_ is zero + + return float("inf") # Return positive infinity if nmse_ is zero + else: + return -nmse_ + + + +# def score(self, X, y, zero_coef_penalty=10): +# print("Debug: Start of score function") + +# if isinstance(X, pd.DataFrame): +# X = self.preprocess_X_df(X) +# print(f"Debug: preprocessed X, nulls: {X.isnull().sum().sum()}") + +# if isinstance(y, pd.DataFrame): +# y = self.preprocess_y_df(y) +# print(f"Debug: preprocessed y, nulls: {y.isnull().sum().sum()}") + +# y_pred = self.predict(X) +# print(f"Debug: y_pred, nulls: {np.isnan(y_pred).sum()}, infs: {np.isinf(y_pred).sum()}") + +# y_pred[y_pred == 0] = 1e-10 +# nmse_ = (y - y_pred)**2 + +# nmse_[y_pred == 1e-10] *= zero_coef_penalty +# denominator = (y**2).mean() + +# print(f"Debug: Denominator: {denominator}") + +# if denominator == 0: +# print("Debug: Denominator is zero.") +# return -1e10 # Some large negative value + +# nmse_ = nmse_.mean() / denominator + +# if nmse_ == 0: +# print("Debug: nmse_ is zero.") +# return -1e10 # Some large negative value + +# print(f"Debug: Returning score: {-nmse_}") +# return -nmse_ + + +# def score(self, X, y, zero_coef_penalty=10): +# if isinstance(X, pd.DataFrame): +# X = self.preprocess_X_df(X) # X_test +# if isinstance(y, pd.DataFrame): +# y = self.preprocess_y_df(y) +# # Make predictions using the predict method of your custom estimator +# y_pred = self.predict(X) +# # Calculate the normalized mean squared error between the true and predicted values +# nmse_ = (y - y_pred)**2 +# nmse_[y_pred==0] *= zero_coef_penalty +# nmse_ = nmse_.mean() / (y**2).mean() +# return -nmse_ # Return the negative normalized mean squared error + + + def updating_network_A_matrix_given_X(self) -> np.ndarray: + """ When we call the fit method, this function is used to help us update the network information. + Here, we can generate updated W matrix, updated D matrix, and updated V matrix. + Then, those updated derived matrices are used to calculate the A matrix. + """ + network = self.network + final_nodes = self.final_nodes + W_df = network.W_df.copy() # updating the W matrix + + # Simplified addition of new nodes + if self.gexpr_nodes_added: + for node in self.gexpr_nodes_added: + W_df[node] = np.nan + W_df.loc[node] = np.nan + + # Consolidated indexing and reindexing operations + W_df = W_df.reindex(index=final_nodes, columns=final_nodes) + + # Handle missing values + W_df.fillna(value=self.prior_network.default_edge_weight, inplace=True) + np.fill_diagonal(W_df.values, 0) + + N = len(final_nodes) + self.N = N + W = W_df.values + np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (N - 1)) + + self.W = W + self.W_df = W_df + + # Check for symmetric matrix + if not ef.check_symmetric(W): + print(":( W matrix is NOT symmetric") + + # Update V matrix + self.V = N * np.eye(N) - np.ones(N) + + # Update D matrix + if not network.edge_values_for_degree: + W_bool = (W > network.threshold_for_degree) + d = np.float64(W_bool.sum(axis=0) - W_bool.diagonal()) + else: + if network.w_transform_for_d == "sqrt": + W_to_use = np.sqrt(W) + elif network.w_transform_for_d == "square": + W_to_use = W ** 2 + else: + W_to_use = W + d = W_to_use.diagonal() * (self.N - 1) + + # Handle pseudocount and self loops + d += network.pseudocount_for_degree + if network.consider_self_loops: + d += 1 + + d_inv_sqrt = 1 / np.sqrt(d) + self.D = ef.DiagonalLinearOperator(d_inv_sqrt) + + # Update inv_sqrt_degree_df + self.inv_sqrt_degree_df = pd.DataFrame({ + "TF": self.final_nodes, + "degree_D": self.D * np.ones(self.N) + }) + + Amat = self.D @ (self.V * W) @ self.D + A_df = pd.DataFrame(Amat, columns=final_nodes, index=final_nodes, dtype=np.float64) + + # Handle nodes based on `overlapped_nodes_only` + gene_expression_nodes = self.gene_expression_nodes + nodes_to_add = list(set(gene_expression_nodes) - set(final_nodes)) + self.nodes_to_add = nodes_to_add + if not self.overlapped_nodes_only: + for name in nodes_to_add: + A_df[name] = 0 + A_df.loc[name] = 0 + A_df = A_df.reindex(columns=sorted(gene_expression_nodes), index=sorted(gene_expression_nodes)) + else: + if len(nodes_to_add) == 1: + print(f"Please note that we remove 1 node {nodes_to_add[0]} found in the input gene expression data (X) that is not found in the input network :)") + elif len(nodes_to_add) > 1: + print(f":) Since overlapped_nodes_only = True, please note that we remove {len(nodes_to_add)} gene expression nodes that are not found in the input network.") + print(nodes_to_add) + A_df = A_df.sort_index(axis=0).sort_index(axis=1) + + self.A_df = A_df + self.A = A_df.values + self.nodes = A_df.columns.tolist() + self.tf_names_list = self.nodes + return self + + + def preprocess_X_df(self, X): + if isinstance(X, pd.DataFrame): + X_df = X + column_names_list = list(X_df.columns) + overlap_num = len(ef.intersection(column_names_list, self.final_nodes)) + if overlap_num == 0: + print("Please note: we are flipping X dataframe around so that the rows are samples and the columns are gene/TF names :)") + X_df = X_df.transpose() + column_names_list = list(X_df.columns) + overlap_num = len(ef.intersection(column_names_list, self.common_nodes)) + gene_names_list = self.final_nodes # so that this matches the order of columns in A matrix as well + X_df = X_df.loc[:, X_df.columns.isin(gene_names_list)] # filtering the X_df as needed based on the columns + X_df = X_df.reindex(columns=gene_names_list)# Reorder columns of dataframe to match order in `column_order` + X = np.array(X_df.values.tolist()) + return X +# def preprocess_X_df(self, X): +# if isinstance(X, pd.DataFrame): +# column_names_list = X.columns.tolist() +# overlap_num = len(set(column_names_list).intersection(self.final_nodes)) + +# if overlap_num == 0: +# print("Please note: we are flipping X dataframe around so that the rows are samples and the columns are gene/TF names :)") +# X = X.transpose() + +# gene_names_list = self.final_nodes # so that this matches the order of columns in A matrix as well +# X = X[gene_names_list] + +# return X.values + + def preprocess_y_df(self, y): + if isinstance(y, pd.DataFrame): + y = y.values.flatten() + return y + + + def return_Linear_ML_model(self, X, y): + regr = LinearRegression(fit_intercept = self.y_intercept, + positive = self.all_pos_coefs, + n_jobs = self.num_jobs) + regr.fit(X, y) + return regr + + + def return_Lasso_ML_model(self, X, y): + regr = Lasso(alpha = self.alpha_lasso, fit_intercept = self.y_intercept, + max_iter = self.max_lasso_iterations, tol = self.tolerance, + selection = self.lasso_selection, + positive = self.all_pos_coefs) + regr.fit(X, y) + return regr + + + def return_LassoCV_ML_model(self, X, y): + regr = LassoCV(cv = self.num_cv_folds, random_state = 0, + fit_intercept = self.y_intercept, + max_iter = self.max_lasso_iterations, + n_jobs = self.num_jobs, + tol = self.tolerance, + selection = self.lasso_selection, + positive = self.all_pos_coefs, + eps = self.lassocv_eps, + n_alphas = self.lassocv_n_alphas, + alphas = self.lassocv_alphas) + regr.fit(X, y) + return regr + + + def return_fit_ml_model(self, X, y): + if self.model_type == "Linear": + model_to_return = self.return_Linear_ML_model(X, y) + elif self.model_type == "Lasso": + model_to_return = self.return_Lasso_ML_model(X, y) + elif self.model_type == "LassoCV": + model_to_return = self.return_LassoCV_ML_model(X, y) + return model_to_return + + +def netrem(edge_list, beta_net = 1, alpha_lasso = 0.01, default_edge_weight = 0.1, + degree_threshold = 0.5, gene_expression_nodes = [], overlapped_nodes_only = False, + y_intercept = False, standardize_X = True, center_y = True, view_network = False, + model_type = "Lasso", lasso_selection = "cyclic", all_pos_coefs = False, tolerance = 1e-4, maxit = 10000, + num_jobs = -1, num_cv_folds = 5, lassocv_eps = 1e-3, + lassocv_n_alphas = 100, # default in sklearn + lassocv_alphas = None, # default in sklearn + verbose = False, + hide_warnings = True): + degree_pseudocount = 1e-3 + if hide_warnings: + warnings.filterwarnings("ignore") + default_beta = False + default_alpha = False + if beta_net == 1: + print("using beta_net default of", 1) + default_beta = True + if alpha_lasso == 0.01: + if model_type != "LassoCV": + print("using alpha_lasso default of", 0.01) + default_alpha = True + edge_vals_for_d = False + self_loops = False + w_transform_for_d = "none" + + prior_graph_dict = {"edge_list": edge_list, + "gene_expression_nodes":gene_expression_nodes, + "edge_values_for_degree": edge_vals_for_d, + "consider_self_loops":self_loops, + "pseudocount_for_degree":degree_pseudocount, + "default_edge_weight": default_edge_weight, + "w_transform_for_d":w_transform_for_d, + "threshold_for_degree": degree_threshold, + "verbose":verbose, + "view_network":view_network} + netty = graph.PriorGraphNetwork(**prior_graph_dict) # uses the network to get features like the A matrix. + greg_dict = {"network": netty, + "model_type": model_type, + "use_network":True, + "standardize_X":standardize_X, + "center_y":center_y, + "y_intercept":y_intercept, + "overlapped_nodes_only":overlapped_nodes_only, + "max_lasso_iterations":maxit, + "all_pos_coefs":all_pos_coefs, + "view_network":view_network, + "verbose":verbose} + if default_alpha == False: + greg_dict["alpha_lasso"] = alpha_lasso + if default_beta == False: + greg_dict["beta_net"] = beta_net + if model_type != "Linear": + greg_dict["tolerance"] = tolerance + greg_dict["lasso_selection"] = lasso_selection + if model_type != "Lasso": + greg_dict["num_jobs"] = num_jobs + if model_type == "LassoCV": + greg_dict["num_cv_folds"] = num_cv_folds + greg_dict["lassocv_eps"] = lassocv_eps + greg_dict["lassocv_n_alphas"] = lassocv_n_alphas + greg_dict["lassocv_alphas"] = lassocv_alphas + greggy = NetREmModel(**greg_dict) + return greggy + + + + +def generate_beta_networks(X_train, y_train, standardize_X, prior_network, overlapped_nodes_only = False, num = 10, max_beta = 200): + """ + Generate a grid of beta_network values to transform X_train. + + Parameters: + X_train (numpy array): training input data + + Returns: + numpy array: grid of beta_network values + """ + if isinstance(X_train, pd.DataFrame): + X_df = X_train + gene_names_list = list(X_df.columns) + if overlapped_nodes_only: + nodes_list = prior_network.nodes#self.nodes + common_nodes = ef.intersection(gene_names_list, nodes_list) + common_nodes.sort() + + X_df = X_df.loc[:, X_df.columns.isin(common_nodes)] + # Reorder columns of dataframe to match order in `column_order` + X_df = X_df.reindex(columns=common_nodes) + else: + X_df = X_df.reindex(columns=gene_names_list) + + if standardize_X: + print("standardizing X :)") + scaler = preprocessing.StandardScaler().fit(X_df) + X_train = scaler.transform(X_df) + else: + X_train = np.array(X_df.values.tolist()) + if isinstance(y_train, pd.DataFrame): + y_train = y_train.values.flatten() + beta_max = 0.5 * np.max(np.abs(X_train.T.dot(y_train))) + beta_min = 0.01 * beta_max + + var_X = np.var(X_train) + var_y = np.var(y_train) + if beta_max > max_beta: # max_beta used to prevent explosion of beta_net values + print(":) using variance to define beta_net values") + beta_max = 0.5 * np.max(np.abs(var_X * var_y)) * 100 + beta_min = 0.01 * beta_max + print(f"beta_min = {beta_min} and beta_max = {beta_max}") + + return np.logspace(np.log10(beta_max), np.log10(beta_min), num=num) + + +def generate_alpha_beta_pairs(X_train, + y_train, + prior_network, + overlapped_nodes_only: bool = False, + standardize_X: bool = True, + center_y: bool = True, + num_beta: int = 50, + num_alpha: int = 10, + max_beta: float = 200, + y_intercept: bool = False, + maxit: int = 10000, + all_pos_coefs: bool = False, + tolerance = 1e-4, + lasso_selection = "cyclic", + num_cv_folds = 5, + num_jobs = -1, + lassocv_eps = 1e-3, + lassocv_n_alphas = 100, + lassocv_alphas = None) -> dict: + """ + Generate a pairwise set of alpha_lasso and beta_network values. + + Parameters: + X_train (numpy array): training input data + y_train (numpy array): training output data + prior_network: The prior network to be used. + overlapped_nodes_only (bool): Whether to use only overlapped nodes. Default is False. + num (int): The number of beta_network values to generate. Default is 100. + + Returns: + dict: Dictionary containing grid of alpha_lasso values and beta_network values. + """ + beta_grid = generate_beta_networks(X_train, y_train, standardize_X, prior_network, overlapped_nodes_only, num=num_beta, max_beta = max_beta) + beta_alpha_grid_dict = {"beta_network_vals": [], "alpha_lasso_vals": []} + + try: + with tqdm(beta_grid, desc=":) Generating beta_net and alpha_lasso pairs") as pbar: + for beta in pbar: + # please fix it so it reflects what we want more... like the proper defaults + netremCV_demo = nm.NetREmModel(beta_network=beta, + model_type="LassoCV", + network=prior_network, + standardize_X = standardize_X, + center_y = center_y, + overlapped_nodes_only=overlapped_nodes_only) +# netremCV_demo = nm.NetREmModel(beta_network=beta, +# model_type="LassoCV", +# network=prior_network, +# overlapped_nodes_only=overlapped_nodes_only, +# standardize_X = standardize_X, +# y_intercept = y_intercept, +# max_lasso_iterations = maxit, +# all_pos_coefs = all_pos_coefs, +# tolerance = tolerance, +# lasso_selection = lasso_selection, +# num_cv_folds = num_cv_folds, +# #num_jobs = num_jobs, +# lassocv_eps = lassocv_eps, +# lassocv_n_alphas = lassocv_n_alphas, +# lassocv_alphas = lassocv_alphas) + + # Fit the model and compute alpha_max and alpha_min + netremCV_demo.fit(X_train, y_train) + X_tilda_train = netremCV_demo.X_tilda_train + y_tilda_train = netremCV_demo.y_tilda_train + alpha_max = 0.5 * np.max(np.abs(X_tilda_train.T.dot(y_tilda_train))) + alpha_min = 0.01 * alpha_max + + # Generate alpha_grid based on alpha_max and alpha_min + optimal_alpha = netremCV_demo.regr.alpha_ + alpha_grid = np.append(optimal_alpha, np.logspace(np.log10(alpha_min), np.log10(alpha_max), num=num_alpha)) + + # Find the best alpha using cross-validation + best_alpha = None + best_score = float('-inf') + for alpha in alpha_grid: + #netremCV_demo.regr.set_params(alpha=alpha) +# netremCV_demo = nm.NetREmModel(beta_network=beta, +# alpha_lasso = alpha, +# model_type="Lasso", +# network=prior_network, +# standardize_X = standardize_X, +# overlapped_nodes_only=overlapped_nodes_only, +# y_intercept = y_intercept, +# max_lasso_iterations = maxit, +# all_pos_coefs = all_pos_coefs, +# tolerance = tolerance, +# lasso_selection = lasso_selection, +# num_cv_folds = num_cv_folds, +# #num_jobs = num_jobs, +# lassocv_eps = lassocv_eps, +# lassocv_n_alphas = lassocv_n_alphas, +# lassocv_alphas = lassocv_alphas) + netremCV_demo = nm.NetREmModel(beta_network=beta, + alpha_lasso = alpha, + standardize_X = standardize_X, + center_y = center_y, + model_type="Lasso", + network=prior_network, + overlapped_nodes_only=overlapped_nodes_only) + scores = cross_val_score(netremCV_demo, X_train, y_train, cv=5) # You can change cv to your specific cross-validation strategy + mean_score = np.mean(scores) + if mean_score > best_score: + best_score = mean_score + best_alpha = alpha + + # Append the beta and best_alpha to the dictionary + beta_alpha_grid_dict["beta_network_vals"].append(beta) + beta_alpha_grid_dict["alpha_lasso_vals"].append(best_alpha) + + except Exception as e: + print(f"An error occurred: {e}") + print("finished generate_alpha_beta_pairs") + print(beta_alpha_grid_dict) + return beta_alpha_grid_dict + + +# Custom scoring function +def custom_mse(y_true, y_pred): + mse = mean_squared_error(y_true, y_pred) + pbar.update(1) # Update the progress bar + return -mse # Negate because GridSearchCV tries to maximize the score + + +def netremCV(edge_list, X, y, + num_beta: int = 50, + num_alpha: int = 10, + max_beta: float = 200, # max_beta used to help prevent explosion of beta_net values + reduced_cv_search: bool = False, # should we do a reduced search (Randomized Search) or a GridSearch? + default_edge_weight: float = 0.1, + degree_threshold: float = 0.5, + gene_expression_nodes = [], + overlapped_nodes_only: bool = False, + standardize_X: bool = True, + center_y: bool = True, + y_intercept: bool = False, + model_type = "Lasso", + lasso_selection = "cyclic", + all_pos_coefs: bool = False, + tolerance: float = 1e-4, + maxit: int = 10000, + num_jobs: int = -1, + num_cv_folds: int = 5, + lassocv_eps: float = 1e-3, + lassocv_n_alphas: int = 100, # default in sklearn + lassocv_alphas = None, # default in sklearn + verbose = False, + searchVerbosity: int = 2): + + prior_graph_dict = {"edge_list": edge_list, + "gene_expression_nodes":gene_expression_nodes, + "edge_values_for_degree": False, + "consider_self_loops":False, + "pseudocount_for_degree":1e-3, + "default_edge_weight": default_edge_weight, + "w_transform_for_d":"none", + "threshold_for_degree": degree_threshold, + "verbose":verbose, + "view_network":False} + + network_to_use = graph.PriorGraphNetwork(**prior_graph_dict) + X_train = X + y_train = y + beta_alpha_grid_dict = generate_alpha_beta_pairs(X_train, + y_train, network_to_use, + overlapped_nodes_only, standardize_X, center_y, + num_beta, num_alpha, + y_intercept, + maxit, + all_pos_coefs, + tolerance, + lasso_selection, + num_cv_folds, + num_jobs, + lassocv_eps, + lassocv_n_alphas, + lassocv_alphas) + print(f"Length of beta_alpha_grid_dict: {len(beta_alpha_grid_dict['beta_network_vals'])}") + + param_grid = [{"alpha_lasso": [alpha_las], "beta_net": [beta_net]} + for alpha_las, beta_net in zip(beta_alpha_grid_dict["alpha_lasso_vals"], + beta_alpha_grid_dict["beta_network_vals"])] + + + + print(":) Performing NetREmCV with both beta_network and alpha_lasso as UNKNOWN.") + + initial_greg = nm.NetREmModel(network=network_to_use, + y_intercept = y_intercept, + standardize_X = standardize_X, + center_y = center_y, + max_lasso_iterations=maxit, + all_pos_coefs=all_pos_coefs, + lasso_selection = lasso_selection, + tolerance = tolerance, + view_network=False, + overlapped_nodes_only=overlapped_nodes_only) + + pbar = tqdm(total=len(param_grid)) # Assuming we're trying 9 combinations of parameters + + if reduced_cv_search: + # Run RandomizedSearchCV + print(f":) since reduced_cv_search = {reduced_cv_search}, we perform RandomizedSearchCV on a reduced search space") + grid_search= RandomizedSearchCV(initial_greg, + param_grid, + n_iter=num_alpha, + cv=num_cv_folds, + #scoring=make_scorer(custom_mse, greater_is_better=False), + verbose=searchVerbosity) + else: + # Run GridSearchCV + grid_search = GridSearchCV(initial_greg, param_grid=param_grid, cv=num_cv_folds, + #scoring=make_scorer(custom_mse, greater_is_better=False), + verbose = searchVerbosity) + grid_search.fit(X_train, y_train) + + # Extract and display the best hyperparameters + best_params = grid_search.best_params_ + optimal_alpha = best_params["alpha_lasso"] + optimal_beta = best_params["beta_net"] + print(f":) NetREmCV found that the optimal alpha_lasso = {optimal_alpha} and optimal beta_net = {optimal_beta}") + + newest_netrem = nm.NetREmModel(alpha_lasso = optimal_alpha, + beta_net = optimal_beta, + network = network_to_use, + y_intercept = y_intercept, + standardize_X = standardize_X, + center_y = center_y, + max_lasso_iterations=maxit, + all_pos_coefs=all_pos_coefs, + lasso_selection = lasso_selection, + tolerance = tolerance, + view_network=False, + overlapped_nodes_only=overlapped_nodes_only) + newest_netrem.fit(X_train, y_train) + train_mse = newest_netrem.test_mse(X_train, y_train) + print(f":) Please note that the training Mean Square Error (MSE) from this fitted NetREm model is {train_mse}") + return newest_netrem + + +def organize_B_interaction_network(netrem_model): + B_interaction_df = netrem_model.B_interaction_df + num_TFs = B_interaction_df.shape[0] + B_interaction_df = B_interaction_df.reset_index().melt(id_vars='index', var_name='TF2', value_name='B_train_weight') + B_interaction_df = B_interaction_df.rename(columns = {"index":"TF1"}) + B_interaction_df = B_interaction_df[B_interaction_df["TF1"] != B_interaction_df["TF2"]] + B_interaction_df = B_interaction_df.sort_values(by = ['B_train_weight'], ascending = False) + B_interaction_df["sign"] = np.where((B_interaction_df.B_train_weight > 0), ":)", ":(") + B_interaction_df["potential_interaction"] = np.where((B_interaction_df.B_train_weight > 0), ":(", + ":( competitive (-)") + B_interaction_df["absVal_B"] = abs(B_interaction_df["B_train_weight"]) + B_interaction_df["info"] = "B matrix of TF-TF interactions" + B_interaction_df["candidate_TFs_N"] = num_TFs + B_interaction_df["target_gene_y"] = netrem_model.target_gene_y + B_interaction_df["num_final_predictors"] = netrem_model.num_final_predictors + B_interaction_df["model_type"] = netrem_model.model_type + B_interaction_df["beta_net"] = netrem_model.beta_net + B_interaction_df["X_standardized"] = netrem_model.standardize_X + B_interaction_df["gene_data"] = "training gene expression data" + + # Step 1: Sort the DataFrame + B_interaction_df = B_interaction_df.sort_values('absVal_B', ascending=False) + + # Step 2: Get the rank + B_interaction_df['rank'] = B_interaction_df['absVal_B'].rank(method='min', ascending=False) + + # Step 3: Calculate the percentile + B_interaction_df['percentile'] = (1 - (B_interaction_df['rank'] / B_interaction_df['absVal_B'].count())) * 100 + return B_interaction_df \ No newline at end of file diff --git a/code/old_code/refresh/PriorGraphNetwork.py b/code/old_code/refresh/PriorGraphNetwork.py new file mode 100644 index 0000000..29d474d --- /dev/null +++ b/code/old_code/refresh/PriorGraphNetwork.py @@ -0,0 +1,547 @@ +# PriorGraphNetwork Class: :) +# Standard libraries +import os +import sys +import random +import copy +import warnings + +# Third-party libraries +import pandas as pd +import numpy as np +import networkx as nx +import scipy +import matplotlib.pyplot as plt +import plotly.express as px +from tqdm import tqdm +import jinja2 + +# Scikit-learn imports +from sklearn import linear_model +from sklearn.metrics import make_scorer +from sklearn.exceptions import ConvergenceWarning +from sklearn.base import RegressorMixin, ClassifierMixin, BaseEstimator +from sklearn.model_selection import KFold, train_test_split, GridSearchCV, cross_val_score +from sklearn.decomposition import TruncatedSVD +from sklearn.linear_model import Lasso, LassoCV, LinearRegression, ElasticNetCV, Ridge + +# Scipy imports +from scipy.linalg import svd as robust_svd +from scipy.sparse.linalg.interface import LinearOperator + +# Type hinting +from typing import Optional, List, Tuple +from numpy.typing import ArrayLike + +# Custom module imports +import essential_functions as ef +import error_metrics as em +import DemoDataBuilderXandY as demo + + +import math +from sklearn.metrics.pairwise import cosine_similarity +from node2vec import Node2Vec + + +# Constants +rng_seed = 2023 # random seed for reproducibility +randSeed = 123 + +# Utility functions +printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) + + +class PriorGraphNetwork: + """:) Please note that this class focuses on incorporating information from a prior network (in our case, + a biological network of some sort). The input would be an edge list with: source, target, weight. If no + weights are given then a weight of 1 will be automatically assumed. + If the prior network is NOT symmetric (most likely directed): + please note we can use graph embedding techniques like weighted node2vec (on the directed graph) to generate + an embedding, find the cosine similarity, and then use the node-node similarity values for our network. + Ultimately, this class builds the W matrix (for the prior network weights to be used for our network + regularization penalty), the D matrix (of degrees), and the V matrix (custom for our approach).""" + + _parameter_constraints = { + "w_transform_for_d": ["none", "sqrt", "square"], + "degree_pseudocount": (0, None), + "default_edge_weight": (0, None), + "threshold_for_degree": (0, None), + "view_network":[True, False], + "verbose":[True, False]} + + def __init__(self, **kwargs): # define default values for constants + + self.edge_values_for_degree = False # we instead consider a threshold by default (for counting edges into our degrees) + self.consider_self_loops = False # no self loops considered + self.verbose = True # printing out statements + self.pseudocount_for_degree = 1e-3 # to ensure that we do not have any 0 degrees for any node in our matrix. + self.undirected_graph_bool = True # by default we assume the input network is undirected and symmetric :) + self.default_edge_weight = 0.1 # if an edge is missing weight information + # these are the nodes we may wish to include. If these are provided, then we may utilize these in our model. + self.gene_expression_nodes = [] # default if we use edge weights for degree: + # if edge_values_for_degree is True: we can use the edge weight values to get the degrees. + self.w_transform_for_d = "none" + #self.square_root_weights_for_degree = False # take the square root of the edge weights for the degree calculations + #self.squaring_weights_for_degree = False # square the edge weights for the degree calculations + # default if we use a threshold for the degree: + self.threshold_for_degree = 0.5 + self.view_network = False + #################### +# self.dimensions = 64 +# self.walk_length = 30 +# self.num_walks = 200 +# self.p = 1 +# self.q = 0.5 +# self.workers = 4 +# self.window = 10 +# self.min_count = 1 +# self.batch_words = 4 + self.node_color_name = "yellow" + self.added_node_color_name = "lightblue" + self.edge_color_name = "red" + self.edge_weight_scaling = 5 + self.debug = False + #################### + self.preprocessed_network = False # is the network preprocessed with the final gene expression nodes + self.__dict__.update(kwargs) # overwrite with any user arguments :) + required_keys = ["edge_list"] # if consider_self_loops is true, we add 1 to degree value for each node, + # check that all required keys are present: + missing_keys = [key for key in required_keys if key not in self.__dict__] + if missing_keys: + raise ValueError(f":( Please note since edge_values_for_degree = {self.edge_values_for_degree} ye are missing information for these keys: {missing_keys}") + self.network_nodes = self.network_nodes_from_edge_list() + # other defined results: + # added Aug 30th: + if isinstance(self.edge_list, pd.DataFrame): + print(":( Please input edgelist as a list of lists instead of a dataframe. For your edge_df, try: edge_df.values.tolist()") + #self.edge_list = self.edge_list.values.tolist() + self.original_edge_list = self.edge_list + if len(self.gene_expression_nodes) > 0: # is not None: + self.preprocessed_network = True + self.gene_expression_nodes.sort() + gene_expression_nodes = self.gene_expression_nodes + self.final_nodes = gene_expression_nodes + common_nodes = ef.intersection(self.network_nodes, self.gene_expression_nodes) + common_nodes.sort() + self.common_nodes = common_nodes + self.gex_nodes_to_add = list(set(self.gene_expression_nodes) - set(self.common_nodes)) + self.network_nodes_to_remove = list(set(self.network_nodes) - set(self.common_nodes)) + # filtering the edge_list: + self.edge_list = [edge for edge in self.original_edge_list if edge[0] in gene_expression_nodes and edge[1] in gene_expression_nodes] + else: + self.final_nodes = self.network_nodes + if self.verbose: + print(self.final_nodes) + self.tf_names_list = self.final_nodes + self.nodes = self.final_nodes + self.N = len(self.tf_names_list) + self.V = self.create_V_matrix() + if self.undirected_graph_bool: + self.directed=False + self.undirected_edge_list_to_matrix() + self.W_original = self.W + #self.edge_df = self.undirected_edge_list_updated().drop_duplicates() + else: + self.directed=True + self.W_original = self.directed_node2vec_similarity(self.edge_list, self.dimensions, + self.walk_length, self.num_walks, + self.p, self.q, self.workers, + self.window, self.min_count, self.batch_words) + self.W = self.generate_symmetric_weight_matrix() + self.W_df = pd.DataFrame(self.W, columns = self.nodes, index = self.nodes) + if self.view_network: + self.view_W_network = self.view_W_network() + else: + self.view_W_network = None + self.degree_vector = self.generate_degree_vector_from_weight_matrix() + self.D = self.generate_degree_matrix_from_weight_matrix() + # added on April 26, 2023 + degree_df = pd.DataFrame(self.final_nodes, columns = ["TF"]) + degree_df["degree_D"] = self.D * np.ones(self.N) + self.inv_sqrt_degree_df = degree_df ######## + self.edge_list_from_W = self.return_W_edge_list() + self.A = self.create_A_matrix() + self.A_df = pd.DataFrame(self.A, columns = self.nodes, index = self.nodes, dtype=np.float64) + self.param_lists = self.full_lists() + self.param_df = pd.DataFrame(self.full_lists(), columns = ["parameter", "data type", "description", "value", "class"]) + self.node_status_df = self.find_node_status_df() + self._apply_parameter_constraints() + + + def find_node_status_df(self): + """ Returns the node status """ + preprocessed_result = "No :(" + if self.preprocessed_network: + preprocessed_result = "Yes :)" + if self.preprocessed_network: + common_df = pd.DataFrame(self.common_nodes, columns = ["node"]) + common_df["preprocessed"] = preprocessed_result + common_df["status"] = "keep :)" + common_df["info"] = "Common Node (Network and Gene Expression)" + full_df = common_df + if len(self.gex_nodes_to_add) > 0: + gex_add_df = pd.DataFrame(self.gex_nodes_to_add, columns = ["node"]) + gex_add_df["preprocessed"] = preprocessed_result + gex_add_df["status"] = "keep :)" + gex_add_df["info"] = "Gene Expression Node Only" + full_df = pd.concat([common_df, gex_add_df]) + if len(self.network_nodes_to_remove) > 0: + net_remove_df = pd.DataFrame(self.network_nodes_to_remove, columns = ["node"]) + net_remove_df["preprocessed"] = preprocessed_result + net_remove_df["status"] = "remove :(" + net_remove_df["info"] = "Network Node Only" + full_df = pd.concat([full_df, net_remove_df]) + else: + full_df = pd.DataFrame(self.network_nodes, columns = ["node"]) + full_df["preprocessed"] = preprocessed_result + full_df["status"] = 'unknown :|' + full_df["info"] = "Original Network Node" + return full_df + + + def network_nodes_from_edge_list(self): + edge_list = self.edge_list + network_nodes = list({node for edge in edge_list for node in edge[:2]}) + network_nodes.sort() + return network_nodes + + + def _apply_parameter_constraints(self): + constraints = {**PriorGraphNetwork._parameter_constraints} + for key, value in self.__dict__.items(): + if key in constraints: + if isinstance(constraints[key], tuple): + if isinstance(constraints[key][0], type) and not isinstance(value, constraints[key][0]): + setattr(self, key, constraints[key][0]) + elif constraints[key][1] is not None and isinstance(constraints[key][1], type) and not isinstance(value, constraints[key][1]): + setattr(self, key, constraints[key][1]) + elif value not in constraints[key]: + setattr(self, key, constraints[key][0]) + return self + + + def create_V_matrix(self): + V = self.N * np.eye(self.N) - np.ones(self.N) + return V + + + + # Optimized functions + def preprocess_edge_list(self): + processed_edge_list = [] + default_edge_weight = self.default_edge_weight + + for sublst in self.edge_list: + if len(sublst) == 2: + processed_edge_list.append(sublst + [default_edge_weight]) + else: + processed_edge_list.append(sublst) + + return processed_edge_list + + def undirected_edge_list_to_matrix(self): + all_nodes = self.final_nodes + edge_list = self.preprocess_edge_list() + default_edge_weight = self.default_edge_weight + N = len(all_nodes) + self.N = N + weight_df = np.full((N, N), default_edge_weight) + + # Create a mapping from node to index + node_to_idx = {node: idx for idx, node in enumerate(all_nodes)} + + for edge in tqdm(edge_list) if self.verbose else edge_list: + try: + source, target, *weight = edge + weight = weight[0] if weight else default_edge_weight + weight = np.nan_to_num(weight, nan=default_edge_weight) + source_idx, target_idx = node_to_idx[source], node_to_idx[target] + weight_df[source_idx, target_idx] = weight + weight_df[target_idx, source_idx] = weight + except ValueError as e: + print(f"An error occurred: {e}") + continue + + np.fill_diagonal(weight_df, 0) + W = weight_df + np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (N - 1)) + + if not ef.check_symmetric(W): + print(":( W matrix is NOT symmetric") + + self.W = W + self.W_df = pd.DataFrame(W, columns=all_nodes, index=self.final_nodes, dtype=np.float64) + return self + + + def generate_symmetric_weight_matrix(self) -> np.ndarray: + """generate symmetric W matrix. W matrix (Symmetric --> W = W_Transpose). + Note: each diagonal element is the summation of other non-diagonal elements in the same row divided by (N-1) + 2023.02.14_Xiang. TODO: add parameter descriptions""" + W = self.W_original + np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (self.N - 1)) + symmetric_W = ef.check_symmetric(W) + if symmetric_W == False: + print(":( W matrix is NOT symmetric") + return None + return W + + + def return_W_edge_list(self): + wMat = ef.view_matrix_as_dataframe(self.W, column_names_list = self.tf_names_list, row_names_list = self.tf_names_list) + w_edgeList = wMat.stack().reset_index() + w_edgeList = w_edgeList[w_edgeList["level_0"] != w_edgeList["level_1"]] + w_edgeList = w_edgeList.rename(columns = {"level_0":"source", "level_1":"target", 0:"weight"}) + w_edgeList = w_edgeList.sort_values(by = ["weight"], ascending = False) + return w_edgeList + + + def view_W_network(self): + roundedW = np.round(self.W, decimals=4) + wMat = ef.view_matrix_as_dataframe(roundedW, column_names_list=self.nodes, row_names_list=self.nodes) + w_edgeList = wMat.stack().reset_index() + # https://stackoverflow.com/questions/48218455/how-to-create-an-edge-list-dataframe-from-a-adjacency-matrix-in-python + w_edgeList = w_edgeList[w_edgeList["level_0"] != w_edgeList["level_1"]] + w_edgeList = w_edgeList.rename(columns={"level_0": "source", "level_1": "target", 0: "weight"}) + w_edgeList = w_edgeList[w_edgeList["weight"] != 0] + + G = nx.from_pandas_edgelist(w_edgeList, source="source", target="target", edge_attr="weight") + pos = nx.spring_layout(G) + weights_list = [G.edges[e]['weight'] * self.edge_weight_scaling for e in G.edges] + fig, ax = plt.subplots() + if self.preprocessed_network and len(self.gex_nodes_to_add) > 0: + new_nodes = self.gex_nodes_to_add + print("new nodes:", new_nodes) + node_color_map = {node: self.added_node_color_name if node in new_nodes else self.node_color_name for node in G.nodes} + nx.draw(G, pos, node_color=node_color_map.values(), edge_color=self.edge_color_name, with_labels=True, width=weights_list, ax=ax) + else: + nx.draw(G, pos, node_color=self.node_color_name, edge_color=self.edge_color_name, with_labels=True, width=weights_list, ax=ax) + + labels = {e: G.edges[e]['weight'] for e in G.edges} + return nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, ax=ax) + + + def generate_degree_vector_from_weight_matrix(self) -> np.ndarray: + """generate d degree vector. 2023.02.14_Xiang TODO: add parameter descriptions + """ + if self.edge_values_for_degree == False: + W_bool = (self.W > self.threshold_for_degree) + d = np.float64(W_bool.sum(axis=0) - W_bool.diagonal()) + else: + if self.w_transform_for_d == "sqrt": #self.square_root_weights_for_degree: # taking the square root of the weights for the edges + W_to_use = np.sqrt(self.W) + elif self.w_transform_for_d == "square": # self.squaring_weights_for_degree: + W_to_use = self.W ** 2 + else: + W_to_use = self.W + d = W_to_use.diagonal() * (self.N - 1) # summing the edge weights + d += self.pseudocount_for_degree + if self.consider_self_loops: + d += 1 # we also add in a self-loop :) + # otherwise, we can just use this threshold for the degree + if self.verbose: + print(":) Please note: we are generating the prior network:") + if self.edge_values_for_degree: + print(":) Please note that we use the sum of the edge weight values to get the degree for a given node.") + else: + print(f":) Please note that we count the number of edges with weight > {self.threshold_for_degree} to get the degree for a given node.") + if self.consider_self_loops: + print(f":) Please note that since consider_self_loops = {self.consider_self_loops} we also add 1 to the degree for each node (as a self-loop).") + print(f":) We also add {self.pseudocount_for_degree} as a pseudocount to our degree value for each node.") + print() # + return d + + + def generate_degree_matrix_from_weight_matrix(self): # D matrix + """:) Please note that this function returns the D matrix as a diagonal matrix + where the entries are 1/sqrt(d). Here, d is a vector corresponding to the degree of each matrix""" + # we see that the D matrix is higher for nodes that are singletons, a much higher value because it is not connected + d = self.degree_vector + d_inv_sqrt = 1 / np.sqrt(d) + # D = np.diag(d_inv_sqrt) # full matrix D, only suitable for small scale. Use DiagonalLinearOperator instead. + D = ef.DiagonalLinearOperator(d_inv_sqrt) + return D + + + def create_A_matrix(self): # A matrix + """ Please note that this function by Saniya creates the A matrix, which is: + :) here: %*% refers to matrix multiplication + and * refers to element-wise multiplication (for 2 dataframes with same exact dimensions, + component-wise multiplication) + # Please note that this function by Saniya creates the A matrix, which is: + # (D_transpose) %*% (V*W) %*% (D) + """ + A = self.D @ (self.V * self.W) @ self.D + approxSame = ef.check_symmetric(A) # please see if A is symmetric + if approxSame: + return A + else: + print(f":( False. A is NOT a symmetric matrix.") + print(A) + return False + + + def full_lists(self): + # network arguments used: + # argument, description, our value + full_lists = [] + term_to_add_last = "PriorGraphNetwork" + row1 = ["default_edge_w", ">= 0", "edge weight for any edge with missing weight info", self.default_edge_weight, term_to_add_last] + row2 = ["self_loops", "boolean", "add 1 to the degree for each node (based on self-loops)?", self.consider_self_loops, term_to_add_last] + + full_lists.append(row1) + full_lists.append(row2) + if self.pseudocount_for_degree != 0: + row3 = ["d_pseudocount", ">= 0", + "to ensure that no nodes have 0 degree value in D matrix", + self.pseudocount_for_degree, term_to_add_last] + full_lists.append(row3) + if self.edge_values_for_degree: + row_to_add = ["edge_vals_for_d", "boolean", + "if True, we use the edge weight values to derive our degrees for matrix D", True, term_to_add_last] + full_lists.append(row_to_add)# arguments to add in: + if self.w_transform_for_d == "sqrt": # take the square root of the edge weights for the degree calculations + row_to_add = ["w_transform_for_d: sqrt", "string", + "for each edge, we use the square root of the edge weight values to derive our degrees for matrix D", self.w_transform_for_d, term_to_add_last] + full_lists.append(row_to_add) + if self.w_transform_for_d == "square": # square the edge weights for the degree calculations + row_to_add = ["w_transform_for_d: square", "string", + "for each edge, we square the edge weight values to derive our degrees for matrix D", self.w_transform_for_d, term_to_add_last] + full_lists.append(row_to_add) + else: # default if we use a threshold for the degree: + row_to_add = ["edge_vals_for_d", "boolean", + "if False, we use a threshold instead to derive our degrees for matrix D", False, term_to_add_last] + full_lists.append(row_to_add) + self.threshold_for_degree = 0.5 # edge weights > this threshold are counted as 1 for the degree + to_add_text = "edge weights > " + str(self.threshold_for_degree) + " are counted as 1 for the degree" + row_to_add = ["thresh_for_d", ">= 0", + to_add_text, self.threshold_for_degree, term_to_add_last] + full_lists.append(row_to_add) + return full_lists + + +def build_prior_network(edge_list, gene_expression_nodes = [], default_edge_weight = 0.1, + degree_threshold = 0.5, + degree_pseudocount = 1e-3, + view_network = True, + verbose = True): + edge_vals_for_d = False + self_loops = False + w_transform_for_d = "none" + prior_graph_dict = {"edge_list": edge_list, + "gene_expression_nodes":gene_expression_nodes, + "edge_values_for_degree": edge_vals_for_d, + "consider_self_loops":self_loops, + "pseudocount_for_degree":degree_pseudocount, + "default_edge_weight": default_edge_weight, + "w_transform_for_d":w_transform_for_d, + "threshold_for_degree": degree_threshold, + "view_network": view_network, + "verbose":verbose} + if verbose: + print("building prior network:") + print("prior graph network used") + netty = PriorGraphNetwork(**prior_graph_dict) # uses the network to get features like the A matrix. #################### + return netty + + +def directed_node2vec_similarity(edge_list: List[Tuple[int, int, float]], + dimensions: int = 64, + walk_length: int = 30, + num_walks: int = 200, + p: float = 1, q: float = 0.5, + workers: int = 4, window: int = 10, + min_count: int = 1, + batch_words: int = 4) -> np.ndarray: + print("directed_node2vec_similarity") + """ Given an edge list and node2vec parameters, returns a scaled similarity matrix for the node embeddings generated + by training a node2vec model on the directed graph defined by the edge list. + + Parameters: + ----------- + edge_list: List[List[int, int, float]] + A list of lists representing the edges of a directed graph. Each edge should be a list of three values: + [source_node, target_node, edge_weight]. If no edge weight is specified, it is assumed to be 1.0. + + dimensions: int, optional (default=64) + The dimensionality of the node embeddings. + + walk_length: int, optional (default=30) + The length of each random walk during the node2vec training process. + + num_walks: int, optional (default=200) + The number of random walks to generate for each node during the node2vec training process. + + p: float, optional (default=1) + The return parameter for the node2vec algorithm. + + q: float, optional (default=0.5) + The in-out parameter for the node2vec algorithm. + + workers: int, optional (default=4) + The number of worker threads to use during the node2vec training process. + + window: int, optional (default=10) + The size of the window for the skip-gram model during training. + + min_count: int, optional (default=1) + The minimum count for a word in the training data to be included in the model. + + batch_words: int, optional (default=4) + The number of words in each batch during training. + + Returns: + -------- + scaled_similarity_matrix: np.ndarray + A scaled (0-1 range) cosine similarity matrix for the node embeddings generated by training a node2vec model + on the directed graph defined by the edge list. + """ + print("Creating directed graph from edge list") + directed_graph = nx.DiGraph() + for edge in edge_list: + source, target = edge[:2] + weight = edge[2] if len(edge) == 3 else 1.0 + directed_graph.add_edge(source, target, weight=weight) + + # Extract unique node names from the graph + node_names = list(directed_graph.nodes) + + print("Initializing the Node2Vec model") + model = Node2Vec(directed_graph, dimensions=dimensions, walk_length=walk_length, + num_walks=num_walks, p=p, q=q, workers=workers) + + print("Training the model") + model = model.fit(window=window, min_count=min_count, batch_words=batch_words) + + print("Getting node embeddings") + node_embeddings = np.array([model.wv[node] for node in node_names]) + + print("Calculating cosine similarity matrix") + similarity_matrix = cosine_similarity(node_embeddings) + + print("Scaling similarity matrix to 0-1 range") + scaled_similarity_matrix = (similarity_matrix + 1) / 2 + + # Create a DataFrame with rows and columns labeled as node names + similarity_matrix = pd.DataFrame(scaled_similarity_matrix, index=node_names, columns=node_names) + print(f":) First 5 entries of the symmetric similarity matrix for {similarity_matrix.shape[0]} nodes.") + print(similarity_matrix.iloc[0:5, 0:5]) + + similarity_df = similarity_matrix.reset_index().melt(id_vars='index', var_name='TF2', value_name='cosine_similarity') + #similarity_df = similarity_df[similarity_df['index'] < similarity_df['TF2']] + similarity_df = similarity_df.rename(columns = {"index":"node_1", "TF2":"node_2"}) + similarity_df = similarity_df[similarity_df["node_1"] != similarity_df["node_2"]] + results_dict = {} + print("\n :) ######################################################## \n") + print(":) Please note that we return a dictionary with 3 keys based on Node2Vec and cosine similarity computations:") + print("1. similarity_matrix: the cosine similarity matrix for the nodes in the original directed graph") + results_dict["similarity_matrix"] = similarity_matrix + print("2. similarity_df: simplified dataframe of the cosine similarity values from the similarity_matrix.") + + results_dict["similarity_df"] = similarity_df + print("3. NetREm_edgelist: an edge_list that is based on similarity_df that is ready to be input for NetREm.") + + results_dict["NetREm_edgelist"] = similarity_df.values.tolist() + print(results_dict.keys()) + return results_dict \ No newline at end of file diff --git a/code/old_code/refresh/directed_to_undirected_network_example.ipynb b/code/old_code/refresh/directed_to_undirected_network_example.ipynb new file mode 100644 index 0000000..121c8ac --- /dev/null +++ b/code/old_code/refresh/directed_to_undirected_network_example.ipynb @@ -0,0 +1,951 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "c912f6d3", + "metadata": {}, + "source": [ + "# Working with Weighted Directed Input Networks in NetREm\n", + "## Converting Weighted Directed Networks to Undirected Similarity Networks :)\n", + "### By: Saniya Khullar, Xiang Huang, Raghu Ramesh, John Svaren, Daifeng Wang\n", + "\n", + "The code for NetREm is optimized for undirected, weighted, networks.\n", + "\n", + "\n", + "Please note that in this Jupyter notebook, we will go through how to work with directed prior network graphs in NetREm (Network Regression Embeddings).\n", + "\n", + "There are many ways to do this. 1 of the ways is via the popular graph embedding method of weighted node2vec, where we learn embeddings for each of the nodes based on a random walk followed by a Skipgram model. Hopefully these embeddings capture as much information about the original directed graph network. \n", + "\n", + "This is similar to the word2vec approach in Natural Language Processing (NLP) tasks.\n", + "Then, we will calculate the similarity among nodes based on the cosine similarity values of their embeddings. Ultimately, we will build out an undirected, weighted network among the predictors, which is a similarity network. \n", + "\n", + "This similarity network may be our input network for NetREm as it is undirected and weighted :)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "a7271420", + "metadata": {}, + "outputs": [], + "source": [ + "printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs))\n", + "rng_seed = 2023 # random seed for reproducibility\n", + "randSeed = 123\n", + "from packages_needed import *\n", + "import error_metrics as em \n", + "from packages_needed import *\n", + "import Netrem_model_builder as nm\n", + "import DemoDataBuilderXandY as demo\n", + "import PriorGraphNetwork as graph\n", + "import netrem_evaluation_functions as nm_eval\n", + "import essential_functions as ef" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "1d304b66", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[['TF1', 'TF2', 0.4],\n", + " ['TF2', 'TF3', 0.7],\n", + " ['TF3', 'TF4', 0.2],\n", + " ['TF3', 'TF5', 0.8],\n", + " ['TF4', 'TF6', 0.5],\n", + " ['TF5', 'TF7', 0.6],\n", + " ['TF6', 'TF8', 0.3],\n", + " ['TF7', 'TF9', 0.9],\n", + " ['TF8', 'TF10', 0.1],\n", + " ['TF9', 'TF1', 0.7]]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Create a small fake weighted directed network of 10 nodes \"TF1 to TF10\"\n", + "# Given by a DataFrame of [source, target, weight]\n", + "data = {\n", + " 'source': [\"TF1\", \"TF2\", \"TF3\", \"TF3\", \"TF4\", \n", + " \"TF5\", \"TF6\", \"TF7\", \"TF8\", \"TF9\"],\n", + " 'target': [\"TF2\", \"TF3\", \"TF4\", \"TF5\", \"TF6\",\n", + " \"TF7\", \"TF8\", \"TF9\", \"TF10\", \"TF1\"],\n", + " 'weight': [0.4, 0.7, 0.2, 0.8, 0.5, \n", + " 0.6, 0.3, 0.9, 0.1, 0.7]\n", + "}\n", + "\n", + "df = pd.DataFrame(data)\n", + "# Convert the DataFrame to a list of edge tuples\n", + "edge_list = [list(x) for x in df.to_records(index=False)]\n", + "edge_list" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "9dc946f1", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create a directed graph\n", + "G = nx.from_pandas_edgelist(df, 'source', 'target', ['weight'], \n", + " create_using=nx.DiGraph())\n", + "\n", + "# Generate layout\n", + "#pos = nx.spring_layout(G, seed=42)\n", + "pos = nx.circular_layout(G)\n", + "\n", + "# Explicitly create a new figure\n", + "fig, ax = plt.subplots()\n", + "\n", + "# Draw nodes, edges, and labels\n", + "nx.draw(G, pos, with_labels=True, node_color='lightgreen', node_size=700, font_size=14,\n", + " font_color='black', font_weight='bold', edge_color='red', ax=ax,\n", + " arrowstyle='->', arrowsize=25)\n", + "\n", + "# Draw edge weights\n", + "labels = nx.get_edge_attributes(G, 'weight')\n", + "nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, font_size=14)\n", + "\n", + "# Show plot\n", + "plt.title(\"Directed Network of TF1 to TF10\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "fbad0b53", + "metadata": {}, + "source": [ + " Given an edge list and node2vec parameters, the directed_node2vec_similarity function in the PriorGraphNetwork class (imported as *graph*) returns a scaled similarity matrix for the node embeddings generated by training a node2vec model on the directed graph defined by the edge list.\n", + "\n", + "**directed_node2vec_similarity**(edge_list: List[Tuple[int, int, float]],\n", + " dimensions: int = 64,\n", + " walk_length: int = 30,\n", + " num_walks: int = 200,\n", + " p: float = 1, q: float = 0.5,\n", + " workers: int = 4, window: int = 10,\n", + " min_count: int = 1,\n", + " batch_words: int = 4) -> np.ndarray:\n", + " \n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "86badb4b", + "metadata": {}, + "source": [ + "Parameters:\n", + "-----------\n", + " edge_list: List[List[int, int, float]]\n", + " A list of lists representing the edges of a directed graph. Each edge should be a list of three values:\n", + " [source_node, target_node, edge_weight]. If no edge weight is specified, it is assumed to be 1.0.\n", + "\n", + " dimensions: int, optional (default=64)\n", + " The dimensionality of the node embeddings.\n", + "\n", + " walk_length: int, optional (default=30)\n", + " The length of each random walk during the node2vec training process.\n", + "\n", + " num_walks: int, optional (default=200)\n", + " The number of random walks to generate for each node during the node2vec training process.\n", + "\n", + " p: float, optional (default=1)\n", + " The return parameter for the node2vec algorithm.\n", + "\n", + " q: float, optional (default=0.5)\n", + " The in-out parameter for the node2vec algorithm.\n", + "\n", + " workers: int, optional (default=4)\n", + " The number of worker threads to use during the node2vec training process.\n", + "\n", + " window: int, optional (default=10)\n", + " The size of the window for the skip-gram model during training.\n", + "\n", + " min_count: int, optional (default=1)\n", + " The minimum count for a word in the training data to be included in the model.\n", + "\n", + " batch_words: int, optional (default=4)\n", + " The number of words in each batch during training." + ] + }, + { + "cell_type": "markdown", + "id": "d90f5a0a", + "metadata": {}, + "source": [ + "We run this function below to retrieve the similarity matrix values for our nodes. " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "f6448c8f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "directed_node2vec_similarity\n", + "Creating directed graph from edge list\n", + "Initializing the Node2Vec model\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "609bc0c933cf4164828857683b308cd4", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Computing transition probabilities: 0%| | 0/10 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TF1TF2TF3TF4TF5TF6TF7TF8TF9TF10
TF11.0000000.9775650.9828390.7919590.9778990.7756400.9671700.7353920.9481570.751666
TF20.9775651.0000000.9792040.8075410.9817140.7892270.9582440.7656840.9749400.767854
TF30.9828390.9792041.0000000.8127720.9826370.7854010.9846940.7564410.9761560.772716
TF40.7919590.8075410.8127721.0000000.7295630.9842060.7597970.9879390.7614820.987777
TF50.9778990.9817140.9826370.7295631.0000000.6994160.9822360.6723450.9832000.681467
TF60.7756400.7892270.7854010.9842060.6994161.0000000.7265840.9838130.7273400.989362
TF70.9671700.9582440.9846940.7597970.9822360.7265841.0000000.6986950.9789320.712766
TF80.7353920.7656840.7564410.9879390.6723450.9838130.6986951.0000000.7145800.984900
TF90.9481570.9749400.9761560.7614820.9832000.7273400.9789320.7145801.0000000.712718
TF100.7516660.7678540.7727160.9877770.6814670.9893620.7127660.9849000.7127181.000000
\n", + "" + ], + "text/plain": [ + " TF1 TF2 TF3 TF4 TF5 TF6 TF7 \\\n", + "TF1 1.000000 0.977565 0.982839 0.791959 0.977899 0.775640 0.967170 \n", + "TF2 0.977565 1.000000 0.979204 0.807541 0.981714 0.789227 0.958244 \n", + "TF3 0.982839 0.979204 1.000000 0.812772 0.982637 0.785401 0.984694 \n", + "TF4 0.791959 0.807541 0.812772 1.000000 0.729563 0.984206 0.759797 \n", + "TF5 0.977899 0.981714 0.982637 0.729563 1.000000 0.699416 0.982236 \n", + "TF6 0.775640 0.789227 0.785401 0.984206 0.699416 1.000000 0.726584 \n", + "TF7 0.967170 0.958244 0.984694 0.759797 0.982236 0.726584 1.000000 \n", + "TF8 0.735392 0.765684 0.756441 0.987939 0.672345 0.983813 0.698695 \n", + "TF9 0.948157 0.974940 0.976156 0.761482 0.983200 0.727340 0.978932 \n", + "TF10 0.751666 0.767854 0.772716 0.987777 0.681467 0.989362 0.712766 \n", + "\n", + " TF8 TF9 TF10 \n", + "TF1 0.735392 0.948157 0.751666 \n", + "TF2 0.765684 0.974940 0.767854 \n", + "TF3 0.756441 0.976156 0.772716 \n", + "TF4 0.987939 0.761482 0.987777 \n", + "TF5 0.672345 0.983200 0.681467 \n", + "TF6 0.983813 0.727340 0.989362 \n", + "TF7 0.698695 0.978932 0.712766 \n", + "TF8 1.000000 0.714580 0.984900 \n", + "TF9 0.714580 1.000000 0.712718 \n", + "TF10 0.984900 0.712718 1.000000 " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "results_dict[\"similarity_matrix\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "240fbff8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
node_1node_2cosine_similarity
1TF2TF10.977565
2TF3TF10.982839
3TF4TF10.791959
4TF5TF10.977899
5TF6TF10.775640
............
94TF5TF100.681467
95TF6TF100.989362
96TF7TF100.712766
97TF8TF100.984900
98TF9TF100.712718
\n", + "

90 rows × 3 columns

\n", + "
" + ], + "text/plain": [ + " node_1 node_2 cosine_similarity\n", + "1 TF2 TF1 0.977565\n", + "2 TF3 TF1 0.982839\n", + "3 TF4 TF1 0.791959\n", + "4 TF5 TF1 0.977899\n", + "5 TF6 TF1 0.775640\n", + ".. ... ... ...\n", + "94 TF5 TF10 0.681467\n", + "95 TF6 TF10 0.989362\n", + "96 TF7 TF10 0.712766\n", + "97 TF8 TF10 0.984900\n", + "98 TF9 TF10 0.712718\n", + "\n", + "[90 rows x 3 columns]" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "similarity_df = results_dict[\"similarity_df\"]\n", + "similarity_df" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "cbe9d663", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[['TF2', 'TF1', 0.9775652289390564],\n", + " ['TF3', 'TF1', 0.9828392863273621],\n", + " ['TF4', 'TF1', 0.791959285736084],\n", + " ['TF5', 'TF1', 0.9778987169265747],\n", + " ['TF6', 'TF1', 0.7756398320198059],\n", + " ['TF7', 'TF1', 0.9671704173088074],\n", + " ['TF8', 'TF1', 0.7353924512863159],\n", + " ['TF9', 'TF1', 0.9481567144393921],\n", + " ['TF10', 'TF1', 0.7516655921936035],\n", + " ['TF1', 'TF2', 0.9775652289390564],\n", + " ['TF3', 'TF2', 0.9792038202285767],\n", + " ['TF4', 'TF2', 0.807540774345398],\n", + " ['TF5', 'TF2', 0.9817137718200684],\n", + " ['TF6', 'TF2', 0.7892271280288696],\n", + " ['TF7', 'TF2', 0.9582439661026001],\n", + " ['TF8', 'TF2', 0.7656841278076172],\n", + " ['TF9', 'TF2', 0.9749403595924377],\n", + " ['TF10', 'TF2', 0.7678544521331787],\n", + " ['TF1', 'TF3', 0.9828392863273621],\n", + " ['TF2', 'TF3', 0.9792038202285767],\n", + " ['TF4', 'TF3', 0.8127720355987549],\n", + " ['TF5', 'TF3', 0.9826374053955078],\n", + " ['TF6', 'TF3', 0.7854012846946716],\n", + " ['TF7', 'TF3', 0.9846940636634827],\n", + " ['TF8', 'TF3', 0.7564405202865601],\n", + " ['TF9', 'TF3', 0.9761558771133423],\n", + " ['TF10', 'TF3', 0.7727159261703491],\n", + " ['TF1', 'TF4', 0.791959285736084],\n", + " ['TF2', 'TF4', 0.807540774345398],\n", + " ['TF3', 'TF4', 0.8127720355987549],\n", + " ['TF5', 'TF4', 0.7295628190040588],\n", + " ['TF6', 'TF4', 0.9842056632041931],\n", + " ['TF7', 'TF4', 0.7597967982292175],\n", + " ['TF8', 'TF4', 0.9879385828971863],\n", + " ['TF9', 'TF4', 0.761481523513794],\n", + " ['TF10', 'TF4', 0.987776517868042],\n", + " ['TF1', 'TF5', 0.9778987169265747],\n", + " ['TF2', 'TF5', 0.9817137718200684],\n", + " ['TF3', 'TF5', 0.9826374053955078],\n", + " ['TF4', 'TF5', 0.7295628190040588],\n", + " ['TF6', 'TF5', 0.6994156837463379],\n", + " ['TF7', 'TF5', 0.9822356104850769],\n", + " ['TF8', 'TF5', 0.6723446846008301],\n", + " ['TF9', 'TF5', 0.983199954032898],\n", + " ['TF10', 'TF5', 0.6814671158790588],\n", + " ['TF1', 'TF6', 0.7756398320198059],\n", + " ['TF2', 'TF6', 0.7892271280288696],\n", + " ['TF3', 'TF6', 0.7854012846946716],\n", + " ['TF4', 'TF6', 0.9842056632041931],\n", + " ['TF5', 'TF6', 0.6994156837463379],\n", + " ['TF7', 'TF6', 0.7265838980674744],\n", + " ['TF8', 'TF6', 0.9838127493858337],\n", + " ['TF9', 'TF6', 0.7273401021957397],\n", + " ['TF10', 'TF6', 0.989362359046936],\n", + " ['TF1', 'TF7', 0.9671704173088074],\n", + " ['TF2', 'TF7', 0.9582439661026001],\n", + " ['TF3', 'TF7', 0.9846940636634827],\n", + " ['TF4', 'TF7', 0.7597967982292175],\n", + " ['TF5', 'TF7', 0.9822356104850769],\n", + " ['TF6', 'TF7', 0.7265838980674744],\n", + " ['TF8', 'TF7', 0.6986950635910034],\n", + " ['TF9', 'TF7', 0.9789323806762695],\n", + " ['TF10', 'TF7', 0.7127657532691956],\n", + " ['TF1', 'TF8', 0.7353924512863159],\n", + " ['TF2', 'TF8', 0.7656841278076172],\n", + " ['TF3', 'TF8', 0.7564405202865601],\n", + " ['TF4', 'TF8', 0.9879385828971863],\n", + " ['TF5', 'TF8', 0.6723446846008301],\n", + " ['TF6', 'TF8', 0.9838127493858337],\n", + " ['TF7', 'TF8', 0.6986950635910034],\n", + " ['TF9', 'TF8', 0.7145799398422241],\n", + " ['TF10', 'TF8', 0.9848997592926025],\n", + " ['TF1', 'TF9', 0.9481567144393921],\n", + " ['TF2', 'TF9', 0.9749403595924377],\n", + " ['TF3', 'TF9', 0.9761558771133423],\n", + " ['TF4', 'TF9', 0.761481523513794],\n", + " ['TF5', 'TF9', 0.983199954032898],\n", + " ['TF6', 'TF9', 0.7273401021957397],\n", + " ['TF7', 'TF9', 0.9789323806762695],\n", + " ['TF8', 'TF9', 0.7145799398422241],\n", + " ['TF10', 'TF9', 0.7127178311347961],\n", + " ['TF1', 'TF10', 0.7516655921936035],\n", + " ['TF2', 'TF10', 0.7678544521331787],\n", + " ['TF3', 'TF10', 0.7727159261703491],\n", + " ['TF4', 'TF10', 0.987776517868042],\n", + " ['TF5', 'TF10', 0.6814671158790588],\n", + " ['TF6', 'TF10', 0.989362359046936],\n", + " ['TF7', 'TF10', 0.7127657532691956],\n", + " ['TF8', 'TF10', 0.9848997592926025],\n", + " ['TF9', 'TF10', 0.7127178311347961]]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "undirected_edge_list = results_dict[\"NetREm_edgelist\"]\n", + "undirected_edge_list" + ] + }, + { + "cell_type": "markdown", + "id": "bad6c517", + "metadata": {}, + "source": [ + "We can visualize this updated undirected network of cosine similarities, which we can input to NetREm. :)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "54c46f82", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create an undirected graph\n", + "G = nx.Graph()\n", + "\n", + "# Add edges to the graph\n", + "for _, row in similarity_df.iterrows():\n", + " G.add_edge(row['node_1'], row['node_2'], weight=row['cosine_similarity'])\n", + "\n", + "# Generate layout\n", + "pos = nx.circular_layout(G)\n", + "# Get edge weights and normalize for better visualization\n", + "edge_weights = [G[u][v]['weight'] for u, v in G.edges()]\n", + "max_weight = max(edge_weights)\n", + "normalized_weights = [5 * (weight / max_weight) \n", + " for weight in edge_weights] # multiplied by 5 for better visibility\n", + "\n", + "# Explicitly create a new figure and axes\n", + "fig, ax = plt.subplots()\n", + "\n", + "# Draw nodes, edges, and labels\n", + "nx.draw(G, pos, with_labels=True, node_color='yellow',\n", + " node_size=700, font_size=18,\n", + " font_color='black', font_weight='bold', \n", + " edge_color='red', width=normalized_weights, ax=ax)\n", + "\n", + "# Draw edge weights\n", + "labels = nx.get_edge_attributes(G, 'weight')\n", + "for label in labels:\n", + " labels[label] = round(labels[label], 2)\n", + "nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, \n", + " font_size=6, ax=ax)\n", + "\n", + "# Show plot\n", + "plt.title(\"Undirected Weighted Network of Cosine Similarity Values\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "d7c46371", + "metadata": {}, + "source": [ + "We can compare the original directed network weights (*directed_weight* values) with the updated cosine similarity undirected values (*cosine_similarity*) as we do below:" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "a0d62a2f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
node_1node_2cosine_similaritydirected_weight
78TF7TF90.9789320.9
38TF3TF50.9826370.8
7TF9TF10.9481570.7
19TF2TF30.9792040.7
58TF5TF70.9822360.6
...............
31TF6TF40.9842060.0
30TF5TF40.7295630.0
28TF2TF40.8075410.0
27TF1TF40.7919590.0
89TF9TF100.7127180.0
\n", + "

90 rows × 4 columns

\n", + "
" + ], + "text/plain": [ + " node_1 node_2 cosine_similarity directed_weight\n", + "78 TF7 TF9 0.978932 0.9\n", + "38 TF3 TF5 0.982637 0.8\n", + "7 TF9 TF1 0.948157 0.7\n", + "19 TF2 TF3 0.979204 0.7\n", + "58 TF5 TF7 0.982236 0.6\n", + ".. ... ... ... ...\n", + "31 TF6 TF4 0.984206 0.0\n", + "30 TF5 TF4 0.729563 0.0\n", + "28 TF2 TF4 0.807541 0.0\n", + "27 TF1 TF4 0.791959 0.0\n", + "89 TF9 TF10 0.712718 0.0\n", + "\n", + "[90 rows x 4 columns]" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "directed_wieghts_df = df\n", + "directed_wieghts_df = directed_wieghts_df.rename(columns = {\"source\":\"node_1\", \"target\":\"node_2\",\n", + " \"weight\":\"directed_weight\"})\n", + "directed_wieghts_df\n", + "comparison_df = pd.merge(similarity_df, directed_wieghts_df, how = \"left\", on = [\"node_1\", \"node_2\"])\n", + "comparison_df = comparison_df.fillna(0)\n", + "comparison_df = comparison_df.sort_values(by = [\"directed_weight\"], ascending = False)\n", + "comparison_df" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/code/old_code/refresh/error_metrics.py b/code/old_code/refresh/error_metrics.py new file mode 100644 index 0000000..d4ca876 --- /dev/null +++ b/code/old_code/refresh/error_metrics.py @@ -0,0 +1,474 @@ +# Error_Metrics.py :) +import pandas as pd +import numpy as np +import random +import copy +from tqdm import tqdm +import os +import sys # https://www.dev2qa.com/how-to-run-python-script-py-file-in-jupyter-notebook-ipynb-file-and-ipython/#:~:text=How%20To%20Run%20Python%20Script%20.py%20File%20In,2.%20Invoke%20Python%20Script%20File%20From%20Ipython%20Command-Line. +import networkx as nx +import scipy +from scipy.linalg import svd as robust_svd +from sklearn.model_selection import KFold, train_test_split, GridSearchCV, cross_val_score +from sklearn.decomposition import TruncatedSVD +from sklearn import linear_model +from sklearn.linear_model import Lasso, LassoCV, LinearRegression, ElasticNetCV, Ridge +from numpy.typing import ArrayLike +# from skopt import gp_minimize, space +from typing import Optional, List, Tuple +from sklearn.metrics import make_scorer +import plotly.express as px +from sklearn.base import RegressorMixin, ClassifierMixin, BaseEstimator +import matplotlib.pyplot as plt +from numpy.typing import ArrayLike +from scipy.sparse.linalg.interface import LinearOperator +import warnings +from sklearn.exceptions import ConvergenceWarning +printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) +rng_seed = 2023 # random seed for reproducibility +randSeed = 123 + +def calculate_mean_square_error(actual_values, predicted_values): + # Please note that this function by Saniya calculates the Mean Square Error (MSE) + difference = (actual_values - predicted_values) + squared_diff = difference ** 2 # square of the difference + mean_squared_diff = np.mean(squared_diff) + return mean_squared_diff + + +def mse(REF: np.ndarray, X: np.ndarray, axis: Optional[int] = None) -> np.float: + """Compute mean square error between array with a reference array - + If REF or X is complex, compute mse(REF.real, X.real) + 1j * mse(REF.imag, X.imag) + + Parameters + ---------- + REF: + ground truth, or reference array, e.g. shape=(n_sample, n_target) for machine learning + X: + result array to compare with reference, e.g. shape=(n_sample, n_target) for machine learning + axis: + Axis along which the comparison is computed. Default to None to compute the comparison + of the flattened array. + + Returns + ------- + mse_: + normalized mean square error + + Examples + ------- + mse(REF, X, axis=0) compute the comparision along n_sample dimension for machine learning + regression application where shape=(n_sample, n_target) + """ + + if (not np.iscomplexobj(REF)) and (not np.iscomplexobj(X)): + return ((X - REF)**2).mean(axis=axis) + else: + return mse(REF.real, X.real, axis) + 1j * mse(REF.imag, X.imag, axis) + + +def nmse(REF: np.ndarray, X: np.ndarray, axis: Optional[int] = None) -> np.float: + """Compute normalized mean square error between array with a reference array - + If REF or X is complex, compute nmse(REF.real, X.real) + 1j * nmse(REF.imag, X.imag) + + Parameters + ---------- + REF: + ground truth, or reference array, e.g. shape=(n_sample, n_target) for machine learning + X: + result array to compare with reference, e.g. shape=(n_sample, n_target) for machine learning + axis: + Axis along which the comparison is computed. Default to None to compute the comparison + of the flattened array. + + Returns + ------- + nmse_: + normalized mean square error + + Examples + ------- + nmse(REF, X, axis=0) compute the comparision along n_sample dimension for machine learning + regression application where shape=(n_sample, n_target) + """ + if (not np.iscomplexobj(REF)) and (not np.iscomplexobj(X)): + return ((X - REF)**2).mean(axis=axis) / (REF**2).mean(axis=axis) + else: + return nmse(REF.real, X.real, axis) + 1j * nmse(REF.imag, X.imag, axis) + + +def snr(REF: np.ndarray, X: np.ndarray, axis: Optional[int] = None) -> np.float64: + """Compare an array with a reference array - compute signal to noise ration in dB. + If REF or X is complex, compute snr(REF.real, X.real) + 1j * snr(REF.imag, X.imag) + + Parameters + ---------- + REF: + ground truth, or reference array, e.g. shape=(n_sample, n_target) for machine learning + X: + result array to compare with reference, e.g. shape=(n_sample, n_target) for machine learning + axis: + Axis along which the comparison is computed. The default is to compute the comparison + of the flattened array. + + Returns + ------- + snr_: + signal to noise ration in dB + + Examples + ------- + snr(REF, X, axis=0) compute the comparision along n_sample dimension for machine learning + regression application where shape=(n_sample, n_target) + """ + if (not np.iscomplexobj(REF)) and (not np.iscomplexobj(X)): + return 10 * np.log10((REF**2).mean(axis=axis) / ((X - REF)**2).mean(axis=axis)) + else: + return snr(REF.real, X.real, axis) + 1j * snr(REF.imag, X.imag, axis) + + +def psnr(REF: np.ndarray, X: np.ndarray, axis: Optional[int] = None, max_: Optional[np.float64] = None) -> np.float64: + """See snr, TODO: copy and modify docstring from snr + """ + if (not np.iscomplexobj(REF)) and (not np.iscomplexobj(X)): + if max_ is None: + max_ = REF.max() + return 10 * np.log10(max_**2 / ((X - REF)**2).mean(axis=axis)) # change from REF.max() to 255 + else: + return psnr(REF.real, X.real, axis, max_) + 1j * psnr(REF.imag, X.imag, axis, max_) + + +def nmse_custom_score(y_true, y_pred): + """ + Calculates the negative normalized mean squared error (MSE) between the true and predicted values. + """ + import numpy as np + if isinstance(y_true, pd.DataFrame): + y_true = y_true.values.flatten() + if isinstance(y_pred, pd.DataFrame): + y_pred = y_pred.values.flatten() + if not any(y_pred): # if all predicted coefficients are 0 + return -np.inf # return a high negative score + nmseVal = nmse(y_true, y_pred) + return -nmseVal + + +def mse_custom_score(y_true, y_pred): + """ + Calculates the negative normalized mean squared error (MSE) between the true and predicted values. + default: greater_is_better, so we set negative mseVal to find the smallest mse + """ + import numpy as np + if isinstance(y_true, pd.DataFrame): + y_true = y_true.values.flatten() + if isinstance(y_pred, pd.DataFrame): + y_pred = y_pred.values.flatten() + if not any(y_pred): # if all predicted coefficients are 0 + return -np.inf # return a high negative score + mseVal = mse(y_true, y_pred) + return -mseVal + + +def snr_custom_score(y_true, y_pred): + """ + Higher the SNR the better + """ + if isinstance(y_true, pd.DataFrame): + y_true = y_true.values.flatten() + if isinstance(y_pred, pd.DataFrame): + y_pred = y_pred.values.flatten() + if not any(y_pred): # if all predicted coefficients are 0 + return -np.inf # return a high negative score + snrVal = snr(y_true, y_pred) + return snrVal + + +def psnr_custom_score(y_true, y_pred): + """ + Higher the psnr, the better + """ + if isinstance(y_true, pd.DataFrame): + y_true = y_true.values.flatten() + if isinstance(y_pred, pd.DataFrame): + y_pred = y_pred.values.flatten() + if not any(y_pred): # if all predicted coefficients are 0 + return -np.inf # return a high negative score + psnrVal = psnr(y_true, y_pred) + return psnrVal + +# Create a custom scorer object using make_scorer +mse_custom_scorer = make_scorer(mse_custom_score) +nmse_custom_scorer = make_scorer(nmse_custom_score) +snr_custom_scorer = make_scorer(snr_custom_score) +psnr_custom_scorer = make_scorer(psnr_custom_score) + + +def generate_model_metrics_for_baselines_df(X_train, y_train, X_test, y_test, model_name = "ElasticNetCV", y_intercept = False, tf_name = "SOX10"): + from sklearn.linear_model import ElasticNetCV, LinearRegression, LassoCV, RidgeCV + print(f"{model_name} results :) for fitting y_intercept = {y_intercept}") + if model_name == "ElasticNetCV": + regr = ElasticNetCV(cv=5, random_state=0, fit_intercept = y_intercept) + elif model_name == "LinearRegression": + regr = LinearRegression(fit_intercept = y_intercept) + elif model_name == "LassoCV": + regr = LassoCV(cv=5, fit_intercept = y_intercept) + elif model_name == "RidgeCV": + regr = RidgeCV(cv=5, fit_intercept = y_intercept) + regr.fit(X_train, y_train) + if model_name in ["RidgeCV", "LinearRegression"]: + model_df = pd.DataFrame(regr.coef_) + else: + model_df = pd.DataFrame(regr.coef_).transpose() + model_df.columns = X_train.columns.tolist() + selected_row = model_df.iloc[0] + selected_cols = selected_row[selected_row != 0].index # Filter out the columns with value 0 + model_df = model_df[selected_cols] + df = model_df.replace("None", np.nan).apply(pd.to_numeric, errors='coerce') + sorted_series = df.abs().squeeze().sort_values(ascending=False) + # convert the sorted series back to a DataFrame + sorted_df = pd.DataFrame(sorted_series) + # add a column for the rank + sorted_df['Rank'] = range(1, len(sorted_df) + 1) + sorted_df['TF'] = sorted_df.index + sorted_df = sorted_df.rename(columns = {0:"AbsoluteVal_coefficient"}) + tfs = sorted_df["TF"].tolist() + if tf_name not in tfs: + sorted_df = pd.DataFrame(["N/A", tf_name]).transpose() + sorted_df.columns = ["Rank", "TF"] + sorted_df["Info"] = model_name + if y_intercept: + sorted_df["y_intercept"] = "True :)" + else: + sorted_df["y_intercept"] = "False :(" + sorted_df["num_TFs"] = model_df.shape[1] + predY_train = regr.predict(X_train) + predY_test = regr.predict(X_test) + train_mse = mse(y_train.values.flatten(), predY_train) + test_mse = mse(y_test.values.flatten(), predY_test) + train_nmse = nmse(y_train.values.flatten(), predY_train) + test_nmse = nmse(y_test.values.flatten(), predY_test) + sorted_df["train_mse"] = train_mse + sorted_df["test_mse"] = test_mse + sorted_df["train_nmse"] = train_nmse + sorted_df["test_nmse"] = test_nmse + predY_train = regr.predict(X_train) + predY_test = regr.predict(X_test) + sorted_df["train_nmse"] = nmse(y_train.values.flatten(), predY_train) + sorted_df["test_nmse"] = nmse(y_test.values.flatten(), predY_test) + sorted_df["train_snr"] = snr(y_train.values.flatten(), predY_train) + sorted_df["test_snr"] = snr(y_test.values.flatten(), predY_test) + sorted_df["train_psnr"] = psnr(y_train.values.flatten(), predY_train) + sorted_df["test_psnr"] = psnr(y_test.values.flatten(), predY_test) + return sorted_df + + +def generate_model_metrics_for_netrem_model_object(netrem_model, y_intercept_fit, X_train, y_train, X_test, y_test, filtered_results = False, tf_name = "SOX10", focus_gene = "y"): + if netrem_model.model_nonzero_coef_df.shape[1] == 1: + sorted_df = pd.DataFrame(["N/A", tf_name]).transpose() + sorted_df.columns = ["Rank", "TF"] + tf_netrem_found = False + if netrem_model.model_type == "LassoCV": + sorted_df["Info"] = "NetREm (b = " + str(netrem_model.beta_network) + "; LassoCV)" #+ netrem_info# + str(netrem_model.optimal_alpha) + ")" + else: + sorted_df["Info"] = "NetREm (b = " + str(netrem_model.beta_network) + "; a = " + netrem_model.alpha_lasso + ")"# : " + netrem_info# + str(netrem_model.optimal_alpha) + ")" + + if y_intercept_fit: + sorted_df["y_intercept"] = "True :)" + else: + sorted_df["y_intercept"] = "False :(" + sorted_df["num_TFs"] = 0 + else: + sorted_df = netrem_model.sorted_coef_df[netrem_model.sorted_coef_df["TF"] == tf_name] + tfs = sorted_df["TF"].tolist() + tf_netrem_found = True + if tf_name not in tfs: + sorted_df = pd.DataFrame(["N/A", tf_name]).transpose() + sorted_df.columns = ["Rank", "TF"] + tf_netrem_found = False + sorted_df["Info"] = "NetREm (b = " + str(netrem_model.beta_network) + "; LassoCV)"# + str(netrem_model.optimal_alpha) + ")" + sorted_df["num_TFs"] = netrem_model.model_nonzero_coef_df.drop(columns = ["y_intercept"]).shape[1] + predY_train = netrem_model.predict(X_train) + predY_test = netrem_model.predict(X_test) + sorted_df["train_mse"] = mse(y_train.values.flatten(), predY_train) + sorted_df["test_mse"] = mse(y_test.values.flatten(), predY_test) + sorted_df["train_nmse"] = nmse(y_train.values.flatten(), predY_train) + sorted_df["test_nmse"] = nmse(y_test.values.flatten(), predY_test) + sorted_df["train_snr"] = snr(y_train.values.flatten(), predY_train) + sorted_df["test_snr"] = snr(y_test.values.flatten(), predY_test) + sorted_df["train_psnr"] = psnr(y_train.values.flatten(), predY_train) + sorted_df["test_psnr"] = psnr(y_test.values.flatten(), predY_test) + sorted_df_netrem = sorted_df + netrem_dict = {"sorted_df_netrem":sorted_df_netrem, "tf_netrem_found":tf_netrem_found} + return netrem_dict + + +def metrics_for_netrem_models_versus_other_models(netrem_with_intercept, netrem_no_intercept, X_train, y_train, X_test, y_test, filtered_results = False, tf_name = "SOX10", target_gene = "y"): + """ :) This is similar to function metrics_for_netrem_versus_other_models() except it focuses on 2 types of NetREm models: + 1. with y-intercept fitted + 2. with no y-intercept fitted + :) Please note: + MSE (Mean Squared Error) and NMSE (Normalized Mean Squared Error) are both measures of the average difference between the predicted and actual values, where lower values indicate better performance. + + PSNR (Peak Signal-to-Noise Ratio) and SNR (Signal-to-Noise Ratio) are both measures of the ratio between the maximum possible signal power and the power of the noise, where higher values indicate better performance. + + However, the specific metrics that are most relevant to a particular machine learning problem can vary depending on the application and the specific goals of the model. So, it's important to consider the context and objectives of each project when selecting evaluation metrics. + """ + focus_gene = target_gene + netrem_intercept_bool = True + netrem_no_intercept_bool = True + if netrem_with_intercept is None: + netrem_intercept_bool = False + tf_netrem_found_with_intercept = False + if netrem_no_intercept is None: + netrem_no_intercept_bool = False + tf_netrem_found_no_intercept = False + + if netrem_with_intercept: + netrem_with_intercept_sorted_dict = generate_model_metrics_for_netrem_model_object(netrem_with_intercept, True, X_train, y_train, X_test, y_test, filtered_results, tf_name, focus_gene) + netrem_with_intercept_sorted_df = netrem_with_intercept_sorted_dict["sorted_df_netrem"] + netrem_with_intercept_sorted_df["y_intercept"] = "True :)" + tf_netrem_found_with_intercept = netrem_with_intercept_sorted_dict["tf_netrem_found"] + + if netrem_no_intercept_bool: + netrem_no_intercept_sorted_dict = generate_model_metrics_for_netrem_model_object(netrem_no_intercept, False, X_train, y_train, X_test, y_test, filtered_results, tf_name, focus_gene) + netrem_no_intercept_sorted_df = netrem_no_intercept_sorted_dict["sorted_df_netrem"] + netrem_no_intercept_sorted_df["y_intercept"] = "False :(" + tf_netrem_found_no_intercept = netrem_no_intercept_sorted_dict["tf_netrem_found"] + + + sorted_df_elasticcv = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "ElasticNetCV", y_intercept = False, tf_name = tf_name) + sorted_df_lassocv = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LassoCV", y_intercept = False, tf_name = tf_name) + sorted_df_ridgecv = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "RidgeCV", y_intercept = False, tf_name = tf_name) + sorted_df_linear = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LinearRegression", y_intercept = False, tf_name = tf_name) + sorted_df_elasticcv2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "ElasticNetCV", y_intercept = True, tf_name = tf_name) + sorted_df_lassocv2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LassoCV", y_intercept = True, tf_name = tf_name) + sorted_df_ridgecv2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "RidgeCV", y_intercept = True, tf_name = tf_name) + sorted_df_linear2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LinearRegression", y_intercept = True, tf_name = tf_name) + + if netrem_no_intercept_bool: + sorty_combo = pd.concat([netrem_no_intercept_sorted_df, sorted_df_elasticcv, sorted_df_ridgecv, sorted_df_lassocv, sorted_df_linear]) + else: + sorty_combo = pd.concat([sorted_df_elasticcv, sorted_df_ridgecv, sorted_df_lassocv, sorted_df_linear]) + if netrem_intercept_bool: + sorty_combo = pd.concat([sorty_combo, netrem_with_intercept_sorted_df, sorted_df_elasticcv2, sorted_df_ridgecv2, sorted_df_lassocv2, sorted_df_linear2]) + else: + sorty_combo = pd.concat([sorty_combo, sorted_df_elasticcv2, sorted_df_ridgecv2, sorted_df_lassocv2, sorted_df_linear2]) + sorty_combo = sorty_combo[sorty_combo["TF"] == tf_name] + sorty_combo["TG"] = focus_gene + sorty_combo = sorty_combo.reset_index().drop(columns = ["index"]) + if 'AbsoluteVal_coefficient' not in sorty_combo.columns.tolist(): + sorty_combo['AbsoluteVal_coefficient'] = pd.Series([float('nan')]*len(sorty_combo)) + + sorty_combo = sorty_combo[['AbsoluteVal_coefficient', 'Rank', 'TF', 'Info', 'y_intercept', 'num_TFs', 'TG', 'train_mse', + 'test_mse', 'train_nmse', 'test_nmse', 'train_snr', 'test_snr', + 'train_psnr', 'test_psnr']] + aaa = sorty_combo + aaa['rank_mse_train'] = aaa['train_mse'].rank(ascending=True).astype(int) + aaa['rank_mse_test'] = aaa['test_mse'].rank(ascending=True).astype(int) + aaa['rank_nmse_train'] = aaa['train_nmse'].rank(ascending=True).astype(int) + aaa['rank_nmse_test'] = aaa['test_nmse'].rank(ascending=True).astype(int) + + aaa['rank_snr_train'] = aaa['train_snr'].rank(ascending=False).astype(int) + aaa['rank_snr_test'] = aaa['test_snr'].rank(ascending=False).astype(int) + aaa['rank_psnr_train'] = aaa['train_psnr'].rank(ascending=False).astype(int) + aaa['rank_psnr_test'] = aaa['test_psnr'].rank(ascending=False).astype(int) + aaa["total_metrics_rank"] = aaa['rank_mse_train'] + aaa['rank_mse_test'] + aaa['rank_nmse_train'] + aaa['rank_nmse_test'] + aaa["total_metrics_rank"] += aaa['rank_snr_train'] + aaa['rank_snr_test'] + aaa['rank_psnr_train'] + aaa['rank_psnr_test'] + sorty_combo = aaa + + reduced_results_df = sorty_combo[sorty_combo["Rank"] != "N/A"] + reduced_results_df = reduced_results_df.sort_values(by = ["Rank"]) + + + if tf_netrem_found_with_intercept: + print(netrem_with_intercept.final_corr_vs_coef_df[["info"] + [tf_name]]) + elif tf_netrem_found_no_intercept: + print(netrem_no_intercept.final_corr_vs_coef_df[["info"] + [tf_name]]) + if filtered_results: + return reduced_results_df + else: + return sorty_combo + + +def metrics_for_netrem_versus_other_models(netrem_model, X_train, y_train, X_test, y_test, filtered_results = False, tf_name = "SOX10", target_gene = "y"): + """ :) Please note: + MSE (Mean Squared Error) and NMSE (Normalized Mean Squared Error) are both measures of the average difference between the predicted and actual values, where lower values indicate better performance. + + PSNR (Peak Signal-to-Noise Ratio) and SNR (Signal-to-Noise Ratio) are both measures of the ratio between the maximum possible signal power and the power of the noise, where higher values indicate better performance. + + However, the specific metrics that are most relevant to a particular machine learning problem can vary depending on the application and the specific goals of the model. So, it's important to consider the context and objectives of each project when selecting evaluation metrics. + """ + focus_gene = target_gene + if netrem_model.model_nonzero_coef_df.shape[1] == 1: + sorted_df = pd.DataFrame(["N/A", tf_name]).transpose() + sorted_df.columns = ["Rank", "TF"] + tf_netrem_found = False + sorted_df["Info"] = "NetREm (b = " + str(netrem_model.beta_network) + "; LassoCV)"# + str(netrem_model.optimal_alpha) + ")" + sorted_df["y_intercept"] = "False :(" + sorted_df["num_TFs"] = 0 + else: + sorted_df = netrem_model.sorted_coef_df[netrem_model.sorted_coef_df["TF"] == tf_name] + tfs = sorted_df["TF"].tolist() + tf_netrem_found = True + if tf_name not in tfs: + sorted_df = pd.DataFrame(["N/A", tf_name]).transpose() + sorted_df.columns = ["Rank", "TF"] + tf_netrem_found = False + sorted_df["Info"] = "NetREm (b = " + str(netrem_model.beta_network) + "; LassoCV)"# + str(netrem_model.optimal_alpha) + ")" + sorted_df["y_intercept"] = "False :(" + sorted_df["num_TFs"] = netrem_model.model_nonzero_coef_df.drop(columns = ["y_intercept"]).shape[1] + predY_train = netrem_model.predict(X_train) + predY_test = netrem_model.predict(X_test) + sorted_df["train_mse"] = mse(y_train.values.flatten(), predY_train) + sorted_df["test_mse"] = mse(y_test.values.flatten(), predY_test) + sorted_df["train_nmse"] = nmse(y_train.values.flatten(), predY_train) + sorted_df["test_nmse"] = nmse(y_test.values.flatten(), predY_test) + sorted_df["train_snr"] = snr(y_train.values.flatten(), predY_train) + sorted_df["test_snr"] = snr(y_test.values.flatten(), predY_test) + sorted_df["train_psnr"] = psnr(y_train.values.flatten(), predY_train) + sorted_df["test_psnr"] = psnr(y_test.values.flatten(), predY_test) + sorted_df_netrem = sorted_df + + sorted_df_elasticcv = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "ElasticNetCV", y_intercept = False, tf_name = tf_name) + sorted_df_lassocv = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LassoCV", y_intercept = False, tf_name = tf_name) + sorted_df_ridgecv = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "RidgeCV", y_intercept = False, tf_name = tf_name) + sorted_df_linear = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LinearRegression", y_intercept = False, tf_name = tf_name) + sorted_df_elasticcv2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "ElasticNetCV", y_intercept = True, tf_name = tf_name) + sorted_df_lassocv2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LassoCV", y_intercept = True, tf_name = tf_name) + sorted_df_ridgecv2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "RidgeCV", y_intercept = True, tf_name = tf_name) + sorted_df_linear2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LinearRegression", y_intercept = True, tf_name = tf_name) + + sorty_combo = pd.concat([sorted_df_netrem, sorted_df_elasticcv, sorted_df_ridgecv, sorted_df_lassocv, sorted_df_linear]) + sorty_combo = pd.concat([sorty_combo, sorted_df_elasticcv2, sorted_df_ridgecv2, sorted_df_lassocv2, sorted_df_linear2]) + sorty_combo = sorty_combo[sorty_combo["TF"] == tf_name] + sorty_combo["TG"] = focus_gene + sorty_combo = sorty_combo.reset_index().drop(columns = ["index"]) + if 'AbsoluteVal_coefficient' not in sorty_combo.columns.tolist(): + sorty_combo['AbsoluteVal_coefficient'] = pd.Series([float('nan')]*len(sorty_combo)) + + sorty_combo = sorty_combo[['AbsoluteVal_coefficient', 'Rank', 'TF', 'Info', 'y_intercept', 'num_TFs', 'TG', 'train_mse', + 'test_mse', 'train_nmse', 'test_nmse', 'train_snr', 'test_snr', + 'train_psnr', 'test_psnr']] + + aaa = sorty_combo + aaa['rank_mse_train'] = aaa['train_mse'].rank(ascending=True).astype(int) + aaa['rank_mse_test'] = aaa['test_mse'].rank(ascending=True).astype(int) + aaa['rank_nmse_train'] = aaa['train_nmse'].rank(ascending=True).astype(int) + aaa['rank_nmse_test'] = aaa['test_nmse'].rank(ascending=True).astype(int) + + aaa['rank_snr_train'] = aaa['train_snr'].rank(ascending=False).astype(int) + aaa['rank_snr_test'] = aaa['test_snr'].rank(ascending=False).astype(int) + aaa['rank_psnr_train'] = aaa['train_psnr'].rank(ascending=False).astype(int) + aaa['rank_psnr_test'] = aaa['test_psnr'].rank(ascending=False).astype(int) + aaa["total_metrics_rank"] = aaa['rank_mse_train'] + aaa['rank_mse_test'] + aaa['rank_nmse_train'] + aaa['rank_nmse_test'] + aaa["total_metrics_rank"] += aaa['rank_snr_train'] + aaa['rank_snr_test'] + aaa['rank_psnr_train'] + aaa['rank_psnr_test'] + sorty_combo = aaa + + reduced_results_df = sorty_combo[sorty_combo["Rank"] != "N/A"] + reduced_results_df = reduced_results_df.sort_values(by = ["Rank"]) + if tf_netrem_found: + print(netrem_model.final_corr_vs_coef_df[["info"] + [tf_name]]) + if filtered_results: + return reduced_results_df + else: + return sorty_combo \ No newline at end of file diff --git a/code/old_code/refresh/essential_functions.py b/code/old_code/refresh/essential_functions.py new file mode 100644 index 0000000..ebe7587 --- /dev/null +++ b/code/old_code/refresh/essential_functions.py @@ -0,0 +1,123 @@ +# Essential_functions.py: :) +import pandas as pd +import numpy as np +import random +import copy +from tqdm import tqdm +import os +import sys # https://www.dev2qa.com/how-to-run-python-script-py-file-in-jupyter-notebook-ipynb-file-and-ipython/#:~:text=How%20To%20Run%20Python%20Script%20.py%20File%20In,2.%20Invoke%20Python%20Script%20File%20From%20Ipython%20Command-Line. +import networkx as nx +import scipy +from scipy.linalg import svd as robust_svd +from sklearn.model_selection import KFold, train_test_split, GridSearchCV, cross_val_score +from sklearn.decomposition import TruncatedSVD +from sklearn import linear_model +from sklearn.linear_model import Lasso, LassoCV, LinearRegression, ElasticNetCV, Ridge +from numpy.typing import ArrayLike +# from skopt import gp_minimize, space +from typing import Optional, List, Tuple +from sklearn.metrics import make_scorer +import plotly.express as px +from sklearn.base import RegressorMixin, ClassifierMixin, BaseEstimator +import matplotlib.pyplot as plt +from numpy.typing import ArrayLike +from scipy.sparse.linalg.interface import LinearOperator +import warnings +from sklearn.exceptions import ConvergenceWarning +printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) +rng_seed = 2023 # random seed for reproducibility +randSeed = 123 + +# Python program to illustrate the intersection +# of two lists in most simple way +def intersection(lst1, lst2): + lst3 = [value for value in lst1 if value in lst2] + return lst3 + + +def view_matrix_as_dataframe(matrix, column_names_list = [], row_names_list = []): + # :) Please note this function by Saniya returns a dataframe representation of the numpy matrix + # optional are the names of the columns and names of the rows (indices) + matDF = pd.DataFrame(matrix) + if len(column_names_list) == matDF.shape[1]: + matDF.columns = column_names_list + if len(row_names_list) == matDF.shape[0]: + matDF.index = row_names_list + return matDF + + +def check_symmetric(a, rtol=1e-05, atol=1e-08): + # https://stackoverflow.com/questions/42908334/checking-if-a-matrix-is-symmetric-in-numpy + # Please note that this function checks if a matrix is symmetric in Python + # for square matrices (same # of rows and columns), there is a possiblity they may be symmetric + # returns True if the matrix is symmetric (matrix = matrix_tranpose) + # returns False if the matrix is NOT symmetric + return np.allclose(a, a.T, rtol=rtol, atol=atol) + + +class DiagonalLinearOperator(LinearOperator): + """Construct a diagonal matrix as a linear operator instead a full numerical matirx np.diag(d). + This saves memory and computation time which is especially useful when d is huge. + D.T = D + For 2d matrix A: + D @ A = d[:, np.newwaxis]* A # scales rows of A + A @ D = A * d[np.newaxis, :] # scales cols of A + For 1d vector v: + D @ v = d * v + v @ D = v * d + NOTE: Coding just for fun: using a numerical matrix or a sparse matrix maybe just fine for network regularization. + By Xiang Huang + """ + def __init__(self, d): + """d is a 1d vector of dimension N""" + N = len(d) + self.d = d + super().__init__(dtype=None, shape=(N, N)) + + def _transpose(self): + return self + + def _matvec(self, v): + return self.d * v + + def _matmat(self, A): + return self.d[:, np.newaxis] * A + + def __rmatmul__(self, x): + """Implmentation of A @ D, and x @ D + We could implment __matmul__ in a similar way without inheriting LinearOperator + Because we inherit from LinearOperator, we can implment _matvec, and _matmat instead. + """ + if x.ndim == 2: + return x * self.d[np.newaxis, :] + elif x.ndim == 1: + return x * self.d + else: + raise ValueError(f'Array should be 1d or 2d, but it is {x.ndim}d') + # Generally A @ D will call A.__matmul__(D) which raises a ValueError and not a NotImplemented + # We need to set __array_priority__ to high value higher than 0 (np.array) and 10.1 (scipy.sparse.csr_matrix) + # https://github.com/numpy/numpy/issues/8155 + # https://stackoverflow.com/questions/40252765/overriding-other-rmul-with-your-classs-mul + __array_priority__ = 1000 + + +def normalize_data_zero_to_one(data): + # https://stackoverflow.com/questions/18380419/normalization-to-bring-in-the-range-of-0-1 + return (data - np.min(data)) / (np.max(data) - np.min(data)) + + +def draw_arrow(direction = "down", color = "blue"): + x = [0.5, 0.5] + if direction == "down": + # Define the coordinates for the arrow + y = [0.9, 0.1] + else: # up-arrow + y = [0.1, 0.9] + fig, ax = plt.subplots(figsize=(2,2)) + # Plot the arrow using Matplotlib + plt.arrow(x[0], y[0], x[1]-x[0], y[1]-y[0], head_width=0.05, head_length=0.1, fc=color, ec=color) + # Set the x and y limits to adjust the plot size + plt.xlim(0, 1) + plt.ylim(0, 1) + plt.axis('off') # Hide the axis labels + plt.show() # Show the plot \ No newline at end of file diff --git a/user_guide/netremCV-old.ipynb b/code/old_code/refresh/netremCV-old.ipynb similarity index 100% rename from user_guide/netremCV-old.ipynb rename to code/old_code/refresh/netremCV-old.ipynb diff --git a/code/old_code/refresh/netremCV.ipynb b/code/old_code/refresh/netremCV.ipynb new file mode 100644 index 0000000..849d1cc --- /dev/null +++ b/code/old_code/refresh/netremCV.ipynb @@ -0,0 +1,1247 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ade59a66", + "metadata": {}, + "source": [ + "# netremCV\n", + "Cross-validation approach for estimating the optimal $\\beta_{net}$ and $\\alpha_{lasso}$.\n", + "\n", + "Selection for $\\beta_{net}$ can impact the optimal values for $\\alpha_{net}$" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "d801c820", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) same_train_test_data = False\n", + "Please note that since we hold out 30.0% of our 100000 samples for testing, we have:\n", + "X_train = 70000 rows (samples) and 5 columns (N = 5 predictors) for training.\n", + "X_test = 30000 rows (samples) and 5 columns (N = 5 predictors) for testing.\n", + "y_train = 70000 corresponding rows (samples) for training.\n", + "y_test = 30000 corresponding rows (samples) for testing.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|███████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 557.68it/s]\n", + "100%|███████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 994.05it/s]\n", + "100%|██████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1684.73it/s]\n" + ] + } + ], + "source": [ + "import sys\n", + "sys.path.append(\"../code\") # assuming \"code\" is one directory up and then down into \"code\"\n", + "\n", + "from DemoDataBuilderXandY import generate_dummy_data\n", + "from Netrem_model_builder import netrem, netremCV\n", + "import PriorGraphNetwork as graph\n", + "import error_metrics as em \n", + "import essential_functions as ef\n", + "import netrem_evaluation_functions as nm_eval\n", + "import Netrem_model_builder as nm\n", + "\n", + "dummy_data = generate_dummy_data(corrVals = [0.9, 0.5, 0.3, -0.2, -0.8],\n", + " num_samples_M = 100000,\n", + " train_data_percent = 70)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "2b97750a", + "metadata": {}, + "outputs": [], + "source": [ + "# 70 samples for training data (used to train and fit GRegulNet model)\n", + "X_train = dummy_data.view_X_train_df()\n", + "y_train = dummy_data.view_y_train_df()\n", + "\n", + "# 30 samples for testing data\n", + "X_test = dummy_data.view_X_test_df()\n", + "y_test = dummy_data.view_y_test_df()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "69496222", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[['TF1', 'TF2', 0.9],\n", + " ['TF4', 'TF5', 0.75],\n", + " ['TF1', 'TF3'],\n", + " ['TF1', 'TF4'],\n", + " ['TF1', 'TF5'],\n", + " ['TF2', 'TF3'],\n", + " ['TF2', 'TF4'],\n", + " ['TF2', 'TF5'],\n", + " ['TF3', 'TF4'],\n", + " ['TF3', 'TF5']]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# prior network edge_list:\n", + "edge_list = [[\"TF1\", \"TF2\", 0.9], [\"TF4\", \"TF5\", 0.75], [\"TF1\", \"TF3\"], [\"TF1\", \"TF4\"], [\"TF1\", \"TF5\"], \n", + " [\"TF2\", \"TF3\"], [\"TF2\", \"TF4\"], [\"TF2\", \"TF5\"], [\"TF3\", \"TF4\"], [\"TF3\", \"TF5\"]]\n", + "edge_list" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "094a7227", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":) using variance to define beta_net values\n", + "beta_min = 1.1506396943803596 and beta_max = 115.06396943803597\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "ac3bddb5c58846a68761633cf7b46850", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + ":) Generating beta_net and alpha_lasso pairs: 0%| | 0/50 [00:00#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}
NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=1.1506396943803596, alpha_lasso=0.051694590434151706, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x00000202249230A0>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=1.1506396943803596, alpha_lasso=0.051694590434151706, network=)" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time \n", + "\n", + "netrem_demoCV = netremCV(edge_list = edge_list, X = X_train, y = y_train) \n", + "netrem_demoCV" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "15723a95", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=1.1506396943803596, alpha_lasso=0.051694590434151706, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x00000202249230A0>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=1.1506396943803596, alpha_lasso=0.051694590434151706, network=)" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netrem_demoCV" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "c599f58e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'info': 'NetREm Model',\n", + " 'alpha_lasso': 0.051694590434151706,\n", + " 'beta_net': 1.1506396943803596,\n", + " 'y_intercept': False,\n", + " 'model_type': 'Lasso',\n", + " 'max_lasso_iterations': 10000,\n", + " 'network': ,\n", + " 'verbose': False,\n", + " 'all_pos_coefs': False,\n", + " 'model_info': 'fitted_model :)',\n", + " 'target_gene_y': 'y',\n", + " 'tolerance': 0.0001,\n", + " 'lasso_selection': 'cyclic'}" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netrem_demoCV.get_params()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "7ba65d39", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.13727397681026726" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netrem_demoCV.test_mse(X_train, y_train)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "1a9f2f60", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.13781162327050317" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netrem_demoCV.test_mse(X_test, y_test)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "758b4684", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
y_interceptTF1TF2TF3TF5
0None0.2777520.0633780.00145-0.159248
\n", + "
" + ], + "text/plain": [ + " y_intercept TF1 TF2 TF3 TF5\n", + "0 None 0.277752 0.063378 0.00145 -0.159248" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netrem_demoCV.model_nonzero_coef_df" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "7afe0c3f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TF1TF2TF3TF4TF5
TF15.4140751.1132610.361368-0.429737-2.771125
TF21.1132611.439763-0.107738-0.125070-0.773506
TF30.361368-0.10773837.921695-0.359079-0.715906
TF4-0.429737-0.125070-0.3590791.1392720.197285
TF5-2.771125-0.773506-0.7159060.1972852.877027
\n", + "
" + ], + "text/plain": [ + " TF1 TF2 TF3 TF4 TF5\n", + "TF1 5.414075 1.113261 0.361368 -0.429737 -2.771125\n", + "TF2 1.113261 1.439763 -0.107738 -0.125070 -0.773506\n", + "TF3 0.361368 -0.107738 37.921695 -0.359079 -0.715906\n", + "TF4 -0.429737 -0.125070 -0.359079 1.139272 0.197285\n", + "TF5 -2.771125 -0.773506 -0.715906 0.197285 2.877027" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netrem_demoCV.B_interaction_df" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "64bd24b1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TF1TF2B_train_weightsignpotential_interactionabsVal_Binfocandidate_TFs_Ntarget_gene_ynum_final_predictorsmodel_typebeta_netgene_datarankpercentile
20TF1TF5-2.771125:(:( competitive (-)2.771125B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data1.095.0
4TF5TF1-2.771125:(:( competitive (-)2.771125B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data1.095.0
5TF1TF21.113261:):(1.113261B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data3.085.0
1TF2TF11.113261:):(1.113261B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data3.085.0
9TF5TF2-0.773506:(:( competitive (-)0.773506B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data5.075.0
21TF2TF5-0.773506:(:( competitive (-)0.773506B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data5.075.0
14TF5TF3-0.715906:(:( competitive (-)0.715906B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data7.065.0
22TF3TF5-0.715906:(:( competitive (-)0.715906B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data7.065.0
15TF1TF4-0.429737:(:( competitive (-)0.429737B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data9.055.0
3TF4TF1-0.429737:(:( competitive (-)0.429737B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data9.055.0
2TF3TF10.361368:):(0.361368B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data11.045.0
10TF1TF30.361368:):(0.361368B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data11.045.0
13TF4TF3-0.359079:(:( competitive (-)0.359079B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data13.035.0
17TF3TF4-0.359079:(:( competitive (-)0.359079B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data13.035.0
19TF5TF40.197285:):(0.197285B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data15.025.0
23TF4TF50.197285:):(0.197285B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data15.025.0
8TF4TF2-0.125070:(:( competitive (-)0.125070B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data17.015.0
16TF2TF4-0.125070:(:( competitive (-)0.125070B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data17.015.0
7TF3TF2-0.107738:(:( competitive (-)0.107738B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data19.05.0
11TF2TF3-0.107738:(:( competitive (-)0.107738B matrix of TF-TF interactions5y4Lasso1.15064training gene expression data19.05.0
\n", + "
" + ], + "text/plain": [ + " TF1 TF2 B_train_weight sign potential_interaction absVal_B \\\n", + "20 TF1 TF5 -2.771125 :( :( competitive (-) 2.771125 \n", + "4 TF5 TF1 -2.771125 :( :( competitive (-) 2.771125 \n", + "5 TF1 TF2 1.113261 :) :( 1.113261 \n", + "1 TF2 TF1 1.113261 :) :( 1.113261 \n", + "9 TF5 TF2 -0.773506 :( :( competitive (-) 0.773506 \n", + "21 TF2 TF5 -0.773506 :( :( competitive (-) 0.773506 \n", + "14 TF5 TF3 -0.715906 :( :( competitive (-) 0.715906 \n", + "22 TF3 TF5 -0.715906 :( :( competitive (-) 0.715906 \n", + "15 TF1 TF4 -0.429737 :( :( competitive (-) 0.429737 \n", + "3 TF4 TF1 -0.429737 :( :( competitive (-) 0.429737 \n", + "2 TF3 TF1 0.361368 :) :( 0.361368 \n", + "10 TF1 TF3 0.361368 :) :( 0.361368 \n", + "13 TF4 TF3 -0.359079 :( :( competitive (-) 0.359079 \n", + "17 TF3 TF4 -0.359079 :( :( competitive (-) 0.359079 \n", + "19 TF5 TF4 0.197285 :) :( 0.197285 \n", + "23 TF4 TF5 0.197285 :) :( 0.197285 \n", + "8 TF4 TF2 -0.125070 :( :( competitive (-) 0.125070 \n", + "16 TF2 TF4 -0.125070 :( :( competitive (-) 0.125070 \n", + "7 TF3 TF2 -0.107738 :( :( competitive (-) 0.107738 \n", + "11 TF2 TF3 -0.107738 :( :( competitive (-) 0.107738 \n", + "\n", + " info candidate_TFs_N target_gene_y \\\n", + "20 B matrix of TF-TF interactions 5 y \n", + "4 B matrix of TF-TF interactions 5 y \n", + "5 B matrix of TF-TF interactions 5 y \n", + "1 B matrix of TF-TF interactions 5 y \n", + "9 B matrix of TF-TF interactions 5 y \n", + "21 B matrix of TF-TF interactions 5 y \n", + "14 B matrix of TF-TF interactions 5 y \n", + "22 B matrix of TF-TF interactions 5 y \n", + "15 B matrix of TF-TF interactions 5 y \n", + "3 B matrix of TF-TF interactions 5 y \n", + "2 B matrix of TF-TF interactions 5 y \n", + "10 B matrix of TF-TF interactions 5 y \n", + "13 B matrix of TF-TF interactions 5 y \n", + "17 B matrix of TF-TF interactions 5 y \n", + "19 B matrix of TF-TF interactions 5 y \n", + "23 B matrix of TF-TF interactions 5 y \n", + "8 B matrix of TF-TF interactions 5 y \n", + "16 B matrix of TF-TF interactions 5 y \n", + "7 B matrix of TF-TF interactions 5 y \n", + "11 B matrix of TF-TF interactions 5 y \n", + "\n", + " num_final_predictors model_type beta_net gene_data \\\n", + "20 4 Lasso 1.15064 training gene expression data \n", + "4 4 Lasso 1.15064 training gene expression data \n", + "5 4 Lasso 1.15064 training gene expression data \n", + "1 4 Lasso 1.15064 training gene expression data \n", + "9 4 Lasso 1.15064 training gene expression data \n", + "21 4 Lasso 1.15064 training gene expression data \n", + "14 4 Lasso 1.15064 training gene expression data \n", + "22 4 Lasso 1.15064 training gene expression data \n", + "15 4 Lasso 1.15064 training gene expression data \n", + "3 4 Lasso 1.15064 training gene expression data \n", + "2 4 Lasso 1.15064 training gene expression data \n", + "10 4 Lasso 1.15064 training gene expression data \n", + "13 4 Lasso 1.15064 training gene expression data \n", + "17 4 Lasso 1.15064 training gene expression data \n", + "19 4 Lasso 1.15064 training gene expression data \n", + "23 4 Lasso 1.15064 training gene expression data \n", + "8 4 Lasso 1.15064 training gene expression data \n", + "16 4 Lasso 1.15064 training gene expression data \n", + "7 4 Lasso 1.15064 training gene expression data \n", + "11 4 Lasso 1.15064 training gene expression data \n", + "\n", + " rank percentile \n", + "20 1.0 95.0 \n", + "4 1.0 95.0 \n", + "5 3.0 85.0 \n", + "1 3.0 85.0 \n", + "9 5.0 75.0 \n", + "21 5.0 75.0 \n", + "14 7.0 65.0 \n", + "22 7.0 65.0 \n", + "15 9.0 55.0 \n", + "3 9.0 55.0 \n", + "2 11.0 45.0 \n", + "10 11.0 45.0 \n", + "13 13.0 35.0 \n", + "17 13.0 35.0 \n", + "19 15.0 25.0 \n", + "23 15.0 25.0 \n", + "8 17.0 15.0 \n", + "16 17.0 15.0 \n", + "7 19.0 5.0 \n", + "11 19.0 5.0 " + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "b_matrix = nm.organize_B_interaction_network(netrem_demoCV)\n", + "b_matrix" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "b5ca8aca", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/code/old_code/refresh/netrem_evaluation_functions.py b/code/old_code/refresh/netrem_evaluation_functions.py new file mode 100644 index 0000000..b99b41d --- /dev/null +++ b/code/old_code/refresh/netrem_evaluation_functions.py @@ -0,0 +1,594 @@ +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import random +import copy +from tqdm import tqdm +import os +import sys # https://www.dev2qa.com/how-to-run-python-script-py-file-in-jupyter-notebook-ipynb-file-and-ipython/#:~:text=How%20To%20Run%20Python%20Script%20.py%20File%20In,2.%20Invoke%20Python%20Script%20File%20From%20Ipython%20Command-Line. +import networkx as nx +import scipy +from scipy.linalg import svd as robust_svd +from sklearn.model_selection import KFold, train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score +from sklearn.decomposition import TruncatedSVD +from sklearn import linear_model +from sklearn.linear_model import Lasso, LassoCV, LinearRegression, ElasticNetCV, Ridge +from numpy.typing import ArrayLike +from typing import Optional, List, Tuple +from sklearn.metrics import make_scorer +import plotly.express as px +from sklearn.base import RegressorMixin, ClassifierMixin, BaseEstimator +from numpy.typing import ArrayLike +from skopt import gp_minimize, space +from scipy.sparse.linalg.interface import LinearOperator +import warnings +from sklearn.exceptions import ConvergenceWarning +printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) +rng_seed = 2023 # random seed for reproducibility +randSeed = 123 +# from packages_needed import * +from error_metrics import * +from DemoDataBuilderXandY import * +from PriorGraphNetwork import * +from Netrem_model_builder import * +from sklearn.linear_model import ElasticNetCV, LinearRegression, LassoCV, RidgeCV +from skopt import gp_minimize, space +from skopt.utils import use_named_args + +class BayesianObjective_Lasso: + def __init__(self, X, y, cv_folds, model, scorer="mse", print_network=False): + self.X = X + self.y = y + self.cv_folds = cv_folds + model.view_network = print_network + self.model = model + self.scorer_obj = 'neg_mean_squared_error' # the default + if scorer == "mse": + self.scorer_obj = em.mse_custom_scorer + elif scorer == "nmse": + self.scorer_obj = em.nmse_custom_scorer + elif scorer == "snr": + self.scorer_obj = em.snr_custom_scorer + elif scorer == "psnr": + self.scorer_obj = em.psnr_custom_scorer + + def __call__(self, params): + try: + alpha_lasso, beta_network = params + # print(f"Testing with alpha_lasso = {alpha_lasso}, beta_network = {beta_network}") + + netrem_model = self.model + #print(netrem_model.get_params()) + netrem_model.alpha_lasso = alpha_lasso + netrem_model.beta_network = beta_network + + cv_scores = cross_val_score(netrem_model, self.X, self.y, cv=self.cv_folds, scoring=self.scorer_obj) + + # Check for infinite values + if np.any(np.isinf(cv_scores)): + # print("Cross-validation scores contain infinite values.") + #return np.inf + return 1e100 # Replace infinite score with large finite value + + # Debugging: Print the individual cross-validation scores + # print(f"Individual cross-validation scores: {cv_scores}") + + score = -cv_scores.mean() + # print(f"Score with alpha_lasso = {alpha_lasso}, beta_network = {beta_network} is {score}") + + #if np.isinf(score): + #print("Score is infinite!") + + return score + + except Exception as e: + #print(f"An exception occurred: {e}") + #return np.inf # Return a high "bad" value to indicate failure + return 1e100 # Replace infinite score with large finite value + + +# Define a callback function to update the progress bar +def progress_bar_callback(res): + progress_bar.update(1) + +def optimal_netrem_model_via_bayesian_param_tuner(netrem_model, X_train, y_train, + beta_net_min = 0.5, + beta_net_max = 1000, + alpha_lasso_min = 0.0001, + alpha_lasso_max = 0.1, + num_grid_values = 100, + cv_folds = 5, + scorer = "mse", + verbose = False): + + print(":) Please note that we are running: optimal_netrem_model_via_bayesian_param_tuner") + if verbose: + print(f":) Please note we are running Bayesian optimization (via skopt Python package) for parameter hunting for beta_network and alpha_lasso with model evaluation scorer: {scorer} :)") + print("we use gp_minimize here for hyperparameter tuning") + print(f":) Please note this is a start-to-finish optimizer for NetREm (Network regression embeddings reveal cell-type protein-protein interactions for gene regulation)") + + + model_type = netrem_model.model_type + if model_type == "LassoCV": + print("please note that we can only do this for Lasso model not for LassoCV :(") + print("Thus, we will alter the model_type to make it Lasso") + netrem_model.model_type = "Lasso" + + param_space = [space.Real(alpha_lasso_min, alpha_lasso_max, name='alpha_lasso', prior='log-uniform'), + space.Real(beta_net_min, beta_net_max, name='beta_network', prior='log-uniform')] + objective = BayesianObjective_Lasso(X_train, y_train, cv_folds = cv_folds, model = netrem_model, scorer = scorer) + + + # Perform Bayesian optimization + result = gp_minimize(objective, param_space, n_calls=num_grid_values, random_state=123) + + results_dict = {} + optimal_model = netrem_model + if verbose: + print(":) ######################################################################\n") + print(f":) Please note the optimal model based on Bayesian optimization found: ") + + bayesian_alpha = result.x[0] + bayesian_beta = result.x[1] + optimal_model.alpha_lasso = bayesian_alpha + optimal_model.beta_network = bayesian_beta + results_dict["bayesian_alpha"] = bayesian_alpha + print(f"alpha_lasso = {bayesian_alpha} ; beta_network = {bayesian_beta}") + if verbose: + print(":) ######################################################################\n") + print("Fitting the model using these optimal hyperparameters for beta_net and alpha_lasso...") + dict_ex = optimal_model.get_params() + optimal_model = nm.NetREmModel(**dict_ex) + optimal_model.fit(X_train, y_train) + print(optimal_model.get_params()) + results_dict["optimal_model"] = optimal_model + results_dict["bayesian_beta"] = bayesian_beta + results_dict["bayesian_alpha"] = bayesian_alpha + results_dict["result"] = result + return results_dict + +# class BayesianObjective_Lasso: +# def __init__(self, X, y, cv_folds, model, scorer = "mse", print_network = False): +# self.X = X +# self.y = y +# self.cv_folds = cv_folds +# model.view_network = print_network +# self.model = model +# self.scorer_obj = 'neg_mean_squared_error' # the default +# if scorer == "mse": +# self.scorer_obj = mse_custom_scorer +# elif scorer == "nmse": +# self.scorer_obj = nmse_custom_scorer +# elif scorer == "snr": +# self.scorer_obj = snr_custom_scorer +# elif scorer == "psnr": +# self.scorer_obj = psnr_custom_scorer + + +# def __call__(self, params): + +# alpha_lasso, beta_network = params +# #network = PriorGraphNetwork(edge_list = edge_list) +# netrem_model = self.model +# #print(netrem_model.get_params()) +# netrem_model.alpha_lasso = alpha_lasso +# netrem_model.beta_network = beta_network +# #netrem_model.view_network = self.view_network +# score = -cross_val_score(netrem_model, self.X, self.y, cv=self.cv_folds, scoring=self.scorer_obj).mean() +# return score + + +# def optimal_netrem_model_via_bayesian_param_tuner(netrem_model, X_train, y_train, +# beta_net_min = 0.001, +# beta_net_max = 10, +# alpha_lasso_min = 0.0001, +# alpha_lasso_max = 0.1, +# num_grid_values = 100, +# gridSearchCV_folds = 5, +# scorer = "mse", +# verbose = False): +# if verbose: +# print(f":) Please note we are running Bayesian optimization (via skopt Python package) for parameter hunting for beta_network and alpha_lasso with model evaluation scorer: {scorer} :)") +# print("we use gp_minimize here for hyperparameter tuning") +# print(f":) Please note this is a start-to-finish optimizer for NetREm (Network regression embeddings reveal cell-type protein-protein interactions for gene regulation)") +# from skopt import gp_minimize, space +# model_type = netrem_model.model_type +# # param_space = [space.Real(alpha_lasso_min, alpha_lasso_max, name='alpha_lasso', prior='log-uniform'), +# # space.Real(beta_net_min, beta_net_max, name='beta_network', prior='log-uniform')] + +# if model_type == "LassoCV": +# print("please note that we can only do this for Lasso model not for LassoCV :(") +# print("Thus, we will alter the model_type to make it Lasso") +# netrem_model.model_type = "Lasso" + +# param_space = [space.Real(alpha_lasso_min, alpha_lasso_max, name='alpha_lasso', prior='log-uniform'), +# space.Real(beta_net_min, beta_net_max, name='beta_network', prior='log-uniform')] +# objective = BayesianObjective_Lasso(X_train, y_train, cv_folds = gridSearchCV_folds, model = netrem_model, scorer = scorer) + +# # Perform Bayesian optimization +# result = gp_minimize(objective, param_space, n_calls=num_grid_values, random_state=123) +# results_dict = {} +# optimal_model = netrem_model +# if verbose: +# print(":) ######################################################################\n") +# print(f":) Please note the optimal model based on Bayesian optimization found: ") + +# bayesian_alpha = result.x[0] +# bayesian_beta = result.x[1] +# optimal_model.alpha_lasso = bayesian_alpha +# optimal_model.beta_network = bayesian_beta +# results_dict["bayesian_alpha"] = bayesian_alpha +# print(f"alpha_lasso = {bayesian_alpha} ; beta_network = {bayesian_beta}") +# if verbose: +# print(":) ######################################################################\n") +# print("Fitting the model using these optimal hyperparameters for beta_net and alpha_lasso...") +# dict_ex = optimal_model.get_params() +# optimal_model = NetREmModel(**dict_ex) +# optimal_model.fit(X_train, y_train) +# print(optimal_model.get_params()) +# results_dict["optimal_model"] = optimal_model +# results_dict["bayesian_beta"] = bayesian_beta +# results_dict["result"] = result +# return results_dict + + +def optimal_netrem_model_via_gridsearchCV_param_tuner(netrem_model, X_train, y_train, num_grid_values, num_cv_jobs = -1): + beta_max = 0.5 * np.max(np.abs(X_train.T.dot(y_train))) + beta_min = 0.01 * beta_max + beta_grid = np.logspace(np.log10(beta_max), np.log10(beta_min), num=num_grid_values) + import copy + alpha_grid = [] + initial_gregCV = netrem_model + original_dict = copy.deepcopy(netrem_model.get_params()) + original_model = NetREmModel(**netrem_model.get_params()) + initial_gregCV.model_type = "LassoCV" + #print(initial_gregCV.get_params()) + for beta in beta_grid: + gregCV_demo = initial_gregCV + gregCV_demo.beta_network = beta + gregCV_demo.fit(X_train, y_train) + optimal_alpha = gregCV_demo.regr.alpha_ + alpha_grid.append(optimal_alpha) + + beta_alpha_grid_dict = {} + beta_alpha_grid_dict["beta_network_vals"] = beta_grid + beta_alpha_grid_dict["alpha_lasso_vals"] = alpha_grid #np.array(alpha_grid) + param_grid = [] + for i in tqdm(range(0, len(beta_alpha_grid_dict["beta_network_vals"]))): + beta_net = beta_alpha_grid_dict["beta_network_vals"][i] + alpha_las = beta_alpha_grid_dict["alpha_lasso_vals"][i] + param_grid.append({"alpha_lasso": [alpha_las], "beta_network": [beta_net]}) + grid_search = GridSearchCV(original_model, param_grid = param_grid, cv=gridSearchCV_folds, + n_jobs = num_cv_jobs, scoring='neg_mean_squared_error') + grid_search.fit(X_train, y_train) + # Get the best hyperparameters + best_params = grid_search.best_params_ + optimal_alpha = best_params["alpha_lasso"] + optimal_beta = best_params["beta_network"] + if isinstance(optimal_alpha, np.ndarray): + optimal_alpha = optimal_alpha[0] + if isinstance(optimal_beta, np.ndarray): + optimal_beta = optimal_beta[0] + print(f":) NetREmModelCV found that the optimal alpha_lasso = {optimal_alpha} and optimal beta_network = {optimal_beta}") + update_NetREmModel = NetREmModel(**original_dict) + update_NetREmModel.beta_network = optimal_beta + update_NetREmModel.alpha_lasso = optimal_alpha + update_NetREmModel = NetREmModel(**update_NetREmModel.get_params()) + update_NetREmModel.fit(X_train, y_train) + return update_NetREmModel + + +def model_comparison_metrics_for_target_gene_with_BayesianOpt_andOr_GridSearchCV_ForNetREm(gene_num, target_genes_list, + X_train_all, X_test_all, y_train_all, y_test_all, + scgrnom_step2_df, tfs, expression_percentile, tf_df, + js_mini, ppi_edge_list, num_tfs_family, gene_expression_genes, tf_name = "SOX10", + beta_net_min = 0.001, + beta_net_max = 10, + alpha_lasso_min = 0.0001, + alpha_lasso_max = 0.1, + num_grid_values = 100, + gridSearchCV_folds = 5, + scorer = "mse", view_network = False, verbose = False, num_cv_jobs = -1): + + focus_gene = target_genes_list[gene_num] # here, this is tough 9, 10 + print(f"Please note that our focus gene (Target gene (TG) y) is: {focus_gene}") + + y_train = y_train_all[[focus_gene]] + y_test = y_test_all[[focus_gene]] + + tfs_for_tg = scgrnom_step2_df[scgrnom_step2_df["TG"] == focus_gene]["TF"].tolist() + tfs_for_tg.sort() + + tfs_for_tg = intersection(tfs_for_tg, tfs) + len(tfs_for_tg) + + low_TFs_bool = False + if len(tfs_for_tg) < 5: + print(":( uh-oh!") + low_TFs_bool = True + if verbose: + print(len(tfs_for_use)) + # adding genes from the same family to the set of TFs (based on co-binding from Step 2) + tf_families_to_add = list(set(tf_df[tf_df["gene"].isin(tfs_for_tg)]["TF_Family"])) + gene_expression_avg = np.mean(X_train_all, axis=0) + + expression_threshold = np.percentile(gene_expression_avg, expression_percentile) + if verbose: + print(f":) Please note that based on the training X data, we find that the {expression_percentile}%ile average gene expression level is: {expression_threshold}") #expression_threshold + gene_expression_avg_df = pd.DataFrame(gene_expression_avg, columns = ["avg_expression"]) + gene_expression_avg_df["gene"] = gene_expression_avg_df.index + genes_above_threshold_df = gene_expression_avg_df[gene_expression_avg_df["avg_expression"] >= expression_threshold] + info_tf_family_expression_df = pd.merge(tf_df, gene_expression_avg_df, how = "inner") + info_tf_family_expression_df = info_tf_family_expression_df.sort_values(by = ["avg_expression"], ascending = False) + info_tf_family_expression_df = info_tf_family_expression_df.sort_values(by = ["TF_Family"]) + mini_info_tf_family_express_df = info_tf_family_expression_df[info_tf_family_expression_df["TF_Family"].isin(tf_families_to_add)] + # sort dataframe by 'TF_Family' and 'avg_expression' in descending order + df_sorted = mini_info_tf_family_express_df.sort_values(['TF_Family', 'avg_expression'], ascending=False) + # select the row with the highest 'avg_expression' for each 'TF_Family' + df_result = df_sorted.groupby('TF_Family').first().reset_index() + + ######################################################################## + df_sorty = info_tf_family_expression_df[info_tf_family_expression_df["gene"].isin(genes_above_threshold_df["gene"].tolist())] + # sort dataframe by 'TF_Family' and 'avg_expression' in descending order + df_sorted1 = df_sorty.sort_values(['TF_Family', 'avg_expression'], ascending=False) + # select the top 2 rows for each 'TF_Family' + if low_TFs_bool: + num_to_use_TFs = num_tfs_family + 1 + df_result1 = df_sorted1.groupby('TF_Family').head(n=num_to_use_TFs).reset_index(drop=True) + else: + df_result1 = df_sorted1.groupby('TF_Family').head(n=num_tfs_family).reset_index(drop=True) + if verbose: + print(df_result1) + tfs_to_use_list = df_result["gene"].tolist() + tfs_to_use_list.sort() + if verbose: + print(f" :) tfs_to_use_list = {tfs_to_use_list}") + + tfs_for_use = list(set(tfs_to_use_list + df_result1["gene"].tolist())) + tfs_for_use.sort() + + ########################################################################## + js_minier = js_mini[js_mini["TF1"].isin(tfs_for_use)] + js_minier = js_minier[js_minier["TF2"].isin(tfs_for_use)] + + # for each tf from scgrnom step 2, we add the top 3 TFs based on the cobind matrix + tfs_added_list = [] + for i in tqdm(range(0, len(tfs_to_use_list))): + tf_num = i#in tfs_for_tg: + if low_TFs_bool: + tfs_added_list += js_minier[js_minier["TF1"] == tfs_to_use_list[tf_num]].head(9)["TF2"].tolist() + else: + tfs_added_list += js_minier[js_minier["TF1"] == tfs_to_use_list[tf_num]].head(3)["TF2"].tolist() + + tfs_added_list.sort() + + + #################################### + if verbose: + print(len(tfs_added_list)) + print(tfs_added_list) + combo_tfs = list(set(tfs_to_use_list+tfs_added_list)) + if verbose: + print(len(combo_tfs)) + print(combo_tfs) + tf_columns = intersection(combo_tfs, gene_expression_genes) + tf_columns = list(set(tf_columns)) + tf_columns.sort() + if verbose: + print(":) # of TFs: ", len(tf_columns)) + print(tf_columns) + + if focus_gene in tf_columns: + tf_columns.remove(focus_gene) + key_genes = tf_columns + + ######################### :) We are filtering the input PPI matrix based on the + # final TFs (key_genes) to help us save time: + filtered_ppi_edge_list = [] + for edge in ppi_edge_list: + if edge[0] in key_genes and edge[1] in key_genes: + filtered_ppi_edge_list.append(edge) + + if verbose: + print(filtered_ppi_edge_list) + + X_train = X_train_all[tf_columns] + X_test = X_test_all[tf_columns] + if verbose: + print("X_train dimensions: ", X_train.shape) + print("X_test dimensions: ", X_test.shape) + + netrem_no_intercept = netrem(edge_list = filtered_ppi_edge_list, + gene_expression_nodes = key_genes, + verbose = verbose, + view_network = view_network) + + netrem_with_intercept = netrem(edge_list = filtered_ppi_edge_list, + y_intercept = True, + verbose = verbose, + gene_expression_nodes = key_genes, + view_network = view_network) + + model_comparison_df1 = pd.DataFrame() + model_comparison_df2 = pd.DataFrame() + bayes_optimizer_bool = False + griddy_optimizer_bool = False + + ##################################################################################### + no_intercept = False + with_intercept = False + try: + optimal_netrem_no_intercept = optimal_netrem_model_via_bayesian_param_tuner(netrem_no_intercept, X_train, y_train, + beta_net_min, + beta_net_max, + alpha_lasso_min, + alpha_lasso_max, + num_grid_values, + gridSearchCV_folds, + scorer, + verbose) + #optimal_netrem_no_intercept = optimal_netrem_model_via_bayesian_param_tuner(netrem_no_intercept, X_train, y_train, verbose = verbose) + optimal_netrem_no_intercept = optimal_netrem_no_intercept["optimal_model"] + no_intercept = True + except: + print(":( Bayesian optimizer is not working for no y-intercept") + optimal_netrem_no_intercept = None + + try: + optimal_netrem_with_intercept = optimal_netrem_model_via_bayesian_param_tuner(netrem_with_intercept, X_train, y_train, + beta_net_min, + beta_net_max, + alpha_lasso_min, + alpha_lasso_max, + num_grid_values, + gridSearchCV_folds, + scorer, + verbose) + + optimal_netrem_with_intercept = optimal_netrem_with_intercept["optimal_model"] + with_intercept = True + + except: + print(":( Bayesian optimizer is not working for y-intercept") + optimal_netrem_with_intercept = None + + if no_intercept or with_intercept: + model_comparison_df1 = metrics_for_netrem_models_versus_other_models(netrem_with_intercept = optimal_netrem_with_intercept, netrem_no_intercept = optimal_netrem_no_intercept, + X_train = X_train, y_train = y_train, + X_test = X_test, y_test = y_test, filtered_results = False, + tf_name = tf_name, target_gene = focus_gene) + model_comparison_df1["approach"] = "bayes_optimizer" + bayes_optimizer_bool = True + + ##################################################################################### + no_intercept = False + with_intercept = False + try: + griddy_netrem_no_intercept = optimal_netrem_model_via_gridsearchCV_param_tuner(netrem_no_intercept, X_train, y_train, + num_grid_values, num_cv_jobs) + + no_intercept = True + except: + print(":( gridsearchCV is not working for no y-intercept") + griddy_netrem_no_intercept = None + + try: + griddy_netrem_with_intercept = optimal_netrem_model_via_gridsearchCV_param_tuner(netrem_with_intercept, X_train, y_train, + num_grid_values, num_cv_jobs) + with_intercept = True + except: + print(":( gridsearchCV is not working for y-intercept") + griddy_netrem_with_intercept = None + + if no_intercept or with_intercept: + model_comparison_df2 = metrics_for_netrem_models_versus_other_models(netrem_with_intercept = griddy_netrem_with_intercept, netrem_no_intercept = griddy_netrem_no_intercept, + X_train = X_train, y_train = y_train, + X_test = X_test, y_test = y_test, filtered_results = False, + tf_name = tf_name, target_gene = focus_gene) + + model_comparison_df2["approach"] = "gridSearchCV" + griddy_optimizer_bool = True + # except: + # print(":( gridsearchCV optimizer is not working") + both_approaches_bool = False + if bayes_optimizer_bool and griddy_optimizer_bool: + combined_model_compare_df = pd.concat([model_comparison_df1, model_comparison_df2]) + both_approaches_bool = True + elif bayes_optimizer_bool: + combined_model_compare_df = pd.concat([model_comparison_df1]) + else: + combined_model_compare_df = pd.concat([model_comparison_df2]) + + if both_approaches_bool: + res3 = combined_model_compare_df + res3["combo_key"] = res3["Info"] + "_" + res3["y_intercept"] + "_" + res3["Rank"].astype(str) + "_" + res3["num_TFs"].astype(str) + # Count the number of occurrences of each combo_key + combo_key_counts = res3.groupby('combo_key').size() + + # Create a boolean mask for the combo_keys that appear more than once + combo_key_mask = combo_key_counts > 1 + + # Update the approach column for the combo_keys that appear more than once + res3.loc[res3['combo_key'].isin(combo_key_counts[combo_key_mask].index), 'approach'] = 'both' + aaa = res3 + + aaa['rank_mse_train'] = aaa['train_mse'].rank(ascending=True).astype(int) + aaa['rank_mse_test'] = aaa['test_mse'].rank(ascending=True).astype(int) + aaa['rank_nmse_train'] = aaa['train_nmse'].rank(ascending=True).astype(int) + aaa['rank_nmse_test'] = aaa['test_nmse'].rank(ascending=True).astype(int) + + aaa['rank_snr_train'] = aaa['train_snr'].rank(ascending=False).astype(int) + aaa['rank_snr_test'] = aaa['test_snr'].rank(ascending=False).astype(int) + aaa['rank_psnr_train'] = aaa['train_psnr'].rank(ascending=False).astype(int) + aaa['rank_psnr_test'] = aaa['test_psnr'].rank(ascending=False).astype(int) + aaa["total_metrics_rank"] = aaa['rank_mse_train'] + aaa['rank_mse_test'] + aaa['rank_nmse_train'] + aaa['rank_nmse_test'] + aaa["total_metrics_rank"] += aaa['rank_snr_train'] + aaa['rank_snr_test'] + aaa['rank_psnr_train'] + aaa['rank_psnr_test'] + aaa = aaa.drop_duplicates() + combined_model_compare_df = aaa + combined_model_compare_df = combined_model_compare_df.drop(columns = ["combo_key"]) + return combined_model_compare_df + + +def baseline_metrics_function(X_train, y_train, X_test, y_test, tg, model_name, y_intercept, verbose = False): + + if verbose: + print(f"{model_name} results :) for fitting y_intercept = {y_intercept}") + try: + if model_name == "ElasticNetCV": + regr = ElasticNetCV(cv=5, random_state=0, fit_intercept = y_intercept) + elif model_name == "LinearRegression": + regr = LinearRegression(fit_intercept = y_intercept) + elif model_name == "LassoCV": + regr = LassoCV(cv=5, fit_intercept = y_intercept) + elif model_name == "RidgeCV": + regr = RidgeCV(cv=5, fit_intercept = y_intercept) + regr.fit(X_train, y_train) + if model_name in ["RidgeCV", "LinearRegression"]: + model_df = pd.DataFrame(regr.coef_) + else: + model_df = pd.DataFrame(regr.coef_).transpose() + if verbose: + print(model_df) + model_df.columns = X_train.columns.tolist() + selected_row = model_df.iloc[0] + selected_cols = selected_row[selected_row != 0].index # Filter out the columns with value 0 + model_df = model_df[selected_cols] + df = model_df.replace("None", np.nan).apply(pd.to_numeric, errors='coerce') + sorted_series = df.abs().squeeze().sort_values(ascending=False) + # convert the sorted series back to a DataFrame + sorted_df = pd.DataFrame(sorted_series) + # add a column for the rank + sorted_df['Rank'] = range(1, len(sorted_df) + 1) + sorted_df['TF'] = sorted_df.index + sorted_df = sorted_df.rename(columns = {0:"AbsoluteVal_coefficient"}) + # tfs = sorted_df["TF"].tolist() + # if tf_name not in tfs: + # sorted_df = pd.DataFrame(["N/A", tf_name]).transpose() + # sorted_df.columns = ["Rank", "TF"] + sorted_df["Info"] = model_name + if y_intercept: + sorted_df["y_intercept"] = "True :)" + else: + sorted_df["y_intercept"] = "False :(" + sorted_df["final_model_TFs"] = model_df.shape[1] + sorted_df["TFs_input_to_model"] = X_train.shape[1] + sorted_df["original_TFs_in_X"] = X_train.shape[1] + + predY_train = regr.predict(X_train) + predY_test = regr.predict(X_test) + train_mse = em.mse(y_train.values.flatten(), predY_train) + test_mse = em.mse(y_test.values.flatten(), predY_test) + sorted_df["train_mse"] = train_mse + sorted_df["test_mse"] = test_mse + sorted_df["train_nmse"] = em.nmse(y_train.values.flatten(), predY_train) + sorted_df["test_nmse"] = em.nmse(y_test.values.flatten(), predY_test) + sorted_df["train_snr"] = em.snr(y_train.values.flatten(), predY_train) + sorted_df["test_snr"] = em.snr(y_test.values.flatten(), predY_test) + sorted_df["train_psnr"] = em.psnr(y_train.values.flatten(), predY_train) + sorted_df["test_psnr"] = em.psnr(y_test.values.flatten(), predY_test) + sorted_df["TG"] = tg + sorted_df = sorted_df.reset_index().drop(columns = ["index"]) + sorted_df + except: + return pd.DataFrame() + return sorted_df \ No newline at end of file diff --git a/code/old_code/refresh/packages_needed.py b/code/old_code/refresh/packages_needed.py new file mode 100644 index 0000000..b4d319e --- /dev/null +++ b/code/old_code/refresh/packages_needed.py @@ -0,0 +1,38 @@ +import pandas as pd +import numpy as np +import random +import copy +from tqdm import tqdm +import os +import sys # https://www.dev2qa.com/how-to-run-python-script-py-file-in-jupyter-notebook-ipynb-file-and-ipython/#:~:text=How%20To%20Run%20Python%20Script%20.py%20File%20In,2.%20Invoke%20Python%20Script%20File%20From%20Ipython%20Command-Line. +import networkx as nx +import scipy +from scipy.linalg import svd as robust_svd +from sklearn.model_selection import KFold, train_test_split, GridSearchCV, cross_val_score +from sklearn.decomposition import TruncatedSVD +from sklearn import linear_model +from sklearn.linear_model import Lasso, LassoCV, LinearRegression, ElasticNetCV, Ridge +from numpy.typing import ArrayLike +from skopt import gp_minimize, space +from typing import Optional, List, Tuple +from sklearn.metrics import make_scorer +import plotly.express as px +from sklearn.base import RegressorMixin, ClassifierMixin, BaseEstimator +import matplotlib.pyplot as plt +from numpy.typing import ArrayLike +from scipy.sparse.linalg.interface import LinearOperator +import warnings +from sklearn.exceptions import ConvergenceWarning +printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) +rng_seed = 2023 # random seed for reproducibility +randSeed = 123 + + +""" +Optimization for +(1 / (2 * M)) * ||y - Xc||^2_2 + (beta / (2 * N^2)) * c'Ac + alpha * ||c||_1 +Which is converted to lasso +(1 / (2 * M)) * ||y_tilde - X_tilde @ c||^2_2 + alpha * ||c||_1 +where M = n_samples and N is the dimension of c. +Check compute_X_tilde_y_tilde() to see how we make sure above normalization is applied using Lasso of sklearn +""" \ No newline at end of file diff --git a/netrem_final_demo.png b/netrem_final_demo.png index 68f5136..2466627 100644 Binary files a/netrem_final_demo.png and b/netrem_final_demo.png differ diff --git a/netrem_gexpr_demo.png b/netrem_gexpr_demo.png index cbf20f4..b00c556 100644 Binary files a/netrem_gexpr_demo.png and b/netrem_gexpr_demo.png differ diff --git a/netrem_pipeline.PNG b/netrem_pipeline.PNG index dbb4328..add34f4 100644 Binary files a/netrem_pipeline.PNG and b/netrem_pipeline.PNG differ diff --git a/output_3_1.png b/output_3_1.png index e3be997..d92f453 100644 Binary files a/output_3_1.png and b/output_3_1.png differ diff --git a/output_3_2.png b/output_3_2.png deleted file mode 100644 index 19c0da7..0000000 Binary files a/output_3_2.png and /dev/null differ diff --git a/output_3_5.png b/output_3_5.png deleted file mode 100644 index 8bb31f1..0000000 Binary files a/output_3_5.png and /dev/null differ diff --git a/user_guide/Dummy_Data_Demo_Example.pdf b/user_guide/Dummy_Data_Demo_Example.pdf new file mode 100644 index 0000000..46a2e57 Binary files /dev/null and b/user_guide/Dummy_Data_Demo_Example.pdf differ diff --git a/user_guide/netremCV.ipynb b/user_guide/netremCV.ipynb index 849d1cc..434c3c9 100644 --- a/user_guide/netremCV.ipynb +++ b/user_guide/netremCV.ipynb @@ -6,11 +6,46 @@ "metadata": {}, "source": [ "# netremCV\n", - "Cross-validation approach for estimating the optimal $\\beta_{net}$ and $\\alpha_{lasso}$.\n", + "## By: Saniya Khullar\n", + "Cross-validation approach for estimating the optimal $\\beta_{net}$ and $\\alpha_{lasso}$ for NetREm models.\n", "\n", "Selection for $\\beta_{net}$ can impact the optimal values for $\\alpha_{net}$" ] }, + { + "cell_type": "markdown", + "id": "bc3c1688", + "metadata": {}, + "source": [ + "netremCV(`edge_list`,
`X`, # *gene expression data for the predictors (e.g. Transcription Factors (TFs))*
\n", + "`y`, # *gene expression data for the response variable (e.g. target gene (TG))*
\n", + " `num_beta`: int = 10,
\n", + " `extra_beta_list` = [0.25, 0.5, 0.75, 1], # *additional beta to try out*
\n", + " `num_alpha`: int = 10,
\n", + " `max_beta`: float = 200, # *max_beta used to help prevent explosion of beta_net values*
\n", + " `reduced_cv_search`: bool = False, # *should we do a reduced search (Randomized Search) or a GridSearch?*
\n", + " `default_edge_weight`: float = 0.1,
\n", + " `degree_threshold`: float = 0.5,
\n", + " `gene_expression_nodes` = [],
\n", + " `overlapped_nodes_only`: bool = False,
\n", + " `standardize_X`: bool = True,
\n", + " `center_y`: bool = True,
\n", + " `y_intercept`: bool = False,
\n", + " `model_type` = \"Lasso\",
\n", + " `lasso_selection` = \"cyclic\", # *default in sklearn*
\n", + " `all_pos_coefs`: bool = False,
\n", + " `tolerance`: float = 1e-4,
\n", + " `maxit`: int = 10000,
\n", + " `num_jobs`: int = -1,
\n", + " `num_cv_folds`: int = 5,
\n", + " `lassocv_eps`: float = 1e-3, # *default in sklearn*
\n", + " `lassocv_n_alphas`: int = 100, # *default in sklearn*
\n", + " `lassocv_alphas` = None, # *default in sklearn*
\n", + " `verbose` = False,
\n", + " `searchVerbosity`: int = 2,
\n", + " `show_warnings`: bool = False
):" + ] + }, { "cell_type": "code", "execution_count": 1, @@ -21,7 +56,27 @@ "name": "stdout", "output_type": "stream", "text": [ - ":) same_train_test_data = False\n", + ":) same_train_test_data = False\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "c12fe758096c4becb7fb41acb546021a", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Generating predictors: 0%| | 0/5 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TF1TF2TF3TF4TF5
0-0.1331900.590034-0.5371130.0855020.360832
10.4413820.845980-1.3441820.3289250.311403
20.6712581.1714991.013758-0.9520910.265659
3-0.5652900.7190510.3447981.276574-0.349368
4-1.4108210.522239-0.6798171.1011291.755733
..................
699950.639805-0.7273370.8952200.266462-0.287123
699960.4912231.649460-1.260947-0.452498-0.300503
69997-0.688052-0.428763-1.080820-0.9335080.795700
69998-2.117304-1.1956601.4094432.0827791.809950
699992.0241601.523225-0.026287-0.107088-1.912545
\n", + "

70000 rows × 5 columns

\n", + "" + ], + "text/plain": [ + " TF1 TF2 TF3 TF4 TF5\n", + "0 -0.133190 0.590034 -0.537113 0.085502 0.360832\n", + "1 0.441382 0.845980 -1.344182 0.328925 0.311403\n", + "2 0.671258 1.171499 1.013758 -0.952091 0.265659\n", + "3 -0.565290 0.719051 0.344798 1.276574 -0.349368\n", + "4 -1.410821 0.522239 -0.679817 1.101129 1.755733\n", + "... ... ... ... ... ...\n", + "69995 0.639805 -0.727337 0.895220 0.266462 -0.287123\n", + "69996 0.491223 1.649460 -1.260947 -0.452498 -0.300503\n", + "69997 -0.688052 -0.428763 -1.080820 -0.933508 0.795700\n", + "69998 -2.117304 -1.195660 1.409443 2.082779 1.809950\n", + "69999 2.024160 1.523225 -0.026287 -0.107088 -1.912545\n", + "\n", + "[70000 rows x 5 columns]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "X_train" + ] + }, + { + "cell_type": "markdown", + "id": "8d994c65", + "metadata": {}, + "source": [ + "Input Protein-Protein Interaction (PPI) network relating TF predictors to each other:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, "id": "69496222", "metadata": {}, "outputs": [ @@ -93,7 +347,7 @@ " ['TF3', 'TF5']]" ] }, - "execution_count": 3, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -107,48 +361,33 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "id": "094a7227", "metadata": {}, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - ":) using variance to define beta_net values\n", - "beta_min = 1.1506396943803596 and beta_max = 115.06396943803597\n" - ] - }, { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "ac3bddb5c58846a68761633cf7b46850", + "model_id": "8e1be7ac5c6241e49f0c9bfe72aa7300", "version_major": 2, "version_minor": 0 }, "text/plain": [ - ":) Generating beta_net and alpha_lasso pairs: 0%| | 0/50 [00:00#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}
NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=1.1506396943803596, alpha_lasso=0.051694590434151706, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x00000202249230A0>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + "
NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, standardize_X=True, center_y=True, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=0.25, alpha_lasso=0.0008985897297578544, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x00000234F23167A0>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ - "NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=1.1506396943803596, alpha_lasso=0.051694590434151706, network=)" + "NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, standardize_X=True, center_y=True, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=0.25, alpha_lasso=0.0008985897297578544, network=)" ] }, - "execution_count": 4, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -450,20 +573,20 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, "id": "15723a95", "metadata": {}, "outputs": [ { "data": { "text/html": [ - "
NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=1.1506396943803596, alpha_lasso=0.051694590434151706, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x00000202249230A0>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + "
NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, standardize_X=True, center_y=True, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=0.25, alpha_lasso=0.0008985897297578544, network=<PriorGraphNetwork.PriorGraphNetwork object at 0x00000234F23167A0>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ - "NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=1.1506396943803596, alpha_lasso=0.051694590434151706, network=)" + "NetREmModel(overlapped_nodes_only=False, all_pos_coefs=False, standardize_X=True, center_y=True, y_intercept=False, max_lasso_iterations=10000, view_network=False, tolerance=0.0001, lasso_selection=cyclic, beta_net=0.25, alpha_lasso=0.0008985897297578544, network=)" ] }, - "execution_count": 5, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -474,7 +597,99 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, + "id": "928aa851", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
infoinput_dataTF1TF2TF3TF4TF5
0network regression coeff. with y: yX_train0.6145360.1060680.007784-0.030845-0.298912
0corr (r) with y: yX_train0.9001780.4962340.302551-0.203738-0.800527
0Absolute Value NetREm Coefficient RankingX_train13542
\n", + "
" + ], + "text/plain": [ + " info input_data TF1 TF2 \\\n", + "0 network regression coeff. with y: y X_train 0.614536 0.106068 \n", + "0 corr (r) with y: y X_train 0.900178 0.496234 \n", + "0 Absolute Value NetREm Coefficient Ranking X_train 1 3 \n", + "\n", + " TF3 TF4 TF5 \n", + "0 0.007784 -0.030845 -0.298912 \n", + "0 0.302551 -0.203738 -0.800527 \n", + "0 5 4 2 " + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netrem_demoCV.final_corr_vs_coef_df" + ] + }, + { + "cell_type": "code", + "execution_count": 8, "id": "c599f58e", "metadata": {}, "outputs": [ @@ -482,12 +697,14 @@ "data": { "text/plain": [ "{'info': 'NetREm Model',\n", - " 'alpha_lasso': 0.051694590434151706,\n", - " 'beta_net': 1.1506396943803596,\n", + " 'alpha_lasso': 0.0008985897297578544,\n", + " 'beta_net': 0.25,\n", " 'y_intercept': False,\n", " 'model_type': 'Lasso',\n", + " 'standardize_X': True,\n", + " 'center_y': True,\n", " 'max_lasso_iterations': 10000,\n", - " 'network': ,\n", + " 'network': ,\n", " 'verbose': False,\n", " 'all_pos_coefs': False,\n", " 'model_info': 'fitted_model :)',\n", @@ -496,7 +713,7 @@ " 'lasso_selection': 'cyclic'}" ] }, - "execution_count": 6, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } @@ -507,17 +724,17 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 9, "id": "7ba65d39", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "0.13727397681026726" + "0.13363812169073666" ] }, - "execution_count": 7, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } @@ -528,17 +745,17 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 10, "id": "1a9f2f60", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "0.13781162327050317" + "0.13487280608906413" ] }, - "execution_count": 8, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } @@ -549,7 +766,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 11, "id": "758b4684", "metadata": {}, "outputs": [ @@ -578,6 +795,7 @@ " TF1\n", " TF2\n", " TF3\n", + " TF4\n", " TF5\n", " \n", " \n", @@ -585,21 +803,22 @@ " \n", " 0\n", " None\n", - " 0.277752\n", - " 0.063378\n", - " 0.00145\n", - " -0.159248\n", + " 0.614536\n", + " 0.106068\n", + " 0.007784\n", + " -0.030845\n", + " -0.298912\n", " \n", " \n", "\n", "" ], "text/plain": [ - " y_intercept TF1 TF2 TF3 TF5\n", - "0 None 0.277752 0.063378 0.00145 -0.159248" + " y_intercept TF1 TF2 TF3 TF4 TF5\n", + "0 None 0.614536 0.106068 0.007784 -0.030845 -0.298912" ] }, - "execution_count": 9, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } @@ -610,8 +829,193 @@ }, { "cell_type": "code", - "execution_count": 10, - "id": "7afe0c3f", + "execution_count": 12, + "id": "6ed7493d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
coefTFTGinfotrain_msebeta_netalpha_lassoAbsoluteVal_coefficientRankfinal_model_TFsTFs_input_to_modeloriginal_TFs_in_Xstandardized_Xcentered_y
0Noney_interceptynetrem_no_intercept0.1336380.250.000899NaN6555TrueTrue
10.614536TF1ynetrem_no_intercept0.1336380.250.0008990.6145361555TrueTrue
20.106068TF2ynetrem_no_intercept0.1336380.250.0008990.1060683555TrueTrue
30.007784TF3ynetrem_no_intercept0.1336380.250.0008990.0077845555TrueTrue
4-0.030845TF4ynetrem_no_intercept0.1336380.250.0008990.0308454555TrueTrue
5-0.298912TF5ynetrem_no_intercept0.1336380.250.0008990.2989122555TrueTrue
\n", + "
" + ], + "text/plain": [ + " coef TF TG info train_mse beta_net \\\n", + "0 None y_intercept y netrem_no_intercept 0.133638 0.25 \n", + "1 0.614536 TF1 y netrem_no_intercept 0.133638 0.25 \n", + "2 0.106068 TF2 y netrem_no_intercept 0.133638 0.25 \n", + "3 0.007784 TF3 y netrem_no_intercept 0.133638 0.25 \n", + "4 -0.030845 TF4 y netrem_no_intercept 0.133638 0.25 \n", + "5 -0.298912 TF5 y netrem_no_intercept 0.133638 0.25 \n", + "\n", + " alpha_lasso AbsoluteVal_coefficient Rank final_model_TFs \\\n", + "0 0.000899 NaN 6 5 \n", + "1 0.000899 0.614536 1 5 \n", + "2 0.000899 0.106068 3 5 \n", + "3 0.000899 0.007784 5 5 \n", + "4 0.000899 0.030845 4 5 \n", + "5 0.000899 0.298912 2 5 \n", + "\n", + " TFs_input_to_model original_TFs_in_X standardized_X centered_y \n", + "0 5 5 True True \n", + "1 5 5 True True \n", + "2 5 5 True True \n", + "3 5 5 True True \n", + "4 5 5 True True \n", + "5 5 5 True True " + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "netrem_demoCV.combined_df" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "a2e8cd9f", "metadata": {}, "outputs": [ { @@ -645,58 +1049,58 @@ " \n", " \n", " TF1\n", - " 5.414075\n", - " 1.113261\n", - " 0.361368\n", - " -0.429737\n", - " -2.771125\n", + " 1.023976\n", + " 0.426869\n", + " 0.207654\n", + " -0.186497\n", + " -0.721913\n", " \n", " \n", " TF2\n", - " 1.113261\n", - " 1.439763\n", - " -0.107738\n", - " -0.125070\n", - " -0.773506\n", + " 0.426869\n", + " 1.023976\n", + " 0.082365\n", + " -0.104892\n", + " -0.400304\n", " \n", " \n", " TF3\n", - " 0.361368\n", - " -0.107738\n", - " 37.921695\n", - " -0.359079\n", - " -0.715906\n", + " 0.207654\n", + " 0.082365\n", + " 9.000000\n", + " -0.128178\n", + " -0.303677\n", " \n", " \n", " TF4\n", - " -0.429737\n", - " -0.125070\n", - " -0.359079\n", - " 1.139272\n", - " 0.197285\n", + " -0.186497\n", + " -0.104892\n", + " -0.128178\n", + " 1.020979\n", + " 0.148067\n", " \n", " \n", " TF5\n", - " -2.771125\n", - " -0.773506\n", - " -0.715906\n", - " 0.197285\n", - " 2.877027\n", + " -0.721913\n", + " -0.400304\n", + " -0.303677\n", + " 0.148067\n", + " 1.020979\n", " \n", " \n", "\n", "" ], "text/plain": [ - " TF1 TF2 TF3 TF4 TF5\n", - "TF1 5.414075 1.113261 0.361368 -0.429737 -2.771125\n", - "TF2 1.113261 1.439763 -0.107738 -0.125070 -0.773506\n", - "TF3 0.361368 -0.107738 37.921695 -0.359079 -0.715906\n", - "TF4 -0.429737 -0.125070 -0.359079 1.139272 0.197285\n", - "TF5 -2.771125 -0.773506 -0.715906 0.197285 2.877027" + " TF1 TF2 TF3 TF4 TF5\n", + "TF1 1.023976 0.426869 0.207654 -0.186497 -0.721913\n", + "TF2 0.426869 1.023976 0.082365 -0.104892 -0.400304\n", + "TF3 0.207654 0.082365 9.000000 -0.128178 -0.303677\n", + "TF4 -0.186497 -0.104892 -0.128178 1.020979 0.148067\n", + "TF5 -0.721913 -0.400304 -0.303677 0.148067 1.020979" ] }, - "execution_count": 10, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } @@ -708,7 +1112,7 @@ { "cell_type": "code", "execution_count": 14, - "id": "64bd24b1", + "id": "6bb4d465", "metadata": {}, "outputs": [ { @@ -744,6 +1148,7 @@ " num_final_predictors\n", " model_type\n", " beta_net\n", + " X_standardized\n", " gene_data\n", " rank\n", " percentile\n", @@ -754,16 +1159,17 @@ " 20\n", " TF1\n", " TF5\n", - " -2.771125\n", + " -0.721913\n", " :(\n", " :( competitive (-)\n", - " 2.771125\n", + " 0.721913\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 1.0\n", " 95.0\n", @@ -772,16 +1178,17 @@ " 4\n", " TF5\n", " TF1\n", - " -2.771125\n", + " -0.721913\n", " :(\n", " :( competitive (-)\n", - " 2.771125\n", + " 0.721913\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 1.0\n", " 95.0\n", @@ -790,16 +1197,17 @@ " 5\n", " TF1\n", " TF2\n", - " 1.113261\n", + " 0.426869\n", " :)\n", " :(\n", - " 1.113261\n", + " 0.426869\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 3.0\n", " 85.0\n", @@ -808,16 +1216,17 @@ " 1\n", " TF2\n", " TF1\n", - " 1.113261\n", + " 0.426869\n", " :)\n", " :(\n", - " 1.113261\n", + " 0.426869\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 3.0\n", " 85.0\n", @@ -826,16 +1235,17 @@ " 9\n", " TF5\n", " TF2\n", - " -0.773506\n", + " -0.400304\n", " :(\n", " :( competitive (-)\n", - " 0.773506\n", + " 0.400304\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 5.0\n", " 75.0\n", @@ -844,16 +1254,17 @@ " 21\n", " TF2\n", " TF5\n", - " -0.773506\n", + " -0.400304\n", " :(\n", " :( competitive (-)\n", - " 0.773506\n", + " 0.400304\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 5.0\n", " 75.0\n", @@ -862,16 +1273,17 @@ " 14\n", " TF5\n", " TF3\n", - " -0.715906\n", + " -0.303677\n", " :(\n", " :( competitive (-)\n", - " 0.715906\n", + " 0.303677\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 7.0\n", " 65.0\n", @@ -880,160 +1292,169 @@ " 22\n", " TF3\n", " TF5\n", - " -0.715906\n", + " -0.303677\n", " :(\n", " :( competitive (-)\n", - " 0.715906\n", + " 0.303677\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 7.0\n", " 65.0\n", " \n", " \n", - " 15\n", + " 2\n", + " TF3\n", " TF1\n", - " TF4\n", - " -0.429737\n", + " 0.207654\n", + " :)\n", " :(\n", - " :( competitive (-)\n", - " 0.429737\n", + " 0.207654\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 9.0\n", " 55.0\n", " \n", " \n", - " 3\n", - " TF4\n", + " 10\n", " TF1\n", - " -0.429737\n", + " TF3\n", + " 0.207654\n", + " :)\n", " :(\n", - " :( competitive (-)\n", - " 0.429737\n", + " 0.207654\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 9.0\n", " 55.0\n", " \n", " \n", - " 2\n", - " TF3\n", + " 15\n", " TF1\n", - " 0.361368\n", - " :)\n", + " TF4\n", + " -0.186497\n", " :(\n", - " 0.361368\n", + " :( competitive (-)\n", + " 0.186497\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 11.0\n", " 45.0\n", " \n", " \n", - " 10\n", + " 3\n", + " TF4\n", " TF1\n", - " TF3\n", - " 0.361368\n", - " :)\n", + " -0.186497\n", " :(\n", - " 0.361368\n", + " :( competitive (-)\n", + " 0.186497\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 11.0\n", " 45.0\n", " \n", " \n", - " 13\n", + " 19\n", + " TF5\n", " TF4\n", - " TF3\n", - " -0.359079\n", + " 0.148067\n", + " :)\n", " :(\n", - " :( competitive (-)\n", - " 0.359079\n", + " 0.148067\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 13.0\n", " 35.0\n", " \n", " \n", - " 17\n", - " TF3\n", + " 23\n", " TF4\n", - " -0.359079\n", + " TF5\n", + " 0.148067\n", + " :)\n", " :(\n", - " :( competitive (-)\n", - " 0.359079\n", + " 0.148067\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 13.0\n", " 35.0\n", " \n", " \n", - " 19\n", - " TF5\n", + " 13\n", " TF4\n", - " 0.197285\n", - " :)\n", + " TF3\n", + " -0.128178\n", " :(\n", - " 0.197285\n", + " :( competitive (-)\n", + " 0.128178\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 15.0\n", " 25.0\n", " \n", " \n", - " 23\n", + " 17\n", + " TF3\n", " TF4\n", - " TF5\n", - " 0.197285\n", - " :)\n", + " -0.128178\n", " :(\n", - " 0.197285\n", + " :( competitive (-)\n", + " 0.128178\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 15.0\n", " 25.0\n", @@ -1042,16 +1463,17 @@ " 8\n", " TF4\n", " TF2\n", - " -0.125070\n", + " -0.104892\n", " :(\n", " :( competitive (-)\n", - " 0.125070\n", + " 0.104892\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 17.0\n", " 15.0\n", @@ -1060,16 +1482,17 @@ " 16\n", " TF2\n", " TF4\n", - " -0.125070\n", + " -0.104892\n", " :(\n", " :( competitive (-)\n", - " 0.125070\n", + " 0.104892\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 17.0\n", " 15.0\n", @@ -1078,16 +1501,17 @@ " 7\n", " TF3\n", " TF2\n", - " -0.107738\n", + " 0.082365\n", + " :)\n", " :(\n", - " :( competitive (-)\n", - " 0.107738\n", + " 0.082365\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 19.0\n", " 5.0\n", @@ -1096,16 +1520,17 @@ " 11\n", " TF2\n", " TF3\n", - " -0.107738\n", + " 0.082365\n", + " :)\n", " :(\n", - " :( competitive (-)\n", - " 0.107738\n", + " 0.082365\n", " B matrix of TF-TF interactions\n", " 5\n", " y\n", - " 4\n", + " 5\n", " Lasso\n", - " 1.15064\n", + " 0.25\n", + " True\n", " training gene expression data\n", " 19.0\n", " 5.0\n", @@ -1116,26 +1541,26 @@ ], "text/plain": [ " TF1 TF2 B_train_weight sign potential_interaction absVal_B \\\n", - "20 TF1 TF5 -2.771125 :( :( competitive (-) 2.771125 \n", - "4 TF5 TF1 -2.771125 :( :( competitive (-) 2.771125 \n", - "5 TF1 TF2 1.113261 :) :( 1.113261 \n", - "1 TF2 TF1 1.113261 :) :( 1.113261 \n", - "9 TF5 TF2 -0.773506 :( :( competitive (-) 0.773506 \n", - "21 TF2 TF5 -0.773506 :( :( competitive (-) 0.773506 \n", - "14 TF5 TF3 -0.715906 :( :( competitive (-) 0.715906 \n", - "22 TF3 TF5 -0.715906 :( :( competitive (-) 0.715906 \n", - "15 TF1 TF4 -0.429737 :( :( competitive (-) 0.429737 \n", - "3 TF4 TF1 -0.429737 :( :( competitive (-) 0.429737 \n", - "2 TF3 TF1 0.361368 :) :( 0.361368 \n", - "10 TF1 TF3 0.361368 :) :( 0.361368 \n", - "13 TF4 TF3 -0.359079 :( :( competitive (-) 0.359079 \n", - "17 TF3 TF4 -0.359079 :( :( competitive (-) 0.359079 \n", - "19 TF5 TF4 0.197285 :) :( 0.197285 \n", - "23 TF4 TF5 0.197285 :) :( 0.197285 \n", - "8 TF4 TF2 -0.125070 :( :( competitive (-) 0.125070 \n", - "16 TF2 TF4 -0.125070 :( :( competitive (-) 0.125070 \n", - "7 TF3 TF2 -0.107738 :( :( competitive (-) 0.107738 \n", - "11 TF2 TF3 -0.107738 :( :( competitive (-) 0.107738 \n", + "20 TF1 TF5 -0.721913 :( :( competitive (-) 0.721913 \n", + "4 TF5 TF1 -0.721913 :( :( competitive (-) 0.721913 \n", + "5 TF1 TF2 0.426869 :) :( 0.426869 \n", + "1 TF2 TF1 0.426869 :) :( 0.426869 \n", + "9 TF5 TF2 -0.400304 :( :( competitive (-) 0.400304 \n", + "21 TF2 TF5 -0.400304 :( :( competitive (-) 0.400304 \n", + "14 TF5 TF3 -0.303677 :( :( competitive (-) 0.303677 \n", + "22 TF3 TF5 -0.303677 :( :( competitive (-) 0.303677 \n", + "2 TF3 TF1 0.207654 :) :( 0.207654 \n", + "10 TF1 TF3 0.207654 :) :( 0.207654 \n", + "15 TF1 TF4 -0.186497 :( :( competitive (-) 0.186497 \n", + "3 TF4 TF1 -0.186497 :( :( competitive (-) 0.186497 \n", + "19 TF5 TF4 0.148067 :) :( 0.148067 \n", + "23 TF4 TF5 0.148067 :) :( 0.148067 \n", + "13 TF4 TF3 -0.128178 :( :( competitive (-) 0.128178 \n", + "17 TF3 TF4 -0.128178 :( :( competitive (-) 0.128178 \n", + "8 TF4 TF2 -0.104892 :( :( competitive (-) 0.104892 \n", + "16 TF2 TF4 -0.104892 :( :( competitive (-) 0.104892 \n", + "7 TF3 TF2 0.082365 :) :( 0.082365 \n", + "11 TF2 TF3 0.082365 :) :( 0.082365 \n", "\n", " info candidate_TFs_N target_gene_y \\\n", "20 B matrix of TF-TF interactions 5 y \n", @@ -1146,62 +1571,62 @@ "21 B matrix of TF-TF interactions 5 y \n", "14 B matrix of TF-TF interactions 5 y \n", "22 B matrix of TF-TF interactions 5 y \n", - "15 B matrix of TF-TF interactions 5 y \n", - "3 B matrix of TF-TF interactions 5 y \n", "2 B matrix of TF-TF interactions 5 y \n", "10 B matrix of TF-TF interactions 5 y \n", - "13 B matrix of TF-TF interactions 5 y \n", - "17 B matrix of TF-TF interactions 5 y \n", + "15 B matrix of TF-TF interactions 5 y \n", + "3 B matrix of TF-TF interactions 5 y \n", "19 B matrix of TF-TF interactions 5 y \n", "23 B matrix of TF-TF interactions 5 y \n", + "13 B matrix of TF-TF interactions 5 y \n", + "17 B matrix of TF-TF interactions 5 y \n", "8 B matrix of TF-TF interactions 5 y \n", "16 B matrix of TF-TF interactions 5 y \n", "7 B matrix of TF-TF interactions 5 y \n", "11 B matrix of TF-TF interactions 5 y \n", "\n", - " num_final_predictors model_type beta_net gene_data \\\n", - "20 4 Lasso 1.15064 training gene expression data \n", - "4 4 Lasso 1.15064 training gene expression data \n", - "5 4 Lasso 1.15064 training gene expression data \n", - "1 4 Lasso 1.15064 training gene expression data \n", - "9 4 Lasso 1.15064 training gene expression data \n", - "21 4 Lasso 1.15064 training gene expression data \n", - "14 4 Lasso 1.15064 training gene expression data \n", - "22 4 Lasso 1.15064 training gene expression data \n", - "15 4 Lasso 1.15064 training gene expression data \n", - "3 4 Lasso 1.15064 training gene expression data \n", - "2 4 Lasso 1.15064 training gene expression data \n", - "10 4 Lasso 1.15064 training gene expression data \n", - "13 4 Lasso 1.15064 training gene expression data \n", - "17 4 Lasso 1.15064 training gene expression data \n", - "19 4 Lasso 1.15064 training gene expression data \n", - "23 4 Lasso 1.15064 training gene expression data \n", - "8 4 Lasso 1.15064 training gene expression data \n", - "16 4 Lasso 1.15064 training gene expression data \n", - "7 4 Lasso 1.15064 training gene expression data \n", - "11 4 Lasso 1.15064 training gene expression data \n", + " num_final_predictors model_type beta_net X_standardized \\\n", + "20 5 Lasso 0.25 True \n", + "4 5 Lasso 0.25 True \n", + "5 5 Lasso 0.25 True \n", + "1 5 Lasso 0.25 True \n", + "9 5 Lasso 0.25 True \n", + "21 5 Lasso 0.25 True \n", + "14 5 Lasso 0.25 True \n", + "22 5 Lasso 0.25 True \n", + "2 5 Lasso 0.25 True \n", + "10 5 Lasso 0.25 True \n", + "15 5 Lasso 0.25 True \n", + "3 5 Lasso 0.25 True \n", + "19 5 Lasso 0.25 True \n", + "23 5 Lasso 0.25 True \n", + "13 5 Lasso 0.25 True \n", + "17 5 Lasso 0.25 True \n", + "8 5 Lasso 0.25 True \n", + "16 5 Lasso 0.25 True \n", + "7 5 Lasso 0.25 True \n", + "11 5 Lasso 0.25 True \n", "\n", - " rank percentile \n", - "20 1.0 95.0 \n", - "4 1.0 95.0 \n", - "5 3.0 85.0 \n", - "1 3.0 85.0 \n", - "9 5.0 75.0 \n", - "21 5.0 75.0 \n", - "14 7.0 65.0 \n", - "22 7.0 65.0 \n", - "15 9.0 55.0 \n", - "3 9.0 55.0 \n", - "2 11.0 45.0 \n", - "10 11.0 45.0 \n", - "13 13.0 35.0 \n", - "17 13.0 35.0 \n", - "19 15.0 25.0 \n", - "23 15.0 25.0 \n", - "8 17.0 15.0 \n", - "16 17.0 15.0 \n", - "7 19.0 5.0 \n", - "11 19.0 5.0 " + " gene_data rank percentile \n", + "20 training gene expression data 1.0 95.0 \n", + "4 training gene expression data 1.0 95.0 \n", + "5 training gene expression data 3.0 85.0 \n", + "1 training gene expression data 3.0 85.0 \n", + "9 training gene expression data 5.0 75.0 \n", + "21 training gene expression data 5.0 75.0 \n", + "14 training gene expression data 7.0 65.0 \n", + "22 training gene expression data 7.0 65.0 \n", + "2 training gene expression data 9.0 55.0 \n", + "10 training gene expression data 9.0 55.0 \n", + "15 training gene expression data 11.0 45.0 \n", + "3 training gene expression data 11.0 45.0 \n", + "19 training gene expression data 13.0 35.0 \n", + "23 training gene expression data 13.0 35.0 \n", + "13 training gene expression data 15.0 25.0 \n", + "17 training gene expression data 15.0 25.0 \n", + "8 training gene expression data 17.0 15.0 \n", + "16 training gene expression data 17.0 15.0 \n", + "7 training gene expression data 19.0 5.0 \n", + "11 training gene expression data 19.0 5.0 " ] }, "execution_count": 14, @@ -1210,17 +1635,8 @@ } ], "source": [ - "b_matrix = nm.organize_B_interaction_network(netrem_demoCV)\n", - "b_matrix" + "organize_B_interaction_network(netrem_demoCV)" ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "b5ca8aca", - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { @@ -1239,7 +1655,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.6" + "version": "3.10.9" } }, "nbformat": 4, diff --git a/user_guide/overlapped_nodes_only.pdf b/user_guide/overlapped_nodes_only.pdf new file mode 100644 index 0000000..2323b61 Binary files /dev/null and b/user_guide/overlapped_nodes_only.pdf differ