BiFuncLib's documentation

BiFuncLib is a Python package that aggregates multiple biclustering methods mainly for functional data.

This Page

CVX

Method Description

  • Sparse Data Representation

Convex Biclustering operates on high-dimensional data matrices where rows typically represent samples (e.g., subjects) and columns represent variables (e.g., genes). The data matrix is often centered to ensure that each column has zero mean. This preprocessing step helps in identifying interpretable patterns by removing the overall mean effect.

  • Model Definition

Convex Biclustering seeks to identify groups of rows and columns in a data matrix that are associated with each other. The goal is to partition the matrix into a checkerboard pattern, where each block represents a bicluster. The model assumes that the observed data matrix \(X \in \mathbb{R}^{p \times n}\) can be approximated by a mean matrix \(U\) with a checkerboard structure. Specifically, the model aims to minimize the following convex criterion:

\[F_\gamma(U) = \frac{1}{2} \|X - U\|_F^2 + \gamma \left( \sum_{i < j} w_{ij} \|U_{\cdot i} - U_{\cdot j}\|_2^2 + \sum_{i < j} \tilde{w}_{ij} \|U_{i \cdot} - U_{j \cdot}\|_2^2 \right),\]

where \(\| \cdot \|_F\) denotes the Frobenius norm, \(w_{ij}\) and \(\tilde{w}_{ij}\) are non-negative weights, and \(\gamma \geq 0\) is a regularization parameter that controls the trade-off between the goodness-of-fit and the complexity of the biclustering.

  • Iterative Algorithm for Computation (COBRA)

The Convex Biclustering problem is solved using the Convex BiclusteRing Algorithm (COBRA), which is an iterative algorithm based on the Dykstra-like proximal algorithm (DLPA). The steps are as follows:

  1. Initialization: Set \(U^{(0)} = X\), \(P^{(0)} = 0\), and \(Q^{(0)} = 0\).

  2. Iterate: - Convex Clustering of Rows: \(Y^{(m)} = \text{prox}_{\gamma \tilde{W}}(U^{(m)T} + P^{(m)T})\) - Update \(P^{(m+1)} = U^{(m)} + P^{(m)} - Y^{(m)T}\) - Convex Clustering of Columns: \(U^{(m+1)} = \text{prox}_{\gamma W}(Y^{(m)T} + Q^{(m)T})\) - Update \(Q^{(m+1)} = Y^{(m)} + Q^{(m)} - U^{(m+1)T}\)

  3. Convergence Check: Repeat the updates until convergence, typically when \(\|U^{(m)} - Y^{(m)}\|_F\) falls below a tolerance \(\tau > 0\).

The proximal mapping \(\text{prox}_{\gamma W}(Z)\) is defined as:

\[\text{prox}_{\gamma W}(Z) = \arg \min_V \left\{ \frac{1}{2} \|Z - V\|_F^2 + \gamma \sum_{i < j} w_{ij} \|V_{\cdot i} - V_{\cdot j}\|_2^2 \right\}\]
  • Penalty Parameter Selection

The regularization parameter \(\gamma\) controls the number of biclusters and the complexity of the solution. To select \(\gamma\) in a data-driven manner, we use a hold-out validation approach. A subset of the data matrix is randomly selected as a validation set, and the model is evaluated on how well it predicts the hold-out set. The optimal \(\gamma\) is chosen to minimize the prediction error over the hold-out set.

  • Post-processing for Interpretation

After obtaining the biclustering solution \(U_\gamma\), the resulting matrix can be reordered to reveal the checkerboard pattern. The biclusters can be visualized using heatmaps, where each block corresponds to a bicluster. This visualization helps in interpreting the relationships between different groups of samples and variables.

Function

This method provides three core functions: cvx_sim_data, cvx_biclus_missing and cvx_biclus_valid. For modeling, The cvx_biclus_missing function performs convex biclustering on incomplete data matrices using an MM algorithm, while cvx_biclus_valid serves as an MM algorithm wrapper for parameter selection. Both functions have built-in plotting parameters that can output graphs of the clustering results, and cvx_biclus_valid can additionally display a graph of the validation error as a function of γ.

cvx_sim_data

cvx_sim_data loads a data matrix according to the CVX model.

cvx_sim_data()

Parameter

The simulated data are loaded internally and have no adjustable parameters.

Value

A data matrix of dimension 100 x 56. (dataframe)

Example

from BiFuncLib.simulation_data import cvx_sim_data
cvx_simdata = cvx_sim_data()

gkn_weights

Before introducing the modeling functions, we introduce gkn_weights, which combines Gaussian kernel weights with k-nearest neighbor weights.

gkn_weights(X, phi=0.5, k_row=5, k_col=5)

Parameter

Parameter

Description

X

dataframe, the data matrix to be clustered. The rows are the features, and the columns are the samples..

phi

numeric, the nonnegative parameter that controls the scale of kernel weights. Default is 0.5.

k_row

integer, The number of row nearest neighbors. Default is 5.

k_col

integer, the number of column nearest neighbors. Default is 5.

Value

The function gkn_weights outputs a dict contains parameters E_row, E_col, w_row and w_col. Detailed explanations for these entries will be provided in the modeling functions below.

Example

from BiFuncLib.cvx_main_func import gkn_weights
wts = gkn_weights(data, phi=phi, k_row=k, k_col=k)
w_row = wts["w_row"]
w_col = wts["w_col"]
E_row = wts["E_row"]
E_col = wts["E_col"]

cvx_biclus_missing

cvx_biclus_missing(data, E_row, E_col, w_row, w_col, gam, Lambda_row, Lambda_col, Theta, max_iter=100,
                   tol=1e-3, max_iter_inner=1000, tol_inner=1e-4, plot_clus = True)

Parameter

Parameter

Description

data

dataframe, the data matrix to be clustered. The rows are the features, and the columns are the samples.

E_row

sparse.csc.csc_matrix, edge-incidence matrix for row graph.

E_col

sparse.csc.csc_matrix, edge-incidence matrix for column graph.

w_row

array, vector of weights for row graph.

w_col

array, vector of weights for column graph.

gam

integer, path parameters and structures.

Lambda_row

array, initial guess of row Langrage multipliers.

Lambda_col

array, initial guess of column Langrage multipliers.

Theta

list, a list of missing indices.

max_iter

integer, maximum number of iterations. Default is 100.

tol

numeric, stopping criterion. Default is 1e-3.

max_iter_inner

integer, maximum number of inner cobra iterations. Default is 1000.

tol_inner

numeric, stopping criterion for inner cobra loop. Default is 1e-4.

plot_clus

bool, whether to plot clustering results. Default is True.

Value

The function cvx_biclus_missing outputs a dict contains clustering results and parameters.

  • iter: integer, the current iteration number of the algorithm.

  • Lambda_col: array, a matrix of regularization parameters for column clustering across different iterations.

  • Lambda_row: array, a matrix of regularization parameters for row clustering across different iterations.

  • mm_loss: list, the loss values from the model at different iterations.

  • nu_col: numeric, a parameter or statistic related to column clustering, possibly a normalization factor.

  • nu_row: numeric, a parameter or statistic related to row clustering, possibly a normalization factor.

  • U: array, a matrix representing the estimated mean values, possibly including imputed values for missing data.

  • V_col: array, a matrix representing the column-wise clustering assignments or indicators.

  • V_row: array, a matrix representing the row-wise clustering assignments or indicators.

Also, when plot_clus=True, it displays the clustering results.

../_images/cvx_missing.png

Example

import numpy as np
import random
import math
from BiFuncLib.simulation_data import cvx_sim_data
from BiFuncLib.cvx_main_func import gkn_weights
from BiFuncLib.cvx_biclus import cvx_biclus_valid, cvx_biclus_missing
cvx_simdata = cvx_sim_data()
X = cvx_simdata.copy()
X = X - np.mean(np.mean(X))
X = X / np.linalg.norm(X, 'fro')
data = X
phi = 0.5
k = 5
print("=== Example 1 ===")
wts = gkn_weights(X, phi=phi, k_row=k, k_col=k)
E_row = wts["E_row"]
E_col = wts["E_col"]
gam = 200
m_row = E_row.shape[0]
m_col = E_col.shape[0]
n = X.shape[1]
p = X.shape[0]
Lambda_row = np.random.randn(n, m_row)
Lambda_col = np.random.randn(p, m_col)
Theta = random.sample(range(1, n*p+1), math.floor(0.1 * n * p))
wts = gkn_weights(data, phi=phi, k_row=k, k_col=k)
w_row = wts["w_row"]
w_col = wts["w_col"]
E_row = wts["E_row"]
E_col = wts["E_col"]
cvx_res1 = cvx_biclus_missing(data, E_row, E_col, w_row, w_col, gam, Lambda_row, Lambda_col, Theta)

cvx_biclus_valid

cvx_biclus_valid(data, E_row, E_col, w_row, w_col, gamma, Lambda_row=None, Lambda_col=None,
                 fraction=0.1, max_iter=100, tol=1e-3, max_iter_inner=1000, tol_inner=1e-4,
                 smooth = False, plot_error = True, plot_clus = True)

Parameter

Parameter

Description

data

dataframe, the data matrix to be clustered. The rows are the features, and the columns are the samples.

E_row

sparse.csc.csc_matrix, edge-incidence matrix for row graph.

E_col

sparse.csc.csc_matrix, edge-incidence matrix for column graph.

w_row

array, vector of weights for row graph.

w_col

array, vector of weights for column graph.

gamma

array, a sequence of regularization parameter for shrinkage.

Lambda_row

array, initial guess of row Langrage multipliers.

Lambda_col

array, initial guess of column Langrage multipliers.

fraction

numeric, fraction of entries for hold out.

max_iter

integer, maximum number of iterations. Default is 100.

tol

numeric, stopping criterion. Default is 1e-3.

max_iter_inner

integer, maximum number of inner cobra iterations. Default is 1000.

tol_inner

numeric, stopping criterion for inner cobra loop. Default is 1e-4.

smooth

bool, whether to smooth the clustering results. Default is False.

plot_error

bool, whether to plot a graph of the validation error as a function of γ. Default is True.

plot_clus

bool, whether to plot clustering results. Default is True.

Value

The function cvx_biclus_valid outputs a dict contains clustering results and parameters.

  • groups_col: list, a list containing cluster assignments and sizes for columns.

  • groups_row: list, a list containing cluster assignments and sizes for rows.

  • ThetaM: array, a matrix representing the row cluster centroids.

  • ThetaV: array, a matrix representing the column cluster centroids.

  • U: list, a list of matrices representing the estimated mean matrix for each iteration of the algorithm.

  • V_col: list, a list of matrices representing the column-wise clustering assignments.

  • V_row: list, a list of matrices representing the row-wise clustering assignments.

  • validation_error: array, an array storing the validation errors for different values of the regularization parameter γ.

When plot_error=True, it displays:

../_images/cvx_error.png

When plot_clus=True and smooth=False, it displays:

../_images/cvx_val.png

When plot_clus=True and smooth=True, it displays:

../_images/cvx_smooth.png

Example

import numpy as np
import random
import math
from BiFuncLib.simulation_data import cvx_sim_data
from BiFuncLib.cvx_main_func import gkn_weights
from BiFuncLib.cvx_biclus import cvx_biclus_valid, cvx_biclus_missing
cvx_simdata = cvx_sim_data()
X = cvx_simdata.copy()
X = X - np.mean(np.mean(X))
X = X / np.linalg.norm(X, 'fro')
data = X
phi = 0.5
k = 5
print("=== Example 2 ===")
nGamma = 5
gammaSeq = 10 ** np.linspace(0, 3, nGamma)
wts = gkn_weights(data, phi=phi, k_row=k, k_col=k)
w_row = wts["w_row"]
w_col = wts["w_col"]
E_row = wts["E_row"]
E_col = wts["E_col"]
cvx_res2 = cvx_biclus_valid(data, E_row, E_col, w_row, w_col, gammaSeq, plot_error = False)
print("=== Example 3 ===")
nGamma = 7
gammaSeq = 10 ** np.linspace(0, 1, nGamma)
cvx_res3 = cvx_biclus_valid(data, E_row, E_col, w_row, w_col, gammaSeq, smooth = True)

Previous: SSVD | Next: References