Table of Contents

introduction
installation
Methods and Main Functions
- FunFEM
- FunLBM
- FunCC
- FunPF
- FunSparse
- FunSAS
- FunLocal
- Bimax
- SSVD
- CVX
References
Contribution

This Page

CVX¶

Method Description¶

Sparse Data Representation

Convex Biclustering operates on high-dimensional data matrices where rows typically represent samples (e.g., subjects) and columns represent variables (e.g., genes). The data matrix is often centered to ensure that each column has zero mean. This preprocessing step helps in identifying interpretable patterns by removing the overall mean effect.

Model Definition

Convex Biclustering seeks to identify groups of rows and columns in a data matrix that are associated with each other. The goal is to partition the matrix into a checkerboard pattern, where each block represents a bicluster. The model assumes that the observed data matrix \(X \in \mathbb{R}^{p \times n}\) can be approximated by a mean matrix \(U\) with a checkerboard structure. Specifically, the model aims to minimize the following convex criterion:

\[F_\gamma(U) = \frac{1}{2} \|X - U\|_F^2 + \gamma \left( \sum_{i < j} w_{ij} \|U_{\cdot i} - U_{\cdot j}\|_2^2 + \sum_{i < j} \tilde{w}_{ij} \|U_{i \cdot} - U_{j \cdot}\|_2^2 \right),\]

where \(\| \cdot \|_F\) denotes the Frobenius norm, \(w_{ij}\) and \(\tilde{w}_{ij}\) are non-negative weights, and \(\gamma \geq 0\) is a regularization parameter that controls the trade-off between the goodness-of-fit and the complexity of the biclustering.

Iterative Algorithm for Computation (COBRA)

The Convex Biclustering problem is solved using the Convex BiclusteRing Algorithm (COBRA), which is an iterative algorithm based on the Dykstra-like proximal algorithm (DLPA). The steps are as follows:

Initialization: Set \(U^{(0)} = X\), \(P^{(0)} = 0\), and \(Q^{(0)} = 0\).

Iterate: - Convex Clustering of Rows: \(Y^{(m)} = \text{prox}_{\gamma \tilde{W}}(U^{(m)T} + P^{(m)T})\) - Update \(P^{(m+1)} = U^{(m)} + P^{(m)} - Y^{(m)T}\) - Convex Clustering of Columns: \(U^{(m+1)} = \text{prox}_{\gamma W}(Y^{(m)T} + Q^{(m)T})\) - Update \(Q^{(m+1)} = Y^{(m)} + Q^{(m)} - U^{(m+1)T}\)

Convergence Check: Repeat the updates until convergence, typically when \(\|U^{(m)} - Y^{(m)}\|_F\) falls below a tolerance \(\tau > 0\).

The proximal mapping \(\text{prox}_{\gamma W}(Z)\) is defined as:

\[\text{prox}_{\gamma W}(Z) = \arg \min_V \left\{ \frac{1}{2} \|Z - V\|_F^2 + \gamma \sum_{i < j} w_{ij} \|V_{\cdot i} - V_{\cdot j}\|_2^2 \right\}\]

Penalty Parameter Selection

The regularization parameter \(\gamma\) controls the number of biclusters and the complexity of the solution. To select \(\gamma\) in a data-driven manner, we use a hold-out validation approach. A subset of the data matrix is randomly selected as a validation set, and the model is evaluated on how well it predicts the hold-out set. The optimal \(\gamma\) is chosen to minimize the prediction error over the hold-out set.

Post-processing for Interpretation

After obtaining the biclustering solution \(U_\gamma\), the resulting matrix can be reordered to reveal the checkerboard pattern. The biclusters can be visualized using heatmaps, where each block corresponds to a bicluster. This visualization helps in interpreting the relationships between different groups of samples and variables.

Function¶

This method provides three core functions: cvx_sim_data, cvx_biclus_missing and cvx_biclus_valid. For modeling, The cvx_biclus_missing function performs convex biclustering on incomplete data matrices using an MM algorithm, while cvx_biclus_valid serves as an MM algorithm wrapper for parameter selection. Both functions have built-in plotting parameters that can output graphs of the clustering results, and cvx_biclus_valid can additionally display a graph of the validation error as a function of γ.

cvx_sim_data¶

cvx_sim_data loads a data matrix according to the CVX model.

cvx_sim_data()

Parameter¶

The simulated data are loaded internally and have no adjustable parameters.

Value¶

A data matrix of dimension 100 x 56. (dataframe)

Example¶

from BiFuncLib.simulation_data import cvx_sim_data
cvx_simdata = cvx_sim_data()

gkn_weights¶

Before introducing the modeling functions, we introduce gkn_weights, which combines Gaussian kernel weights with k-nearest neighbor weights.

gkn_weights(X, phi=0.5, k_row=5, k_col=5)

Parameter¶

Parameter	Description
X	dataframe, the data matrix to be clustered. The rows are the features, and the columns are the samples..
phi	numeric, the nonnegative parameter that controls the scale of kernel weights. Default is 0.5.
k_row	integer, The number of row nearest neighbors. Default is 5.
k_col	integer, the number of column nearest neighbors. Default is 5.

Value¶

The function gkn_weights outputs a dict contains parameters E_row, E_col, w_row and w_col. Detailed explanations for these entries will be provided in the modeling functions below.

Example¶

from BiFuncLib.cvx_main_func import gkn_weights
wts = gkn_weights(data, phi=phi, k_row=k, k_col=k)
w_row = wts["w_row"]
w_col = wts["w_col"]
E_row = wts["E_row"]
E_col = wts["E_col"]

cvx_biclus_missing¶

cvx_biclus_missing(data, E_row, E_col, w_row, w_col, gam, Lambda_row, Lambda_col, Theta, max_iter=100,
                   tol=1e-3, max_iter_inner=1000, tol_inner=1e-4, plot_clus = True)

Parameter¶

Parameter	Description
data	dataframe, the data matrix to be clustered. The rows are the features, and the columns are the samples.
E_row	sparse.csc.csc_matrix, edge-incidence matrix for row graph.
E_col	sparse.csc.csc_matrix, edge-incidence matrix for column graph.
w_row	array, vector of weights for row graph.
w_col	array, vector of weights for column graph.
gam	integer, path parameters and structures.
Lambda_row	array, initial guess of row Langrage multipliers.
Lambda_col	array, initial guess of column Langrage multipliers.
Theta	list, a list of missing indices.
max_iter	integer, maximum number of iterations. Default is 100.
tol	numeric, stopping criterion. Default is 1e-3.
max_iter_inner	integer, maximum number of inner cobra iterations. Default is 1000.
tol_inner	numeric, stopping criterion for inner cobra loop. Default is 1e-4.
plot_clus	bool, whether to plot clustering results. Default is True.

Value¶

The function cvx_biclus_missing outputs a dict contains clustering results and parameters.

iter: integer, the current iteration number of the algorithm.
Lambda_col: array, a matrix of regularization parameters for column clustering across different iterations.
Lambda_row: array, a matrix of regularization parameters for row clustering across different iterations.
mm_loss: list, the loss values from the model at different iterations.
nu_col: numeric, a parameter or statistic related to column clustering, possibly a normalization factor.
nu_row: numeric, a parameter or statistic related to row clustering, possibly a normalization factor.
U: array, a matrix representing the estimated mean values, possibly including imputed values for missing data.
V_col: array, a matrix representing the column-wise clustering assignments or indicators.
V_row: array, a matrix representing the row-wise clustering assignments or indicators.

Also, when plot_clus=True, it displays the clustering results.

Example¶

import numpy as np
import random
import math
from BiFuncLib.simulation_data import cvx_sim_data
from BiFuncLib.cvx_main_func import gkn_weights
from BiFuncLib.cvx_biclus import cvx_biclus_valid, cvx_biclus_missing
cvx_simdata = cvx_sim_data()
X = cvx_simdata.copy()
X = X - np.mean(np.mean(X))
X = X / np.linalg.norm(X, 'fro')
data = X
phi = 0.5
k = 5
print("=== Example 1 ===")
wts = gkn_weights(X, phi=phi, k_row=k, k_col=k)
E_row = wts["E_row"]
E_col = wts["E_col"]
gam = 200
m_row = E_row.shape[0]
m_col = E_col.shape[0]
n = X.shape[1]
p = X.shape[0]
Lambda_row = np.random.randn(n, m_row)
Lambda_col = np.random.randn(p, m_col)
Theta = random.sample(range(1, n*p+1), math.floor(0.1 * n * p))
wts = gkn_weights(data, phi=phi, k_row=k, k_col=k)
w_row = wts["w_row"]
w_col = wts["w_col"]
E_row = wts["E_row"]
E_col = wts["E_col"]
cvx_res1 = cvx_biclus_missing(data, E_row, E_col, w_row, w_col, gam, Lambda_row, Lambda_col, Theta)

cvx_biclus_valid¶

cvx_biclus_valid(data, E_row, E_col, w_row, w_col, gamma, Lambda_row=None, Lambda_col=None,
                 fraction=0.1, max_iter=100, tol=1e-3, max_iter_inner=1000, tol_inner=1e-4,
                 smooth = False, plot_error = True, plot_clus = True)

Parameter¶

Parameter	Description
data	dataframe, the data matrix to be clustered. The rows are the features, and the columns are the samples.
E_row	sparse.csc.csc_matrix, edge-incidence matrix for row graph.
E_col	sparse.csc.csc_matrix, edge-incidence matrix for column graph.
w_row	array, vector of weights for row graph.
w_col	array, vector of weights for column graph.
gamma	array, a sequence of regularization parameter for shrinkage.
Lambda_row	array, initial guess of row Langrage multipliers.
Lambda_col	array, initial guess of column Langrage multipliers.
fraction	numeric, fraction of entries for hold out.
max_iter	integer, maximum number of iterations. Default is 100.
tol	numeric, stopping criterion. Default is 1e-3.
max_iter_inner	integer, maximum number of inner cobra iterations. Default is 1000.
tol_inner	numeric, stopping criterion for inner cobra loop. Default is 1e-4.
smooth	bool, whether to smooth the clustering results. Default is False.
plot_error	bool, whether to plot a graph of the validation error as a function of γ. Default is True.
plot_clus	bool, whether to plot clustering results. Default is True.

Value¶

The function cvx_biclus_valid outputs a dict contains clustering results and parameters.

groups_col: list, a list containing cluster assignments and sizes for columns.
groups_row: list, a list containing cluster assignments and sizes for rows.
ThetaM: array, a matrix representing the row cluster centroids.
ThetaV: array, a matrix representing the column cluster centroids.
U: list, a list of matrices representing the estimated mean matrix for each iteration of the algorithm.
V_col: list, a list of matrices representing the column-wise clustering assignments.
V_row: list, a list of matrices representing the row-wise clustering assignments.
validation_error: array, an array storing the validation errors for different values of the regularization parameter γ.

When plot_error=True, it displays:

When plot_clus=True and smooth=False, it displays:

When plot_clus=True and smooth=True, it displays:

Example¶

import numpy as np
import random
import math
from BiFuncLib.simulation_data import cvx_sim_data
from BiFuncLib.cvx_main_func import gkn_weights
from BiFuncLib.cvx_biclus import cvx_biclus_valid, cvx_biclus_missing
cvx_simdata = cvx_sim_data()
X = cvx_simdata.copy()
X = X - np.mean(np.mean(X))
X = X / np.linalg.norm(X, 'fro')
data = X
phi = 0.5
k = 5
print("=== Example 2 ===")
nGamma = 5
gammaSeq = 10 ** np.linspace(0, 3, nGamma)
wts = gkn_weights(data, phi=phi, k_row=k, k_col=k)
w_row = wts["w_row"]
w_col = wts["w_col"]
E_row = wts["E_row"]
E_col = wts["E_col"]
cvx_res2 = cvx_biclus_valid(data, E_row, E_col, w_row, w_col, gammaSeq, plot_error = False)
print("=== Example 3 ===")
nGamma = 7
gammaSeq = 10 ** np.linspace(0, 1, nGamma)
cvx_res3 = cvx_biclus_valid(data, E_row, E_col, w_row, w_col, gammaSeq, smooth = True)

Previous: SSVD | Next: References

On this page

CVX
- Method Description
- Function
  - cvx_sim_data
    - Parameter
    - Value
    - Example
  - gkn_weights
    - Parameter
    - Value
    - Example
  - cvx_biclus_missing
    - Parameter
    - Value
    - Example
  - cvx_biclus_valid
    - Parameter
    - Value
    - Example

BiFuncLib's documentation

This Page

Quick search

CVX¶

Method Description¶

Function¶

cvx_sim_data¶

Parameter¶

Value¶

Example¶

gkn_weights¶

Parameter¶

Value¶

Example¶

cvx_biclus_missing¶

Parameter¶

Value¶

Example¶

cvx_biclus_valid¶

Parameter¶

Value¶

Example¶