BiFuncLib's documentation

BiFuncLib is a Python package that aggregates multiple biclustering methods mainly for functional data.

This Page

SSVD

Method Description

  • Sparse Data Representation

SSVD operates on high-dimensional data matrices where rows typically represent samples (e.g., subjects) and columns represent variables (e.g., genes). The data matrix is often centered to ensure that each column has zero mean. This preprocessing step helps in identifying interpretable patterns by removing the overall mean effect.

  • Model Definition

SSVD seeks a low-rank, checkerboard-structured matrix approximation to the data matrix \(X\). The checkerboard structure is achieved by imposing sparsity on both the left and right singular vectors, resulting in many zero entries. Specifically, the model aims to decompose \(X\) into a sum of rank-one matrices \(s_k u_k v_k^T\), where \(u_k\) and \(v_k\) are sparse singular vectors. This structure allows for simultaneous clustering of rows and columns, revealing interpretable row-column associations.

  • Iterative Algorithm for Computation

The SSVD algorithm alternates between optimizing the left and right singular vectors while imposing sparsity-inducing penalties. The steps are as follows:

  1. Initialization: Apply standard SVD to the data matrix \(X\) to obtain the initial singular vectors.

  2. Update :math:`v`: For fixed \(u\), update \(v\) using a soft-thresholding rule to impose sparsity:

    \[\tilde{v}_j = \text{sign}((X^T u)_j) \left( |(X^T u)_j| - \frac{\lambda_v w_{2,j}}{2} \right)_+\]
  3. Update :math:`u`: For fixed \(v\), update \(u\) similarly:

    \[\tilde{u}_i = \text{sign}((X v)_i) \left( |(X v)_i| - \frac{\lambda_u w_{1,i}}{2} \right)_+\]
  4. Normalization and Singular Value Update: Normalize \(u\) and \(v\) and update the singular value \(s\).

  5. Convergence Check: Repeat the updates until convergence, typically within 5 to 10 iterations. The penalty parameters \(\lambda_u\) and \(\lambda_v\) are selected using the Bayesian Information Criterion (BIC) to balance sparsity and goodness-of-fit.

  • Penalty Parameter Selection

The degrees of sparsity of the singular vectors \(u\) and \(v\) are controlled by the penalty parameters \(\lambda_u\) and \(\lambda_v\). These parameters are selected using the BIC, which balances the model complexity (number of non-zero entries) with the fit to the data. The BIC is defined as:

\[\text{BIC}(\lambda) = \frac{\|Y - \hat{Y}\|^2}{nd \cdot \hat{\sigma}^2} + \frac{\log(nd)}{nd} \hat{df}(\lambda),\]

where \(\hat{df}(\lambda)\) is the degree of sparsity (number of non-zero entries) and \(\hat{\sigma}^2\) is the estimated error variance.

  • Post-processing for Interpretation

After obtaining the SSVD layers, the resulting sparse singular vectors can be used to identify biclusters. For example, in gene expression data, the non-zero entries in \(u\) and \(v\) indicate which samples and genes are associated within each bicluster. The biclusters can be visualized using image plots or scatter plots of the singular vectors, revealing distinct groupings and contrasts between different conditions or classes.

Function

This method provides three four functions: ssvd_sim_data for simulation, s4vd_biclus and ssvd_biclus for modeling, and jaccardmat for evaluation. For visualization, we use bcheatmap function which is the same as Bimax to show the clustering result. In this section, we detail their respective usage, as well as parameters, output values and usage examples for each function.

ssvd_sim_data

ssvd_sim_data generates a dict contains data matrix and true clustering results.

ssvd_sim_data()

Parameter

The simulated data generates data in a fixed way and have no adjustable parameters.

Value

  • data: array, data matrix generated by SSVD model.

  • res: BiclustResult object, true clustering results.

Example

from BiFuncLib.simulation_data import ssvd_sim_data
ssvd_simdata = ssvd_sim_data()

s4vd_biclus & ssvd_biclus

The ssvd_biclus function performs biclustering of the data matrix using sparse singular value decomposition, while s4vd_biclus builds upon this by incorporating nested stability selection for enhanced biclustering results.

ssvd_biclus(data, K=10, threu=1, threv=1, gamu=0, gamv=0, merr=1e-4, niter=100)

and

s4vd_biclus(data, steps=100, pcerv=0.1, pceru=0.1, ss_thr=(0.6, 0.65), size=0.5, gamm=0, iters=100, nbiclust=10, merr=1e-3, cols_nc=True,
            rows_nc=True, row_overlap=True, col_overlap=True, row_min=1, col_min=1, pointwise=True, start_iter=3, savepath=False)

Parameter

For ssvd_biclus function, the parameters are listed below:

Parameter

Description

data

array, the matrix to be clustered.

K

integer, number of SSVD-layers.

threu

integer 1 or 2, type of penalty (thresholding rule) for the left singular vector, 1 = (Adaptive) LASSO, 2 = hard thresholding. Default is 1.

threv

integer 1 or 2, type of penalty (thresholding rule) for the right singular vector, 1 = (Adaptive) LASSO, 2 = hard thresholding. Default is 1.

gamu

numeric, weight parameter in Adaptive LASSO for the left singular vector, nonnegative constant. Default is 0.

gamv

numeric, weight parameter in Adaptive LASSO for the right singular vector, nonnegative constant. Default is 0.

merr

numeric, threshold to decide convergence. Default is 1e-4.

niter

integer, maximum number of iterations. Default is 100.

For s4vd_biclus function, the parameters are listed below:

Parameter

Description

data

array, the matrix to be clustered.

steps

integer, number of subsamples used to perform the stability selection. Default is 100.

pcerv

numeric, per comparsion wise error rate to control the number of falsely selected right singular vector coefficients (columns/samples). Default is 0.1.

pceru

numeric, per comparsion wise error rate to control the number of falsely selected left singular vector coefficients (rows/genes). Default is 0.1.

ss_thr

tuple, range of the cutoff threshold (relative selection frequency) for the stability selection. Default is (0.6, 0.65).

size

numeric, size of the subsamples used to perform the stability selection. Default is 0.5.

gamm

numeric, weight parameter for the adaptive LASSO, nonnegative constant. Default is 0.

iters

integer, maximal number of iterations to fit a single bicluster. Default is 100.

nbiclust

integer, maximal number of biclusters. Default is 10.

merr

numeric, threshold to decide convergence. Default is 1e-3.

cols_nc

bool, allow for negative correlation of columns (samples) over rows (genes). Default is True.

rows_nc

bool, allow for negative correlation of rows (genes) over columns (samples). Default is True.

row_overlap

bool, allow rows to overlap between biclusters. Default is True.

col_overlap

bool, allow columns to overlap between biclusters. Default is True.

row_min

integer, minimal number of rows. Default is 1.

col_min

integer, minimal number of columns. Default is 1.

pointwise

bool, performs a fast pointwise stability selection instead of calculating the complete stability path. Default is False.

start_iter

integer, number of starting iterations in which the algorithm is not allowed to converge. Default is 3.

savepath

bool, saves the stability path in order plot the path with the stabpathplot function. Default is False.

Value

Both functions return a BiclustResult object, which is the same as Bimax.

Example

from BiFuncLib.simulation_data import ssvd_sim_data
from BiFuncLib.ssvd_biclus import s4vd_biclus, ssvd_biclus
ssvd_simdata = ssvd_sim_data()
data = ssvd_simdata['data']
res_sim = ssvd_simdata['res']
s4vd_res = s4vd_biclus(data, pcerv=0.5, pceru=0.5, pointwise=False, nbiclust=1)
s4vd_res_pw = s4vd_biclus(data, pcerv=0.5, pceru=0.5, pointwise=True, nbiclust=1)
res2_ssvd = ssvd_biclus(data,K=1)

Other functions

For visualization, we use bcheatmap function which can be seen at Bimax to show the clustering results. A heatmap will be displayed:

../_images/ssvd_res.png

For evaluation, we use jaccardmat function. Its usage is shown below:

jaccardmat(res1, res2, mode=None)

res1 and res2 parameters are two BiclustResult objects, mode is used to control whether the calculation is performed row-wise (mode=’row’), column-wise (mode=’column’), or over the entire matrix (mode=’None’). This ffunction outputs a matrix contains jaccard indexes.

Example

from BiFuncLib.simulation_data import ssvd_sim_data
from BiFuncLib.ssvd_main_func import jaccardmat
from BiFuncLib.ssvd_biclus import s4vd_biclus, ssvd_biclus
from BiFuncLib.bcheatmap import bcheatmap
ssvd_simdata = ssvd_sim_data()
data = ssvd_simdata['data']
res_sim = ssvd_simdata['res']
s4vd_res = s4vd_biclus(data, pcerv=0.5, pceru=0.5, pointwise=False, nbiclust=1)
print(jaccardmat(res_sim, s4vd_res, 'row'))
print(jaccardmat(res_sim, s4vd_res, 'column'))
bcheatmap(data, s4vd_res)
s4vd_res_pw = s4vd_biclus(data, pcerv=0.5, pceru=0.5, pointwise=True, nbiclust=1)
print(jaccardmat(res_sim, s4vd_res_pw))
bcheatmap(data, s4vd_res_pw)
res2_ssvd = ssvd_biclus(data,K=1)
print(jaccardmat(res_sim, res2_ssvd))
bcheatmap(data, res2_ssvd)

Previous: Bimax | Next: CVX