SSVD¶
Method Description¶
Sparse Data Representation
SSVD operates on high-dimensional data matrices where rows typically represent samples (e.g., subjects) and columns represent variables (e.g., genes). The data matrix is often centered to ensure that each column has zero mean. This preprocessing step helps in identifying interpretable patterns by removing the overall mean effect.
Model Definition
SSVD seeks a low-rank, checkerboard-structured matrix approximation to the data matrix \(X\). The checkerboard structure is achieved by imposing sparsity on both the left and right singular vectors, resulting in many zero entries. Specifically, the model aims to decompose \(X\) into a sum of rank-one matrices \(s_k u_k v_k^T\), where \(u_k\) and \(v_k\) are sparse singular vectors. This structure allows for simultaneous clustering of rows and columns, revealing interpretable row-column associations.
Iterative Algorithm for Computation
The SSVD algorithm alternates between optimizing the left and right singular vectors while imposing sparsity-inducing penalties. The steps are as follows:
Initialization: Apply standard SVD to the data matrix \(X\) to obtain the initial singular vectors.
Update :math:`v`: For fixed \(u\), update \(v\) using a soft-thresholding rule to impose sparsity:
\[\tilde{v}_j = \text{sign}((X^T u)_j) \left( |(X^T u)_j| - \frac{\lambda_v w_{2,j}}{2} \right)_+\]Update :math:`u`: For fixed \(v\), update \(u\) similarly:
\[\tilde{u}_i = \text{sign}((X v)_i) \left( |(X v)_i| - \frac{\lambda_u w_{1,i}}{2} \right)_+\]Normalization and Singular Value Update: Normalize \(u\) and \(v\) and update the singular value \(s\).
Convergence Check: Repeat the updates until convergence, typically within 5 to 10 iterations. The penalty parameters \(\lambda_u\) and \(\lambda_v\) are selected using the Bayesian Information Criterion (BIC) to balance sparsity and goodness-of-fit.
Penalty Parameter Selection
The degrees of sparsity of the singular vectors \(u\) and \(v\) are controlled by the penalty parameters \(\lambda_u\) and \(\lambda_v\). These parameters are selected using the BIC, which balances the model complexity (number of non-zero entries) with the fit to the data. The BIC is defined as:
\[\text{BIC}(\lambda) = \frac{\|Y - \hat{Y}\|^2}{nd \cdot \hat{\sigma}^2} + \frac{\log(nd)}{nd} \hat{df}(\lambda),\]
where \(\hat{df}(\lambda)\) is the degree of sparsity (number of non-zero entries) and \(\hat{\sigma}^2\) is the estimated error variance.
Post-processing for Interpretation
After obtaining the SSVD layers, the resulting sparse singular vectors can be used to identify biclusters. For example, in gene expression data, the non-zero entries in \(u\) and \(v\) indicate which samples and genes are associated within each bicluster. The biclusters can be visualized using image plots or scatter plots of the singular vectors, revealing distinct groupings and contrasts between different conditions or classes.
Function¶
This method provides three four functions: ssvd_sim_data for simulation, s4vd_biclus and ssvd_biclus for modeling, and jaccardmat for evaluation. For visualization, we use bcheatmap function which is the same as Bimax to show the clustering result. In this section, we detail their respective usage, as well as parameters, output values and usage examples for each function.
ssvd_sim_data¶
ssvd_sim_data generates a dict contains data matrix and true clustering results.
ssvd_sim_data()
Parameter¶
The simulated data generates data in a fixed way and have no adjustable parameters.
Value¶
data: array, data matrix generated by SSVD model.
res: BiclustResult object, true clustering results.
Example¶
from BiFuncLib.simulation_data import ssvd_sim_data
ssvd_simdata = ssvd_sim_data()
s4vd_biclus & ssvd_biclus¶
The ssvd_biclus function performs biclustering of the data matrix using sparse singular value decomposition, while s4vd_biclus builds upon this by incorporating nested stability selection for enhanced biclustering results.
ssvd_biclus(data, K=10, threu=1, threv=1, gamu=0, gamv=0, merr=1e-4, niter=100)
and
s4vd_biclus(data, steps=100, pcerv=0.1, pceru=0.1, ss_thr=(0.6, 0.65), size=0.5, gamm=0, iters=100, nbiclust=10, merr=1e-3, cols_nc=True,
rows_nc=True, row_overlap=True, col_overlap=True, row_min=1, col_min=1, pointwise=True, start_iter=3, savepath=False)
Parameter¶
For ssvd_biclus function, the parameters are listed below:
Parameter |
Description |
|---|---|
data |
array, the matrix to be clustered. |
K |
integer, number of SSVD-layers. |
threu |
integer 1 or 2, type of penalty (thresholding rule) for the left singular vector, 1 = (Adaptive) LASSO, 2 = hard thresholding. Default is 1. |
threv |
integer 1 or 2, type of penalty (thresholding rule) for the right singular vector, 1 = (Adaptive) LASSO, 2 = hard thresholding. Default is 1. |
gamu |
numeric, weight parameter in Adaptive LASSO for the left singular vector, nonnegative constant. Default is 0. |
gamv |
numeric, weight parameter in Adaptive LASSO for the right singular vector, nonnegative constant. Default is 0. |
merr |
numeric, threshold to decide convergence. Default is 1e-4. |
niter |
integer, maximum number of iterations. Default is 100. |
For s4vd_biclus function, the parameters are listed below:
Parameter |
Description |
|---|---|
data |
array, the matrix to be clustered. |
steps |
integer, number of subsamples used to perform the stability selection. Default is 100. |
pcerv |
numeric, per comparsion wise error rate to control the number of falsely selected right singular vector coefficients (columns/samples). Default is 0.1. |
pceru |
numeric, per comparsion wise error rate to control the number of falsely selected left singular vector coefficients (rows/genes). Default is 0.1. |
ss_thr |
tuple, range of the cutoff threshold (relative selection frequency) for the stability selection. Default is (0.6, 0.65). |
size |
numeric, size of the subsamples used to perform the stability selection. Default is 0.5. |
gamm |
numeric, weight parameter for the adaptive LASSO, nonnegative constant. Default is 0. |
iters |
integer, maximal number of iterations to fit a single bicluster. Default is 100. |
nbiclust |
integer, maximal number of biclusters. Default is 10. |
merr |
numeric, threshold to decide convergence. Default is 1e-3. |
cols_nc |
bool, allow for negative correlation of columns (samples) over rows (genes). Default is True. |
rows_nc |
bool, allow for negative correlation of rows (genes) over columns (samples). Default is True. |
row_overlap |
bool, allow rows to overlap between biclusters. Default is True. |
col_overlap |
bool, allow columns to overlap between biclusters. Default is True. |
row_min |
integer, minimal number of rows. Default is 1. |
col_min |
integer, minimal number of columns. Default is 1. |
pointwise |
bool, performs a fast pointwise stability selection instead of calculating the complete stability path. Default is False. |
start_iter |
integer, number of starting iterations in which the algorithm is not allowed to converge. Default is 3. |
savepath |
bool, saves the stability path in order plot the path with the stabpathplot function. Default is False. |
Value¶
Both functions return a BiclustResult object, which is the same as Bimax.
Example¶
from BiFuncLib.simulation_data import ssvd_sim_data
from BiFuncLib.ssvd_biclus import s4vd_biclus, ssvd_biclus
ssvd_simdata = ssvd_sim_data()
data = ssvd_simdata['data']
res_sim = ssvd_simdata['res']
s4vd_res = s4vd_biclus(data, pcerv=0.5, pceru=0.5, pointwise=False, nbiclust=1)
s4vd_res_pw = s4vd_biclus(data, pcerv=0.5, pceru=0.5, pointwise=True, nbiclust=1)
res2_ssvd = ssvd_biclus(data,K=1)
Other functions¶
For visualization, we use bcheatmap function which can be seen at Bimax to show the clustering results. A heatmap will be displayed:
For evaluation, we use jaccardmat function. Its usage is shown below:
jaccardmat(res1, res2, mode=None)
res1 and res2 parameters are two BiclustResult objects, mode is used to control whether the calculation is performed row-wise (mode=’row’), column-wise (mode=’column’), or over the entire matrix (mode=’None’). This ffunction outputs a matrix contains jaccard indexes.
Example¶
from BiFuncLib.simulation_data import ssvd_sim_data
from BiFuncLib.ssvd_main_func import jaccardmat
from BiFuncLib.ssvd_biclus import s4vd_biclus, ssvd_biclus
from BiFuncLib.bcheatmap import bcheatmap
ssvd_simdata = ssvd_sim_data()
data = ssvd_simdata['data']
res_sim = ssvd_simdata['res']
s4vd_res = s4vd_biclus(data, pcerv=0.5, pceru=0.5, pointwise=False, nbiclust=1)
print(jaccardmat(res_sim, s4vd_res, 'row'))
print(jaccardmat(res_sim, s4vd_res, 'column'))
bcheatmap(data, s4vd_res)
s4vd_res_pw = s4vd_biclus(data, pcerv=0.5, pceru=0.5, pointwise=True, nbiclust=1)
print(jaccardmat(res_sim, s4vd_res_pw))
bcheatmap(data, s4vd_res_pw)
res2_ssvd = ssvd_biclus(data,K=1)
print(jaccardmat(res_sim, res2_ssvd))
bcheatmap(data, res2_ssvd)