cola

Subgroup classification is a basic task in genomic data analysis. The cola package provides a general framework for subgroup classification by consensus partitioning.

39
mentions
1
contributor

Cite this software

What cola can do for you

cola: A General Framework for Consensus Partitioning

R-CMD-check
bioc
bioc

Citation

Zuguang Gu, et al., cola: an R/Bioconductor package for consensus partitioning through a general framework, Nucleic Acids Research, 2021. https://doi.org/10.1093/nar/gkaa1146

Zuguang Gu, et al., Improve consensus partitioning via a hierarchical procedure. Briefings in bioinformatics 2022. https://doi.org/10.1093/bib/bbac048

Install

cola is available on Bioconductor, you can install it by:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("cola")

The latest version can be installed directly from GitHub:

library(devtools)
install_github("jokergoo/cola")

Methods

The cola supports two types of consensus partitioning.

Standard consensus partitioning

Features

  1. It modularizes the consensus clustering processes that various methods can
    be easily integrated in different steps of the analysis.
  2. It provides rich visualizations for intepreting the results.
  3. It allows running multiple methods at the same time and provides
    functionalities to compare results in a straightforward way.
  4. It provides a new method to extract features which are more efficient to
    separate subgroups.
  5. It generates detailed HTML reports for the complete analysis.

Workflow

The steps of consensus partitioning is:

  1. Clean the input matrix. The processing are: adjusting outliers, imputing missing
    values and removing rows with very small variance. This step is optional.
  2. Extract subset of rows with highest scores. Here "scores" are calculated by
    a certain method. For gene expression analysis or methylation data
    analysis, $n$ rows with highest variance are used in most cases, where
    the "method", or let's call it "the top-value method" is the variance (by
    var() or sd()). Note the choice of "the top-value method" can be
    general. It can be e.g. MAD (median absolute deviation) or any user-defined
    method.
  3. Scale the rows in the sub-matrix (e.g. gene expression) or not (e.g. methylation data).
    This step is optional.
  4. Randomly sample a subset of rows from the sub-matrix with probability $p$ and
    perform partition on the columns of the matrix by a certain partition
    method, with trying different numbers of subgroups.
  5. Repeat step 4 several times and collect all the partitions.
  6. Perform consensus partitioning analysis and determine the best number of
    subgroups which gives the most stable subgrouping.
  7. Apply statistical tests to find rows that show significant difference
    between the predicted subgroups. E.g. to extract subgroup specific genes.
  8. If rows in the matrix can be associated to genes, downstream analysis such
    as function enrichment analysis can be performed.

Usage

Three lines of code to perfrom cola analysis:

mat = adjust_matrix(mat) # optional
rl = run_all_consensus_partition_methods(
    mat, 
    top_value_method = c("SD", "MAD", ...),
    partition_method = c("hclust", "kmeans", ...),
    cores = ...)
cola_report(rl, output_dir = ...)

Plots

Following plots compare consensus heatmaps with k = 4 under all combinations of methods.

Hierarchical consensus partitioning

Features

  1. It can detect subgroups which show major differences and also moderate differences.
  2. It can detect subgroups with large sizes as well as with tiny sizes.
  3. It generates detailed HTML reports for the complete analysis.

Hierarchical Consensus Partitioning

Usage

Three lines of code to perfrom hierarchical consensus partitioning analysis:

mat = adjust_matrix(mat) # optional
rh = hierarchical_partition(mat, mc.cores = ...)
cola_report(rh, output_dir = ...)

Plots

Following figure shows the hierarchy of the subgroups.

Following figure shows the signature genes.

License

MIT @ Zuguang Gu

Logo of cola
Keywords
Programming languages
  • HTML 93%
  • R 7%
License
</>Source code
Packages

Participating organisations

German Cancer Research Center

Reference papers

Mentions

Contributors

Helmholtz Program-oriented Funding IV

Research Field
Research Program
PoF Topic
3 Health
3.1 Cancer Research
3.1.1 Cell and Tumor Biology
3.1.2 Functional and Structural Genomics
  • 3 Health
    • 3.1 Cancer Research
      • 3.1.1 Cell and Tumor Biology
      • 3.1.2 Functional and Structural Genomics