Problem overview

We will replicate the study presented in:


Summary

  • Human breast cancer data obtained from MCF-7 breast cancer expression profiles (GSE6462)

  • The expression data contain samples induced by two different ErbB receptor ligands, EGF and HRG. Our analysis will focus on EGF-induced samples

  • Total number of genes in the samples: 12,773

  • Binary labeling of samples: 1 “up-regulated”; 0 otherwise

  • Considered transcription factors (TFs) predicted to bind in the promoter of the genes (from MSigDB): 397 total

  • Data preparation from the author’s website

image:

Problem statement

Key question: Are there specific combinations of motifs that are enriched in one group of genes and not the other?

Brute force analysis will require the enumeration of \(2^M\) sets of motifs, with \(M = 397\)


Source: Presentation by Koji Tsuda, Krupp Symposium (2016), slide 10

Analysis

library(CASMAP)

Set the paths to the input files (data and labels). Note: for simplicity, the data files are assumed to be located in the same directory as the scripts.

data_file  <- "breast_cancer_matrix.dat"
label_file <- "breast_cancer_label.dat"

Create an object to search for higher order interactions (all possible combinations).

obj_hoi <- CASMAP(mode="higherOrderEpistasis")

Set hyperparameters of the analysis:

  • alpha: Target Family-Wise Error Rate (FWER). Must be strictly between 0 and 1. Default value is 0.05.
  • max_comb_size: A numeric specifying the maximum length of combinations. For example, if set to 4, then only combinations of size between 1 and 4 (inclusive) will be considered. To consider combinations of arbitrary (maximal) length, use value 0, which is the default value.
obj_hoi$setTargetFWER(alpha=0.05)
obj_hoi$setMaxCombinationSize(max_comb_size=0)

Read input files. Note: Nomenclature of named-parameters is assumed for genomic data.

obj_hoi$readFiles(genotype_file=data_file, phenotype_file=label_file)

All is ready for the analysis! Execute the search. Note: Depending on the data, this step may take some time.

obj_hoi$execute()

The analysis is finalized. Get the summary results.

summary_results <- obj_hoi$getSummary()
print(summary_results)
## $n.iset.processed
## [1] 33837957
## 
## $n.iset.closed.processed
## [1] 12060658
## 
## $n.iset.testable
## [1] 12009298
## 
## $testability.threshold
## [1] 3.981072e-09
## 
## $target.fwer
## [1] 0.05
## 
## $corrected.significance.threshold
## [1] 4.163441e-09

Get the statistically significant sets of motifs.

sig_sets <- obj_hoi$getSignificantInteractions()
names(sig_sets)
## [1] "itemsets"   "score"      "odds_ratio" "pvalue"

Display the most significant sets.

sort_idx <- order(sig_sets$pvalue)
sig_sets_ordered <- sig_sets[sort_idx, ]
head(sig_sets_ordered[, c("pvalue", "itemsets")], n=5)
##            pvalue                                   itemsets
## 790  1.025043e-19         10, 97, 230, 88, 74, 146, 202, 297
## 3983 1.495522e-19                         317, 182, 106, 146
## 260  1.910906e-17                     19, 209, 106, 249, 297
## 788  1.910906e-17    10, 97, 230, 88, 74, 146, 155, 202, 297
## 789  1.910906e-17 10, 97, 230, 88, 74, 146, 6, 202, 297, 266

Display the borderline significant sets.

sort_idx <- order(sig_sets$pvalue)
sig_sets_ordered <- sig_sets[sort_idx, ]
tail(sig_sets_ordered[, c("pvalue", "itemsets")], n=5)
##            pvalue           itemsets
## 1797 3.399109e-09      313, 146, 297
## 16   3.482785e-09   7, 106, 164, 266
## 46   3.482785e-09 368, 106, 266, 297
## 1830 3.482785e-09        313, 146, 6
## 1802 4.008258e-09      313, 146, 202

Conclusion

The combinatorial analysis with CASMAP does not explore the entire search space. Untestable hypothesis are pruned achieving more statistical power.


Source: Terada et al., PNAS 2013. (Figure 2)