We will replicate a study originally presented in:
Samples from the Arabidopsis thaliana model organism. Original source of data: Atwell et al., 2010. Link to genotye and phenotype data.
The binary phenotype avrB was analyzed, with a total of 87 samples (55 cases, 32 controls).
Total number of single-nucleotide polymorphisms (SNPs): 214,032 (all homozygous). For this exercise, we subsampled the SNPs in chromosome 1 to 650 SNPs
The categorical covariate was used to correct for population structure. Created with k-means clustering on the three principal components of the empirical kinship matrix. The value of k was optimized to minimize genomic inflation.
In Example 2 we focused on genomic intervals. Here analyze interactions between variants.
Key question: Are there specific combinations of SNPs whose interaction is associated to the phenotype?
Brute force analysis will require the enumeration of \(2^M\) sets of motifs, with \(M = 214,032\) (here \(M = 650\), see above)
Import the CASMAP package
library(CASMAP)
Set the paths to the input files (genotype, phenotype and covariate).
genotype_file <- "X.dat"
phenotype_file <- "Y.dat"
covariate_file <- "C.dat"
Create an object to perform region-based GWAS (interval search).
obj_hoi <- CASMAP(mode="higherOrderEpistasis")
Set hyperparameters of the analysis:
obj_hoi$setTargetFWER(alpha=0.05)
obj_hoi$setMaxCombinationSize(max_comb_size=0)
Read input files.
The A. thaliana genotypes we analyze here are homozygous (original data are binary). See Example 2 for details on how to encode the variants if this is not the case.
obj_hoi$readFiles(genotype_file=genotype_file, phenotype_file=phenotype_file, covariate_file=covariate_file)
Run the algorithm. Retrieve statistically associated interactions between genomic variants. This execution will take a bit longer than the other two examples.
obj_hoi$execute()
The analysis is finalized. Get the summary results.
summary_results <- obj_hoi$getSummary()
print(summary_results)
## $n.iset.processed
## [1] 25697705
##
## $n.iset.closed.processed
## [1] 1688572
##
## $n.iset.testable
## [1] 312045
##
## $testability.threshold
## [1] 1.44544e-07
##
## $target.fwer
## [1] 0.05
##
## $corrected.significance.threshold
## [1] 1.602333e-07
Get the statistically significant sets of SNPs
sig_sets <- obj_hoi$getSignificantInteractions()
str(sig_sets)
## 'data.frame': 1 obs. of 4 variables:
## $ itemsets :List of 1
## ..$ : num 356 537 358 627
## $ score : num 31.2
## $ odds_ratio: num Inf
## $ pvalue : num 2.3e-08
Display the hits
head(sig_sets[, c("pvalue", "itemsets")])
## pvalue itemsets
## 0 2.299653e-08 356, 537, 358, 627
Save all results to output files.
obj_hoi$writeSummary("output/summary.txt")
obj_hoi$writeProfile("output/profiling.txt")
obj_hoi$writeSignificantInteractions("output/significant_interactions.txt")