Table of Contents
This tutorial walks through the example data bundled with fastGxE and follows the current command-line workflow implemented in the codebase.
All commands below assume you have changed into the example/ directory and created an output/ subdirectory for results:
cd example
mkdir -p output
The executable is assumed to be on your PATH as fastgxe. If you built from source and the binary is not on your PATH, replace fastgxe with the full path, e.g. ../build/fastgxe.
1. Example inputs #
The example/ directory contains a minimal dataset for testing the software:
test.bedtest.bimtest.famtest_simu_pheno.txt
This dataset is intentionally small. It is useful for checking that the workflow runs end to end, but it is not large enough to provide stable real-data inference.
PLINK genotype input #
fastGxE reads PLINK binary genotype data through a prefix passed to --bfile.
For example:
--bfile test
means the program will use:
test.bedtest.bimtest.fam
The same prefix convention is used for --grm. For example, --grm test refers to the GRM prefix, not to a single file such as test.grm.bin.
Phenotype and covariate table #
The phenotype table passed to --data should satisfy the following rules:
- A header row is required.
- Fields should be separated by blanks or tabs.
- The first column must contain individual IDs.
- Only individuals present in both the phenotype table and the GRM will be retained.
- Duplicate sample IDs are not allowed.
A minimal layout looks like this:
| iid | pheno | age | sex | E1 | E2 | E3 | … | E40 |
|---|---|---|---|---|---|---|---|---|
| 1110005 | -0.75286 | 54 | F | 0.21092 | 0.648552 | 0.468537 | … | -0.55331 |
| 1110006 | -0.62861 | 49 | M | 0.000148 | -0.12916 | -1.77284 | … | 0.836599 |
| 1110008 | 1.132841 | 61 | F | -1.14961 | -0.29059 | 0.446468 | … | -0.10859 |
Typical column usage is:
--trait phenofor the phenotype--covar agefor continuous adjustment covariates--class sexfor categorical covariates--env-int E1:E40for interacting environments
Important option behavior #
--env-intaccepts either explicit names or a header range such asE1:E40.- Variables listed in
--env-intshould not also be listed in--covaror--class. - Interacting environments are standardized by default before GxE testing.
- Use
--no-standardize-envif you want to keep the raw scale. - Missing phenotype or covariate tokens default to:
NA Na na NAN NaN nan -NAN -NaN -nan <NA> <na> N/A n/a - You can override or extend the missing-value list with
--missing-data.
2. Build and prepare the GRM #
Step 1: build a dense additive GRM #
Generate the additive GRM from the example PLINK files:
fastgxe --make-grm --bfile test --out output/test
This produces the standard GRM sidecar files under the chosen prefix, including:
output/test.grm.binoutput/test.grm.idoutput/test.grm.N
For large datasets, the GRM can be computed in partitions with --npart <total> <part> and then merged:
fastgxe --make-grm --bfile test --npart 4 1 --out output/test
fastgxe --make-grm --bfile test --npart 4 2 --out output/test
fastgxe --make-grm --bfile test --npart 4 3 --out output/test
fastgxe --make-grm --bfile test --npart 4 4 --out output/test
fastgxe --process-grm --merge --grm output/test --npart 4
Each partition writes files such as output/test.4_1.grm.bin, output/test.4_1.grm.id, and output/test.4_1.grm.N. The merge step combines them back into output/test.grm.bin.
Step 2: group related samples for downstream fastGxE analysis #
The downstream --test-main and --test-gxe workflows rely on grouped GRM sidecar files. Create them with:
fastgxe --process-grm --group --grm output/test --cut-value 0.05
This step identifies connected components in the relatedness graph and writes grouped sidecar files, including:
output/test.grm.groupoutput/test.grm.group.size- a sparse within-group GRM representation used by downstream analysis
3. Test SNP main effects #
Run a genome-wide SNP main-effect scan with:
fastgxe --test-main \
--grm output/test \
--bfile test \
--data test_simu_pheno.txt \
--trait pheno \
--out output/test_main
Main outputs are:
output/test_main.varoutput/test_main.random.resoutput/test_main.res
test_main.res contains one row per SNP with the columns:
order chrom SNP cm base allele1 allele2 af missing beta se p
Useful options #
--covaradds continuous covariates.--classadds categorical covariates.--p-cutfilters the final output by p-value. The default2keeps all SNPs.--num-random-snpcontrols the number of random SNPs used in calibration.
Variance components only #
If you omit --bfile, the command still fits the variance components and writes *.var, then stops before the genome-wide SNP scan.
4. Test GxE interactions #
Run a genome-wide GxE scan with:
fastgxe --test-gxe \
--grm output/test \
--bfile test \
--data test_simu_pheno.txt \
--trait pheno \
--env-int E1:E40 \
--out output/test_gxe
This command treats E1 through E40 as the interacting environments.
Main outputs are:
output/test_gxe.varoutput/test_gxe.main.random.resoutput/test_gxe.GxEnoMain.random.resoutput/test_gxe.GxE.random.resoutput/test_gxe.res
The final test_gxe.res file contains:
- SNP main-effect estimates:
beta_main,se_main,p_main - one
beta,se,ptriplet for each interacting environment - combined GxE summary columns:
p_single,p_multi,p_gxe
Important defaults #
- Interacting environments are standardized by default.
- Noise-by-environment interaction terms are enabled by default.
To disable environment standardization:
fastgxe --test-gxe \
--grm output/test \
--bfile test \
--data test_simu_pheno.txt \
--trait pheno \
--env-int E1:E40 \
--no-standardize-env \
--out output/test_gxe_rawE
To disable noise-by-environment terms:
fastgxe --test-gxe \
--grm output/test \
--bfile test \
--data test_simu_pheno.txt \
--trait pheno \
--env-int E1:E40 \
--no-noisebye \
--out output/test_gxe_no_noisebye
5. Parallelize genome-wide scans #
Both --test-main and --test-gxe support task splitting with:
--split-task <total_parts> <current_part>
For example, split a GxE scan into 10 jobs:
fastgxe --test-gxe \
--grm output/test \
--bfile test \
--data test_simu_pheno.txt \
--trait pheno \
--env-int E1:E40 \
--split-task 10 1 \
--out output/test_gxe_part1
Repeat the same command with part indices 2 through 10.
Important notes:
--split-taskand--snp-rangecannot be used together.- Give each parallel job a distinct
--outprefix. - The result files are written under task-specific output names after the split is resolved.
- Merging association result files is currently left to an external Python or R script.
6. Identify environments driving lead SNP GxE signals with mmSuSiE #
Once candidate GxE loci are identified, you can use mmSuSiE to prioritize the environmental variables driving the signal:
https://github.com/chaoning/mmsusie
Installation #
git clone https://github.com/chaoning/mmsusie.git
cd mmsusie
pip install .
Prepare the sparse GRM #
Export the GRM in index-triplet form required by mmSuSiE:
fastgxe --process-grm --reformat --grm output/test --out-fmt 1 --out output/test
This writes the files used by the sparse block-diagonal workflow:
output/test.grm.idoutput/test.grm.groupoutput/test.grm.index_triplet
Run mmSuSiE #
The sparse-GRM workflow uses the MMSuSiESp class:
from mmsusie import MMSuSiESp
model = MMSuSiESp()
# Input files (run from the example/ directory)
bed_file = "test" # PLINK genotype prefix
grm_file = "output/test" # sparse GRM prefix from --process-grm --reformat
pheno_file = "test_simu_pheno.txt" # phenotype/covariate table
env_int = [f"E{i}" for i in range(1, 41)] # environment columns to fine-map
# Step 1: LD-prune significant GxE hits to obtain one lead SNP per locus.
# ld_pure keeps the most significant SNP in each LD block (r² < ld_r2).
assoc_file = "output/test_gxe.res"
df_leading = model.ld_pure(
assoc_file,
bed_file,
ld_r2=0.1, # prune SNPs with pairwise r² >= 0.1
snp="SNP", # column name for SNP ID in assoc_file
p="p_gxe", # column name for GxE p-value
p_cutoff=5e-8, # genome-wide significance threshold
)
print(df_leading)
# Step 2: Fine-map environments for a chosen lead SNP.
snp_id = "rs550011" # lead SNP from df_leading
trait = "pheno" # phenotype column in pheno_file
varcom_file = "output/test_gxe.var" # variance components estimated by fastGxE
out_file = "output/test_mmsusie" # output file prefix
res_dct = model.mmsusie_lead_gxe(
pheno_file,
trait,
env_int,
grm_file,
bed_file,
snp_id,
varcom_file,
out_file,
L=10, # maximum number of causal environments
maxiter=100, # maximum ELBO iterations
tol=1e-3, # convergence tolerance on ELBO
coverage=0.95, # credible set coverage level
min_abs_corr=0.5, # minimum purity for a credible set to be reported
estimate_sigma=False, # use fixed prior variance during fitting
)
# res_dct["cs"] — credible sets: environments with cumulative PIP >= coverage
# res_dct["pip"] — per-environment posterior inclusion probabilities (DataFrame)
# res_dct["alpha"] — L x p posterior assignment probabilities per component
# res_dct["mu"] — L x p posterior mean effects per component
# res_dct["elbo"] — ELBO values across iterations (convergence diagnostic)
print(res_dct["cs"])
df_pip = res_dct["pip"]
print(df_pip)
print(df_pip[df_pip["pip"] > 0.5]) # environments with strong evidence of interaction
Output files #
model.mmsusie_lead_gxe() writes four result files under the given out_file prefix:
| File | Contents |
|---|---|
output/test_mmsusie.pip.txt |
Posterior inclusion probability for each environment |
output/test_mmsusie.alpha.txt |
Per-component posterior assignment probabilities |
output/test_mmsusie.mu.txt |
Posterior mean effects |
output/test_mmsusie.cs.txt |
Credible sets |
7. Summary workflow #
For a full example run, the main steps are:
cd example && mkdir -p output- Build the GRM with
--make-grm. - Group related samples with
--process-grm --group. - Run
--test-mainor--test-gxe. - Reformat the GRM and follow up with
mmSuSiEif you want environment-level fine-mapping.