Tracing the evolution of lineage-specific transcription factor binding sites


Prerequisites

Source Code

Code also includes a program to construct motif PWM and generate kmer list based on ChIP-seq peak file. The kmer list then can be used to run BOO.

Arguments

-s <string>
Multiple sequence alignment file for peak region. This file can be generated using mafsInRegion from UCSC Genome Browser Utilities
-r <string>
ChIP-seq peak file. This file is a bed format file indicating the location of peak on the genome. This file should be consistent with multiple sequence alignment file mentioned above. The fourth column should follow the format <TF>_<num>.<peak_score> (e.g. GATA1_1.200). Here <peak_score> must be integer.
-m <character>
running mode. <f> indicates format (preprocessing step). Program will scan the peak file using kmer list and save the indexed result into binary files.
-t <string>
Phylogenetic tree file provided from the source code.
-x <string>
Species. for human. Currently only support human.
-k <int>
ChIP-seq peak threshold. Only peak with scores higher than this cutoff will be considered as true ChIP-seq peak region.
-w <string>
Window size in the following format -<dis_1>:<dis_2>. Windows size must be smaller than peak size.
-a <string>
Kmer list name
-q <string>
kmer list file. This file contains all the kmers that can be considered as true TFBS motif (one motif per line). An example file can be seen here. We also provide a program to generate PWM and motif kmer list file from ChIP-seq peak region.

Outputs

*BooBED_All.txt
Predicted branch-of-orgin of TFBS in bed file format. The fourth column indicates the predicted branch-of-origin of TFBS

Instructions

Download the package, uncompressed the file using tar -zxvf boo_0.1.0.tar.gz and follow the instruction in README.txt to install the package. Executable binary programs were also provided. A ChIP-seq peak file in four column bed format is required to run BOO. It must meet the followsing requirement (see example file here). Other required files can be generated using the following scripts/tool.
  1. multiple sequence alignment file
  2. Download human multiz46way alignments file from UCSC Genome Browser, put those files in folder (e.g. Genome_UCSC/Human/hg19/multiz46way/) and uncompress them. Use mafsInRegion to extract multiple sequence alignment within ChIP-seq peak.
    mafsInRegion chip_genCoord_Broad_H1hESC.hg19.CTCF.txt hg19_multiz_Broad_H1hESC.CTCF.txt Genome_UCSC/Human/hg19/multiz46way/*.maf
  3. kmer list file
  4. For a known TF, in order to reduce the calculation time, we use a list of re-computed kmer list to search for TFBS (see example kmer file here). Along with BOO program, we also provide another tools called BOO_pwm to generate PWM motif profile from ChIP-seq peak region and output a kmer list. User can also provide a customer kmer list generated by any other methods (*BOO will not automatically generate reverse complement of kmer. So please include both kmer and its reverse complement.)
    ./boo_pwm -m F -s data/hg19/CTCF/hg19_multiz_Broad_GM12878.CTCF.txt -r data/hg19/CTCF/chip_genCoord_Broad_GM12878.hg19.CTCF.txt -w -100:100 -k 100 -t data/tree_file.nh -x hg19
    ./boo_pwm -m w -s data/hg19/CTCF/hg19_multiz_Broad_GM12878.CTCF.txt -r data/hg19/CTCF/chip_genCoord_Broad_GM12878.hg19.CTCF.txt -w -100:100 -k 100 -t data/tree_file.nh -x hg19 -a GGGGCKC
    The meaning of parameter is same as main program BOO.
Once multiple sequence alignment and kmer list are generated, create a new folder (e.g. data/hg19/CTCF). Then create a link pointing to those files or copy files to that folder. Output files will be written into the same folder as multiple sequence alignment file. There are two steps to run BOO.
  1. ./boo -m f -s data/CTCF/hg19_multiz_Broad_GM12878.CTCF.txt -r data/CTCF/chip_genCoord_Broad_GM12878.hg19.CTCF.txt -t data/tree_file.nh -x hg19 -k 100 -w -100:100 -a GGGGCKC -q CTCF_kmer.txt
  2. ./boo -s data/CTCF/hg19_multiz_Broad_GM12878.CTCF.txt -r data/CTCF/chip_genCoord_Broad_GM12878.hg19.CTCF.txt -t data/tree_file.nh -x hg19 -k 100 -w -100:100 -a GGGGCKC -q CTCF_kmer.txt
Once the code is finished, check *BooBED_All.txt for predicted TFBS branch-of-origin.

Contact

Yang Zhang and Jian Ma