Tracing the evolution of lineage-specific transcription factor binding sites
Prerequisites
- 32-bit or 64-bit GNU/Linux
- GCC 4.0 + with Standard C++ Library
- GNU make
Source Code
- version 1.0, April 16, 2014
Code also includes a program to construct motif PWM and generate kmer list based on ChIP-seq peak file. The kmer list then can be used to run BOO.
Arguments
-s <string>
-r <string>
ChIP-seq peak file. This file is a bed format file indicating the location of peak on the genome. This file should be consistent with multiple sequence alignment file mentioned above. The fourth column should follow the format <TF>_<num>.<peak_score> (e.g. GATA1_1.200). Here <peak_score> must be integer.
-m <character>
running mode. <f> indicates format (preprocessing step). Program will scan the peak file using kmer list and save the indexed result into binary files.
-t <string>
Phylogenetic tree file provided from the source code.
-x <string>
Species. for human. Currently only support human.
-k <int>
ChIP-seq peak threshold. Only peak with scores higher than this cutoff will be considered as true ChIP-seq peak region.
-w <string>
Window size in the following format -<dis_1>:<dis_2>. Windows size must be smaller than peak size.
-a <string>
Kmer list name
-q <string>
kmer list file. This file contains all the kmers that can be considered as true TFBS motif (one motif per line). An example file can be seen
here. We also provide a program to generate PWM and motif kmer list file from ChIP-seq peak region.
Outputs
*BooBED_All.txt
Predicted branch-of-orgin of TFBS in bed file format. The fourth column indicates the predicted branch-of-origin of TFBS
Instructions
Download the
package, uncompressed the file using
tar -zxvf boo_0.1.0.tar.gz
and follow the instruction in README.txt to install the package. Executable binary programs were also provided. A ChIP-seq peak file in four column bed format is required to run BOO. It must meet the followsing requirement (see example file
here).
- Peak file should be centered on peak summit.
- Length of peak should be same (e.g. 800bp).
- The fourth column must follow the format <TF>.<num>.<peak_score>, in which <TF> is TF name, <num> is peak rank and <peak_score> is an interger score indicating the peak height. A threshold will be used to filter peaks with score below user specifiec cutoff(-k). Example ChIP-seq peak file can be download here.
Other required files can be generated using the following scripts/tool.
- multiple sequence alignment file
Download human multiz46way alignments file from UCSC Genome Browser, put those files in folder (e.g. Genome_UCSC/Human/hg19/multiz46way/) and uncompress them. Use mafsInRegion to extract multiple sequence alignment within ChIP-seq peak.
mafsInRegion chip_genCoord_Broad_H1hESC.hg19.CTCF.txt hg19_multiz_Broad_H1hESC.CTCF.txt Genome_UCSC/Human/hg19/multiz46way/*.maf
- kmer list file
For a known TF, in order to reduce the calculation time, we use a list of re-computed kmer list to search for TFBS (see example kmer file here). Along with BOO program, we also provide another tools called BOO_pwm to generate PWM motif profile from ChIP-seq peak region and output a kmer list. User can also provide a customer kmer list generated by any other methods (*BOO will not automatically generate reverse complement of kmer. So please include both kmer and its reverse complement.)
./boo_pwm -m F -s data/hg19/CTCF/hg19_multiz_Broad_GM12878.CTCF.txt -r data/hg19/CTCF/chip_genCoord_Broad_GM12878.hg19.CTCF.txt -w -100:100 -k 100 -t data/tree_file.nh -x hg19
./boo_pwm -m w -s data/hg19/CTCF/hg19_multiz_Broad_GM12878.CTCF.txt -r data/hg19/CTCF/chip_genCoord_Broad_GM12878.hg19.CTCF.txt -w -100:100 -k 100 -t data/tree_file.nh -x hg19 -a GGGGCKC
The meaning of parameter is same as main program BOO.
Once multiple sequence alignment and kmer list are generated, create a new folder (e.g. data/hg19/CTCF). Then create a link pointing to those files or copy files to that folder. Output files will be written into the same folder as multiple sequence alignment file. There are two steps to run BOO.
./boo -m f -s data/CTCF/hg19_multiz_Broad_GM12878.CTCF.txt -r data/CTCF/chip_genCoord_Broad_GM12878.hg19.CTCF.txt -t data/tree_file.nh -x hg19 -k 100 -w -100:100 -a GGGGCKC -q CTCF_kmer.txt
./boo -s data/CTCF/hg19_multiz_Broad_GM12878.CTCF.txt -r data/CTCF/chip_genCoord_Broad_GM12878.hg19.CTCF.txt -t data/tree_file.nh -x hg19 -k 100 -w -100:100 -a GGGGCKC -q CTCF_kmer.txt
Once the code is finished, check *BooBED_All.txt for predicted TFBS branch-of-origin.
Contact
Yang Zhang and Jian Ma