gMK pipeline

# Introduction gMK is a pipeline prduced to calculate the MK-test of each gene in focus species (e.g. *oryza sativa*), using the closely related species information (SNPs from population, e.g. *oryza rufipogon* and *oryza barthii*). gMK is followed the methods proposed by John and Martin in 1991 [1]. William's correction and Yates' correction are emoployed by the pipeline. ## 1. programs need to be install Before running, we need to install some programs which are essentil for the pipeline. Different version of program may lead to different results or problems. [This pipeline are tested in two independent system and works well]

### CentOS release 6.5
### a. blat (version 36)[==blat, pslReps==]
### b. blast+ (bersion 2.3.0)[==blastdbcmd, makeblastdb==]
### c. Perl (V5.10.1; model Math::CDF is necessary for step2)
### d. bash (4.1.2)[Awk (3.1.7)、grep (2.6.3)、sed (4.2.1)、split (8.4)]

### CentOS release 7.4
### a. blat (version 36)[blat, pslReps]
### b. blast+(2.7.1)[blastdbcmd, makeblastdb]
### c. Perl (v5.16.3; model Math::CDF is necessary for step2)
### d. bash (4.2.46) awk (4.0.2) grep(2.20)sed(4.2.2) split (8.22)
## 2. data prepartion Before runing the pipeline, we also need to prepar the sutable data.

### a. the focus species CDS sequences (==Osjap.cds==)
### b. the genome sequences of all the species (==Osjap.fasta, Obart.fasta, Orufi.fasta==; It is also OK if only using one reference genome.)
### c. the SNPs from the population of all the species (==Osjap.vcf, Obart.vcf, Orufi.vcf==)
## 3. running the pipeline ### a. format data (this is used for blastdbcmd to grab sequences from genome), this should be done for all the species (a bunch of files)

`````````````````````````````````````````````````````````
makeblastdb -dbtype nucl -in Osjap.fasta  -parse_seqids  -hash_index
`````````````````````````````````````````````````````````
### b. blat comparison the cds to genomes (three psl files).

`````````````````````````````````````````````````````````
blat Obart.fasta oryza.cds Osjap_cds_2_Obart.psl
`````````````````````````````````````````````````````````
### c. filter the blat results, to keep the best part (three new psl files will be used in the following steps)

`````````````````````````````````````````````````````````
pslReps ../Osjap_cds_2_Obart.psl Osjap_cds_2_Obart_pslReps.psl Osjap_cds_2_Obart_pslReps.psr
`````````````````````````````````````````````````````````
### d. running the pipeline (the pipeline support parallel calculation).

`````````````````````````````````````````````````````````
perl MK_Test_pipeline_step1_v6.pl  

-blat_in1 Osjap_2_itself_pslReps.psl 
-blat_in2 Osjap_2_Orufi_pslReps.psl 
-blat_in3 Osjap_2_Obart_pslReps.psl 

-db_file1 ../../Osjap.fasta 
-db_file2 ../../Orufi.fasta 
-db_file3 ../../Obart.fasta   

-vcf_file1 Osjap.vcf
-vcf_file2 Orufi.vcf 
-vcf_file3 Obart.vcf 

-indi_acc1 10 ##==assume there are 10 individuals in species one==
-indi_acc2 10 
-indi_acc3 10  

-seq_file1 ../../Osjap.cds
-blastdbcmd /opt/soft/ncbi-blast-2.3.0+/bin/blastdbcmd 
-cpu_th 128
`````````````````````````````````````````````````````````
### e. other parameters in pipeline

- ***-blat_th      *** this parameter will determinate the threshold with blat results, it would be range in [0-1] theoretically, the default velue is 0.8.
- ***-split_infile *** this is the option for the split program employed by pipeline, which is like "split -l xx" in reality.
- ***-cpu_th       *** how may process/cpus will be used for calculation.
### f. the following steps. - The a-e steps will produce a time stamp document, e.g. "1503765805_fasta". All the alignment sequences will be kept in bunch of fasta files. runing the step3 will produce a file results for all the genes. step3.pl will driven step2.pl file to calculate one by one; parameters a b c are the individuals numbers for each population.

`````````````````````````````````````````````````````````
perl MK_Test_pipeline_step3.pl a b c 1503765805_fasta >gMK.results.txt
`````````````````````````````````````````````````````````
## 4. other kind of use - If people have well alignment coding gene sequences, running step2.pl independently, can also produce the results; parameters a b c are the individuals numbers for each population.

`````````````````````````````````````````````````````````
perl MK_Test_pipeline_step2.pl a b c -input 1503765805_fasta/LOC_Os12g01680.1.fasta
`````````````````````````````````````````````````````````
# Problems we had known ### 1. Illegal division by zero at MK_Test_pipeline_step2.pl line 473/409. - When There is few information sites in the VCF file, we will get 0 at some raws of the 2x2 table, which makes G-test unavaliable. Thus, for this gene, you may need to find another way to evaluate the selection pressure. ### 2. When one gene come to more than two great matches by blat. - According to the data in rice, it is expected that many genes may have more than two good matches when comparing to the genome sequences. In our pipeline, only the first one will be alignment and calculate the MK-test values. (remaining good matches will be record in the two_best_blat_rel.log file.) This kind of situation may lead to high fix/divergence sites, which because of paralogous. ### 3. When doing the parallel calculation, some IDs may cross two split files. - While the pipeline driven the split program, it will split the *psl_for_mk_test file into bunch of smaller files, and it is possible that one gene may appeared in two smaller files, which may lead to problems. ### 4. The un-precise part of alignment. - Due to the un-precise alignment of blat results, e.g. very tiny exons may attract to alignment with near by of larger exons, this may lead to skew of results. ### 5. The low quality of SNPs may lead to problems. - The pipeline automaticly alignment the SNPs from VCF with the refernece, if the quality of SNPs is low, then the information may mislead the results. So we encourage people filter the SNP first. ### 6. "awk: fatal: cannot open file `scaffold2494.populationb_new.vcf' for reading (No such file or directory)" - This is because there is no any SNPs detected in scaffold2494 in populationb. ### 7. If you have met new problems, please feel free to contact me by following emails. ![image](http://bioinfor.kib.ac.cn/zhanglab/sites/default/files/field/image/kib_email1.png) # Data we had tested - Rice Data (Re-sequencing Data): we developed the pipeline according to the rice genome datas. Oryza sativa, O.rufipogon, O.barthii; the SNPs range from 4.10-9.67 millions with 299-362M genome size. - Rhododendron Data (ddRAD Data): we developed the pipeline according to the Rhododendron delavayi, R.agastum and R.irroratum; the SNPs range from 0.74-2.71 millions 666M genome size. # Version information - version 1.3 the first complete version. - version 1.5 In this version, we add collect_snp_cmd.log to show the commod used in the pipeline. - version 1.6 (Download)In this version, we add *_cmd.log files in sub_xxx.vcf document, which may help people understand how it works. - Free Mind Annotations (Download): This may help people if want to know the detail of the scripts. # Other informations you may want to know Degenerate alphabet| Nucleotides ---|--- M | A/C Y | C/T K | G/T S | C/G R | A/G W | A/T H | ACT V | ACG D | AGT B | CGT X/N | means we know nothing about this site, it will be neglect I | INDEL, it will be neglect during the calculation # reference 1. Adaptive protein evolution at the *Adh* locus in *Drosophila*. John H. McDonald and Martin Kreitman. Nature, Vol 351,652-654 (1991).