gMK is a pipeline prduced to calculate the MK-test of each gene in focus species (e.g. *oryza sativa*), using the closely related species information (SNPs from population, e.g. *oryza rufipogon* and *oryza barthii*). gMK is followed the methods proposed by John and Martin in 1991 [1]. William's correction and Yates' correction are emoployed by the pipeline.
Before running, we need to install some programs which are essentil for the pipeline. Different version of program may lead to different results or problems.
[This pipeline are tested in two independent system and works well]
### CentOS release 6.5
### a. blat (version 36)[==blat, pslReps==]
### b. blast+ (bersion 2.3.0)[==blastdbcmd, makeblastdb==]
### c. Perl (V5.10.1; model Math::CDF is necessary for step2)
### d. bash (4.1.2)[Awk (3.1.7)、grep (2.6.3)、sed (4.2.1)、split (8.4)]
### CentOS release 7.4
### a. blat (version 36)[blat, pslReps]
### b. blast+(2.7.1)[blastdbcmd, makeblastdb]
### c. Perl (v5.16.3; model Math::CDF is necessary for step2)
### d. bash (4.2.46) awk (4.0.2) grep(2.20)sed(4.2.2) split (8.22)
Before runing the pipeline, we also need to prepar the sutable data.
### a. the focus species CDS sequences (==Osjap.cds==)
### b. the genome sequences of all the species (==Osjap.fasta, Obart.fasta, Orufi.fasta==; It is also OK if only using one reference genome.)
### c. the SNPs from the population of all the species (==Osjap.vcf, Obart.vcf, Orufi.vcf==)
`````````````````````````````````````````````````````````
makeblastdb -dbtype nucl -in Osjap.fasta -parse_seqids -hash_index
`````````````````````````````````````````````````````````
`````````````````````````````````````````````````````````
blat Obart.fasta oryza.cds Osjap_cds_2_Obart.psl
`````````````````````````````````````````````````````````
`````````````````````````````````````````````````````````
pslReps ../Osjap_cds_2_Obart.psl Osjap_cds_2_Obart_pslReps.psl Osjap_cds_2_Obart_pslReps.psr
`````````````````````````````````````````````````````````
`````````````````````````````````````````````````````````
perl MK_Test_pipeline_step1_v6.pl
-blat_in1 Osjap_2_itself_pslReps.psl
-blat_in2 Osjap_2_Orufi_pslReps.psl
-blat_in3 Osjap_2_Obart_pslReps.psl
-db_file1 ../../Osjap.fasta
-db_file2 ../../Orufi.fasta
-db_file3 ../../Obart.fasta
-vcf_file1 Osjap.vcf
-vcf_file2 Orufi.vcf
-vcf_file3 Obart.vcf
-indi_acc1 10 ##==assume there are 10 individuals in species one==
-indi_acc2 10
-indi_acc3 10
-seq_file1 ../../Osjap.cds
-blastdbcmd /opt/soft/ncbi-blast-2.3.0+/bin/blastdbcmd
-cpu_th 128
`````````````````````````````````````````````````````````
- ***-blat_th *** this parameter will determinate the threshold with blat results, it would be range in [0-1] theoretically, the default velue is 0.8.
- ***-split_infile *** this is the option for the split program employed by pipeline, which is like "split -l xx" in reality.
- ***-cpu_th *** how may process/cpus will be used for calculation.
The a-e steps will produce a time stamp document, e.g. "1503765805_fasta". All the alignment sequences will be kept in bunch of fasta files. runing the step3 will produce a file results for all the genes. step3.pl will driven step2.pl file to calculate one by one; parameters a b c are the individuals numbers for each population.
`````````````````````````````````````````````````````````
perl MK_Test_pipeline_step3.pl a b c 1503765805_fasta >gMK.results.txt
`````````````````````````````````````````````````````````
If people have well alignment coding gene sequences, running step2.pl independently, can also produce the results; parameters a b c are the individuals numbers for each population.
`````````````````````````````````````````````````````````
perl MK_Test_pipeline_step2.pl a b c -input 1503765805_fasta/LOC_Os12g01680.1.fasta
`````````````````````````````````````````````````````````
When There is few information sites in the VCF file, we will get 0 at some raws of the 2x2 table, which makes G-test unavaliable. Thus, for this gene, you may need to find another way to evaluate the selection pressure.
According to the data in rice, it is expected that many genes may have more than two good matches when comparing to the genome sequences. In our pipeline, only the first one will be alignment and calculate the MK-test values. (remaining good matches will be record in the two_best_blat_rel.log file.) This kind of situation may lead to high fix/divergence sites, which because of paralogous.
While the pipeline driven the split program, it will split the *psl_for_mk_test file into bunch of smaller files, and it is possible that one gene may appeared in two smaller files, which may lead to problems.
Due to the un-precise alignment of blat results, e.g. very tiny exons may attract to alignment with near by of larger exons, this may lead to skew of results.
The pipeline automaticly alignment the SNPs from VCF with the refernece, if the quality of SNPs is low, then the information may mislead the results. So we encourage people filter the SNP first.
This is because there is no any SNPs detected in scaffold2494 in populationb.
Degenerate alphabet | Nucleotides |
---|---|
M | A/C |
Y | C/T |
K | G/T |
S | C/G |
R | A/G |
W | A/T |
H | ACT |
V | ACG |
D | AGT |
B | CGT |
X/N | means we know nothing about this site, it will be neglect |
I | INDEL, it will be neglect during the calculation |
Adaptive protein evolution at the *Adh* locus in *Drosophila*. John H. McDonald and Martin Kreitman. Nature, Vol 351,652-654 (1991).