Genome level calculate Ka/Ks --gKaKs

 

The main purpose of this program is to align the CDSs from a well-annotated genome to a target genome, which hasn’t been annotated yet. It uses blat to find the best match between the CDSs and target genome sequences, and then uses bl2seq to align every CDS to the blat-identified target genome region. After merging the aligned sequences and removing gaps according to reference CDS codon, it uses codeml/yn00 of PAML and other 10 methods from KaKs_Calculator to compute the ka/ks ratio between CDSs and their homolog sequences in the target genome. Also this program can compute Ka/Ks ratio for two lists of homologous DNA sequences, with one list as CDS sequences and the other list as genomic sequences.

 

Program installation:

 

This program require several widely used bioinformatics programs. Before running this program, please make sure the installation of the bellowing programs. We suggest to make a link at /usr/local/bin/.

  1. Blat: http://genome.ucsc.edu/FAQ/FAQblat.html#blat3
  2. Blast (will use bl2seq, formatdb and fastacmd) ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.26/ncbi-blast-2.2.26+-x64-linux.tar.gz
  3. Codeml: http://abacus.gene.ucl.ac.uk/software/paml.html
  4. KaKs_Calculation: https://ngdc.cncb.ac.cn/tools/kaks/download

 

Data preparation:

 

  1. The CDS sequences of the reference genome in fasta format –query_seq=”ref.cds”
  2. The gff3 file of the reference genome –gff=”ref.gff3”
  3. Target genome sequence in fasta format –hit_seq=”hit.fa” 

 

Flowchart:

 

The function of the pipeline:

 

This program can be used to compute ka/ks ratio between the genes in one well-annotated genome and their ortholog sequences in another closely related genome, which hasn’t been annotated. The result

  1. can be used to compute the diverge time between two species through estimating average Ks and mutation rate;
  2. can be used to estimate how many ortholog sequence pairs are under functional constraints;
  3. can be used as evidence to annotate genes

 

Notation:

 

  1. This program works for closely related species. If species splits from each other too long ago, the sequences diverge strikingly, bl2seq cannot generate good results.
  2. Because this program deletes all the gaps that can’t be aligned in codon level, the Ks generated by this approach tend to be smaller.
  3. In –problem_loc records, the Ks/Ka cannot be computed for the CDS id under ===reverse=== category.
  4. More details please read document.

 

Download:

 

Document Codev1.3

Citation:

  • gKaKs: The pipeline for genome level Ka/Ks calculation. Chengjun Zhang; Jun Wang; Manyuan Long; Chuanzhu Fan Bioinformatics 2013; doi: 10.1093/bioinformatics/btt009

Bug Report:  

FAQs:

  1. Do you use the gene name or transcript name as a ID???
  2. Did i used the right command?
  3. I found error report with GFF?
  4. There're a lot of "no good hit " and "more than one hit " in problem_locs.log, how to deal with it?
  5. Can this pipe line used for genomes with far evolutionary relationship? Which of the genomes analyzed fall into this catregory?
  6. Why the results of codeml method with same data vary a lot?

  7. Input file "File encoding" problem. 2016-04-06 updated.

 Mirror Link:Wayne State University  @ The University of Chicago

 Comments About Paper: