Preparing GTF file for nextflow rnaseq regarding S. epidermidis 1585

gene_x 0 like s 947 view s

Tags: processing, bacterium, RNA-seq

  1. In the example RP62A

    1. #[IMPORTANT FINAL STEPS]
    2. [1] python3 gff_to_gtf.py RP62A.gff3 RP62A.gtf or python3 ~/Scripts/genbank_to_gtf.py 1585.gb 1585.gtf
    3. [2] python3 modify_gtf.py RP62A.gtf RP62A_m.gtf
    4. python3 modify_gtf.py 1585.gtf 1585_m.gtf #replace CDS with exon!
    5. gene_biotype "protein_coding";
    6. [3] sed -i -e 's/gene_id "rna-/gene_id "gene-/g' RP62A_m.gtf
    7. # --gtf_extra_attributes [string] By default, the pipeline uses the `gene_name` field to obtain additional gene identifiers from the input GTF file when
    8. running Salmon. [default: gene_name]
    9. # --gtf_group_features [string] Define the attribute type used to group features in the GTF file when running Salmon. [default: gene_id]
    10. # --featurecounts_group_type [string] The attribute type used to group feature types in the GTF file when generating the biotype plot with
    11. featureCounts. [default: gene_biotype]
    12. # --featurecounts_feature_type [string] By default, the pipeline assigns reads based on the 'exon' attribute within the GTF file. [default: exon]
    13. #--gtf_extra_attributes gene
    14. #--gtf_group_features Parent --featurecounts_group_type gene_biotype --featurecounts_feature_type CDS (outdated since [CHANGE1,2,3])
    15. # ------------ gff_to_gtf.py, then modify *.gtf file as follows -------------
    16. #Upload the scripts gff_to_gtf.py and modify_gtf.py.
    17. python3 gff_to_gtf.py RP62A.gff3 RP62A.gtf
    18. #[CHANGE1] ID in gene and Parent in CDS --> gene_id; ID in exon --> transcript_id; Name in gene --> gene_name
    19. #[CHANGE2] "Protein Homology" --> "RefSeq"
    20. #[CHANGE3] CDS --> exon
    21. #[CHANGE4?] --gtf_extra_attributes ||||gene|||| refers to ID=gene-SERP_RS00220;Name=noc;gbkey=Gene;||||gene=noc||||; "gene=" exists in both gene and exon. choose gene as gtf_extra_attributes! (see --gtf_extra_attributes gene in the next command)
  2. In the example of 1585 wildtype genome 1585.gb (CP143516-CP143519)

    1. Features Annotated :: Gene; CDS; rRNA; tRNA; ncRNA
    2. Genes (total) :: 2,332 (CDS + RNA)
    3. CDSs (total) :: 2,240 (CDS) grep "CDS" 1585.gtf | wc -l
    4. Genes (coding) :: 2,183 (Pseudo has also CDS?)
    5. CDSs (with protein) :: 2,183
    6. Genes (RNA) :: 92
    7. rRNAs :: 7, 6, 6 (5S, 16S, 23S) grep "rRNA" 1585.gtf | wc -l
    8. complete rRNAs :: 7, 6, 6 (5S, 16S, 23S)
    9. tRNAs :: 69 grep "tRNA" 1585.gtf | wc -l
    10. ncRNAs :: 4 grep "ncRNA" 1585.gtf | wc -l
    11. Pseudo Genes (total) :: 57
    12. CDSs (without protein) :: 57
    13. Pseudo Genes (ambiguous residues) :: 0 of 57
    14. Pseudo Genes (frameshifted) :: 33 of 57
    15. Pseudo Genes (incomplete) :: 30 of 57
    16. Pseudo Genes (internal stop) :: 25 of 57
    17. Pseudo Genes (multiple problems) :: 21 of 57
    18. gene 730216..731766
    19. /locus_tag="V0R30_03330"
    20. rRNA 730216..731766
    21. /locus_tag="V0R30_03330"
    22. #Ziel all ncRNA, rRNA, tmRNA, tRNA as exon, the original types set as gene_biotype "protein_coding"; gene_biotype "tRNA"; ...
    23. cut -d$'\t' -f3 1585_m.gtf | sort -u
    24. exon
    25. #gene
    26. ncRNA
    27. rRNA
    28. tmRNA
    29. tRNA
    30. #Ausführung: STEP0: one record (tmRNA) lacking the "tmRNA" row, manually corrected!
    31. # Transfer-messenger RNA (tmRNA, 10Sa RNA, ssrA) is bacterial RNA having both tRNA and mRNA properties
    32. # STEP1: python3 genbank_to_gtf.py 1585.gb 1585.gtf
    33. # STEP2: python3 modify_gtf.py 1585.gtf 1585_m.gtf #"CDS" --> "exon"
    34. # STEP3: replace transcript_id " --> transcript_id "tx-
    35. # STEP4: ergänzen all lines as gene_id; gene_name; transcript_id; and add all lines annotated with "gene" at the end with 'gene_biotype "protein_coding";'
    36. # using "python3 add_geneid_genename_genebiotype.py 1585_m.gtf 1585_m_.gtf
    37. #DEBUG ERROR Stop at line : CP143516 biopython exon 1 1356 . + . gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"
    38. # Error Message: Cannot find transcript_id!
    39. # STEP5: Process features of types Gene, CDS, rRNA, tRNA, ncRNA, and tmRNA; For rRNA, tRNA, tmRNA, and ncRNA features, it will add an additional line with exon_id following the pattern described. with "python3 add_exonid_to_exon_tRNA_rRNA_tmRNA_ncRNA.py 1585_m_.gtf 1585_m__.gtf"
    40. # STEP6: MANUALLY modify the lines ncRNA (3), tmRNA (1), rRNA (19), tRNA (69): tRNA --> exon, 'gene_biotype "tRNA";';
    41. # with "grep "tRNA" 1585_m_.gtf | wc -l" check they are correctly changed!
    42. #RESULTS:
    43. #CP143516 biopython gene 1 1356 . + . gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; gene_biotype "protein_coding"
    44. #CP143516 biopython exon 1 1356 . + . gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; exon_id "exon-V0R30_00005";
    45. #...
    46. #/home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem
    47. #grep ">" genome.transcripts.fa | wc -l
    48. #2240 only exon level, but not tRNA, rRNA, tmRNA, ncRNA records!
    49. # STEP7 (Optional) To include also the 92 records in the results, we can change the annotations records of rRNA, tRNA, tmRNA, ncRNA to exon, so that we can also the records!
    50. # see 1585_m___.gtf
    51. # grep "exon" 1585_m___.gtf | wc -l
    52. # 2332
    53. # grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/genome.transcripts.fa | wc -l
    54. # grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem/genome.transcripts.fa | wc -l
    55. # 2332
    56. #For example exon-V0R30_00090 is a tRNA.
    57. #>tx-V0R30_00090
    58. #GGCTCCTTGGTCAAGCGGTTAAGACACCGCCCTTTCACGGCGGTAACACGGGTTCGAGTCCCGTAGGAGTCACCA
    59. #We can also filter the count number in the downstream analysis from the big matrix!
    60. #tmRNA: V0R30_09330; ncRNA: V0R30_05655, V0R30_06520, V0R30_11060
    61. >tx-V0R30_05655
    62. AGTCATCTGTGGTGTTCGTAAGTTTGCTTTTTATTTGGGCCTAACACTCTTTGATCAGGGAGCCCAATAGGTTTTCTCGCAGCGCACACGCCTCTATAGGAGGACTTGCAAAACGAGAAACAGGGCACCCACCTGTATATAGCAGGCCGAATGATCAAGCTATTTATAACTACGGCATCAACGGACTCTAT
    63. >tx-V0R30_06520
    64. TATTTCGGGTAATCGCTATAGCAATATAGAGGAAAGTCCATGCTCACACAATCTGAGATGATTGTAGTGTTCGTGCTTGATGAAACAATAAATCAAGGCATTAATTTGACGGCAATGAAATAACCTAAGTCATTGGATATGGTTAGAATAGTTTGAAAGTGCCACAGTGACGTAGCTTTTATAGAAATATAAAAGGTGGAACGCGGTAAACCCCTCGAGTGAGCAATCCAAATTGGGTAGGAGCACTTGTTTAGCGGAATTCAACGTATAGACGAGACGATTTTTACGCGAAAGTAAAAATATGTAGACAGATGGTTACCACCGACGTACCAGTGTAACTAGTACACATTATGAGTACAACGGAACAGAACATGGCTTACAGAA
    65. >tx-V0R30_11060
    66. CCGTGCTAGGTGGGGAGGTAGCGGTTCCCTGTACTCGAAATCCGCTTTATGCGAGACTTAATTCCTTTGTTGAGGGCGTATTTTTGTGAAGTCTGCCCGAAGCACGTAGTGTTTGAAGATTTCGGTCCTATGCAATATGAACCCATGAACCATGTCAGGTCCTGACGGAAGCAGCATTAAGTGGATCCTCATATGTGCCGTAGGGTAGCCGAGATTTAGCTGTCGACTTCGGTAACGTTTATGATATGCGTTCGATACAAAGGTGCATGG
    67. nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_1585 --fasta 1585.fasta --gtf 1585_m___.gtf -profile test_full -resume --max_memory 200.GB --max_time 2400.h --save_reference --aligner star_salmon --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum