Analysis of the RNA binding protein (RBP) motifs for RNA-Seq and miRNAs (v3)

gene_x 0 like s 11 view s

Tags: pipeline

There are several alternative R packages and tools to perform motif enrichment analysis for RNA-binding proteins (RBPs), beyond PWMEnrich::motifEnrichment(). Here are the most notable ones:

| Tool / Package           | Enrichment        | Custom Motifs   | CLI or R? | RNA-specific?  |
| ------------------------ | ----------------- | --------------- | --------- | -------------- |
| **PWMEnrich**            | ✅                 | ✅               | R         | ✅              | --> tried (see pipeline.v1-block3)!
| **RBPmap**               | ✅                 | ❌ (uses own db) | Web/CLI   | ✅              | --> tried RBPmap, but it is too slow!
| **Biostrings/TFBSTools** | ❌ (only scanning) | ✅               | R         | ❌              |  #ATtRACT+Biostrings/TFBSTools (tried, pipeline.v1-block3)
| **rmap**                 | ✅ (CLIP-based)    | ❌               | R         | ✅              |
| **Homer**                | ✅                 | ✅               | CLI       | ⚠ RNA optional |
| **MEME (AME, FIMO)**     | ✅                 | ✅               | Web/CLI   | ⚠ Generic      | --> Finally using ATtRACT+FIMO, AMD has BUG, not runnable!

#For me it was suggested to use “RBPmap” or “GraphProt” to do this analysis.
  1. Get 3UTR.fasta, 5UTR.fasta, CDS.fasta and transcripts.fasta

            mRNA Transcript
    ┌────────────┬────────────┬────────────┐
    │   5′ UTR   │     CDS    │   3′ UTR   │
    └────────────┴────────────┴────────────┘
    ↑            ↑            ↑            ↑
    Start        Start        Stop         End
    of           Codon       Codon        of
    Transcript                             Transcript
    
    ✅ Option 1: Use GENCODE and python scripts (CHOSEN!)
    
    #Input: up- and down-, all-regulated files
    ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-up.txt    #20086
    ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-down.txt  #634
    ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-up.txt     #23832
    ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-down.txt   #375
    
    #Filtering the down-regulated genes to include only protein_coding genes before extracting 3' UTRs, because
    #1. Only protein_coding genes have well-annotated 3' UTRs
    #3' UTRs are defined as the region after the CDS (coding sequence) and before the poly-A tail.
    #Non-coding RNAs (e.g., lncRNA, snoRNA, miRNA precursors) do not have CDS, and therefore don't have canonical 3' UTRs.
    #2. In GENCODE, most UTR annotations are only provided for transcripts of gene_type = "protein_coding".
    
    cd ~/DATA/Data_Ute/RBPs_analysis/extract_3UTR_5UTR_CDS_transcript
    grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-up.txt > MKL-1_wt.EV_vs_parental-up_protein_coding.txt
    grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-down.txt > MKL-1_wt.EV_vs_parental-down_protein_coding.txt
    grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-up.txt > WaGa_wt.EV_vs_parental-up_protein_coding.txt
    grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-down.txt > WaGa_wt.EV_vs_parental-down_protein_coding.txt
    
    #Visit and Download: GENCODE FTP site https://www.gencodegenes.org/human/
        * GTF annotation file (e.g., gencode.v48.annotation.gtf.gz)
        * Corresponding genome FASTA (e.g., GRCh38.primary_assembly.genome.fa.gz)
    wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.annotation.gtf.gz
    wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/GRCh38.primary_assembly.genome.fa.gz
    gunzip gencode.v48.annotation.gtf.gz
    gunzip GRCh38.primary_assembly.genome.fa.gz
    
    python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-down_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_down
    python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-up_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_up  #5988
    python extract_transcript_parts.py WaGa_wt.EV_vs_parental-down_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_down  #93
    python extract_transcript_parts.py WaGa_wt.EV_vs_parental-up_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_up  #6538
    
    ✅ Option 2-5 see at the end!
    
  2. Why 3' UTR?

    🧬 miRNA, RBP, or translation/post-transcriptional regulation
    ➡️ Use 3' UTR sequences
    
    Because:
    
    Most miRNA binding and many RBP motifs are located in the 3' UTR.
    
    It’s the primary region for mRNA stability, localization, and translation regulation.
    
    🧠 Example: You're looking for binding enrichment of miRNAs or RNA-binding proteins (PUM, HuR, etc.)
    ✅ Input = 3UTR.fasta
    
    🧪 If you're testing PBRs related to:
    - Translation initiation, upstream ORFs, or 5' cap interaction:
    ➡️ Use 5' UTR
    
    - Coding mutations, protein-level motifs, or translational efficiency:
    ➡️ Use CDS
    
    - General transcriptome-wide motif search (no preference):
    ➡️ Use transcripts, or test all regions separately to localize signal
    
  3. Recommended Workflow with RBPmap https://rbpmap.technion.ac.il (Too slow!)

    RBPmap itself does not compute enrichment p-values or FDR; it's a motif scanning tool.
    
    To get statistically meaningful RBP enrichments, combine RBPmap with custom permutation testing or Fisher’s exact test + multiple testing correction.
    
        1. Prepare foreground (target) and background sequences
    
            Extract 3′ UTRs of:
    
            📉 Downregulated mRNAs (foreground) — likely targeted by upregulated miRNAs
    
            ⚪ A control set of 3′ UTRs — e.g., non-differentially expressed protein-coding genes
    
                grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-all.txt > MKL-1_wt.EV_vs_parental-all_protein_coding.txt
                grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-all.txt > WaGa_wt.EV_vs_parental-all_protein_coding.txt
    
                cut -d',' -f1 MKL-1_wt.EV_vs_parental-all_protein_coding.txt | sort > all_genes.txt  #19239
                cut -d',' -f1 MKL-1_wt.EV_vs_parental-up_protein_coding.txt | sort > up_genes.txt  #5988
                cut -d',' -f1 MKL-1_wt.EV_vs_parental-down_protein_coding.txt | sort > down_genes.txt  #112
                cat up_genes.txt down_genes.txt | sort | uniq > regulated_genes.txt
                comm -23 all_genes.txt regulated_genes.txt > background_genes.txt
                grep -Ff background_genes.txt MKL-1_wt.EV_vs_parental-all_protein_coding.txt > MKL-1_wt.EV_vs_parental-background_protein_coding.txt  #13139
    
                cut -d',' -f1 WaGa_wt.EV_vs_parental-all_protein_coding.txt | sort > all_genes.txt  #19239
                cut -d',' -f1 WaGa_wt.EV_vs_parental-up_protein_coding.txt | sort > up_genes.txt  #6538
                cut -d',' -f1 WaGa_wt.EV_vs_parental-down_protein_coding.txt | sort > down_genes.txt  #93
                cat up_genes.txt down_genes.txt | sort | uniq > regulated_genes.txt
                comm -23 all_genes.txt regulated_genes.txt > background_genes.txt
                grep -Ff background_genes.txt WaGa_wt.EV_vs_parental-all_protein_coding.txt > WaGa_wt.EV_vs_parental-background_protein_coding.txt  #12608
    
                python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-background_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_background
                python extract_transcript_parts.py WaGa_wt.EV_vs_parental-background_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_background
    
                foreground.fasta: 你的目标(前景)序列,例如下调基因的 3′UTRs。
                background.fasta: 你的背景对照序列,例如未显著差异表达的基因的 3′UTRs。
    
        2. Run RBPmap separately on both sets (in total of 6 calculations)
    
            * Submit both sets of UTRs to RBPmap.
            * Use the same settings (e.g., “human genome”, “high stringency”, "Apply conservation filter" etc.)
            * Choose all RBPs
            * Download motif match outputs for both sets
    
        3. Count motif hits per RBP in each set
    
            You now have:
            For each RBP:
            a: number of target 3′ UTRs with a motif match
            b: number of background 3′ UTRs with a motif match
            c: total number of target 3′ UTRs
            d: total number of background 3′ UTRs
    
        4. Perform Fisher’s Exact Test per RBP
    
            For each RBP, construct a 2x2 table:
    
            Motif Present   Motif Absent
            Foreground (targets)    a   c - a
            Background  b   d - b
    
        5. Adjust p-values for multiple testing
        Use Benjamini-Hochberg (FDR) correction (e.g., in Python or R) across all RBPs tested.
    
        6.✅ Summary
    
            Step    Tool
            Prepare Database of RNA-binding motifs  ATtRACT
            3′ UTR extraction   extract_transcript_parts.py
            Motif scan  RBPmap or FIMO
            Count motif hits    Your own parser (Python or R)
            Fisher’s exact test scipy.stats or fisher.test()
            FDR correction  multipletests() or p.adjust()
    
        python rbp_enrichment.py rbpmap_downregulated.tsv rbpmap_background.tsv rbp_enrichment_results.csv
    
  4. Quick Drop-In Plan (RBPmap Alternative with FIMO for motif scan)

    1. [ATtRACT + FIMO (MEME suite)]
    
        ATtRACT: Database of RNA-binding motifs.
        FIMO: Fast and scriptable motif scanning tool.
    
        #Download RBP motifs (PWM) from ATtRACT DB; Convert to MEME format (if needed); Use FIMO to scan UTR sequences
    
        grep "Homo_sapiens" ATtRACT_db.txt > attract_human.txt
    
        #cut -f12 attract_human.txt | sort | uniq > valid_ids.txt
    
        python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_background.3UTR.fasta MKL-1_background.filtered.3UTR.fasta
        ✅ 筛选完成: 总序列 = 70650
        🧹 已移除过短序列 (<16 nt): 1760
        🟢 保留有效序列 (≥16 nt): 68890
        📁 新背景文件保存为: MKL-1_background.filtered.3UTR.fasta
        # 检查背景文件中有多少序列:
        grep -c "^>" MKL-1_background.filtered.3UTR.fasta
        68890
        # 检查背景 FIMO 命中的总序列数:
        cut -f3 fimo_background_MKL-1_background/fimo.tsv | sort | uniq | wc -l
        67841
        python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_up.3UTR.fasta MKL-1_up.filtered.3UTR.fasta
        python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_down.3UTR.fasta MKL-1_down.filtered.3UTR.fasta
        python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_background.3UTR.fasta WaGa_background.filtered.3UTR.fasta
        python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_up.3UTR.fasta WaGa_up.filtered.3UTR.fasta
        python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta WaGa_down.filtered.3UTR.fasta
    
        python convert_attract_pwm_to_meme.py
    
        fimo --thresh 1e-4 --oc fimo_foreground_MKL-1_down attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_down.3UTR.fasta
        fimo --thresh 1e-4 --oc fimo_foreground_MKL-1_up attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_up.3UTR.fasta
        fimo --thresh 1e-4 --oc fimo_background_MKL-1_background attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_background.3UTR.fasta
        fimo --thresh 1e-4 --oc fimo_foreground_WaGa_down attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta
        fimo --thresh 1e-4 --oc fimo_foreground_WaGa_up attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_up.3UTR.fasta
        fimo --thresh 1e-4 --oc fimo_background_WaGa_background attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_background.3UTR.fasta
    
        #Explanation for the table from FIMO (Find Individual Motif Occurrences), which scans sequences to find statistically significant matches to known motifs (e.g., RNA or DNA binding sites).
    
        Column  Meaning
        motif_id    ID of the motif, as defined in the .meme file
        motif_alt_id    Alternative ID or name for the motif (may be blank or unused)
        sequence_name   Name of the sequence where the motif was found (e.g., gene
        start   Start position (1-based) of the motif match within the sequence
        stop    End position of the motif match
        strand  Strand on which the motif was found: + (forward) or - (reverse)
        score   Motif match score; higher scores indicate better matches
        p-value Statistical significance of the match (lower is more significant)
        q-value Adjusted p-value (False Discovery Rate corrected)
        matched_sequence    The actual sequence in the input that matches the motif
    
        ✅ Example Interpretation
        1338 ENSG00000134871|ENST00000714397|3UTR 103 114 + 23.4126 5.96e-08 0.111 GGAGAGAAGGGA
        motif_id: 1338 — a numeric ID from your motif file
        sequence_name: ENSG00000134871|ENST00000714397|3UTR — refers to the gene, transcript, and region (3′ UTR)
        start–stop: 103–114 — the motif occurs from position 103 to 114
        strand: + — found on the positive strand
        score: 23.41 — high score means strong motif match
        p-value: 5.96e-08 — very statistically significant
        q-value: 0.111 — FDR-corrected p-value
        matched_sequence: GGAGAGAAGGGA — the actual sequence match in the UTR
    
        💡 Tips
        You can map motif_id to RBP (RNA-binding protein) names using an annotation file like ATtRACT_db.txt.
        Typically, q-value < 0.05 is considered significant.
        Duplicate matches in different transcripts of the same gene may occur and are valid.
        Would you like help converting motif_id to RBP names for clarity?
    
        🧠 In most biological contexts:
            * Counting a motif as present multiple times because it's in several transcripts can inflate significance.
            * If you're using Fisher's exact test (as in enrichment), this transcript-level duplication can bias results.
    
        ⚠️ Caveat: If you're studying isoform-specific regulation, then transcript-level data may be valuable and shouldn't be collapsed. But for most general RBP enrichment or gene expression studies, the gene-level collapse is preferred.
    
        #Keep only one match per gene (based on Ensembl Gene ID like ENSG00000134871) for each RBP motif, even if multiple transcripts have hits.
        #python filter_fimo_best_per_gene.py --input fimo_foreground/fimo.tsv --output fimo_foreground/fimo.filtered.tsv
        convert_gtf_to_Gene_annotation_TSV_file.py  #generate gene_annotation.tsv
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_foreground_MKL-1_down/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_foreground_MKL-1_down/fimo.filtered.tsv \
        --output_annotated fimo_foreground_MKL-1_down/fimo.filtered.annotated.tsv
        #21559
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_foreground_MKL-1_up/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_foreground_MKL-1_up/fimo.filtered.tsv \
        --output_annotated fimo_foreground_MKL-1_up/fimo.filtered.annotated.tsv
        #(736661 rows)
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_background_MKL-1_background/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_background_MKL-1_background/fimo.filtered.tsv \
        --output_annotated fimo_background_MKL-1_background/fimo.filtered.annotated.tsv
        #(1869075 rows)
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_foreground_WaGa_down/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_foreground_WaGa_down/fimo.filtered.tsv \
        --output_annotated fimo_foreground_WaGa_down/fimo.filtered.annotated.tsv
        #(20364 rows)
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_foreground_WaGa_up/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_foreground_WaGa_up/fimo.filtered.tsv \
        --output_annotated fimo_foreground_WaGa_up/fimo.filtered.annotated.tsv
        #(805634 rows)
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_background_WaGa_background/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_background_WaGa_background/fimo.filtered.tsv \
        --output_annotated fimo_background_WaGa_background/fimo.filtered.annotated.tsv
        #(1811615 rows)
    
        python run_enrichment.py \
            --attract ATtRACT_db.txt \
            --fimo_fg fimo_foreground_MKL-1_up/fimo.filtered.tsv \
            --fimo_bg fimo_background_MKL-1_background/fimo.filtered.tsv \
            --output rbp_enrichment_MKL-1_up.csv \
            --strategy inclusive
        python run_enrichment.py \
            --attract ATtRACT_db.txt \
            --fimo_fg fimo_foreground_MKL-1_down/fimo.filtered.tsv \
            --fimo_bg fimo_background_MKL-1_background/fimo.filtered.tsv \
            --output rbp_enrichment_MKL-1_down.csv
        python run_enrichment.py \
            --attract ATtRACT_db.txt \
            --fimo_fg fimo_foreground_WaGa_up/fimo.filtered.tsv \
            --fimo_bg fimo_background_WaGa_background/fimo.filtered.tsv \
            --output rbp_enrichment_WaGa_up.csv
        python run_enrichment.py \
            --attract ATtRACT_db.txt \
            --fimo_fg fimo_foreground_WaGa_down/fimo.filtered.tsv \
            --fimo_bg fimo_background_WaGa_background/fimo.filtered.tsv \
            --output rbp_enrichment_WaGa_down.csv
    
        python plot_volcano.py --csv rbp_enrichment_MKL-1_up.csv --output MKL-1_volcano_up.pdf --title "Upregulated MKL-1"
        python plot_rbp_heatmap.py \
        --csvs rbp_enrichment_MKL-1_up.csv rbp_enrichment_MKL-1_down.csv \
        --labels Upregulated Downregulated \
        --output MKL-1_rbp_enrichment_heatmap.pdf
    
        #Column Meaning
        #a  Number of unique foreground UTRs hit by the RBP
        #b  Number of unique background UTRs hit by the RBP
        #c  Total number of foreground UTRs
        #d  Total number of background UTRs (⬅️ this is the value you're asking about)
        #p_value, fdr   From Fisher's exact test on enrichment
    
        #-- Get all genes the number 1621 refers to --
        #AGO2,1621,5050,5732,12987,1.0,1.0   #MKL-1_up
        #motif_ids are 414 and 399
        grep "^414" fimo.filtered.annotated.tsv > AGO2.txt
        grep "^399" fimo.filtered.annotated.tsv >> AGO2.txt
        cut -d$'\t' -f11 AGO2.txt | sort -u > AGO2_uniq.txt
        wc -l AGO2_uniq.txt
        #1621 AGO2_uniq.txt
    
        #工具 功能  关注点 应用场景
        FIMO    精确查找 motif 出现位置 motif 在什么位置出现   找出具体结合位点
        AME 统计 motif 富集情况   哪些 motif 在某组序列中更富集  比较 motif 是否显著出现更多
    
        如你还在做差异表达后的RBP富集分析,可以考虑先用 FIMO 扫描,再用你自己写的代码 + Fisher’s exact test 做类似 AME 的工作,或直接用 AME 做分析
    
        # Generate the attract_human.meme inkl. Gene_name!
        #python generate_named_meme.py pwm.txt attract_human.txt
        python generate_attract_human_meme.py pwm.txt ATtRACT_db.txt
    
        #ERROR during running ame --> DEBUG!
        #--control ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_all.3UTR.fasta \
        ame --control --shuffle-- \
        --oc ame_out \
        --scoring avg \
        --method fisher --verbose 5 ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta attract_human.meme
    
    2. GraphProt2 (ALTERNATIVE_TODO)
    
        ML-based tool using sequence + structure
    
        Pre-trained models for many RBPs
    
        ✅ Advantages:
    
        Local, GPU/CPU supported
    
        More biologically realistic (includes structure)
    
  5. miRNAs motif analysis using ATtRACT + FIMO

    ✅ Goal
    
        * Extract their sequences
        * Generate a background set
        * Run RBP enrichment (e.g., with RBPmap or FIMO)
        * Get p-adjusted enrichment stats (e.g., Fisher + BH)
    
        Input_1. DE results (differential expression file from smallRNA-seq)
            #Input: up- and down-, all-regulated files
            #~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-up.txt  #83
            #~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-down.txt  #34
            #~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-all.txt  #1304
            ~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-up.txt  #66
            ~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-down.txt  #38
            ~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-all.txt  #1304
            #Format: 1st column = miRNA ID (e.g., hsa-miR-21-5p), optionally with other stats.
    
        Input_2. Reference FASTA (Reference sequences from miRBase or GENCODE)
            #From miRBase: https://mirbase.org/download/  https://mirbase.org/download/CURRENT/
            ##miRBase_v21
            #mature.fa.gz → contains mature miRNA sequences
            #hairpin.fa.gz → for pre-miRNAs
    
            cp ~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-*.txt .
            #"hsa-miR-3180|hsa-miR-3180-3p"
            #>hsa-miR-3180 MIMAT0018178 Homo sapiens miR-3180
            #UGGGGCGGAGCUUCCGGAG
            #>hsa-miR-3180-3p MIMAT0015058 Homo sapiens miR-3180-3p
            #UGGGGCGGAGCUUCCGGAGGCC
    
        5.1 (Optional, not used!)
    
            #python extract_miRNA_fasta.py EV_vs_parental-up.txt mature_v21.fa up_mature_miRNAs.fa --unmatched up_mature_unmatched.txt  #84+0
            #python extract_miRNA_fasta.py EV_vs_parental-up.txt hairpin_v21.fa up_precursor_miRNAs.fa --unmatched up_precursor_unmatched.txt  #0
            #python extract_miRNA_fasta.py EV_vs_parental-down.txt mature_v21.fa down_mature_miRNAs.fa --unmatched down_mature_unmatched.txt  #34+0
            #python extract_miRNA_fasta.py EV_vs_parental-down.txt hairpin_v21.fa down_precursor_miRNAs.fa --unmatched down_precursor_unmatched.txt  #0
            #python extract_miRNA_fasta.py EV_vs_parental-all.txt mature_v21.fa all_mature_miRNAs.fa --unmatched all_mature_unmatched.txt         #1304+16
            #python extract_miRNA_fasta.py EV_vs_parental-all.txt hairpin_v21.fa all_precursor_miRNAs.fa --unmatched all_precursor_unmatched.txt  #16
            python extract_miRNA_fasta.py untreated_vs_parental_cells-up.txt mature_v21.fa up_mature_miRNAs.fa --unmatched up_mature_unmatched.txt  #67+0
            python extract_miRNA_fasta.py untreated_vs_parental_cells-up.txt hairpin_v21.fa up_precursor_miRNAs.fa --unmatched up_precursor_unmatched.txt  #0
            python extract_miRNA_fasta.py untreated_vs_parental_cells-down.txt mature_v21.fa down_mature_miRNAs.fa --unmatched down_mature_unmatched.txt  #38+0
            python extract_miRNA_fasta.py untreated_vs_parental_cells-down.txt hairpin_v21.fa down_precursor_miRNAs.fa --unmatched down_precursor_unmatched.txt  #0
            python extract_miRNA_fasta.py untreated_vs_parental_cells-all.txt mature_v21.fa all_mature_miRNAs.fa --unmatched all_mature_unmatched.txt         #1304+16
            python extract_miRNA_fasta.py untreated_vs_parental_cells-all.txt hairpin_v21.fa all_precursor_miRNAs.fa --unmatched all_precursor_unmatched.txt  #16
    
        5.2 (Advanced)
            Extract Sequences + Background Set
    
            Inputs:
                * up_miRNA.txt and down_miRNA.txt: DE results (first column = miRNA name, e.g., hsa-miR-21-5p)
                * mature.fa or hairpin.fa from miRBase
    
            Outputs:
                * mirna_up.fa
                * mirna_down.fa
                * mirna_background.fa
    
            #Use all remaining miRNAs as background:
            python prepare_miRNA_sets.py untreated_vs_parental_cells-up.txt untreated_vs_parental_cells-down.txt mature_v21.fa mirna --full-background
            mv mirna_background.fa mirna_full-background.fa
            #Use random subset background. Note that the generated background has the number of maxsize(up, down), in the case is up (84 records):
            python prepare_miRNA_sets.py untreated_vs_parental_cells-up.txt untreated_vs_parental_cells-down.txt mature_v21.fa mirna
            # grep ">" mature_v21.fa | wc -l  #35828
            # grep ">" mirna_full-background.fa | wc -l  #35710-->35723
            # grep ">" mirna_up.fa | wc -l  #84
            # grep ">" mirna_down.fa | wc -l  #34
            # grep ">" mirna_background.fa | wc -l  #84-->67
            # #35,710 + 84 + 34 = 35,828
    
        🔬 What You Can Do Next
        Goal    Tool    Input
        * RBP motif enrichment in pre-miRNAs    RBPmap, FIMO, AME   up_precursor_miRNAs.fa
        * Motif comparison (up vs down miRNAs)  DREME, MEME, HOMER  Up/down mature miRNAs
        * Build background for enrichment   Random subset of other miRNAs   Filtered from hairpin.fa
    
        fimo --thresh 1e-4 --oc fimo_mirna_down attract_human.meme mirna_down.fa
        fimo --thresh 1e-4 --oc fimo_mirna_up attract_human.meme mirna_up.fa
        fimo --thresh 1e-4 --oc fimo_mirna_full-background attract_human.meme mirna_full-background.fa
        fimo --thresh 1e-4 --oc fimo_mirna_background attract_human.meme mirna_background.fa
        #END
    
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_mirna_down/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_mirna_down/fimo.filtered.tsv \
        --output_annotated fimo_mirna_down/fimo.filtered.annotated.tsv
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_mirna_up/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_mirna_up/fimo.filtered.tsv \
        --output_annotated fimo_mirna_up/fimo.filtered.annotated.tsv
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_mirna_full-background/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_mirna_full-background/fimo.filtered.tsv \
        --output_annotated fimo_mirna_full-background/fimo.filtered.annotated.tsv
        python filter_fimo_best_per_gene_annotated.py \
        --input fimo_mirna_background/fimo.tsv \
        --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
        --output_filtered fimo_mirna_background/fimo.filtered.tsv \
        --output_annotated fimo_mirna_background/fimo.filtered.annotated.tsv
    
        python run_enrichment.py \
            --attract ATtRACT_db.txt \
            --fimo_fg fimo_mirna_up/fimo.filtered.tsv \
            --fimo_bg fimo_mirna_full-background/fimo.filtered.tsv \
            --output rbp_enrichment_mirna_up_on_full-background.csv \
            --strategy inclusive
        python run_enrichment.py \
            --attract ATtRACT_db.txt \
            --fimo_fg fimo_mirna_down/fimo.filtered.tsv \
            --fimo_bg fimo_mirna_full-background/fimo.filtered.tsv \
            --output rbp_enrichment_mirna_down_on_full-background.csv \
            --strategy inclusive
        python run_enrichment.py \
            --attract ATtRACT_db.txt \
            --fimo_fg fimo_mirna_up/fimo.filtered.tsv \
            --fimo_bg fimo_mirna_background/fimo.filtered.tsv \
            --output rbp_enrichment_mirna_up_on_subset-background.csv \
            --strategy inclusive
        python run_enrichment.py \
            --attract ATtRACT_db.txt \
            --fimo_fg fimo_mirna_down/fimo.filtered.tsv \
            --fimo_bg fimo_mirna_background/fimo.filtered.tsv \
            --output rbp_enrichment_mirna_down_on_subset-background.csv \
            --strategy inclusive
    
        ✅ RBP Enrichment from RBPmap Results
        🔹 Use RBPmap output (typically CSV or TSV)
        🔹 Compare hit counts in input vs background
        🔹 Perform Fisher's exact test + Benjamini-Hochberg correction
        🔹 Plot significantly enriched RBPs
    
        📁 Requirements
        You’ll need:
    
        File    Description
        rbpmap_up.tsv   RBPmap result file for upregulated set
        rbpmap_background.tsv   RBPmap result file for background set
    
        📝 These should have columns like:
    
        Motif Name or Protein
    
        Sequence Name or Sequence ID
        (If different, I’ll show you how to adjust.
    
        python analyze_rbpmap_enrichment.py rbpmap_up.tsv rbpmap_background.tsv enriched_up.csv enriched_up_plot.png
    
        ✅ Output
        enriched_up.csv
        RBP FG_hits BG_hits pval    padj    enriched
        ELAVL1  24  2   0.0001  0.003   ✅
        HNRNPA1 15  10  0.04    0.06    ❌
    
        enriched_up_plot.png
        Barplot showing top significant RBPs (lowest FDR)
    
        🧰 Customization Options
        Would you like:
    
            * Support for multiple RBPmap files at once?
    
            * To match by RBP family?
    
            * A full report (PDF/HTML) of top hits?
    
            * Let me know, and I’ll tailor the next script!
    
  6. The planned pipeline is very reasonable and scientifically sound for RBP (RNA-binding protein) enrichment using miRNAs as input for FIMO motif scanning and downstream analysis.

    Here’s a breakdown of why your steps make sense, plus a few important suggestions:
    
    ✅ YOUR PIPELINE LOGIC (EXCELLENT)
    1. Motif Scanning with FIMO
    
    fimo --thresh 1e-4 --oc fimo_mirna_* attract_human.meme mirna_*.fa
    ✔️ Uses ATtRACT motifs to scan miRNA sequences.
    
    ✔️ Applies a reasonable p-value threshold (1e-4) to detect motif hits.
    
    📌 If you want more sensitivity (but more noise), you could try 1e-3; for higher stringency, 1e-5.
    
    2. Filter FIMO hits to best per gene and annotate
    
    python filter_fimo_best_per_gene_annotated.py --input ... --annot ... --output ...
    ✔️ Filters redundant hits.
    
    ✔️ Annotates with gene symbols or other metadata — essential for interpretability.
    
    3. RBP Enrichment
    
    python run_enrichment.py --fimo_fg ... --fimo_bg ... --attract ... --output ...
    ✔️ Uses foreground (up/down-regulated) and background (non-differential) miRNAs.
    
    ✔️ The --strategy inclusive is typical and includes all motif hits, not just strongest per gene or region.
    
    ✔️ Running up and down separately = ✅ best practice.
    
    🔍 MINOR IMPROVEMENTS / SUGGESTIONS
    ✅ Use the same background for both up and down analyses:
    You already are — good!
    
    This ensures your enrichment comparison is balanced and interpretable.
    
    ⚠️ Check motif directionality (optional):
    Some RBPs are strand-specific — but since miRNAs are single-stranded, and your FASTA is from mature sequences, you’re probably okay.
    
    ✅ Consider including summary stats:
    How many motifs hit in up/down/background?
    
    How many unique RBPs?
    
    This helps interpret enrichment p-values.
    
    🧪 Statistical validation (optional but helpful):
    If you have many miRNAs (e.g. >30), consider bootstrapping enrichment from random subsets of the background to assess robustness.
    
    ✅ FINAL VERDICT
    Your pipeline is:
    
    ✔️ Technically correct
    
    ✔️ Biologically appropriate
    
    ✔️ Reproducible and clear
    
    🧠 Well-structured for miRNA-RBP motif enrichment
    
  7. Why background differs between RNA-seq and miRNA motif analysis:

    In RNA-seq differential expression analysis, especially for gene-level enrichment (e.g., GO or pathway analysis), the background typically includes all genes detected or expressed, not just a random subset. The logic is:
    
    ✅ “If a gene was testable (i.e., expressed above a threshold), it should be part of the universe we consider when evaluating enrichment.”
    
    This is because the statistical model assumes:
    
    Your universe is all genes tested.
    
    You're asking: Are my differentially expressed genes overrepresented in some category, compared to all testable genes?
    
    🔬 Why a random subset background was used in the miRNA motif pipeline (e.g., RBP enrichment):
    In the prepare_miRNA_sets.py script, the subset was used for one reason only:
    
    📉 To balance the number of background sequences relative to the foreground (up/down) — especially for statistical tests like Fisher's exact test.
    
    But this is optional and not required for correctness. A larger background improves statistical power, but must match distributional properties (e.g., GC content, sequence length) to avoid bias.
    
    ✅ So: Should you use all non-differentially expressed miRNAs as background?
    Yes, that is often more appropriate, assuming:
    
    You have a full set of detectable/testable miRNAs (e.g., all from the FASTA).
    
    You exclude the foreground (up/down) sets.
    
    The background matches the general properties of the foreground set.
    
  8. RBP enrichments via FIMO (Another description; it is the same to the workflow in the point 4)

    1. Collect the 3′ UTR sequences: Use the 3UTR.fasta file generated earlier, filtered to protein-coding and downregulated genes.
    
    2. Prepare Motif Database (MEME format)
    
        * ATtRACT: https://attract.cnic.es
        * RBPDB: http://rbpdb.ccbr.utoronto.ca
        * Ray2013 (CISBP-RNA motifs) — available via MEME Suite
        * [RBPmap motifs (if downloadable)]
        #Example format: rbp_motifs.meme
    
    2. Run FIMO to Scan for RBP Motifs (Similar to RBPmap)
    
        fimo --oc fimo_up rbp_motifs.meme mirna_up.fa
        fimo --oc fimo_down rbp_motifs.meme mirna_down.fa
        fimo --oc fimo_background rbp_motifs.meme mirna_background.fa
        #This produces fimo.tsv in each output folder.
    
    3. Run RBP motif enrichment using MEME Suite using AME (Analysis of Motif Enrichment). Note that FIMO+run_enrichment.py=AME, however, directly using AME returns ERROR:
    
        ame \
        --control control_3UTRs.fasta \
        --oc ame_out \
        --scoring avg \
        --method fisher \
        3UTR.fasta \
        rbp_motifs.meme
    
        Where:
    
        * 3UTR.fasta = your downregulated genes’ 3′ UTRs
        * control_3UTRs.fasta = background UTRs (e.g., random protein-coding genes not downregulated)
        * rbp_motifs.meme = motif file from RBPDB or Ray2013
    
    4. Interpret Results: Output includes RBP motifs enriched in your downregulated mRNAs' 3′ UTRs.
    
        You can then link enriched RBPs to known interactions with your upregulated miRNAs, or explore their regulatory roles.
    
    5. ✅ Bonus: Predict Which mRNAs Are Targets of Your miRNAs
    
        Use tools like: miRanda, TargetScan, miRDB
    
        Then intersect predicted targets with your downregulated genes to identify likely functional interactions.
    
    6. Summary
    
        Goal    Input   Tool / Approach
        RBP enrichment  3UTR.fasta of downregulated genes   AME with RBP motifs
        Background/control  3′ UTRs from non-differential or upregulated genes
        Link miRNA to targets   Use TargetScan / miRanda    Intersect with down genes
    
    7. Would you like:
    
        * Ready-to-use RBP motif .meme file?
        * Script to generate background sequences?
        * Visualization options for the enrichment results?
    
  9. Other options to get sequences of 3UTR, 5UTR, CDS and mRNA transcripts

    ✅ Option 2: Use Ensembl BioMart (web-based, no coding) --> Lasting too long!
    
        Go to Ensembl BioMart https://www.ensembl.org/biomart/martview/7b826bcbd0cec79021977f8dc12a8f61
    
        Select:
    
        Database: Ensembl Genes
        Dataset: Homo sapiens genes (GRCh38 or latest)
    
        Click on “Filters” → expand Region or Gene to limit your selection (optional).
        Click on “Attributes”:
        Under Sequences, check:
        Sequences
        3' UTR sequences
    
        Optionally add gene IDs, transcript IDs, etc.
    
        Click “Results” to view/download the FASTA of 3' UTRs.
    
    ✅ Option 3: Use GENCODE (precompiled annotations) and gffread
    
        Use a tool like gffread (from the Cufflinks or gffread package) to extract 3' UTRs:
    
            #gffread gencode.v44.annotation.gtf -g GRCh38.primary_assembly.genome.fa -w all_utrs.fa -U
            #gffread -w three_prime_utrs.fa -g GRCh38.fa -x cds.fa -y proteins.fa -U -F gencode.gtf
    
            grep -P "\tthree_prime_utr\t" gencode.v48.annotation.gtf > three_prime_utrs.gtf
            gtf2bed < three_prime_utrs.gtf > three_prime_utrs.bed
            bedtools getfasta -fi GRCh38.primary_assembly.genome.fa -bed three_prime_utrs.bed -name -s > three_prime_utrs.fa
    
            gffread gencode.v48.annotation.gtf -g GRCh38.primary_assembly.genome.fa -U -w all_with_utrs.fa
    
        Add -U flag to extract UTRs, and filter post hoc for only 3' UTRs if needed.
    
    ✅ Option 4: Use Bioconductor in R (UCSC-ID, not suitable!)
    
        # Install if not already installed
        if (!requireNamespace("BiocManager", quietly = TRUE))
            install.packages("BiocManager")
        BiocManager::install("GenomicFeatures")
        BiocManager::install("txdbmaker")
        #sudo apt-get update
        #sudo apt-get install libmariadb-dev
        #(optional)sudo apt-get install libmysqlclient-dev
        install.packages("RMariaDB")
    
        # Load library
        library(GenomicFeatures)
    
        # Create TxDb object for human genome
        txdb <- txdbmaker::makeTxDbFromUCSC(genome="hg38", tablename="refGene")
    
        # Extract 3' UTRs by transcript
        utr3 <- threeUTRsByTranscript(txdb, use.names=TRUE)
    
    # View or export as needed
    
    ✅ Option 5: Extract 3′ UTRs Using UCSC Table Browser (GUI method)
        🔗 Website:
        UCSC Table Browser
    
        🔹 Step-by-Step Instructions
        1. Set the basic parameters:
        Clade: Mammal
    
        Genome: Human
    
        Assembly: GRCh38/hg38
    
        Group: Genes and Gene Predictions
    
        Track: GENCODE v44 (or latest)
    
        Table: knownGene or wgEncodeGencodeBasicV44
    
        Choose knownGene for RefSeq-like models or wgEncodeGencodeBasicV44 for GENCODE
    
        2. Region:
        Select: genome (default)
    
        3. Output format:
        Select: sequence
    
        4. Click "get output"
        🔹 Sequence Retrieval Options:
        On the next page (after clicking "get output"), you’ll see sequence options.
    
        Configure as follows:
        ✅ Output format: FASTA
    
        ✅ Which part of the gene: Select only
        → UTRs → 3' UTR only
    
        ✅ Header options: choose if you want gene name,
    
  10. ⚡️ Bonus: Combine with miRNA-mRNA predictions

    Once you have RBPs enriched in downregulated mRNAs, you can intersect:
        * Which RBPs overlap miRNA binding regions (e.g., via CLIPdb or POSTAR)
        * Check if miRNAs and RBPs compete or co-bind
    This can lead to identifying miRNA-RBP regulatory modules.
    

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum