Analysis of the RNA binding protein (RBP) motifs for RNA-Seq and miRNAs (v2)

gene_x 0 like s 696 view s

Tags: pipeline

There are several alternative R packages and tools to perform motif enrichment analysis for RNA-binding proteins (RBPs), beyond PWMEnrich::motifEnrichment(). Here are the most notable ones:

| Tool / Package           | Enrichment        | Custom Motifs   | CLI or R? | RNA-specific?  |
| ------------------------ | ----------------- | --------------- | --------- | -------------- |
| **PWMEnrich**            | ✅                 | ✅               | R         | ✅              |
| **RBPmap**               | ✅                 | ❌ (uses own db) | Web/CLI   | ✅              |  ----> try RBPmap_results + enrichments!
| **Biostrings/TFBSTools** | ❌ (only scanning) | ✅               | R         | ❌              |  #ATtRACT + Biostrings / TFBSTools
| **rmap**                 | ✅ (CLIP-based)    | ❌               | R         | ✅              |
| **Homer**                | ✅                 | ✅               | CLI       | ⚠ RNA optional |
| **MEME (AME, FIMO)**     | ✅                 | ✅               | Web/CLI   | ⚠ Generic      |

Get 3UTR.fasta, 5UTR.fasta, CDS.fasta and transcripts.fasta

        mRNA Transcript
┌────────────┬────────────┬────────────┐
│   5′ UTR   │     CDS    │   3′ UTR   │
└────────────┴────────────┴────────────┘
↑            ↑            ↑            ↑
Start        Start        Stop         End
of           Codon       Codon        of
Transcript                             Transcript

✅ Option 1: Use GENCODE and python scripts (CHOSEN!)

~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-up.txt    #20086
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-down.txt  #634
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-up.txt     #23832
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-down.txt   #375

#Filtering the down-regulated genes to include only protein_coding genes before extracting 3' UTRs, because
#1. Only protein_coding genes have well-annotated 3' UTRs
#3' UTRs are defined as the region after the CDS (coding sequence) and before the poly-A tail.
#Non-coding RNAs (e.g., lncRNA, snoRNA, miRNA precursors) do not have CDS, and therefore don't have canonical 3' UTRs.
#2. In GENCODE, most UTR annotations are only provided for transcripts of gene_type = "protein_coding".

grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-up.txt > MKL-1_wt.EV_vs_parental-up_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-down.txt > MKL-1_wt.EV_vs_parental-down_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-up.txt > WaGa_wt.EV_vs_parental-up_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-down.txt > WaGa_wt.EV_vs_parental-down_protein_coding.txt

#Visit and Download: GENCODE FTP site https://www.gencodegenes.org/human/
    * GTF annotation file (e.g., gencode.v48.annotation.gtf.gz)
    * Corresponding genome FASTA (e.g., GRCh38.primary_assembly.genome.fa.gz)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.annotation.gtf.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/GRCh38.primary_assembly.genome.fa.gz
gunzip gencode.v48.annotation.gtf.gz
gunzip GRCh38.primary_assembly.genome.fa.gz

python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-down_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_down
python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-up_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_up  #5988
python extract_transcript_parts.py WaGa_wt.EV_vs_parental-down_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_down  #93
python extract_transcript_parts.py WaGa_wt.EV_vs_parental-up_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_up  #6538

✅ Option 2-5 see at the end!

Why 3' UTR?

🧬 miRNA, RBP, or translation/post-transcriptional regulation
➡️ Use 3' UTR sequences

Because:

Most miRNA binding and many RBP motifs are located in the 3' UTR.

It’s the primary region for mRNA stability, localization, and translation regulation.

🧠 Example: You're looking for binding enrichment of miRNAs or RNA-binding proteins (PUM, HuR, etc.)
✅ Input = 3UTR.fasta

🧪 If you're testing PBRs related to:
- Translation initiation, upstream ORFs, or 5' cap interaction:
➡️ Use 5' UTR

- Coding mutations, protein-level motifs, or translational efficiency:
➡️ Use CDS

- General transcriptome-wide motif search (no preference):
➡️ Use transcripts, or test all regions separately to localize signal

Recommended Workflow with RBPmap https://rbpmap.technion.ac.il (Too slow!)

RBPmap itself does not compute enrichment p-values or FDR; it's a motif scanning tool.

To get statistically meaningful RBP enrichments, combine RBPmap with custom permutation testing or Fisher’s exact test + multiple testing correction.

    1. Prepare foreground (target) and background sequences

        Extract 3′ UTRs of:

        📉 Downregulated mRNAs (foreground) — likely targeted by upregulated miRNAs

        ⚪ A control set of 3′ UTRs — e.g., non-differentially expressed protein-coding genes

            grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-all.txt > MKL-1_wt.EV_vs_parental-all_protein_coding.txt
            grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-all.txt > WaGa_wt.EV_vs_parental-all_protein_coding.txt

            cut -d',' -f1 MKL-1_wt.EV_vs_parental-all_protein_coding.txt | sort > all_genes.txt  #19239
            cut -d',' -f1 MKL-1_wt.EV_vs_parental-up_protein_coding.txt | sort > up_genes.txt  #5988
            cut -d',' -f1 MKL-1_wt.EV_vs_parental-down_protein_coding.txt | sort > down_genes.txt  #112
            cat up_genes.txt down_genes.txt | sort | uniq > regulated_genes.txt
            comm -23 all_genes.txt regulated_genes.txt > background_genes.txt
            grep -Ff background_genes.txt MKL-1_wt.EV_vs_parental-all_protein_coding.txt > MKL-1_wt.EV_vs_parental-background_protein_coding.txt  #13139

            cut -d',' -f1 WaGa_wt.EV_vs_parental-all_protein_coding.txt | sort > all_genes.txt  #19239
            cut -d',' -f1 WaGa_wt.EV_vs_parental-up_protein_coding.txt | sort > up_genes.txt  #6538
            cut -d',' -f1 WaGa_wt.EV_vs_parental-down_protein_coding.txt | sort > down_genes.txt  #93
            cat up_genes.txt down_genes.txt | sort | uniq > regulated_genes.txt
            comm -23 all_genes.txt regulated_genes.txt > background_genes.txt
            grep -Ff background_genes.txt WaGa_wt.EV_vs_parental-all_protein_coding.txt > WaGa_wt.EV_vs_parental-background_protein_coding.txt  #12608

            python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-background_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_background
            python extract_transcript_parts.py WaGa_wt.EV_vs_parental-background_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_background

            foreground.fasta: 你的目标（前景）序列，例如下调基因的 3′UTRs。
            background.fasta: 你的背景对照序列，例如未显著差异表达的基因的 3′UTRs。

    2. Run RBPmap separately on both sets (in total of 6 calculations)

        * Submit both sets of UTRs to RBPmap.
        * Use the same settings (e.g., “human genome”, “high stringency”, "Apply conservation filter" etc.)
        * Choose all RBPs
        * Download motif match outputs for both sets

    3. Count motif hits per RBP in each set

        You now have:
        For each RBP:
        a: number of target 3′ UTRs with a motif match
        b: number of background UTRs with a motif match
        c: total number of target UTRs
        d: total number of background UTRs

    4. Perform Fisher’s Exact Test per RBP

        For each RBP, construct a 2x2 table:

        Motif Present   Motif Absent
        Foreground (targets)    a   c - a
        Background  b   d - b

    5. Adjust p-values for multiple testing
    Use Benjamini-Hochberg (FDR) correction (e.g., in Python or R) across all RBPs tested.

    6.✅ Summary

        Step    Tool
        Prepare Database of RNA-binding motifs  ATtRACT
        3′ UTR extraction   extract_transcript_parts.py
        Motif scan  RBPmap or FIMO
        Count motif hits    Your own parser (Python or R)
        Fisher’s exact test scipy.stats or fisher.test()
        FDR correction  multipletests() or p.adjust()

    python rbp_enrichment.py rbpmap_downregulated.tsv rbpmap_background.tsv rbp_enrichment_results.csv

Quick Drop-In Plan (RBPmap Alternative with FIMO for motif scan)

1. [ATtRACT + FIMO (MEME suite)]

    ATtRACT: Database of RNA-binding motifs.
    FIMO: Fast and scriptable motif scanning tool.

    #Download RBP motifs (PWM) from ATtRACT DB; Convert to MEME format (if needed); Use FIMO to scan UTR sequences

    grep "Homo_sapiens" ATtRACT_db.txt > attract_human.txt

    #cut -f12 attract_human.txt | sort | uniq > valid_ids.txt

    python convert_attract_pwm_to_meme.py

    fimo --thresh 1e-4 --oc fimo_foreground_MKL-1_down attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_down.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_foreground_MKL-1_up attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_up.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_background_MKL-1_background attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_background.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_foreground_WaGa_down attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_foreground_WaGa_up attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_up.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_background_WaGa_background attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_background.3UTR.fasta
    #end

    #TODO_TOMORROW: mv PBS_analysis RBP_analysis

    #Test
    python run_enrichment.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_foreground_WaGa_down/fimo.tsv \
        --fimo_bg fimo_foreground2/fimo.tsv \
        --output rbp_enrichment_test.csv

    python run_enrichment.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_foreground_MKL-1_up/fimo.tsv \
        --fimo_bg fimo_background_MKL-1_background/fimo.tsv \
        --output rbp_enrichment_MKL-1_up.csv
    python run_enrichment.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_foreground_MKL-1_down/fimo.tsv \
        --fimo_bg fimo_background_MKL-1_background/fimo.tsv \
        --output rbp_enrichment_MKL-1_down.csv
    python run_enrichment.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_foreground_WaGa_up/fimo.tsv \
        --fimo_bg fimo_background_WaGa_background/fimo.tsv \
        --output rbp_enrichment_WaGa_up.csv
    python run_enrichment.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_foreground_WaGa_down/fimo.tsv \
        --fimo_bg fimo_background_WaGa_background/fimo.tsv \
        --output rbp_enrichment_WaGa_down.csv

    #工具 功能  关注点 应用场景
    FIMO    精确查找 motif 出现位置 motif 在什么位置出现   找出具体结合位点
    AME 统计 motif 富集情况   哪些 motif 在某组序列中更富集  比较 motif 是否显著出现更多

    如你还在做差异表达后的RBP富集分析，可以考虑先用 FIMO 扫描，再用你自己写的代码 + Fisher’s exact test 做类似 AME 的工作，或直接用 AME 做分析

    # Generate the attract_human.meme inkl. Gene_name!
    #python generate_named_meme.py pwm.txt attract_human.txt
    python generate_attract_human_meme.py pwm.txt ATtRACT_db.txt

    #ERROR during running ame --> DEBUG!
    #--control ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_all.3UTR.fasta \
    ame --control --shuffle-- \
    --oc ame_out \
    --scoring avg \
    --method fisher --verbose 5 ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta attract_human.meme

2. GraphProt2 (ALTERNATIVE_TODO)

    ML-based tool using sequence + structure

    Pre-trained models for many RBPs

    ✅ Advantages:

    Local, GPU/CPU supported

    More biologically realistic (includes structure)

miRNAs motif analysis using ATtRACT + FIMO

✅ Goal

    * Extract their sequences
    * Generate a background set
    * Run RBP enrichment (e.g., with RBPmap or FIMO)
    * Get p-adjusted enrichment stats (e.g., Fisher + BH)

    5.1 (Optional)
    Input_1. DE results (differential expression file from smallRNA-seq)
        Example file: smallRNA_upregulated.txt
        Format: 1st column = miRNA ID (e.g., hsa-miR-21-5p), optionally with other stats.

    Input_2. Reference FASTA (Reference sequences from miRBase or GENCODE)
        From miRBase:
        mature.fa.gz → contains mature miRNA sequences
        hairpin.fa.gz → for pre-miRNAs

        python extract_miRNA_fasta.py smallRNA_upregulated.txt mature.fa up_mature_miRNAs.fa
        python extract_miRNA_fasta.py smallRNA_downregulated.txt hairpin.fa down_precursor_miRNAs.fa

    5.2 (Advanced)
        Extract Sequences + Background Set

        Inputs:
            * up_miRNA.txt and down_miRNA.txt: DE results (first column = miRNA name, e.g., hsa-miR-21-5p)
            * mature.fa or hairpin.fa from miRBase

        Outputs:
            * mirna_up.fa
            * mirna_down.fa
            * mirna_background.fa

        python prepare_miRNA_sets.py up_miRNA.txt down_miRNA.txt mature.fa mirna

    🔬 What You Can Do Next

    Goal    Tool    Input
    * RBP motif enrichment in pre-miRNAs    RBPmap, FIMO, AME   up_precursor_miRNAs.fa
    * Motif comparison (up vs down miRNAs)  DREME, MEME, HOMER  Up/down mature miRNAs
    * Build background for enrichment   Random subset of other miRNAs   Filtered from hairpin.fa

    ✅ RBP Enrichment from RBPmap Results
    🔹 Use RBPmap output (typically CSV or TSV)
    🔹 Compare hit counts in input vs background
    🔹 Perform Fisher's exact test + Benjamini-Hochberg correction
    🔹 Plot significantly enriched RBPs

    📁 Requirements
    You’ll need:

    File    Description
    rbpmap_up.tsv   RBPmap result file for upregulated set
    rbpmap_background.tsv   RBPmap result file for background set

    📝 These should have columns like:

    Motif Name or Protein

    Sequence Name or Sequence ID
    (If different, I’ll show you how to adjust.

    python analyze_rbpmap_enrichment.py rbpmap_up.tsv rbpmap_background.tsv enriched_up.csv enriched_up_plot.png

    ✅ Output
    enriched_up.csv
    RBP FG_hits BG_hits pval    padj    enriched
    ELAVL1  24  2   0.0001  0.003   ✅
    HNRNPA1 15  10  0.04    0.06    ❌

    enriched_up_plot.png
    Barplot showing top significant RBPs (lowest FDR)

    🧰 Customization Options
    Would you like:

        * Support for multiple RBPmap files at once?

        * To match by RBP family?

        * A full report (PDF/HTML) of top hits?

        * Let me know, and I’ll tailor the next script!

RBP enrichments via FIMO (The same to the workflow in 4)

1. Collect the 3′ UTR sequences: Use the 3UTR.fasta file generated earlier, filtered to protein-coding and downregulated genes.

2. Prepare Motif Database (MEME format)

    * ATtRACT: https://attract.cnic.es
    * RBPDB: http://rbpdb.ccbr.utoronto.ca
    * Ray2013 (CISBP-RNA motifs) — available via MEME Suite
    * [RBPmap motifs (if downloadable)]
    #Example format: rbp_motifs.meme

2. Run FIMO to Scan for RBP Motifs (Similar to RBPmap)

    fimo --oc fimo_up rbp_motifs.meme mirna_up.fa
    fimo --oc fimo_down rbp_motifs.meme mirna_down.fa
    fimo --oc fimo_background rbp_motifs.meme mirna_background.fa
    #This produces fimo.tsv in each output folder.

3. Run RBP motif enrichment using MEME Suite using AME (Analysis of Motif Enrichment):

    ame \
    --control control_3UTRs.fasta \
    --oc ame_out \
    --scoring avg \
    --method fisher \
    3UTR.fasta \
    rbp_motifs.meme

    Where:

    * 3UTR.fasta = your downregulated genes’ 3′ UTRs
    * control_3UTRs.fasta = background UTRs (e.g., random protein-coding genes not downregulated)
    * rbp_motifs.meme = motif file from RBPDB or Ray2013

4. Interpret Results: Output includes RBP motifs enriched in your downregulated mRNAs' 3′ UTRs.

    You can then link enriched RBPs to known interactions with your upregulated miRNAs, or explore their regulatory roles.

5. ✅ Bonus: Predict Which mRNAs Are Targets of Your miRNAs

    Use tools like: miRanda, TargetScan, miRDB

    Then intersect predicted targets with your downregulated genes to identify likely functional interactions.

6. Summary

    Goal    Input   Tool / Approach
    RBP enrichment  3UTR.fasta of downregulated genes   AME with RBP motifs
    Background/control  3′ UTRs from non-differential or upregulated genes
    Link miRNA to targets   Use TargetScan / miRanda    Intersect with down genes

7. Would you like:

    * Ready-to-use RBP motif .meme file?
    * Script to generate background sequences?
    * Visualization options for the enrichment results?

Other options to get sequences of 3UTR, 5UTR, CDS and mRNA transcripts

✅ Option 2: Use Ensembl BioMart (web-based, no coding) --> Lasting too long!

    Go to Ensembl BioMart https://www.ensembl.org/biomart/martview/7b826bcbd0cec79021977f8dc12a8f61

    Select:

    Database: Ensembl Genes
    Dataset: Homo sapiens genes (GRCh38 or latest)

    Click on “Filters” → expand Region or Gene to limit your selection (optional).
    Click on “Attributes”:
    Under Sequences, check:
    Sequences
    3' UTR sequences

    Optionally add gene IDs, transcript IDs, etc.

    Click “Results” to view/download the FASTA of 3' UTRs.

✅ Option 3: Use GENCODE (precompiled annotations) and gffread

    Use a tool like gffread (from the Cufflinks or gffread package) to extract 3' UTRs:

        #gffread gencode.v44.annotation.gtf -g GRCh38.primary_assembly.genome.fa -w all_utrs.fa -U
        #gffread -w three_prime_utrs.fa -g GRCh38.fa -x cds.fa -y proteins.fa -U -F gencode.gtf

        grep -P "\tthree_prime_utr\t" gencode.v48.annotation.gtf > three_prime_utrs.gtf
        gtf2bed < three_prime_utrs.gtf > three_prime_utrs.bed
        bedtools getfasta -fi GRCh38.primary_assembly.genome.fa -bed three_prime_utrs.bed -name -s > three_prime_utrs.fa

        gffread gencode.v48.annotation.gtf -g GRCh38.primary_assembly.genome.fa -U -w all_with_utrs.fa

    Add -U flag to extract UTRs, and filter post hoc for only 3' UTRs if needed.

✅ Option 4: Use Bioconductor in R (UCSC-ID, not suitable!)

    # Install if not already installed
    if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    BiocManager::install("GenomicFeatures")
    BiocManager::install("txdbmaker")
    #sudo apt-get update
    #sudo apt-get install libmariadb-dev
    #(optional)sudo apt-get install libmysqlclient-dev
    install.packages("RMariaDB")

    # Load library
    library(GenomicFeatures)

    # Create TxDb object for human genome
    txdb <- txdbmaker::makeTxDbFromUCSC(genome="hg38", tablename="refGene")

    # Extract 3' UTRs by transcript
    utr3 <- threeUTRsByTranscript(txdb, use.names=TRUE)

# View or export as needed

✅ Option 5: Extract 3′ UTRs Using UCSC Table Browser (GUI method)
    🔗 Website:
    UCSC Table Browser

    🔹 Step-by-Step Instructions
    1. Set the basic parameters:
    Clade: Mammal

    Genome: Human

    Assembly: GRCh38/hg38

    Group: Genes and Gene Predictions

    Track: GENCODE v44 (or latest)

    Table: knownGene or wgEncodeGencodeBasicV44

    Choose knownGene for RefSeq-like models or wgEncodeGencodeBasicV44 for GENCODE

    2. Region:
    Select: genome (default)

    3. Output format:
    Select: sequence

    4. Click "get output"
    🔹 Sequence Retrieval Options:
    On the next page (after clicking "get output"), you’ll see sequence options.

    Configure as follows:
    ✅ Output format: FASTA

    ✅ Which part of the gene: Select only
    → UTRs → 3' UTR only

    ✅ Header options: choose if you want gene name,

⚡️ Bonus: Combine with miRNA-mRNA predictions

Once you have RBPs enriched in downregulated mRNAs, you can intersect:
    * Which RBPs overlap miRNA binding regions (e.g., via CLIPdb or POSTAR)
    * Check if miRNAs and RBPs compete or co-bind
This can lead to identifying miRNA-RBP regulatory modules.

like unlike

点赞本文的读者

还没有人对此文章表态

本文有评论

没有评论

Analysis of the RNA binding protein (RBP) motifs for RNA-Seq and miRNAs (v2)

本文有评论

看文章，发评论，不要沉默

最受欢迎文章

最新文章

最多评论文章

推荐相似文章