gene_x 0 like s 28 view s
Tags: protein, pipeline, RNA-seq
Correlating RNA-seq data with mass spectrometry (MS)-based proteomics data is a powerful way to link transcript-level expression with protein-level abundance. Here’s a step-by-step outline of how to approach it:
Preprocessing and Normalization
For RNA-Seq data:
For MS proteomics data:
Data Mapping and Integration
Gene/Protein Mapping: Use gene symbols, Ensembl IDs, or UniProt IDs to map transcript-level data (RNA-seq) to protein-level data (MS). Be cautious of differences in annotation – e.g., some genes might have multiple protein isoforms.
Common Identifiers:
Data Filtering
Correlation Analysis
Steps:
- Construct a table with rows as genes/proteins and columns as samples.
- For each row, you’ll have two vectors:
* RNA expression (e.g., normalized RNA counts)
* Protein abundance (e.g., log-transformed LFQ intensity)
- Calculate:
from scipy.stats import pearsonr, spearmanr
rna_vector = [...]
protein_vector = [...]
pearson_corr, _ = pearsonr(rna_vector, protein_vector)
spearman_corr, _ = spearmanr(rna_vector, protein_vector)
Visualize and Interpret
Plot scatter plots of RNA vs protein levels for:
Plot correlation coefficients:
Assess overall agreement:
Consider Batch Effects and Biological Variability
If the datasets come from different experiments or platforms, consider batch correction methods (e.g., ComBat from the sva R package).
Be mindful that:
Summary Workflow
✅ Preprocess & normalize both datasets ✅ Map genes/proteins to common IDs ✅ Filter to shared, high-quality data ✅ Calculate correlations ✅ Visualize and interpret
Python script that walks through the key steps of correlating RNA-seq data with proteomics data:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr, spearmanr
import matplotlib.pyplot as plt
import seaborn as sns
# --- Step 1: Load your data ---
# Example: CSVs with genes/proteins as rows, samples as columns
rna_data = pd.read_csv('rna_seq_data.csv', index_col=0) # genes x samples
protein_data = pd.read_csv('proteomics_data.csv', index_col=0) # proteins x samples
# --- Step 2: Map genes to proteins (assuming same identifiers) ---
# Filter to common genes/proteins
common_genes = rna_data.index.intersection(protein_data.index)
rna_data_filtered = rna_data.loc[common_genes]
protein_data_filtered = protein_data.loc[common_genes]
print(f"Number of common genes/proteins: {len(common_genes)}")
# --- Step 3: Log transform if needed (optional) ---
rna_data_log = np.log2(rna_data_filtered + 1)
protein_data_log = np.log2(protein_data_filtered + 1)
# --- Step 4: Calculate gene-wise correlations across samples ---
pearson_corrs = []
spearman_corrs = []
for gene in common_genes:
rna_vector = rna_data_log.loc[gene]
protein_vector = protein_data_log.loc[gene]
pearson_corr, _ = pearsonr(rna_vector, protein_vector)
spearman_corr, _ = spearmanr(rna_vector, protein_vector)
pearson_corrs.append(pearson_corr)
spearman_corrs.append(spearman_corr)
# Save results
correlation_df = pd.DataFrame({
'Gene': common_genes,
'Pearson': pearson_corrs,
'Spearman': spearman_corrs
})
correlation_df.to_csv('gene_protein_correlations.csv', index=False)
print("Saved gene-wise correlation data to 'gene_protein_correlations.csv'")
# --- Step 5: Visualize the correlation distributions ---
sns.histplot(correlation_df['Pearson'], bins=30, kde=True, color='skyblue')
plt.xlabel('Pearson Correlation')
plt.title('Distribution of Pearson Correlations (RNA vs Protein)')
plt.show()
sns.histplot(correlation_df['Spearman'], bins=30, kde=True, color='salmon')
plt.xlabel('Spearman Correlation')
plt.title('Distribution of Spearman Correlations (RNA vs Protein)')
plt.show()
# --- Step 6: Scatter plot for a selected gene/protein ---
example_gene = common_genes[0] # change to your gene of interest
plt.scatter(rna_data_log.loc[example_gene], protein_data_log.loc[example_gene])
plt.xlabel('Log2 RNA Expression')
plt.ylabel('Log2 Protein Abundance')
plt.title(f'RNA vs Protein for {example_gene}')
plt.grid(True)
plt.show()
# Key Notes:
#✅ Replace the filenames (rna_seq_data.csv and proteomics_data.csv) with your actual files.
#✅ The script expects rows to be genes/proteins and columns to be samples.
#✅ Modify or add steps if you have different normalization needs (e.g., DESeq2 normalization).
R script that covers the same steps as above:
# --- Load libraries ---
library(ggplot2)
library(dplyr)
# --- Step 1: Load your data ---
# Example: CSVs with genes/proteins as rows, samples as columns
rna_data <- read.csv("rna_seq_data.csv", row.names = 1)
protein_data <- read.csv("proteomics_data.csv", row.names = 1)
# --- Step 2: Find common genes/proteins ---
common_genes <- intersect(rownames(rna_data), rownames(protein_data))
rna_data_filtered <- rna_data[common_genes, ]
protein_data_filtered <- protein_data[common_genes, ]
cat("Number of common genes/proteins:", length(common_genes), "\n")
# --- Step 3: Log-transform if needed (optional) ---
rna_data_log <- log2(rna_data_filtered + 1)
protein_data_log <- log2(protein_data_filtered + 1)
# --- Step 4: Calculate gene-wise correlations across samples ---
pearson_corrs <- numeric(length(common_genes))
spearman_corrs <- numeric(length(common_genes))
for (i in seq_along(common_genes)) {
gene <- common_genes[i]
rna_vector <- as.numeric(rna_data_log[gene, ])
protein_vector <- as.numeric(protein_data_log[gene, ])
pearson_corrs[i] <- cor(rna_vector, protein_vector, method = "pearson")
spearman_corrs[i] <- cor(rna_vector, protein_vector, method = "spearman")
}
# Save the results
correlation_df <- data.frame(
Gene = common_genes,
Pearson = pearson_corrs,
Spearman = spearman_corrs
)
write.csv(correlation_df, "gene_protein_correlations.csv", row.names = FALSE)
cat("Saved gene-wise correlation data to 'gene_protein_correlations.csv'\n")
# --- Step 5: Visualize the correlation distributions ---
ggplot(correlation_df, aes(x = Pearson)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
labs(title = "Distribution of Pearson Correlations (RNA vs Protein)",
x = "Pearson Correlation", y = "Frequency") +
theme_minimal()
ggplot(correlation_df, aes(x = Spearman)) +
geom_histogram(bins = 30, fill = "salmon", color = "black") +
labs(title = "Distribution of Spearman Correlations (RNA vs Protein)",
x = "Spearman Correlation", y = "Frequency") +
theme_minimal()
# --- Step 6: Scatter plot for a selected gene/protein ---
example_gene <- common_genes[1] # change this to your gene of interest
df_example <- data.frame(
RNA = as.numeric(rna_data_log[example_gene, ]),
Protein = as.numeric(protein_data_log[example_gene, ])
)
ggplot(df_example, aes(x = RNA, y = Protein)) +
geom_point() +
labs(title = paste("RNA vs Protein for", example_gene),
x = "Log2 RNA Expression", y = "Log2 Protein Abundance") +
theme_minimal() +
geom_smooth(method = "lm", se = FALSE, color = "red")
# Key Notes:
#✅ Replace "rna_seq_data.csv" and "proteomics_data.csv" with your real file names.
#✅ Rows: genes/proteins, columns: samples.
#✅ Change example_gene to any gene of interest for plotting.
#Tweak this for the new dataset or extend it with batch correction or other normalizations?
点赞本文的读者
还没有人对此文章表态
没有评论
Processing Data_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606
RNA-seq Tam on Acinetobacter baumannii strain ATCC 19606 CP059040.1 (Data_Tam_RNAseq_2024)
Deciphering S. aureus with metatranscriptomics (Marc's project)
© 2023 XGenes.com Impressum