Variant calling (inter-host + intra-host) for Data_Pietschmann_229ECoronavirus_Mutations_2025 (via docker own_viral_ngs) v2

gene_x 0 like s 496 view s

Tags: variation, pipeline, DNA-seq

Input data:

ln -s ../raw_data_2024/hCoV229E_Rluc_R1.fastq.gz hCoV229E_Rluc_R1.fastq.gz
ln -s ../raw_data_2024/hCoV229E_Rluc_R2.fastq.gz hCoV229E_Rluc_R2.fastq.gz
ln -s ../raw_data_2024/p10_DMSO_R1.fastq.gz p10_DMSO_R1.fastq.gz
ln -s ../raw_data_2024/p10_DMSO_R2.fastq.gz p10_DMSO_R2.fastq.gz
ln -s ../raw_data_2024/p10_K22_R1.fastq.gz p10_K22_R1.fastq.gz
ln -s ../raw_data_2024/p10_K22_R2.fastq.gz p10_K22_R2.fastq.gz
ln -s ../raw_data_2024/p10_K7523_R1.fastq.gz p10_K7523_R1.fastq.gz
ln -s ../raw_data_2024/p10_K7523_R2.fastq.gz p10_K7523_R2.fastq.gz
ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20606/p16_DMSO_S29_R1_001.fastq.gz p16_DMSO_R1.fastq.gz
ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20606/p16_DMSO_S29_R2_001.fastq.gz p16_DMSO_R2.fastq.gz
ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20607/p16_K22_S30_R1_001.fastq.gz p16_K22_R1.fastq.gz
ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20607/p16_K22_S30_R2_001.fastq.gz p16_K22_R2.fastq.gz
ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20608/p16_X7523_S31_R1_001.fastq.gz p16_X7523_R1.fastq.gz
ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20608/p16_X7523_S31_R2_001.fastq.gz p16_X7523_R2.fastq.gz

Call variant calling using snippy

ln -s ~/Tools/bacto/db/ .;
ln -s ~/Tools/bacto/envs/ .;
ln -s ~/Tools/bacto/local/ .;
cp ~/Tools/bacto/Snakefile .;
cp ~/Tools/bacto/bacto-0.1.json .;
cp ~/Tools/bacto/cluster.json .;

#download CU459141.gb from GenBank
mv ~/Downloads/sequence\(2\).gb db/PP810610.gb

#setting the following in bacto-0.1.json
    "fastqc": false,
    "taxonomic_classifier": false,
    "assembly": true,
    "typing_ariba": false,
    "typing_mlst": true,
    "pangenome": true,
    "variants_calling": true,
    "phylogeny_fasttree": true,
    "phylogeny_raxml": true,
    "recombination": false, (due to gubbins-error set false)
    "genus": "Alphacoronavirus",
    "kingdom": "Viruses",
    "species": "Human coronavirus 229E",
    "mykrobe": {
        "species": "corona"
    },
    "reference": "db/PP810610.gb"

mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3
(bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds

Summarize all SNPs and Indels from the snippy result directory.

#Output: snippy/summary_snps_indels.csv
# IMPORTANT_ADAPT the array isolates = ["AYE-S", "AYE-Q", "AYE-WT on Tig4", "AYE-craA on Tig4", "AYE-craA-1 on Cm200", "AYE-craA-2 on Cm200"]
python3 ~/Scripts/summarize_snippy_res.py snippy
cd snippy
#grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv

Using spandx calling variants (almost the same results to the one from viral-ngs!)

mamba activate /home/jhuang/miniconda3/envs/spandx
mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610
cp PP810610.gb  ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610/genes.gbk
vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
/home/jhuang/miniconda3/envs/spandx/bin/snpEff build PP810610    #-d
~/Scripts/genbank2fasta.py PP810610.gb
mv PP810610.gb_converted.fna PP810610.fasta    #rename "NC_001348.1 xxxxx" to "NC_001348" in the fasta-file
ln -s /home/jhuang/Tools/spandx/ spandx
(spandx) nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq" --ref PP810610.fasta --annotation --database PP810610 -resume

# Rerun SNP_matrix.sh due to the error ERROR_CHROMOSOME_NOT_FOUND in the variants annotation
cd Outputs/Master_vcf
(spandx) cp -r ../../snippy/hCoV229E_Rluc/reference .
(spandx) cp ../../spandx/bin/SNP_matrix.sh ./
#Note that ${variant_genome_path}=NC_001348 in the following command, but it was not used after command replacement.
#Adapt "snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v ${variant_genome_path} out.vcf > out.annotated.vcf" to
"/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -c reference/snpeff.config -dataDir . ref out.vcf > out.annotated.vcf" in SNP_matrix.sh
(spandx) bash SNP_matrix.sh PP810610 .

Calling inter-host variants by merging the results from snippy+spandx (Manually!)

# Inter-host variants（宿主间变异）:一种病毒在两个人之间有不同的基因变异，这些变异可能与宿主的免疫反应、疾病表现或病毒传播的方式相关。
cp All_SNPs_indels_annotated.txt All_SNPs_indels_annotated_backup.txt
vim All_SNPs_indels_annotated.txt

#in the file ids: grep "$(echo -e '\t')353$(echo -e '\t')" All_SNPs_indels_annotated.txt >> All_SNPs_indels_annotated_.txt
#Replace \n with " All_SNPs_indels_annotated.txt >> All_SNPs_indels_annotated_.txt\ngrep "
#Replace grep " --> grep "$(echo -e '\t')
#Replace " All_ --> $(echo -e '\t')" All_

# Potential intra-host variants: 10871, 19289, 23435.
CHROM   POS     REF     ALT     TYPE    hCoV229E_Rluc_trimmed   p10_DMSO_trimmed        p10_K22_trimmed p10_K7523_trimmed       p16_DMSO_trimmed        p16_K22_trimmed p16_X7523_trimmed       Effect  Impact  Functional_Class        Codon_change    Protein_and_nucleotide_change   Amino_Acid_Length       Gene_name       Biotype
PP810610        1464    T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        gTt/gCt p.Val416Ala/c.1247T>C   6757    CDS_1   protein_coding
PP810610        1699    C       T       SNP     T       T       T       T       T       T       T       synonymous_variant      LOW     SILENT  gtC/gtT p.Val494Val/c.1482C>T   6757    CDS_1   protein_coding
PP810610        6691    C       T       SNP     T       T       T       T       T       T       T       synonymous_variant      LOW     SILENT  tgC/tgT p.Cys2158Cys/c.6474C>T  6757    CDS_1   protein_coding
PP810610        6919    C       G       SNP     G       G       G       G       G       G       G       synonymous_variant      LOW     SILENT  ggC/ggG p.Gly2234Gly/c.6702C>G  6757    CDS_1   protein_coding
PP810610        7294    T       A       SNP     A       A       A       A       A       A       A       missense_variant        MODERATE        MISSENSE        agT/agA p.Ser2359Arg/c.7077T>A  6757    CDS_1   protein_coding
* PP810610       10871   C       T       SNP     C       C/T     T       C/T     C/T     T       C/T     missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu3552Phe/c.10654C>T 6757    CDS_1   protein_coding
PP810610        14472   T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        aTg/aCg p.Met4752Thr/c.14255T>C 6757    CDS_1   protein_coding
PP810610        15458   T       C       SNP     C       C       C       C       C       C       C       synonymous_variant      LOW     SILENT  Ttg/Ctg p.Leu5081Leu/c.15241T>C 6757    CDS_1   protein_coding
PP810610        16035   C       A       SNP     A       A       A       A       A       A       A       stop_gained     HIGH    NONSENSE        tCa/tAa p.Ser5273*/c.15818C>A   6757    CDS_1   protein_coding
PP810610        17430   T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        tTa/tCa p.Leu5738Ser/c.17213T>C 6757    CDS_1   protein_coding
* PP810610       19289   G       T       SNP     G       G       T       G       G       G/T     G       missense_variant        MODERATE        MISSENSE        Gtt/Ttt p.Val6358Phe/c.19072G>T 6757    CDS_1   protein_coding
PP810610        21183   T       G       SNP     G       G       G       G       G       G       G       missense_variant        MODERATE        MISSENSE        tTt/tGt p.Phe230Cys/c.689T>G    1173    CDS_2   protein_coding
PP810610        22636   T       G       SNP     G       G       G       G       G       G       G       missense_variant        MODERATE        MISSENSE        aaT/aaG p.Asn714Lys/c.2142T>G   1173    CDS_2   protein_coding
PP810610        23022   T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        tTa/tCa p.Leu843Ser/c.2528T>C   1173    CDS_2   protein_coding
* PP810610       23435   C       T       SNP     C       C       T       C/T     C       C/T     C/T     missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu981Phe/c.2941C>T   1173    CDS_2   protein_coding
PP810610        24512   C       T       SNP     T       T       T       T       T       T       T       missense_variant        MODERATE        MISSENSE        Ctc/Ttc p.Leu36Phe/c.106C>T     88      CDS_4   protein_coding
PP810610        24781   C       T       SNP     T       T       T       T       T       T       T       missense_variant        MODERATE        MISSENSE        aCt/aTt p.Thr36Ile/c.107C>T     77      CDS_5   protein_coding
PP810610        25163   C       T       SNP     T       T       T       T       T       T       T       missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu82Phe/c.244C>T     225     CDS_6   protein_coding
PP810610        25264   C       T       SNP     T       T       T       T       T       T       T       synonymous_variant      LOW     SILENT  gtC/gtT p.Val115Val/c.345C>T    225     CDS_6   protein_coding
PP810610        26838   G       T       SNP     T       T       T       T       T       T       T

Calling intra-host variants using viral-ngs

# Intra-host variants（宿主内变异）：同一个人感染了某种病毒，但在其体内的不同细胞或器官中可能存在多个不同的病毒变异株。

#How to run and debug the viral-ngs docker?
# ---- DEBUG_2025_1: using docker instead ----
mkdir viralngs; cd viralngs
ln -s ~/Tools/viral-ngs_docker/Snakefile Snakefile
ln -s  ~/Tools/viral-ngs_docker/bin bin
cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/refsel.acids refsel.acids
cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/lastal.acids lastal.acids
cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/config.yaml config.yaml
cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-runs.txt samples-runs.txt
cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-depletion.txt samples-depletion.txt
cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-metagenomics.txt samples-metagenomics.txt
cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-assembly.txt samples-assembly.txt
cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-assembly-failures.txt samples-assembly-failures.txt
# Adapt the sample-*.txt

mkdir viralngs/data
mkdir viralngs/data/00_raw

mkdir bams
ref_fa="PP810610.fasta";
#for sample in hCoV229E_Rluc p10_DMSO p10_K22; do
for sample in p10_K7523 p16_DMSO p16_K22 p16_X7523; do
    bwa index ${ref_fa}; \
    bwa mem -M -t 16 ${ref_fa} trimmed/${sample}_trimmed_P_1.fastq trimmed/${sample}_trimmed_P_2.fastq | samtools view -bS - > bams/${sample}_genome_alignment.bam; \
done

conda activate viral-ngs4
#for sample in hCoV229E_Rluc p10_DMSO p10_K22; do
#for sample in p10_K7523 p16_DMSO p16_K22 p16_X7523; do
for sample in p16_K22; do
    picard AddOrReplaceReadGroups I=bams/${sample}_genome_alignment.bam O=~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2025/viralngs/data/00_raw/${sample}.bam SORT_ORDER=coordinate CREATE_INDEX=true RGPL=illumina RGID=$sample RGSM=$sample RGLB=standard RGPU=$sample VALIDATION_STRINGENCY=LENIENT; \
done
conda deactivate

# -- ! Firstly set the samples-assembly.txt empty, so that only focus on running depletion!
docker run -it -v /mnt/md1/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2025/viralngs:/work -v /home/jhuang/Tools/viral-ngs_docker:/home/jhuang/Tools/viral-ngs_docker -v /home/jhuang/REFs:/home/jhuang/REFs -v /home/jhuang/Tools/GenomeAnalysisTK-3.6:/home/jhuang/Tools/GenomeAnalysisTK-3.6 -v /home/jhuang/Tools/novocraft_v3:/home/jhuang/Tools/novocraft_v3 -v /usr/local/bin/gatk:/usr/local/bin/gatk   own_viral_ngs bash
cd /work
snakemake --directory /work --printshellcmds --cores 40

# -- ! Secondly manully run assembly steps
# --> By itereative add the unfinished assembly in the list, each time replace one, and run "snakemake --directory /work --printshellcmds --cores 40"

    # # ---- NOTE that the following steps need rerun --> DOES NOT WORK, USE STRATEGY ABOVE ----
    # #for sample in p10_K22 p10_K7523; do
    # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
    #     bin/read_utils.py merge_bams data/01_cleaned/${sample}.cleaned.bam tmp/01_cleaned/${sample}.cleaned.bam --picardOptions SORT_ORDER=queryname
    #     bin/read_utils.py rmdup_mvicuna_bam tmp/01_cleaned/${sample}.cleaned.bam data/01_per_sample/${sample}.cleaned.bam --JVMmemory 30g
    # done
    #
    # #Note that the error generated by nextflow is from the step gapfill_gap2seq!
    # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
    #     bin/assembly.py assemble_spades data/01_per_sample/${sample}.taxfilt.bam /home/jhuang/REFs/viral_ngs_dbs/trim_clip/contaminants.fasta tmp/02_assembly/${sample}.assembly1-spades.fasta --nReads 10000000 --threads 15 --memLimitGb 12
    # done
    # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
    # for sample in p10_K22 p10_K7523; do
    #     bin/assembly.py order_and_orient tmp/02_assembly/${sample}.assembly1-spades.fasta refsel_db/refsel.fasta tmp/02_assembly/${sample}.assembly2-scaffolded.fasta --min_pct_contig_aligned 0.05 --outAlternateContigs tmp/02_assembly/${sample}.assembly2-alternate_sequences.fasta --nGenomeSegments 1 --outReference tmp/02_assembly/${sample}.assembly2-scaffold_ref.fasta --threads 15
    # done
    #
    # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
    #     bin/assembly.py gapfill_gap2seq tmp/02_assembly/${sample}.assembly2-scaffolded.fasta data/01_per_sample/${sample}.cleaned.bam tmp/02_assembly/${sample}.assembly2-gapfilled.fasta --memLimitGb 12 --maskErrors --randomSeed 0 --loglevel DEBUG
    # done

#IMPORTANT: Reun the following commands!
for sample in hCoV229E_Rluc  p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
    bin/assembly.py impute_from_reference tmp/02_assembly/${sample}.assembly2-gapfilled.fasta tmp/02_assembly/${sample}.assembly2-scaffold_ref.fasta tmp/02_assembly/${sample}.assembly3-modify.fasta --newName ${sample} --replaceLength 55 --minLengthFraction 0.05 --minUnambig 0.05 --index  --loglevel DEBUG
done

    # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
    #     bin/assembly.py refine_assembly tmp/02_assembly/${sample}.assembly3-modify.fasta data/01_per_sample/${sample}.cleaned.bam tmp/02_assembly/${sample}.assembly4-refined.fasta --outVcf tmp/02_assembly/${sample}.assembly3.vcf.gz --min_coverage 2 --novo_params '-r Random -l 20 -g 40 -x 20 -t 502' --threads 15  --loglevel DEBUG
    #     bin/assembly.py refine_assembly tmp/02_assembly/${sample}.assembly4-refined.fasta data/01_per_sample/${sample}.cleaned.bam data/02_assembly/${sample}.fasta --outVcf tmp/02_assembly/${sample}.assembly4.vcf.gz --min_coverage 3 --novo_params '-r Random -l 20 -g 40 -x 20 -t 100' --threads 15  --loglevel DEBUG
    # done

# -- ! Thirdly set the samples-assembly.txt completely and run "snakemake --directory /work --printshellcmds --cores 40"

# ---------------------------- BUG list of the docker pipeline, mostly are due to the version incompability ----------------------------
#BUG_1: FileNotFoundError: [Errno 2] No such file or directory: '/home/jhuang/Tools/samtools-1.9/samtools': '/home/jhuang/Tools/samtools-1.9/samtools'
#DEBUG_1 (DEPRECATED):
        # - In docker install independent samtools
        conda create -n samtools-1.9-env samtools=1.9 -c bioconda -c conda-forge
        # - persistence the modified docker, next time run own docker image
        docker ps
        #CONTAINER ID   IMAGE                              COMMAND   CREATED         STATUS         PORTS     NAMES
        #881a1ad6a990   quay.io/broadinstitute/viral-ngs   "bash"    8 minutes ago   Up 8 minutes             intelligent_yalow
        docker commit 881a1ad6a990 own_viral_ngs
        docker image ls
        docker run -it own_viral_ngs bash
        #Change the path as "/opt/miniconda/envs/samtools-1.9-env/bin/samtools" in /work/bin/tools/samtools.py
        #         If another tool expect for samtools could not be installed, also use the same method above to install it on own_viral_ngs!
#DEBUG_1_BETTER_SIMPLE: TOOL_VERSION = '1.6' --> '1.9' in ~/Tools/viral-ngs_docker/bin/tools/samtools.py

#BUG_2:
        bin/taxon_filter.py deplete data/00_raw/2040_04.bam tmp/01_cleaned/2040_04.raw.bam tmp/01_cleaned/2040_04.bmtagger_depleted.bam tmp/01_cleaned/2040_04.rmdup.bam data/01_cleaned/2040_04.cleaned.bam --bmtaggerDbs /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/hg19 /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/metagenomics_contaminants_v3 /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/GRCh37.68_ncRNA-GRCh37.68_transcripts-HS_rRNA_mitRNA --blastDbs /home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/hybsel_probe_adapters /home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/metag_v3.ncRNA.mRNA.mitRNA.consensus --threads 15 --srprismMemory 14250 --JVMmemory 50g --loglevel DEBUG
        #2025-05-23 09:58:45,326 - __init__:445:_attempt_install - DEBUG - Currently installed version of blast: 2.7.1-h4422958_6
        #2025-05-23 09:58:45,327 - __init__:448:_attempt_install - DEBUG - Expected version of blast:            2.6.0
        #2025-05-23 09:58:45,327 - __init__:449:_attempt_install - DEBUG - Incorrect version of blast installed. Removing it...
#DEBUG_2: TOOL_VERSION = "2.6.0" --> "2.7.1" in ~/Tools/viral-ngs_docker/bin/tools/blast.py

#BUG_3:
        bin/read_utils.py bwamem_idxstats data/01_cleaned/1762_04.cleaned.bam /home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta --outStats reports/spike_count/1762_04.spike_count.txt --minScoreToFilter 60 --loglevel DEBUG
#DEBUG_3: TOOL_VERSION = "0.7.15" --> "0.7.17" in ~/Tools/viral-ngs_docker/bin/tools/bwa.py

#BUG_4: FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/bin/trimmomatic': '/usr/local/bin/trimmomatic'
#DEBUG_4: TOOL_VERSION = "0.36" --> "0.38" in ~/Tools/viral-ngs_docker/bin/tools/trimmomatic.py

#BUG_5: FileNotFoundError: [Errno 2] No such file or directory: '/usr/bin/spades.py': '/usr/bin/spades.py'
#DEBUG_5:  TOOL_VERSION = "0.36" --> "0.38" in ~/Tools/viral-ngs_docker/bin/tools/trimmomatic.py
#                def install_and_get_path(self):
#                        # the conda version wraps the jar file with a shell script
#                        return 'trimmomatic'

#BUG_6: bin/assembly.py order_and_orient tmp/02_assembly/2039_04.assembly1-spades.fasta refsel_db/refsel.fasta tmp/02_assembly/2039_04.assembly2-scaffolded.fasta --min_pct_contig_aligned 0.05 --outAlternateContigs tmp/02_assembly/2039_04.assembly2-alternate_sequences.fasta --nGenomeSegments 1 --outReference tmp/02_assembly/2039_04.assembly2-scaffold_ref.fasta --threads 15 --loglevel DEBUG
2025-05-23 17:40:19,526 - __init__:445:_attempt_install - DEBUG - Currently installed version of mummer4: 4.0.0beta2-pl526hf484d3e_4
2025-05-23 17:40:19,527 - __init__:448:_attempt_install - DEBUG - Expected version of mummer4:            4.0.0rc1
2025-05-23 17:40:19,527 - __init__:449:_attempt_install - DEBUG - Incorrect version of mummer4 installed. Removing it..
DEBUG_6:  TOOL_VERSION = "4.0.0rc1" --> "4.0.0beta2" in ~/Tools/viral-ngs_docker/bin/tools/mummer.py

#BUG_7: bin/assembly.py order_and_orient tmp/02_assembly/2039_04.assembly1-spades.fasta refsel_db/refsel.fasta tmp/02_assembly/2039_04.assembly2-scaffolded.fasta --min_pct_contig_aligned 0.05 --outAlternateContigs tmp/02_assembly/2039_04.assembly2-alternate_sequences.fasta --nGenomeSegments 1 --outReference tmp/02_assembly/2039_04.assembly2-scaffold_ref.fasta --threads 15 --loglevel DEBUG
        File "bin/assembly.py", line 549, in <listcomp>
        base_counts = [sum([len(seg.seq.replace("N", "")) for seg in scaffold]) \
        AttributeError: 'Seq' object has no attribute 'replace'

DEBUG_7:  base_counts = [sum([len(seg.seq.replace("N", "")) for seg in scaffold]) --> base_counts = [sum([len(seg.seq.ungap('N')) for seg in scaffold]) in ~/Tools/viral-ngs_docker/bin/assembly.py

BUG_8:
        bin/assembly.py refine_assembly tmp/02_assembly/1243_2.assembly3-modify.fasta data/01_per_sample/1243_2.cleaned.bam tmp/02_assembly/1243_2.assembly4-refined.fasta --outVcf tmp/02_assembly/1243_2.assembly3.vcf.gz --min_coverage 2 --novo_params '-r Random -l 20 -g 40 -x 20 -t 502' --threads 15  --loglevel DEBUG
        File "/work/bin/tools/gatk.py", line 75, in execute
        FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/bin/gatk': '/usr/local/bin/gatk'
#DEBUG_8: -v /usr/local/bin/gatk:/usr/local/bin/gatk in 'docker run' and change default python in the script via a shebang; TOOL_VERSION = "3.8" --> "3.6" in ~/Tools/viral-ngs_docker/bin/tools/gatk.py

BUG_9:
        pyyaml is missing!

#DEBUG_9: NO_ERROR if rerun!
        bin/assembly.py impute_from_reference tmp/02_assembly/2039_04.assembly2-gapfilled.fasta tmp/02_assembly/2039_04.assembly2-scaffold_ref.fasta tmp/02_assembly/2039_04.assembly3-modify.fasta --newName 2039_04 --replaceLength 55 --minLengthFraction 0.05 --minUnambig 0.05 --index   --loglevel DEBUG
        for sample in 2039_04 2040_04; do
        for sample in 1762_04 1243_2 875_04; do
                bin/assembly.py impute_from_reference tmp/02_assembly/${sample}.assembly2-gapfilled.fasta tmp/02_assembly/${sample}.assembly2-scaffold_ref.fasta tmp/02_assembly/${sample}.assembly3-modify.fasta --newName ${sample} --replaceLength 55 --minLengthFraction 0.05 --minUnambig 0.05 --index  --loglevel DEBUG
        done

#BUG_10: bin/reports.py consolidate_fastqc reports/fastqc/2039_04/align_to_self reports/fastqc/2040_04/align_to_self reports/fastqc/1762_04/align_to_self reports/fastqc/1243_2/align_to_self reports/fastqc/875_04/align_to_self reports/summary.fastqc.align_to_self.txt
#DEBUG_10: File "bin/intrahost.py", line 527 and line 579 in merge_to_vcf
#                       #MODIFIED_BACK
                        samp_to_seqIndex[sampleName] = seq.seq.ungap('-')
                        #samp_to_seqIndex[sampleName] = seq.seq.replace("-", "")

#BUG_11: bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/2039_04.fasta data/02_assembly/2040_04.fasta data/02_assembly/1762_04.fasta data/02_assembly/1243_2.fasta data/02_assembly/875_04.fasta data/03_multialign_to_ref --ep 0.123 --maxiters 1000 --preservecase --localpair --outFilePrefix aligned --sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt --threads 15 --loglevel DEBUG
2025-05-26 15:04:19,014 - cmd:195:main_argparse - INFO - command: bin/interhost.py multichr_mafft inFastas=['ref_genome/reference.fasta', 'data/02_assembly/2039_04.fasta', 'data/02_assembly/2040_04.fasta', 'data/02_assembly/1762_04.fasta', 'data/02_assembly/1243_2.fasta', 'data/02_assembly/875_04.fasta'] localpair=True globalpair=None preservecase=True reorder=None gapOpeningPenalty=1.53 ep=0.123 verbose=False outputAsClustal=None maxiters=1000 outDirectory=data/03_multialign_to_ref outFilePrefix=aligned sampleRelationFile=None sampleNameListFile=data/03_multialign_to_ref/sampleNameList.txt threads=15 loglevel=DEBUG tmp_dir=/tmp tmp_dirKeep=False
2025-05-26 15:04:19,014 - cmd:209:main_argparse - DEBUG - using tempDir: /tmp/tmp-interhost-multichr_mafft-nuws9mhp
2025-05-26 15:04:21,085 - __init__:445:_attempt_install - DEBUG - Currently installed version of mafft: 7.402-0
2025-05-26 15:04:21,085 - __init__:448:_attempt_install - DEBUG - Expected version of mafft:            7.221
2025-05-26 15:04:21,085 - __init__:449:_attempt_install - DEBUG - Incorrect version of mafft installed. Removing it...
#DEBUG_11:  TOOL_VERSION = "7.221" --> "7.402" in ~/Tools/viral-ngs_docker/bin/tools/mafft.py

#BUG_12: bin/interhost.py snpEff data/04_intrahost/isnvs.vcf.gz PP810610.1 data/04_intrahost/isnvs.annot.vcf.gz j.huang@uke.de --loglevel DEBUG
        2025-06-10 13:14:07,526 - __init__:445:_attempt_install - DEBUG - Currently installed version of snpeff: 4.3.1t-3
        2025-06-10 13:14:07,527 - __init__:448:_attempt_install - DEBUG - Expected version of snpeff:            4.1l
#DEBUG_12: -v /usr/local/bin/gatk:/usr/local/bin/gatk in 'docker run' and change default python in the script via a shebang; TOOL_VERSION = "4.1l" --> "4.3.1t" in ~/Tools/viral-ngs_docker/bin/tools/snpeff.py

Comparing intra- and inter-host variants, comparing the variants to the alignments of the assemblies to confirm its correctness.

From the step 5, only 5 inter-host variants were confirmed: they are 10871, 19289, 23435.

PP810610    10871   hCoV229E_Rluc   hCoV229E_Rluc       C,T 0.0057070386810399495   0.011348936781066188    1.0 missense_variant    10654C>T    Leu3552Phe  3552    6758    Gene_217_20492  XBA84229.1
PP810610    10871   p10_DMSO    p10_DMSO        C,T 0.0577716643741403  0.10886819833916395 1.0 missense_variant    10654C>T    Leu3552Phe  3552    6758    Gene_217_20492  XBA84229.1
PP810610    10871   p10_K22 p10_K22     C,T 1.0 0.0 1.0 missense_variant    10654C>T    Leu3552Phe  3552    6758    Gene_217_20492  XBA84229.1
PP810610    10871   p10_K7523   p10_K7523       C,T 0.8228321896444167  0.2915587546587828  1.0 missense_variant    10654C>T    Leu3552Phe  3552    6758    Gene_217_20492  XBA84229.1
PP810610    10871   p16_DMSO    p16_DMSO        C,T 0.02927088877062267 0.05682820768240093 1.0 missense_variant    10654C>T    Leu3552Phe  3552    6758    Gene_217_20492  XBA84229.1
PP810610    10871   p16_K22 p16_K22     C,T 0.9911209766925638  0.017600372505084394    1.0 missense_variant    10654C>T    Leu3552Phe  3552    6758    Gene_217_20492  XBA84229.1
PP810610    10871   p16_X7523   p16_X7523       C,T 0.8776699029126214  0.21473088886794223 1.0 missense_variant    10654C>T    Leu3552Phe  3552    6758    Gene_217_20492  XBA84229.1

PP810610    19289   hCoV229E_Rluc   hCoV229E_Rluc       G,T 0.0 0.0 1.0 missense_variant    19073G>T    Gly6358Val  6358    6758    Gene_217_20492  XBA84229.1
PP810610    19289   p10_DMSO    p10_DMSO        G,T 0.0 0.0 1.0 missense_variant    19073G>T    Gly6358Val  6358    6758    Gene_217_20492  XBA84229.1
PP810610    19289   p10_K22 p10_K22     G,T 1.0 0.0 1.0 missense_variant    19073G>T    Gly6358Val  6358    6758    Gene_217_20492  XBA84229.1
PP810610    19289   p10_K7523   p10_K7523       G,T 0.0 0.0 1.0 missense_variant    19073G>T    Gly6358Val  6358    6758    Gene_217_20492  XBA84229.1
PP810610    19289   p16_DMSO    p16_DMSO        G,T 0.0 0.0 1.0 missense_variant    19073G>T    Gly6358Val  6358    6758    Gene_217_20492  XBA84229.1
PP810610    19289   p16_K22 p16_K22     G,T 0.9884823848238482  0.02276991943361173 1.0 missense_variant    19073G>T    Gly6358Val  6358    6758    Gene_217_20492  XBA84229.1
PP810610    19289   p16_X7523   p16_X7523       G,T 0.0 0.0 1.0 missense_variant    19073G>T    Gly6358Val  6358    6758    Gene_217_20492  XBA84229.1

PP810610    23435   hCoV229E_Rluc   hCoV229E_Rluc       C,T 0.0 0.0 1.0 missense_variant    2941C>T Leu981Phe   981 1173    Gene_20494_24015    XBA84230.1
PP810610    23435   p10_DMSO    p10_DMSO        C,T 0.031912415560214305    0.061788026586653055    1.0 missense_variant    2941C>T Leu981Phe   981 1173    Gene_20494_24015    XBA84230.1
PP810610    23435   p10_K22 p10_K22     C,T 1.0 0.0 1.0 missense_variant    2941C>T Leu981Phe   981 1173    Gene_20494_24015    XBA84230.1
PP810610    23435   p10_K7523   p10_K7523       C,T 0.8352090032154341  0.27526984832663026 1.0 missense_variant    2941C>T Leu981Phe   981 1173    Gene_20494_24015    XBA84230.1
PP810610    23435   p16_DMSO    p16_DMSO        C,T 0.0 0.0 1.0 missense_variant    2941C>T Leu981Phe   981 1173    Gene_20494_24015    XBA84230.1
PP810610    23435   p16_K22 p16_K22     C,T 0.958498023715415   0.07955912449811753 1.0 missense_variant    2941C>T Leu981Phe   981 1173    Gene_20494_24015    XBA84230.1
PP810610    23435   p16_X7523   p16_X7523       C,T 0.13175164058556285 0.22878629157715102 1.0 missense_variant    2941C>T Leu981Phe   981 1173    Gene_20494_24015    XBA84230.1

Generate variant_annot.xls and coverages.xls

sudo chown -R jhuang:jhuang data
# -- generate isnvs_annot_complete__.txt, isnvs_annot_0.05.txt from ~/DATA/Data_Pietschmann_RSV_Probe3/data/04_intrahost
cp isnvs.annot.txt isnvs.annot_complete.txt
~/Tools/csv2xls-0.4/csv_to_xls.py isnvs.annot_complete.txt -d$'\t' -o isnvs.annot_complete.xls
#delete the columns patient, time, Hw and Hs and the header in the xls and save as txt file.

awk '{printf "%.3f\n", $5}' isnvs.annot_complete.csv > f5
cut -f1-4 isnvs.annot_complete.csv > f1_4
cut -f6- isnvs.annot_complete.csv > f6_
paste f1_4 f5 > f1_5
paste f1_5 f6_ > isnvs_annot_complete_.txt
#correct f5 in header of isnvs_annot_complete_.txt to iSNV_freq
#header: chr    pos sample  alleles iSNV_freq   eff_type    eff_codon_dna   eff_aa  eff_aa_pos  eff_prot_len    eff_gene    eff_protein
~/Tools/csv2xls-0.4/csv_to_xls.py isnvs_annot_complete_.txt -d$'\t' -o variant_annot.xls

#MANUALLY generate variant_annot_0.01.csv variant_annot_0.05.csv
awk ' $5 >= 0.05 ' isnvs_annot_complete_.txt > 0.05.csv
cut -f2 0.05.csv

awk ' $5 >= 0.01 ' isnvs_annot_complete_.txt > 0.01.csv
cut -f2 0.05.csv | uniq > ids_0.05
cut -f2 0.01.csv | uniq > ids_0.01

#Replace '\n' with '\\t" isnvs_annot_complete_.txt >> isnvs_annot_0.05.txt\ngrep -P "PP810610\\t' in ids_0.05 and then deleting the 'pos' line
#Replace '\n' with '\\t" isnvs_annot_complete_.txt >> isnvs_annot_0.01.txt\ngrep -P "PP810610\\t'  in ids_0.01 and then deleting the 'pos' line
#Run ids_0.05 and ids_0.01

cp ../../Outputs/Master_vcf/All_SNPs_indels_annotated.txt ../../Outputs/Master_vcf/All_SNPs_indels_annotated.txt hCoV229E_Rluc_variants
# Delete the three records which already reported in intra-host results hCoV229E_Rluc_variants: they are 10871, 19289, 23435.
PP810610       10871   C       T       SNP     C       C/T     T       C/T     C/T     T       C/T     missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu3552Phe/c.10654C>T 6757    CDS_1   protein_coding
PP810610       19289   G       T       SNP     G       G       T       G       G       G/T     G       missense_variant        MODERATE        MISSENSE        Gtt/Ttt p.Val6358Phe/c.19072G>T 6757    CDS_1   protein_coding
PP810610       23435   C       T       SNP     C       C       T       C/T     C       C/T     C/T     missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu981Phe/c.2941C>T   1173    CDS_2   protein_coding

~/Tools/csv2xls-0.4/csv_to_xls.py isnvs_annot_0.05.txt isnvs_annot_0.01.txt hCoV229E_Rluc_variants -d$'\t' -o variant_annot.xls
#Modify sheetname to variant_annot_0.05 and variant_annot_0.01 and add the header in Excel file.
#Note in the complete list, Set 2024 is NOT a subset of Set 2025 because the element 26283 is in set 2024 but missing from set 2025.

# -- calculate the coverage
samtools depth ./data/02_align_to_self/hCoV229E_Rluc.mapped.bam > hCoV229E_Rluc_cov.txt
samtools depth ./data/02_align_to_self/p10_DMSO.mapped.bam > p10_DMSO_cov.txt
samtools depth ./data/02_align_to_self/p10_K22.mapped.bam > p10_K22_cov.txt
samtools depth ./data/02_align_to_self/p10_K7523.mapped.bam > p10_K7523_cov.txt
~/Tools/csv2xls-0.4/csv_to_xls.py hCoV229E_Rluc_cov.txt p10_DMSO_cov.txt p10_K22_cov.txt p10_K7523_cov.txt -d$'\t' -o coverages.xls
#draw coverage and see if they are continuous?
samtools depth ./data/02_align_to_self/p16_DMSO.mapped.bam > p16_DMSO_cov.txt
samtools depth ./data/02_align_to_self/p16_K22.mapped.bam > p16_K22_cov.txt
samtools depth ./data/02_align_to_self/p16_X7523.mapped.bam > p16_K7523_cov.txt
~/Tools/csv2xls-0.4/csv_to_xls.py p16_DMSO_cov.txt p16_K22_cov.txt p16_K7523_cov.txt -d$'\t' -o coverages_p16.xls

        # Load required packages
        library(ggplot2)
        library(dplyr)

        # Read the coverage data
        cov_data <- read.table("p16_K7523_cov.txt", header = FALSE, sep = "\t",
                        col.names = c("Chromosome", "Position", "Coverage"))

        # Create full position range for the given chromosome
        full_range <- data.frame(Position = seq(min(cov_data$Position), max(cov_data$Position)))

        # Merge with actual coverage data and fill missing positions with 0
        cov_full <- full_range %>%
        left_join(cov_data[, c("Position", "Coverage")], by = "Position") %>%
        mutate(Coverage = ifelse(is.na(Coverage), 0, Coverage))

        # Save the plot to PNG
        png("p16_K7523_coverage_filled.png", width = 1200, height = 600)

        ggplot(cov_full, aes(x = Position, y = Coverage)) +
        geom_line(color = "steelblue", size = 0.3) +
        labs(title = "Coverage Plot for p16_K7523 (Missing = 0)",
        x = "Genomic Position",
        y = "Coverage Depth") +
        theme_minimal() +
        theme(
        plot.title = element_text(hjust = 0.5),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12)
        )

        dev.off()

(Optional) Consensus sequences of each and of all isolates

cat PP810610.1.fa OZ035258.1.fa MZ712010.1.fa OK662398.1.fa OK625404.1.fa KF293664.1.fa NC_002645.1.fa > all.fa
cp data/02_assembly/*.fasta ./
for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523; do \
mv ${sample}.fasta ${sample}.fa
cat all.fa ${sample}.fa >> all.fa
done

cat RSV_dedup.fa all.fa > RSV_all.fa
mafft --clustalout --adjustdirection RSV_all.fa > RSV_all.aln
snp-sites RSV_all.aln -o RSV_all_.aln

Report

Please find attached the variant analysis results for Thomas. Variant frequencies in the new samples are highlighted in yellow.

Although PP810610 is used as the reference, only differences observed in the samples p10_DMSO, p10_K22, p10_K7523, p16_DMSO, p16_K22, and p16_X7523 compared to hCoV229E_Rluc are reported in the sheets variant_annot_0.05 and variant_annot_0.01 (see variant_annot.xls). Variants already present in hCoV229E_Rluc are excluded from these sheets. In total, 17 mutations were found in hCoV229E_Rluc relative to PP810610, detailed in the sheet “hCoV229E_Rluc_variants” (see variant_annot.xls).

------ Explanation of iSNV_freq in the sheets variant_annot_0.05 and variant_annot_0.01 ------

The iSNV_freq column shows the frequency of the second allele at each position. For example, at position 23435 on chr PP810610:

chr               Position    Sample            Alleles    iSNV_freq
PP810610    23435    hCoV229E_Rluc    C,T        0
PP810610    23435    p10_DMSO           C,T       0.032
PP810610    23435    p10_K22              C,T       0.995
PP810610    23435    p10_K7523          C,T       0.835
PP810610    23435    p16_DMSO          C,T        0
PP810610    23435    p16_K22              C,T       0.958
PP810610    23435    p16_X7523          C,T       0.132

The second allele (T) frequencies are:
0 (only C)
0.032 (3.2% T)
0.995 (99.5% T)
0.835 (83.5% T)
0 (only C)
0.958 (95.8% T)
0.132 (13.2% T)

# --

Regarding the mutation at position 19289 — you're absolutely right, and I had also noticed the discrepancy.

In the 2024 analysis, I performed intra-host variant calling, which detects only those variants with frequencies strictly between 0% and 100% within a single sample. Since position 19289 showed 100% G in p10_DMSO, 100% T in p10_K22, and 100% G in p10_K7523, it was not identified as an intra-host variant at that time. Rather, it's a clear example of an inter-host variant — a fixed difference between samples.

In the 2025 analysis, I again used intra-host variant calling. This time, the mutation at position 19289 in p16_K22 was detected at 98.8% T, which falls within the threshold and therefore appears in the intra-host variant table.

After noticing this, I also ran a dedicated inter-host variant calling analysis, which specifically highlights differences between samples rather than within them. The results can be found in the third table ("hCoV229E_Rluc_variants") of the variant_annot.xls file I sent you previously. As you’ll see, all 17 positions are identical across the 7 samples, indicating that no additional inter-host variants were detected beyond what we had already observed.

Lastly, please find the coverage data in the attached files.

# --

Just following up on the mutation at position 19289. By tweaking some settings in the inter-host variant calling, we can also detect variants at positions like 19289. However, in these results, a “/” indicates intra-host variants that require further validation through intra-host variant calling. The intra-host variant calling uses a more precise mapping strategy, enabling a more accurate estimation of allele frequencies.

Here’s an example from the inter-host variant table showing the mutation at 19289 with the adjusted settings:

        CHROM       POS      REF    ALT    TYPE    hCoV229E_Rluc    p10_DMSO    p10_K22    p10_K7523    p16_DMSO    p16_K22    p16_X7523
        PP810610    19289    G      T      SNP          G               G           T           G          G           G/T          G

like unlike

点赞本文的读者

还没有人对此文章表态

本文有评论

没有评论

Variant calling (inter-host + intra-host) for Data_Pietschmann_229ECoronavirus_Mutations_2025 (via docker own_viral_ngs) v2

本文有评论

看文章，发评论，不要沉默

最受欢迎文章

最新文章

最多评论文章

推荐相似文章