Genake in Snakemake's shell command


Elysee
# md5sum on fastq folder on cluster
rule md5sum_fastq_cluster:
     input:
         path_cluster+'/'+project_name+'/'+project_name+'.csv'
     output:
         path_cluster+'/'+project_name+'/'+'md5sum.txt'
     shell:
         """find {path_cluster}/{project_name} -type f -name "*.fastq.gz" -exec md5sum {{}} + | awk '{{print $1, gensub( ".*/", "", $2 )}}' | sort > {output}"""
 
 
 # md5sum on fastq folder on remote server
 rule md5sum_fastq_SAN:
     input:
         copyFASTQdone
     output:
         SFTPsan.remote(server_san+path_san+'/'+project_name+'/md5sum.txt')
     shell:
         """ssh imrb@{server_san} "find {path_san}/{project_name} -type f -name '*.fastq.gz' -exec md5sum {{}} + | awk '{{print \$1, gensub( ".*/", "", \$2 )}}' | sort" > {output}"""

--------------------------------------------------------------------------
awk: ligne de commande:1: {print $1, gensub( .*/, , $2 )}
awk: ligne de commande:1:                    ^ syntax error
awk: ligne de commande:1: {print $1, gensub( .*/, , $2 )}

Apparently, my gensub syntax is wrong.
Before adding the gensub command, I got 2 shell commands from 2 rules:

"""find {path_cluster}/{project_name} -type f -name "*.fastq.gz" -exec md5sum {{}} + | awk '{{print $1}}' | sort > {output}"""

"""ssh imrb@{server_san} "find {path_san}/{project_name} -type f -name '*.fastq.gz' -exec md5sum {{}} + | awk '{{print \$1}}' | sort > {output}"""

It's working. It's just that I can't find the correct syntax since I added gensub.
I need this gensub to basically do the same thing as basenamedeleting a file path .
Of course, I tried awk/gensub commands outside snakemake and it works.

Just in case, here is the file my rule generates:

# md5sum.txt before gensub
01afd3f2bf06d18c5609b2c2c963eddf /data/imrb/Data/200122_GSC/14-CTRL50TMZ1907192_S11_R2_001.fastq.gz
03e353c316aef09c748aa2363db95599 /data/imrb/Data/200122_GSC/15-11650TMZ1907192_S12_R2_001.fastq.gz
1ba21b8be882bcb62c464ba515800ca4 /data/imrb/Data/200122_GSC/1-CTRL120719_S1_R2_001.fastq.gz

# md5sum.txt after gensub
01afd3f2bf06d18c5609b2c2c963eddf 14-CTRL50TMZ1907192_S11_R2_001.fastq.gz
03e353c316aef09c748aa2363db95599 15-11650TMZ1907192_S12_R2_001.fastq.gz
1ba21b8be882bcb62c464ba515800ca4 1-CTRL120719_S1_R2_001.fastq.gz
Elysee

Thanks to dariober, I found the correct syntax for each rule.

For the first rule: I need to escape double quotes used in awk

rule md5sum_fastq_cluster:
     input:
         path_cluster+'/'+project_name+'/'+project_name+'.csv'
     output:
         path_cluster+'/'+project_name+'/'+'md5sum.txt'
     shell:
         """find {path_cluster}/{project_name} -type f -name "*.fastq.gz" -exec md5sum {{}} + | awk '{{print $1, gensub( \".*/\", \"\", $2 )}}' | sort > {output}"""

For the second rule, to pass a shell command to SSH, I needed to escape the double quotes twice and add one \before $2

 rule md5sum_fastq_SAN:
     input:
         copyFASTQdone
     output:
         SFTPsan.remote(server_san+path_san+'/'+project_name+'/md5sum.txt')
     shell:
         """ssh imrb@{server_san} "find {path_san}/{project_name} -type f -name '*.fastq.gz' -exec md5sum {{}} + | awk '{{print \$1, gensub( \\".*/\\", \\"\\", \$2 )}}' | sort" > {output}"""

Related


Genake in Snakemake's shell command

Elysee # md5sum on fastq folder on cluster rule md5sum_fastq_cluster: input: path_cluster+'/'+project_name+'/'+project_name+'.csv' output: path_cluster+'/'+project_name+'/'+'md5sum.txt' shell: """find {path_cluster}/{p

Genake in Snakemake's shell command

Elysee # md5sum on fastq folder on cluster rule md5sum_fastq_cluster: input: path_cluster+'/'+project_name+'/'+project_name+'.csv' output: path_cluster+'/'+project_name+'/'+'md5sum.txt' shell: """find {path_cluster}/{p

Genake in Snakemake's shell command

Elysee # md5sum on fastq folder on cluster rule md5sum_fastq_cluster: input: path_cluster+'/'+project_name+'/'+project_name+'.csv' output: path_cluster+'/'+project_name+'/'+'md5sum.txt' shell: """find {path_cluster}/{p

Genake in Snakemake's shell command

Elysee # md5sum on fastq folder on cluster rule md5sum_fastq_cluster: input: path_cluster+'/'+project_name+'/'+project_name+'.csv' output: path_cluster+'/'+project_name+'/'+'md5sum.txt' shell: """find {path_cluster}/{p

Genake in Snakemake's shell command

Elysee # md5sum on fastq folder on cluster rule md5sum_fastq_cluster: input: path_cluster+'/'+project_name+'/'+project_name+'.csv' output: path_cluster+'/'+project_name+'/'+'md5sum.txt' shell: """find {path_cluster}/{p

Genake in Snakemake's shell command

Elysee # md5sum on fastq folder on cluster rule md5sum_fastq_cluster: input: path_cluster+'/'+project_name+'/'+project_name+'.csv' output: path_cluster+'/'+project_name+'/'+'md5sum.txt' shell: """find {path_cluster}/{p

Combine shell command line in snakemake

User 3224522 I would like to combine the two command lines into one to avoid intermediate files. workdir: "/path/to/workdir/" rule all: input: "my.filtered.vcf.gz" rule bedtools: input: invcf="/path/to/my.vcf.gz", bedgz="/pat

Snakemake: How to specify absolute path to shell command

Vkkodali I'm writing a snakemake rule that uses multiple commands like this: rule RULE1: input: 'path/to/input.file' output: 'path/to/output.file' shell: 'path/to/command1 {input} | /path/to/command2 | /path/to/command3 {output}' If /path/to/command1it'

Snakemake: How to specify absolute path to shell command

Vkkodali I'm writing a snakemake rule that uses multiple commands like this: rule RULE1: input: 'path/to/input.file' output: 'path/to/output.file' shell: 'path/to/command1 {input} | /path/to/command2 | /path/to/command3 {output}' If /path/to/command1it'

Snakemake: How to specify absolute path to shell command

Vkkodali I'm writing a snakemake rule that uses multiple commands like this: rule RULE1: input: 'path/to/input.file' output: 'path/to/output.file' shell: 'path/to/command1 {input} | /path/to/command2 | /path/to/command3 {output}' If /path/to/command1it'

Snakemake: How to specify absolute path to shell command

Vkkodali I'm writing a snakemake rule that uses multiple commands like this: rule RULE1: input: 'path/to/input.file' output: 'path/to/output.file' shell: 'path/to/command1 {input} | /path/to/command2 | /path/to/command3 {output}' If /path/to/command1it'

What's the best way to prevent snakemake failing shell/R errors?

Rioan I want the snakemake workflow to keep running even if some rules fail. For example, I use various tools to perform peak calling of ChIP-seq data. However, some programs issue error messages when peaks are not recognized. In this case I'd rather create an

What's the best way to prevent snakemake failing shell/R errors?

Rioan I want the snakemake workflow to keep running even if some rules fail. For example, I use various tools to perform peak calling of ChIP-seq data. However, some programs issue error messages when peaks are not recognized. In this case I'd rather create an

What's the best way to prevent snakemake failing shell/R errors?

Rioan I want the snakemake workflow to keep running even if some rules fail. For example, I use various tools to perform peak calling of ChIP-seq data. However, some programs issue error messages when peaks are not recognized. In this case I'd rather create an

What's the best way to prevent snakemake failing shell/R errors?

Rioan I want the snakemake workflow to keep running even if some rules fail. For example, I use various tools to perform peak calling of ChIP-seq data. However, some programs issue error messages when peaks are not recognized. In this case I'd rather create an

What's the best way to prevent snakemake failing shell/R errors?

Rioan I want the snakemake workflow to keep running even if some rules fail. For example, I use various tools to perform peak calling of ChIP-seq data. However, some programs issue error messages when peaks are not recognized. In this case I'd rather create an

Understanding the shell's "read" command

Vilnius I'm trying to understand UNIX Shell and the "read" command confuses me. As shown in the following code snippet (or "while-read" idiom), this command "takes" a line of standard input. (read -r foo ; echo '*** Before cat ***' ; cat) << 'END' hello world

Shell command in gnuplot's if block

Asaka I want to execute a shell command inside an if-block in Gnuplot. I have tried the following: datatype = 'full' if ( datatype eq 'full' ) { # Run shell command !echo 'full' } else { # Run different shell command !echo 'not full' } Howeve

Understanding the shell's "read" command

Vilnius I'm trying to understand UNIX Shell and the "read" command confuses me. As shown in the following code snippet (or "while-read" idiom), this command "takes" a line of standard input. (read -r foo ; echo '*** Before cat ***' ; cat) << 'END' hello world

Snakemake: Wrap command logging

Sebio I don't know how to log the commands executed by the wrapper. Neither "snakemake -p" nor "snakemake -D" showed me the actual command that was run. What's the best way to log commands created via wrappers? Cheers, Seb Johannes Coster So currently, -p does

Traverse strings in shell commands with Snakemake

neutral To provide some background, I am trying to compose a pipeline to analyze in silico deep sequencing results of CRISPR targets. I amplified a known sequence from the genome in 50 different places, and each amplicon contained a predicted off-target site t

Format Snakemake input file in shell

Tomas Bencomo I'm using snakemakepipes to run GATK commands MarkDuplicateon multiple input bam files from different read groups . rule mark_duplicates: input: get_dedup_input output: bam=temp("bams/{patient}.{sample_type}.markdups.bam")

Format Snakemake input file in shell

Tomas Bencomo I'm using snakemakepipes to run GATK commands MarkDuplicateon multiple input bam files from different read groups . rule mark_duplicates: input: get_dedup_input output: bam=temp("bams/{patient}.{sample_type}.markdups.bam")

Traverse strings in shell commands with Snakemake

neutral To provide some background, I am trying to compose a pipeline to analyze in silico deep sequencing results of CRISPR targets. I amplified a known sequence from the genome in 50 different places, and each amplicon contained a predicted off-target site t

Format Snakemake input file in shell

Tomas Bencomo I'm using snakemakepipes to run GATK commands MarkDuplicateon multiple input bam files from different read groups . rule mark_duplicates: input: get_dedup_input output: bam=temp("bams/{patient}.{sample_type}.markdups.bam")

Traverse strings in shell commands with Snakemake

neutral To provide some background, I am trying to compose a pipeline to analyze in silico deep sequencing results of CRISPR targets. I amplified a known sequence from the genome in 50 different places, and each amplicon contained a predicted off-target site t