Genake in Snakemake's shell command
Elysee
# md5sum on fastq folder on cluster
rule md5sum_fastq_cluster:
input:
path_cluster+'/'+project_name+'/'+project_name+'.csv'
output:
path_cluster+'/'+project_name+'/'+'md5sum.txt'
shell:
"""find {path_cluster}/{project_name} -type f -name "*.fastq.gz" -exec md5sum {{}} + | awk '{{print $1, gensub( ".*/", "", $2 )}}' | sort > {output}"""
# md5sum on fastq folder on remote server
rule md5sum_fastq_SAN:
input:
copyFASTQdone
output:
SFTPsan.remote(server_san+path_san+'/'+project_name+'/md5sum.txt')
shell:
"""ssh imrb@{server_san} "find {path_san}/{project_name} -type f -name '*.fastq.gz' -exec md5sum {{}} + | awk '{{print \$1, gensub( ".*/", "", \$2 )}}' | sort" > {output}"""
--------------------------------------------------------------------------
awk: ligne de commande:1: {print $1, gensub( .*/, , $2 )}
awk: ligne de commande:1: ^ syntax error
awk: ligne de commande:1: {print $1, gensub( .*/, , $2 )}
Apparently, my gensub syntax is wrong.
Before adding the gensub command, I got 2 shell commands from 2 rules:
"""find {path_cluster}/{project_name} -type f -name "*.fastq.gz" -exec md5sum {{}} + | awk '{{print $1}}' | sort > {output}"""
"""ssh imrb@{server_san} "find {path_san}/{project_name} -type f -name '*.fastq.gz' -exec md5sum {{}} + | awk '{{print \$1}}' | sort > {output}"""
It's working. It's just that I can't find the correct syntax since I added gensub.
I need this gensub to basically do the same thing as basename
deleting a file path .
Of course, I tried awk/gensub commands outside snakemake and it works.
Just in case, here is the file my rule generates:
# md5sum.txt before gensub
01afd3f2bf06d18c5609b2c2c963eddf /data/imrb/Data/200122_GSC/14-CTRL50TMZ1907192_S11_R2_001.fastq.gz
03e353c316aef09c748aa2363db95599 /data/imrb/Data/200122_GSC/15-11650TMZ1907192_S12_R2_001.fastq.gz
1ba21b8be882bcb62c464ba515800ca4 /data/imrb/Data/200122_GSC/1-CTRL120719_S1_R2_001.fastq.gz
# md5sum.txt after gensub
01afd3f2bf06d18c5609b2c2c963eddf 14-CTRL50TMZ1907192_S11_R2_001.fastq.gz
03e353c316aef09c748aa2363db95599 15-11650TMZ1907192_S12_R2_001.fastq.gz
1ba21b8be882bcb62c464ba515800ca4 1-CTRL120719_S1_R2_001.fastq.gz
Elysee
Thanks to dariober, I found the correct syntax for each rule.
For the first rule: I need to escape double quotes used in awk
rule md5sum_fastq_cluster:
input:
path_cluster+'/'+project_name+'/'+project_name+'.csv'
output:
path_cluster+'/'+project_name+'/'+'md5sum.txt'
shell:
"""find {path_cluster}/{project_name} -type f -name "*.fastq.gz" -exec md5sum {{}} + | awk '{{print $1, gensub( \".*/\", \"\", $2 )}}' | sort > {output}"""
For the second rule, to pass a shell command to SSH, I needed to escape the double quotes twice and add one \
before $2
rule md5sum_fastq_SAN:
input:
copyFASTQdone
output:
SFTPsan.remote(server_san+path_san+'/'+project_name+'/md5sum.txt')
shell:
"""ssh imrb@{server_san} "find {path_san}/{project_name} -type f -name '*.fastq.gz' -exec md5sum {{}} + | awk '{{print \$1, gensub( \\".*/\\", \\"\\", \$2 )}}' | sort" > {output}"""