A Minimalist Structure for Snakemake
Also for Unix users forced to work on Windows
I have heard of the use of GNU Make for enhancing reproducibility for some time. I did not incorporate Make into my work, however, since a simple build script written in Bash was sufficient. Everything was well in control, and I could structure the workflow to my will.
It was not until I started working in a company setting that I found most things out of my control. Decades of conventions have been accumulating and passing on, and personal workflows have to fit into existing ones. In order to fit into my company’s conventions of data analysis (which pretty much just ignore analysis reproducibility), the number of scripts grew exponentially and quickly fell out of my control (see figure below, which is automatically generated by Snakemake). I needed a way to document and track my workflow in a consistent and scalable manner. This was when I picked up Prof. Broman’s great introductory post on GNU Make again. Everything seemed hopeful, but I was soon defeated by the omnipresent Windows. Since it is required to work on Windows machines in my company, and since Make for Windows has difficulties dealing with Chinese file paths, I had to give up on Make. Snakemake came as my savior.
Meeting Snakemake
Snakemake was inspired by, but way more complicated than, GNU Make. Since it is backed by Python, cross-platform issues such as character encodings are automatically resolved. Snakemake is a thoughtful project that was originally developed to facilitate computational research and reproducibility. Thus, it may take some time to get started since there are many concepts to pick up. It’s totally worth it, however. Dealing with complex tasks requires a complicated framework. Often, these complications make sense (and are appreciated) only after we face real-world complexities. Going through Snakemake’s tutorial and experimenting with it on the computer would be sufficient to get an average user started. It is not as complicated as it might seem.
Snakemake Recommended Workflow
A great thing about Snakemake is that it is opinionated. This means that certain conventions1 are proposed, and most users would benefit from them since they spare the burden of planning and creating workflow structures.
For instance, Snakemake recommends the directory structure listed
below for every Snakemake workflow. This structure is so simple that its genius
might not be obvious at first glance. There are four directories in the project
root—workflow/
, config/
, results/
, and resources/
. workflow/
holds
the coding stuff. Code for data analysis, computation, and reproducibility are
all found in this directory. config/
is for optional configuration and I would
skip it here (in my own project, I did not use config files since the
Snakefile
is sufficient for my purposes). results/
and resources/
are what
(I think) make this structure fantastic. resources/
holds all raw data,
i.e., data that are not reproducible on your computer (e.g., manually annotated
data). All data resulting from the computation in the current project are
located in results/
. So ideally, you could delete results/
at any time
without worry. A single command snakmake -c
should generate all the results
from resources/
. The genius of this structure is that it eliminates the need
to worry about where to place newly arrived data, as commonly encountered in
real-world situations (e.g., an analysis might require data that you did not
foresee).
1├── .gitignore
2├── README.md
3├── workflow
4│ ├── rules
5| │ ├── module1.smk
6| │ └── module2.smk
7│ ├── scripts
8| │ ├── script1.py
9| │ └── script2.R
10| └── Snakefile
11├── config
12│ └── config.yaml
13├── results
14└── resources
An Enhanced Snakemake Workflow
I adopted the workflow above in my work. It was great, but I still found two annoying drawbacks.
Long directory names
Since in a
Snakefile
, file paths of inputs and outputs are always repeated, it soon becomes annoying to type in paths starting withresources/...
andresults/...
. In addition, “resources” and “results” have a common prefix, which often confuses me. It would be better if the two terms were more readily distinguished visually.Confusing relative paths
According to the documentation, relative paths in different directives are interpreted differently. To be short, relative paths in
input:
,output:
, andshell:
are interpreted relative to the working directory (i.e., where you invoke the commandsnakemake -c
), whereas in directives such asscript:
, they are interpreted as relative to theSnakefile
. So it would be cognitively demanding to switch back and forth between the reference points of relative paths while writing theSnakefile
. Why not have all paths relative to the project root?
To deal with the aforementioned problems, I modified the recommended directory structure and arrived at the structure below:
1├── README.md
2├── Snakefile
3├── made
4├── raw
5└── src
Simplified directory names
resources/
is renamed asraw/
, andresults/
is renamed asmade/
. Theworkflow/
directory is broken down intosrc/
(holding scripts) and theSnakefile
.Consistent relative paths
Since
Snakefile
is now placed in the project root, the problem of different relative paths for different directives is resolved, as long as the user always invokes the commandsnakemake -c
in the project root.
The source code of this Snakemake workflow can be found here on GitHub.
Some Notes for Using Git-Bash as Shell
The experience of using Snakemake on Windows is great overall. I ran into a
few problems, but the problems were usually solvable. There is one particular
problem that took me a while to solve. On Windows, the default shell executable
used in Snakemake (and Python) is Cmd (or maybe Powershell). However, since I am more
familiar with Bash and Unix tools, it is a real inconvenience. I had set up
Git-Bash on the company’s Windows machine but then spent a long time figuring
out how to set Git-Bash as the default shell in Snakemake. The information for
Snakemake users on Windows is scarce. I guess Snakemake is just unpopular among
Windows users. After reading the source code, the solution turned
out to be quite simple. Just put the code below at the top of the Snakefile
and place the path to Git-Bash executable in shell.executable()
. This will
allow the identical Snakefile
to be used on both Windows and Unix-like
computers without additional configurations.
1# Additional setup for running with git-bash on Windows
2if os.name == 'nt':
3 from snakemake.shell import shell
4 shell.executable(r'C:\Users\rd\AppData\Local\Programs\Git\bin\bash.exe')
“Good” conventions here, as opposed to naturally resulting conventions without the consideration of reproducibility. ↩︎