A Minimalist Structure for Snakemake
Also for Unix users forced to work on Windows
I have heard of the use of GNU Make for enhancing reproducibility for some time. I did not incorporate Make into my work however, since a simple build script written in Bash was sufficient. Everything was well in control, and I could structure the workflow to my will.
It was not until I started working in a company setting that I found most things out of my control. Decades of conventions have been accumulating and passing on, and personal workflows have to fit into existing ones. In order to fit into my company’s conventions of data analysis (which pretty much just ignore analysis reproducibility), the number of scripts grew exponentially and quickly fell out of my control (see figure below, which is automatically generated by Snakemake). I needed a way to document and track my workflow in a consistent and scalable manner. This was when I picked up Prof. Broman’s great introductory post on GNU Make again. Everything seemed hopeful, but I was soon defeated by the omnipresent Windows. Since it is required to work onSome Notes for as Windows machines in my company, and since Make for Windows has difficulties dealing with Chinese file paths, I had to give up on Make. Snakemake then came as my savior.
Snakemake was inspired by, but way more complicated than, GNU Make. Since it is backed by Python, cross-platform issues such as character encodings are automatically resolved. Snakemake is a thoughtful project that was originally developed to facilitate computational research and reproducibility. Thus, it may take some time to get started since there are many concepts to pick up. It’s totally worth it, however. Dealing with complex tasks requires a complicated framework. Often, these complications make sense (and are appreciated) only after we face real-world complex tasks. Going through Snakemake’s tutorial and experimenting with it on the computer would be sufficient to get an average user started. It is not as complicated as it seems at first glance.
Snakemake Recommended Workflow
A great thing about Snakemake is that it is opinionated. This means that certain conventions1 are proposed, and most users would benefit from these conventions since they spare the burden of structuring the workflow.
For instance, Snakemake recommends the directory structure listed
below for every Snakemake workflow. This structure is so simple that its genius
might not be obvious at first glance. There are four directories in the project
the coding stuff. Code for data analysis, computation, and reproducibility are
all found in this directory.
config/ is for optional configuration and I would
skip it here (in my own project, I did not use config files since the
Snakefile is sufficient for my purposes).
resources/ are what
(I think) make this structure fantastic.
resources/ holds all raw data,
i.e., data that are not reproducible on your computer (e.g., manually annotated
data). All data resulting from the computation in the current project are
results/. So ideally, you could delete
results/ at any time
without worry. A single command
snakmake -c should generate all the results
resources/. The genius of this structure is that it eliminates the need
of worrying about where to place newly arrived data, as commonly encountered in
real-world situations (e.g., an analysis might require data that you did not
1├── .gitignore 2├── README.md 3├── workflow 4│ ├── rules 5| │ ├── module1.smk 6| │ └── module2.smk 7│ ├── scripts 8| │ ├── script1.py 9| │ └── script2.R 10| └── Snakefile 11├── config 12│ └── config.yaml 13├── results 14└── resources
An Enhanced Snakemake Workflow
I adopted the workflow above in my work. It was great, but I still found two annoying drawbacks.
Long directory names
Since in a
Snakefile, file paths of inputs and outputs are always repeated, it soon becomes annoying to type in paths starting with
results/.... In addition, “resources” and “results” have a common prefix, which often confuses me. It would be better off if the two terms are more readily distinguished visually.
Confusing relative paths
According to the documentation, relative paths in different directives are interpreted differently. To be short, relative paths in
shell:are interpreted relative to the working directory (i.e., where you invoke the command
snakemake -c), whereas in directives such as
script:, they are interpreted as relative to the
Snakefile. So it would be cognitively demanding to switch back and forth between the reference points of relative paths while writing the
Snakefile. Why not have all paths relative to the project root?
To deal with the aforementioned problems, I modified the recommended directory structure and arrived at the structure below:
1├── README.md 2├── Snakefile 3├── made 4├── raw 5└── src
Simplified directory names
resources/is renamed as
results/is renamed as
workflow/directory is broken down into
src/(holding scripts) and the
Consistent relative paths
Snakefileis now placed in the project root, the problem of different relative paths for different directives is resolved, as long as the user always invokes the command
snakemake -cin the project root.
The source code of this Snakemake workflow can be found here on GitHub.
Some Notes for Using Git-Bash as Shell
The experience of using Snakemake on Windows is great overall. I have run into a
few problems, but the problems were usually solvable. There is one particular
problem that took me a while to solve. On Windows, the default shell executable
used in Snakemake (and Python) is Cmd (or maybe Powershell). But since I am more
familiar with Bash and Unix tools, it is a real inconvenience. I had setup
Git-Bash on the company’s Windows machine but then spent a long time figuring
out how to set Git-Bash as the default shell in Snakemake. The information for
Snakemake users on Windows is scarce. I guess Snakemake is just unpopular among
Windows users. After reading the source code, the solution turned
out to be quite simple. Just put the code below at the top of the
and place the path to Git-Bash executable in
shell.executable(). This will
allow the identical
Snakefile to be used on both Windows and Unix-like
computers without additional configurations.
1# Additional setup for running with git-bash on Windows 2if os.name == 'nt': 3 from snakemake.shell import shell 4 shell.executable(r'C:\Users\rd\AppData\Local\Programs\Git\bin\bash.exe')
“Good” conventions here, as opposed to naturally-resulting conventions without the consideration of reproducibility. ↩︎