Yongfu's Blog

A Minimalist Structure for Snakemake

Also for Unix users forced to work on Windows

I have heard of the use of GNU Make for enhancing reproducibility for some time. I did not incorporate Make into my work, however, since a simple build script written in Bash was sufficient. Everything was well in control, and I could structure the workflow to my will.

It was not until I started working in a company setting that I found most things out of my control. Decades of conventions have been accumulating and passing on, and personal workflows have to fit into existing ones. In order to fit into my company’s conventions of data analysis (which pretty much just ignore analysis reproducibility), the number of scripts grew exponentially and quickly fell out of my control (see figure below, which is automatically generated by Snakemake). I needed a way to document and track my workflow in a consistent and scalable manner. This was when I picked up Prof. Broman’s great introductory post on GNU Make again. Everything seemed hopeful, but I was soon defeated by the omnipresent Windows. Since it is required to work on Windows machines in my company, and since Make for Windows has difficulties dealing with Chinese file paths, I had to give up on Make. Snakemake came as my savior.

Data analysis workflow graph generated by Snakemake
Data analysis workflow graph generated by Snakemake

Meeting Snakemake

Snakemake was inspired by, but way more complicated than, GNU Make. Since it is backed by Python, cross-platform issues such as character encodings are automatically resolved. Snakemake is a thoughtful project that was originally developed to facilitate computational research and reproducibility. Thus, it may take some time to get started since there are many concepts to pick up. It’s totally worth it, however. Dealing with complex tasks requires a complicated framework. Often, these complications make sense (and are appreciated) only after we face real-world complexities. Going through Snakemake’s tutorial and experimenting with it on the computer would be sufficient to get an average user started. It is not as complicated as it seems at first glance.

A great thing about Snakemake is that it is opinionated. This means that certain conventions1 are proposed, and most users would benefit from them since they spare the burden of planning and creating workflow structures.

For instance, Snakemake recommends the directory structure listed below for every Snakemake workflow. This structure is so simple that its genius might not be obvious at first glance. There are four directories in the project root—workflow/, config/, results/, and resources/. workflow/ holds the coding stuff. Code for data analysis, computation, and reproducibility are all found in this directory. config/ is for optional configuration and I would skip it here (in my own project, I did not use config files since the Snakefile is sufficient for my purposes). results/ and resources/ are what (I think) make this structure fantastic. resources/ holds all raw data, i.e., data that are not reproducible on your computer (e.g., manually annotated data). All data resulting from the computation in the current project are located in results/. So ideally, you could delete results/ at any time without worry. A single command snakmake -c should generate all the results from resources/. The genius of this structure is that it eliminates the need to worry about where to place newly arrived data, as commonly encountered in real-world situations (e.g., an analysis might require data that you did not foresee).

 1├── .gitignore
 2├── README.md
 3├── workflow
 4│   ├── rules
 5|   │   ├── module1.smk
 6|   │   └── module2.smk
 7│   ├── scripts
 8|   │   ├── script1.py
 9|   │   └── script2.R
10|   └── Snakefile
11├── config
12│   └── config.yaml
13├── results
14└── resources

An Enhanced Snakemake Workflow

I adopted the workflow above in my work. It was great, but I still found two annoying drawbacks.

  1. Long directory names

    Since in a Snakefile, file paths of inputs and outputs are always repeated, it soon becomes annoying to type in paths starting with resources/... and results/.... In addition, “resources” and “results” have a common prefix, which often confuses me. It would be better if the two terms were more readily distinguished visually.

  2. Confusing relative paths

    According to the documentation, relative paths in different directives are interpreted differently. To be short, relative paths in input:, output:, and shell: are interpreted relative to the working directory (i.e., where you invoke the command snakemake -c), whereas in directives such as script:, they are interpreted as relative to the Snakefile. So it would be cognitively demanding to switch back and forth between the reference points of relative paths while writing the Snakefile. Why not have all paths relative to the project root?

To deal with the aforementioned problems, I modified the recommended directory structure and arrived at the structure below:

1├── README.md
2├── Snakefile
3├── made
4├── raw
5└── src
  1. Simplified directory names

    resources/ is renamed as raw/, and results/ is renamed as made/. The workflow/ directory is broken down into src/ (holding scripts) and the Snakefile.

  2. Consistent relative paths

    Since Snakefile is now placed in the project root, the problem of different relative paths for different directives is resolved, as long as the user always invokes the command snakemake -c in the project root.

The source code of this Snakemake workflow can be found here on GitHub.

Some Notes for Using Git-Bash as Shell

The experience of using Snakemake on Windows is great overall. I ran into a few problems, but the problems were usually solvable. There is one particular problem that took me a while to solve. On Windows, the default shell executable used in Snakemake (and Python) is Cmd (or maybe Powershell). However, since I am more familiar with Bash and Unix tools, it is a real inconvenience. I had set up Git-Bash on the company’s Windows machine but then spent a long time figuring out how to set Git-Bash as the default shell in Snakemake. The information for Snakemake users on Windows is scarce. I guess Snakemake is just unpopular among Windows users. After reading the source code, the solution turned out to be quite simple. Just put the code below at the top of the Snakefile and place the path to Git-Bash executable in shell.executable(). This will allow the identical Snakefile to be used on both Windows and Unix-like computers without additional configurations.

1# Additional setup for running with git-bash on Windows
2if os.name == 'nt':
3    from snakemake.shell import shell
4    shell.executable(r'C:\Users\rd\AppData\Local\Programs\Git\bin\bash.exe')

  1. “Good” conventions here, as opposed to naturally resulting conventions without the consideration of reproducibility. ↩︎