Ken Brewer's BioDev Blog

Frictionless development using platforms

Ken Brewer — Sun, 14 Apr 2024 19:00:52 GMT

When I first started this blog I did so with the idea of trying to exemplify some of the DevOps principles I was excited about within the building and publishing of the website itself. Although neat in principle, this ran into a couple of major headaches in practice. Chief among them was the high maintenance burden and the high activation energy required for writing content.

Maintenance Burden

Over the past year I've ended up tinkering a great deal with the underlying deployment pattern of my website. This included:

Migrating from Github Pages to Cloudflare Pages for content hosting
Migrating my DNS configuration (and then separately my domain registration) from Google Domains to Cloudflare
Setting up previews of development branches in Cloudflare
Troubleshooting issues with all of the above.

These were all necessary and/or useful improvements, and I had a long list of additional things I wanted to improve further around setting up edge caching and improving load times, but none of these improvements resulted in me writing more content.

Activation Energy

Besides the maintenance burden I described above, the other reason that I didn't find myself writing a lot of content was the high-activation energy involved in adding new content. Each new article required:

Creating a new branch in my Github repo
Writing the content
Making a commit
Pushing the commit
Creating a PR
Previewing the PR build
Merging the PR

While the version-control and CI/CD process provides substantial value within the context of building software, it felt like unnecessary overhead. And when I had an idea or thought I wanted to share, I would often get mentally blocked with all the little things that needed to happen.

In the end this high maintenance burden and high activation energy made it too hard to do the thing that my blog was intended to accomplish: provide a platform for me to share my ideas and interests.

Going with a Platform

In the end, I decided to replace my burdensome self-hosted publishing solution with a good commercial platform: Hashnode. I'm not going to do a breakdown of the features and rationale for choosing this platform in particular, but I do want to mention a bit of the "strategic" value of this decision because I think there are some valuable learnings here for teams building computational platforms in biopharma.

Choosing to use a good quality commercial platform to handle undifferentiated parts of your software stack can mean massive time savings from:

building the features you need now
building features that you don't know you need, but will need in the future
visual tooling and an optimized user interface for routine tasks

That last bullet point is always particularly hard to justify building for an internal tool but can provide massive time savings when routine tasks become less burdensome, frustrating, or fully automated.

I've got a backlog of article ideas that I've been kicking around for the last year, and I expect this new platform-based approach makes it a lot easier to get those out into the world.

A simpler Nextflow template

Ken Brewer — Sun, 07 Apr 2024 16:00:00 GMT

Nextflow is the the go-to tool for many people in the bioinformatics community who are working on developing data pipelines. Unfortunately, there is a pretty steep learning curve as there is a whole new Groovy-based syntax and framework for code organization to learn. The steepest part of this learning curve in my experience happened when I tried to move from a simple pipeline structure like those present in the main Nextflow documentation, to the fully-featured, best-practice templates used by the open-source nf-core community. To address this steep learning curve I've created a new, slimmed down Nextflow project template based off nf-core's main template. I hope it can be a stepping stone for intermediate Nextflow developers looking to learn best practices for pipeline development and for experienced Nextflow developers looking for a leaner codebase that can start generating outputs quicker than using the full template.

What is Nextflow?

Nextflow is a powerful workflow management tool that I frequently use to build, execute, and automate complex scientific pipelines. Its main strength lies in the ability to modularize virtually any program or custom code. Those modular of compute (called processes) can then be strung together into a workflow that can be run identically on a variety of computing infrastructures, including local computers, cloud-based platforms, and high-performance computing clusters. That modularity and portability are certainly two of the features that make Nextflow so popular among the bioinformatics community, but the most useful aspect of working with Nextflow is the open-source community that has developed around it.

The nf-core community

nf-core is a community-driven project that provides a set of standardized Nextflow pipelines for some of the most common bioinformatics analyses. Some of these pipelines are very complex, so much so that they are visualized with metro map inspired diagrams like this one from nf-core/rnaseq:

To manage these complex pipelines, nf-core has also developed a powerful python package called nf-core/tools that provides a set of command-line tools for creating, linting, testing, and syncing pipelines that adhere to nf-core standards.

Challenges of working with the default nf-core template

nf-core provides a template that can be used to create a new pipeline from scratch. This template is a great foundation for building a complex multi-step pipeline, but it can be overwhelming for beginning-to-intermediate Nextflow developers who are just trying to get a simple pipeline up and running while familiarizing themselves with Nextflow best practices. Even for experienced Nextflow developers, the nf-core template can be a bit of a pain to work with because of the many different areas of the template that need to be configured/modified to get some outputs for a new process.

A simpler nf-core-based template

To address some of the challenges of working with the default nf-core template, I created a simplenextflow template based on the nf-core template that is much simpler to work with. Here are some of the main changes I made:

Fewer files to search for relevant code
To ensure that the nf-core template is as flexible as possible, it is broken up into several different files that are used to configure various aspects of the pipeline. When I started developing pipelines using nf-core best practices, this was one of the most confusing aspects of the template. In simple-nextflow, I have moved the vast majority of configuration logic back into the nextflow.config file instead of having it imported from other files. Additionally, I have moved all of the workflow and subworkflow logic back into the main.nf file.
Instructions for adapting the pipeline
At the top of the README file, I also added some basic instructions of the core pieces of the pipeline that need to be modified in order to get change the template from the default fastqc example to a new pipeline.
Added config profile and templating for Wave containers
One of the most exciting new features of Seqera Labs has introduced is the ability to use Wave containers to run processes in a containerized environment without having to build a new container image for each process. Instead you can simply include a Dockerfile or environment.yml file in the process directory and Nextflow Tower will build a container image for that process on the fly. This is a great feature for developing pipelines because it allows you to quickly test out new code without having to build a new container image for each change. You can access this feature immediately in this template by running the pipeline with the -profile wave flag.
Removal of check_versions and MultiQC
The check_versions process and the MultiQC process are both great tools for ensuring that version information is captured accurately and you can visualize your results. However, they also fall in the category of features that are overwhelming for most new Nextflow developers. I've chosen to remove them from the template, to help make it easier to provide a simpler logic in the main.nf file. Best practices for capturing version information is still very important, and should be added back in for any pipeline that is intended for production use.
Keeping samplesheet logic
The concept of using a samplesheet to define the inputs was something that I considered cutting, because modifying the bin/check_samplesheet.py script to work with a new pipeline is one of the most time-consuming parts of adapting the template for a new pipeline. However, I think that setting up proper associations between files and their metadata is one of the most important practices in good pipeline development, so I decided to keep it in the template. I've been thinking about how to make the process of customizing the bin/check_samplesheet.py script easier, but I haven't come up with a good solution yet. Let me know if you have any ideas!
Things that are still in the template
Many of the wonderful quality of life features included in nf-core pipelines are still present in this slimmed down version. These are all features of the nf-core template that essentially work "out-of-the-box" with no additional configuration needed that would slow down the process to begin generating outputs. Some of my favorite features I was able to keep in the template are:
- Reproducible development environments using Codespaces and Gitpod
- Colorful logging and output of non-default parameters
- Email and slack notifications

Conclusion

I hope this template is useful to anyone looking to familiarize themselves with nf-core best practices in more bite-sized chunks or are just looking for a simple template to get them to their desired outputs more quickly. Let me know if you have any suggestions for improving the template in the comments below or by making an issue/feature request in the GitHub repo.

References

Blog post changelog

2023-04-08 - Added new intro section

Setting up a personal blog with CI/CD

Ken Brewer — Sun, 02 Apr 2023 16:00:00 GMT

Why blog?

Based on the recommendation of multiple colleagues in the Flagship Pioneering informatics community, I recently listened to The Phoenix Project on audiobook. This novel focuses on a struggling IT department within the fictional company Parts Unlimited, and how the main character, Bill, turns things around by implementing the core principles of the DevOps movement. It is an engaging and educational read, and I highly recommend it to anyone interested in DevOps.

Although Bill's story is set in a medium-sized manufacturing company, I found many of the challenges he faced relatable as a computational biologist working in a small biotech startup. His frantic efforts to ensure integrity of critical data for his colleagues in HR and finance reminded me of the urgency I feel when delivering bioinformatic analyses to bench scientists who need them for their experiments. The novel inspired me to consider how I could apply DevOps principles to the specific problems I face at work.

As I started diving deeper into DevOps, I was somewhat surprised to discover that I already possessed all the core technical skills. I had simply lacked a framework to apply those technical skills in an integrated, synergistic way. That led to my decision to set up a blog focused on applying the principles of DevOps to specific problems faced by computational biologists, bioinformaticians, machine learning engineers and data scientists who work in the biotech and pharmaceutical industries.

Criteria for the blog

If I wanted to write a blog that would in part be focused on DevOps, I wanted to set up the website using that same set of principles I wanted to write about. The DevOps principles I wanted to apply to the blog were:

Automate everything
Use version control
Use a CI/CD pipeline to deploy the site

I also had a few other criteria for the blog:

Use a static site generator to make the blog easy to maintain.
Be able to write new posts in markdown.
Keep costs as low as possible, preferably free.
Be able to use a custom domain name.
Have sufficient flexibility to customize the look and feel of the site.

Static site generator

I decided to use Jekyll as the static site generator for the blog. Jekyll is a popular choice for static site generators, and is well supported by GitHub Pages. Because this was my first time using Jekyll and I don't believe in re-inventing the wheel, I decided to use a pre-built theme to get started. I chose Beautiful Jekyll because it is well documented, and has a clean, modern look.

Setting up the blog using CI/CD

I largely followed the instructions on the Beautiful Jekyll website to set up the blog, but I made a few modifications in-line with this blog's focus on DevOps and automation.

Setting up a DevContainer

Earlier this year, I led an effort at ProFound to set up standardized development environments. I'll write more about that effort in a future post, but I wanted to use the same approach for this blog. VsCode makes it trivially easy to set up a DevContainer for a project even for a project that uses a framework/language that you are not familiar with. Using the "Add Development Container Configuration Files" command, I was able to quickly generate a `devcontainer.json` file that would launch a Docker container with all the necessary dependencies for Beautiful Jekyll.

{     "name": "Jekyll",    "image": "mcr.microsoft.com/devcontainers/jekyll:0-buster",    "features": { "ghcr.io/devcontainers/features/node:1": {} },    "forwardPorts": [4000],    "postCreateCommand": "bundle exec jekyll serve --watch" }

As soon as this file was saved, VSCode automatically prompted me to re-open the project in the container.

Setting up a Cloud IDE

One of the benefits of using a DevContainer is that you can use the same container to develop locally or in the cloud using Github Codespaces. Github Codespaces has a generous free tier, and is a great way to reduce the friction of setting up a new project. You can read more about how to set up a Github Codespace in the Github documentation.

Setting up a CI/CD pipeline

The original repo for Beautiful Jekyll included a simple GitHub Actions workflow While enabling the Github Pages feature in the repo settings, I found a template Github Actions pipeline that can be used to build and deploy the site instead:

# Sample workflow for building and deploying a Jekyll site to GitHub Pagesname: Deploy Jekyll site to Pageson:  push:    branches: ["main"]  workflow_dispatch:# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pagespermissions:  contents: read  pages: write  id-token: write# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.concurrency:  group: "pages"  cancel-in-progress: falsejobs:  # Build job  build:    runs-on: ubuntu-latest    steps:      - name: Checkout        uses: actions/checkout@v3      - name: Setup Ruby        uses: ruby/setup-ruby@ee2113536afb7f793eed4ce60e8d3b26db912da4 # v1.127.0        with:          ruby-version: '3.1' # Not needed with a .ruby-version file          bundler-cache: true # runs 'bundle install' and caches installed gems automatically          cache-version: 0 # Increment this number if you need to re-download cached gems      - name: Setup Pages        id: pages        uses: actions/configure-pages@v3      - name: Build with Jekyll        # Outputs to the './_site' directory by default        run: bundle exec jekyll build --baseurl "${{ steps.pages.outputs.base_path }}"        env:          JEKYLL_ENV: production      - name: Upload artifact        # Automatically uploads an artifact from the './_site' directory by default        uses: actions/upload-pages-artifact@v1  # Deployment job  deploy:    environment:      name: github-pages      url: ${{ steps.deployment.outputs.page_url }}    runs-on: ubuntu-latest    needs: build    steps:      - name: Deploy to GitHub Pages        id: deployment        uses: actions/deploy-pages@v2

Setting up a custom domain

I wanted to use a custom domain name for the blog, so I followed the instructions on the Github Pages documentation to set up a custom domain.In order to use a custom domain, you need to create a CNAME record in your DNS settings that points to .github.io.I did this manually in the console for Cloudflare, but I plan to integrate these settings into a Terraform configuration file that I'll build into this repo in the future.I could have waited until I had the Terraform setup ready to go too, but getting a minimum viable product up and running quickly is a key part of the DevOps philosophy.

Conclusion

I'm really happy with how with how simple it was to set up this blog using DevOps principles.It only took a few hours to set up the blog, and I have something simple and robust enough for me to write posts and iterate on with little-to-no overhead.I'm looking forward to writing more posts in the future, and I hope you'll join me on this journey!

References

Image credits

Cover image credit by Nathan da Silva on Unsplash