Quantcast
Channel: balena Blog
Viewing all articles
Browse latest Browse all 27

Migrating the entire balenaOS CI/CD to GitHub Actions: From Bottlenecks to Breakthroughs

$
0
0

BalenaOS runs on millions of single-board computers provisioned by our customers. Automated testing of balenaOS ensures that we can continuously deliver new features, and address bugs. The pipeline also enables support for new device types for our customers, on time, every time. Our team has built hardware-in-the-loop CI/CD systems and virtualized testing to enable this exponential scale. This approach frees up critical engineering resources in the process.

In our previous blog posts, we discussed our comprehensive OS testing pipeline using Jenkins. This blog is a high-level rundown of our migration journey from Jenkins to GitHub Actions. It highlights the transition process, the benefits to the customers, and the key challenges we faced. The migration involved moving our balenaOS build, test, and deploy pipelines operating on Jenkins to a feature parity setup on GitHub Actions piece by piece.

BalenaOS CI/CD scale

BalenaOS is an embedded Linux operating system optimized for running containers on embedded devices. It’s a key component of the balenaCloud platform. We support balenaOS for 100+ public device types and 40+ private device types. These numbers are growing every day. Customers depend on us to implement security patches, deliver new features, and update dependencies. 

We build balenaOS for every device type we support for each commit in every pull request. That’s over 100+ balenaOS draft releases built per pull request merge. Each balenaOS draft release triggers test suite runs. These test suites test each component of the operating system. They additionally assess over-the-air (OTA) updates called balena HUP and evaluate OS integration with balenaCloud. These tests run on our hardware-in-the-loop CI/CD system. Once the tests are successful, the pull request gets merged and the draft OS release is finalized and deployed.

Our OS team works on about 10 pull requests every day. Which build over 1,000 balenaOS releases. Additionally, more than 3,000 test jobs are executed, debugged, and retried by our team. For every new device we support, autokits are added to the pool to test new OS releases. The upfront cost to support a device in the first place is fixed. However, ongoing testing and maintenance have no upper bound unless the device is discontinued. At this scale, having a bottlenecked CI/CD pipeline creating difficulties for both our team and our customers.

Jenkins and balena

Jenkins is an open-source CI/CD server used in balena since the start. It facilitated balenaOS Yocto builds, balena base images build, e2e test runs, and leviathan testing.

Jenkins at scale: Growing pains 

As our usage grew, Jenkins at the scale required maintenance, updates, and a lot of resources to run reliably. We used 50+ plugins to improve almost every aspect of our Jenkins usage. Automation, GitHub integration, artifact and job retention, sharing between workflows, templating, and secrets management to name a few. 

Jenkins building balenaOS releases for a pull request change

Over the years to alleviate these issues, we built custom scripts to build balenaOS called balena-yocto-scripts. This kept complexity out of Jenkins pipelines. However, it created a difficult maintenance state. Too many years of technical debt had already accumulated. After we implemented hardware in the loop testing for balenaOS in 2021, our setup presented several points of friction:

  • Strange interaction between Jenkins plugins and GitHub checks lead to false reporting of test pass/fails on PRs 
  • Race conditions in Jenkins jobs reporting statuses to GitHub checks which have led to PRs being merged without passing tests
  • Test results were difficult to associate with PRs, and retrying failing jobs wasn’t straightforward to people unaware of the setup.
  • Test results were not visible to external contributors to see why their test results were failing.

Reliability, maintenance, and costs were contributing factors to initiating the search for a new solution. We needed a easier to use solution that eliminates a service and keeps job runs closer to the code.

Why GitHub Actions?

GitHub Actions from GitHub simplifies workflows with flexible automation. It offers easy-to-use CI/CD capabilities for engineering teams. GitHub Actions can be used as a self-hosted or SaaS service and offers native capabilities and seamless integration with GitHub. Our research and proof of concept found several features helpful to us.

  • Native Integration: All balenaOS repositories were on GitHub. We already run our reusable, zero-conf Flowzone workflows on GitHub actions.
  • Configuration & version control: Uses YAML files for configuration, version-controlled, and committed in the repository. 
  • Community: Huge collection of helpful reusable community-contributed actions.
  • Scalability: Supports both GitHub-hosted and self-hosted runners.
  • Enterprise Support: Growing community and extensive documentation.
  • Pricing: Free for public repositories, limited free minutes for private repositories, paid plans available.
  • Security: Built-in secret management, automatic token authentication, and security hardening features.
  • Triggers: Support multiple job triggers based on GitHub events (push, open, close, even cron)

The terminology too was similar. Jenkins uses agent, stages, and steps keywords to describe it’s components which map to GitHub Actions’s keywords like jobs.<job_id>.runs-on, jobs, and jobs.<job_id>.steps . This nuanced similarity helped lower the learning curve for our team.

Initial Planning

Before starting the migration, we conducted a thorough audit of our existing Jenkins pipelines. This involved identifying all the jobs, stages, and steps that needed to be replicated in GitHub Actions. We forecasted increased build runners usage that GitHub Actions would incur in this process. 

Moving to GitHub Actions also meant that we had a unique chance. We were able to de-tangle the technical debt in our custom balena-yocto-scripts written specifically for Jenkins to run cohesively. The team worked towards a basic GitHub Action job that can build a balenaOS release. This proof of concept went through multiple iterations. We spent weeks optimizing the runner architecture, dependencies, and cache systems. These steps ensured the Yocto job would be successful consistently. This become the foundation of our planned CI/CD re-architecture.

Building Phase: Time to migrate

The proof of concept gave us an understand of inputs, outputs, environments, secrets, artifact handling, and deployment. This initial plan was iterated over a dozen times with improvements for scale, ease of use, and input simplification. A key indicator of progress was the number of inputs to the workflow. Initially, we defined many inputs to the workflow to aid customisability. These were later eliminated in favor of inferring them from runtime, the repository and github action environment variables. We started our build phase when the team was aligned on the prototype. We decided to create the components for the action in parallel.

Pipeline Architecture: Bigger Picture

The balenaOS build, test, deploy (BTD) pipeline is written as a composite workflow. We thought of separating each stage in a different workflow file by creating reusable workflows. The team quickly realized having separate jobs for each task (build, deploy, test) is cleaner and organized. However, each job would need to reinitialize environments. It would also require reinitializing triggers and inputs. For Yocto jobs, this is a herculean task. We needed to change our approach. We decided to use one reusable workflow. This allowed us to have all three tasks defined together in a composite action. This made maintenance and distribution straightforward for our use case to deploy the same action across hundreds of repositories.

With over a thousand lines of steps to build, test, and deploy a balenaOS release, the system now manages all kinds of edge cases. It also can process multiple different inputs and test scenarios. Here’s a big picture of how it works.

How it works

The workflow is triggered by a pull request event or the creation of a new tagged version. With each step, the build and test job provides feedback. It reports directly on the pull request about the status. It also shows error and retry options. Once a draft release of balenaOS gets built. It’s passed onto a matrix of testing jobs that test the release in different environments, on different test devices or scenarios (for example, secure boot testing)

If any tests fail, then each individual test run can be retried thanks to GitHub Actions retrying only failed jobs. This method prevents us from needing to continuously rebuild balenaOS. Artifacts are retained for a week in the temporary GitHub Artifact Storage. If all tests are successful, then final checks take place, and the pull request merges.

After the pull request is merged, a new tagged version is created which marks the balenaOS as final. The artifacts are uploaded to S3 and momentarily become available to be downloaded from balena.io/os and the balenaCloud dashboard. For special releases, like ESR, we add ESR tags. These tags are used by the API to signal the esr status – like next/sunset. All of this from build to tests on an actual device to deployment happens automatically. 

To read more about the jobs in detail and understand the logic working of the workflow, check out the flowchart. This flowchart represents the intended logic graph taking into account user inputs, event types, and expected results/outputs.

Leviathan iterations & boosting test performance

Leviathan’s inputs and working behavior over the years have been heavily influenced by Jenkins. Ideally, Leviathan needed to be modified to be more CI/CD agnostic. We made several modifications to infer as many inputs it needed from the environment. This led to a lot of iterative changes to Leviathan to make job runs more streamlined. 

A core requirement of the Test task was the ability to run autokit tests in parallel. Running multiple test jobs in the same Leviathan environment is a complex task. However, GitHub Actions runs test jobs parallelly in a job matrix. The test_matrix input in our action allows users to exactly tailor their test job according to their requirements. They can select worker types, fleets, environments, tests to run, and the scenario to run them in. 

 test_matrix: >
        {
          "test_suite": ["os","cloud","hup"],
          "environment": ["bm.balena-dev.com"],
          "worker_type": ["testbot"],
          "runs_on": [["ubuntu-latest"]]
        }

The GitHub Action creates a new job for each permutation of this matrix. This provides us with a baked in solution to execute test runs in parallel instead of series. Test runs now take one-third of the time. Each job executes faster. They can be individually retried if and when they fail. This OS team now has the ability to quickly retry tests when they need to. 

A limitation we overcame in GitHub Actions was not being able to pass or refer output artifacts from one job to another job. For Leviathan jobs, we used GitHub Artifact Storage to pass artifacts from the build job to the test job. 

Calling the workflow

The workflow has been implemented in each device type repository using a caller workflow. We can use the caller workflow to configure the job, and provide the inputs, and triggers as per our needs. This way we have multiple benefits, 

  1. Able to document, version, and create reusable, reproducible configurations to run jobs
  2. Set events & specific conditions as to when the job gets triggered
  3. Ability to configure inputs to customize the build, test, and deploy jobs as per our needs 
  4. Ability to run jobs manually by overriding inputs by using `workflow_dispatch`

These caller workflows remain committed and are version-controlled next to the device type code. This process provides external contributors transparency to observe, troubleshoot, and optimize our CI/CD flows without reaching support. If there are issues, use GitHub Issues to notify the team. This can help inform them of problems arising in the BTD workflow.

Find the intel-nuc caller workflow for a sample of how it all works.

The First Deployment

For our first deployment, we created a staging repository which was a fork of our current balena-raspberrypi repository.

This repository helped in testing all scenarios for our workflows. Some were, 

  1. Public device types
  2. Private device types
  3. Secure boot images
  4. Extended Support Release (ESR) images
  5. Test matrix, alternative environments, and build outputs.  

The first run faced several setup issues. Resolving these issues was crucial for us to enable a smoother migration later. This became evident when we deployed to hundreds of repositories. The team smoothed some rough edges. They finally were able to get successful runs of the balenaOS BTD workflow. This marked the build phase complete.

Post that, we customized the caller workflows to fit our specific needs. This included setting up environment variables, and secrets, and configuring self-hosted runners where necessary. We followed a methodological approach to our migration. We tested each component of the GitHub Action in production for a staging repository.

The actual migration begins

With the success from the first successful run, we acted quickly. We migrated our bigger device-type repositories over to GitHub Actions. In our plan, each team member took on a repository to migrate to GitHub Action. Our initial decision to build a reusable workflow was incredibly helpful. Enabling the build, test, and deploy action on a repository was simple. We just added caller workflows to each repository in a pull request. 

This migration got easier with our extensive use of our secrets and policy configuration bot. We automated permissions and secrets, changed required runs, and modified environment settings. We continuously improved our action with every challenge we faced. The migration ran into some hurdles. We figured out scaling issues and triggers. We also wrote documentation to provide a better developer experience for our team. Within four weeks we migrated from Jenkins to GitHub Actions and were able to shut down our Jenkins self-hosted instance.

Conclusion

The transition from Jenkins to GitHub Actions has been a significant step forward in unlocking greater scale for balenaOS. It has enabled us to streamline our workflows, improve efficiency, and scale our CI/CD pipelines more effectively. We hope our experience can serve as a guide for others considering a similar migration.

For customers: Transparency of builds & Ease of access

Building balenaOS in the open comes with many challenges and benefits. The primary bottleneck in enabling our community to fix issues is our CI/CD system. Making new patches has also been slowed by our setup with Jenkins. Jenkins needed to be behind SSO to limit access to sensitive data & credentials used in our builds.

GitHub Actions running OS tests

This problem implicitly was solved when we moved to GitHub Actions. GitHub Actions logs visibility is dependent on the visibility of the GitHub repository it runs on. This makes access control way simpler for us and our operations team.   

Customers can now find answers to what checks are failing. They can debug builds and ask questions. They can even submit patches without getting the balenaOS team involved. This is a major step forward for the enablement we want for our community. We continue to work towards making it easier for folks to add support for new device types. We strive to help them understand the logs to fast-track their development. 

Road ahead

As our device support for balenaOS increases, this GitHub Action pipeline will scale and grow with it. We will be improving our logging implementation for the action. Our next cycle of work will include adding observability for our hundreds of GitHub actions running across our repositories.

CI/CD observability is crucial for ensuring efficient and reliable software releases. The diverse landscape of CI/CD tools has created challenges in maintaining consistent end-to-end observability. This is particularly true for relatively new CI/CD providers. We will be publishing our research and a solution to that soon. 

Stay tuned for more updates as we continue to enhance our development and deployment processes!

The post Migrating the entire balenaOS CI/CD to GitHub Actions: From Bottlenecks to Breakthroughs appeared first on balena Blog.


Viewing all articles
Browse latest Browse all 27

Trending Articles