Jenkins Wrangling for Fun & Profit 2017-12-05

While there have been many new developments in CI/testing tools, Jenkins is still a mainstay. And to be fair to the Jenkins team, it has come a tremendous way in the past few years. The new Pipelines system is more flexible than anything I’ve used before, and the Blue Ocean UI is a big graphical and UX upgrade. My team at SAP started using Jenkins long before I arrived, but over the years we have slowly accumulated some complaints about how it was working and how we managed things.

This post is going to be a (very) long-form dive into how we set things up and why. I do not think this is going to work out of the box for (almost) anyone else, but the hope is that this will provide a blueprint for others to build their own solutions on the same general ideas.

The Problems

Before I launch into what we did, let’s list out the issues we had with our current set up so we are all on the same page. Leaving aside some details that exist only for PCI, we have a Jenkins server deployed by Chef. Some plugins were installed originally by the Chef cookbook, but most were installed and upgraded by hand since then. Jobs were mostly created via a custom CLI tool that talks to the Jenkins API, but then updated by hand (or more often, not) after that.

So we spent some time in a meeting room with a whiteboard and came up with a few top-level problems:

Upgrades are too unpredictable, both for Jenkins and individual plugins.
Jenkins configuration is (mostly) not versioned, ditto for job configs.
Job configs can bitrot over time with no easy way to update them other than one at a time.
No pull-request builds, and overall existing builds are quite slow.

While this applied to a lot of different use cases for Jenkins, the one I chose to tackle first was Chef cookbook testing, but with a clear eye towards building a solution which will grow to include other use cases as we want to move them off the old Jenkins server.

The Shape

After a bunch of research, the overall shape of the goal came together fairly quickly. We’ll go through each of these in excruciating detail later on, but to start let’s break it down into bullet points:

Deployed on Kubernetes, because this is the way we’re trying to move everything.
Build a container image containing Jenkins, all the plugins, and configuration.
Use Helm for managing the deployment (and rollback if needed).
Manage the configuration via Jenkins groovy as much as possible.
Use the “organization folder” system in Jenkins to auto-detect projects.
Use shared pipeline libraries to keep the per-repository config low.
Build a container image for the cookbook testing environment that has all needed gems pre-installed.
Work out a way to test cookbooks on top of Kubernetes pods.

Kubernetes?

We did briefly look over non-Kubernetes deployment options like building a new Chef cookbook or using a dedicated Nomad cluster, but with the continued rise of Kubernetes as an operations platform it seemed like a good idea to use this as an internal experiment for running “real” (but not customer facing) services on Kubernetes. In the months since that choice, I think we have only seen the industry move even more behind Kubernetes as the next dominant platform so this seems to have been the right move. If you already have a heavy investment in Mesos or Nomad then perhaps just ignore the Kubernetes-specific bits of this.

Within the Kubernetes ecosystem there are a few tools/patterns for managing deployment of complex applications (i.e. things that need more than just a pod). While “folder full of YAML + kubectl” and ksonnet are nice from a simplicity point-of-view, the rollback capabilities of Helm made it the clear choice in my mind.

The deployment of Kubernetes and Helm themselves are out-of-scope for this post; there are numerous guides for Kubernetes and setting up Helm is mostly just helm init. Our production cluster is currently on AWS and set up with Kops, but your mileage may vary if you aren’t on AWS. If you just want to play with the stuff in this post, I would highly recommend starting with a hosted cluster option like Google GKE, Azure AKS, or the newly announced Amazon EKS.

The Jenkins Container

The first stop on our journey is building a Docker image for the Jenkins server. Going line by line so we can talk about it as we go:

FROM jenkins/jenkins:2.92-alpine

We’re starting from the existing Jenkins Docker Hub images. This determines which version of Jenkins gets used so doing a Jenkins upgrade consists of changing this line, building a new container, pushing it to our registry, and then upgrading the Helm release. This is using the Jenkins weekly release, so we try to keep this bumped roughly once a week, though if it ends up a few weeks behind that’s totally fine.

COPY saml-idp-metadata.xml /metadata.xml
COPY plugins.txt /plugins.txt
COPY style.css /style.css

Next we copy some base files. The SAP internal authentication system uses SAML (please hold your sighs for the end) so we store the IdP metadata in a file to use in the Jenkins config. The plugins.txt looks like:

kubernetes:1.1
workflow-aggregator:2.5
workflow-job:2.15
credentials-binding:1.13
git:3.6.4
blueocean:1.3.3
github-oauth:0.28.1
matrix-auth:2.2
saml:1.0.4

and gets used later down by the install-plugins.sh script that comes in the base image. The style.css file is a few minor tweaks on top of the theme for things it gets wrong or we didn’t like:

// The theme overwrites this so we need to fix.
.glyphicon {
  font-family: 'Glyphicons Halflings' !important;
}

// Theme has no icon for this.
.icon-github-branch {
  background-image: url('/static/ccf6b398/plugin/github-branch-source/images/24x24/github-branch.png');
}

// Force all icons in that bar to be grayscale.
.icon-md {
  filter: grayscale(1);
}

Overall files that aren’t expected to change often.

RUN mkdir -p /usr/share/jenkins/ref/secrets && \
    # Why is this not the default?
    echo false > /usr/share/jenkins/ref/secrets/slave-to-master-security-kill-switch && \
    # Install all our plugins so they are baked in to the image.
    /usr/local/bin/install-plugins.sh < /plugins.txt && \
    # Install a nicer default theme to make it look shiny for non-BlueOcean.
    mkdir /usr/share/jenkins/ref/userContent && \
    curl --compressed http://jenkins-contrib-themes.github.io/jenkins-neo-theme/dist/neo-light.css > /usr/share/jenkins/ref/userContent/neo-light.css.override && \
    cat /style.css >> /usr/share/jenkins/ref/userContent/neo-light.css.override

Then the meat of the Dockerfile. A poorly documented feature of the Jenkins Docker image is that all files under /usr/share/jenkins/ref are used to seed the creation of the JENKINS_HOME folder during startup. Normally these files are only copied over the first time, but if they end in .override it is copied every time Jenkins starts (with the .override trimmed off).

First we set the poorly named slave-to-master-security-kill-switch file which makes it so JNLP builders don’t get admin access to the Jenkins server because we don’t want rogue builds to take down the universe if possible.

Next we install all the Jenkins plugins. It should be noted that the Docker layer cache can sometimes bite you here. Because we only list the top-level plugins we want (the script handles finding all dependencies), if we want to upgrade an internal dependency but nothing else has changed, might need to manually zap the cached layer image. Given that Jenkins itself releases weekly anyway (meaning we change the FROM image and invalidate the whole cache), it’s not hugely likely that this will an operational problem, but be aware.

After that we set up some custom CSS theming. While Blue Ocean does have a nicely refreshed UI, the default post-login landing page is still the normal UI and we wanted to spruce that up a bit, both for aesthetic reasons and to make it easier to tell at a glance which Jenkins server you are looking at. Neo-light seemed the nicest of the themes that still worked, but you can change or ignore this part as you wish.

COPY config.groovy /usr/share/jenkins/ref/init.groovy.d/zzz_alti-jenkins.groovy.override
COPY plugin/target/alti-jenkins-plugin.hpi /usr/share/jenkins/ref/plugins/alti-jenkins-plugin.hpi.override

And finally we copy over two more files. These change more often than the plugins.txt or Jenkins version (at least so far during development, hopefully that will change over time) so they go at the end. The config Groovy code gets put in place to be run automatically at startup, with the weird zzz thing because Jenkins alpha-sorts the hook scripts if there is more than one and we want to be last. The config itself is big and complex so we’ll cover that further down.

The alti-jenkins plugin is a bit of an experiment, currently all it does is add the <link> tags to the HTML for the theme CSS and sets a few security HTTP headers. This could probably be replaced with the simple-theme plugin instead, but I would like to add more stuff to it (ex. custom job health metric that ignores failed PR builds), so we’re leaving it for now.

And with that, we have a Jenkins Docker image. docker build -t ourrepo.com/alti_jenkins:2.92 ., docker push ourrepo.com/alti_jenkins:2.92, ???, profit. As a built artifact this encompasses the Jenkins release, all plugins used, and the configuration code. Just about everything we could ask for.

We’ll talk about the config.groovy in just a moment, but because all secrets (or any other configuration you want to hide) is coming in at run-time from Kubernetes, this image doesn’t actually contain anything that needs to be hidden. If you aren’t running your own registry already, you could push this up to a public Docker Hub account instead.

`config.groovy`

This was the bulk of my time on the project, a slowly expanding config script that started with some basics and now encompasses the entire setup process. I will lead off with the fact that I am neither a Jenkins nor Groovy expert so I’m sure this code can be improved, for example I only learned very late in my writing that import is optional in Groovy if you use the fully-qualified class name. With that in mind, let’s go line by line again:

import static jenkins.model.Jenkins.instance as jenkins

The most important import, the Jenkins object singleton. We use this a ton, so put it in a magic global.

import com.cloudbees.plugins.credentials.*
import com.cloudbees.plugins.credentials.common.*
import com.cloudbees.plugins.credentials.domains.*
import com.cloudbees.plugins.credentials.impl.*
import hudson.security.GlobalMatrixAuthorizationStrategy
import hudson.security.Permission
import hudson.util.Secret
import jenkins.branch.OrganizationFolder
import jenkins.install.InstallState
import jenkins.plugins.git.GitSCMSource
import org.csanchez.jenkins.plugins.kubernetes.ContainerEnvVar
import org.csanchez.jenkins.plugins.kubernetes.ContainerTemplate
import org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud
import org.csanchez.jenkins.plugins.kubernetes.PodTemplate
import org.csanchez.jenkins.plugins.kubernetes.ServiceAccountCredential
import org.jenkinsci.plugins.github_branch_source.BranchDiscoveryTrait
import org.jenkinsci.plugins.github_branch_source.GitHubSCMNavigator
import org.jenkinsci.plugins.github_branch_source.OriginPullRequestDiscoveryTrait
import org.jenkinsci.plugins.plaincredentials.impl.StringCredentialsImpl
import org.jenkinsci.plugins.saml.SamlEncryptionData
import org.jenkinsci.plugins.saml.SamlSecurityRealm
import org.jenkinsci.plugins.workflow.libs.LibraryConfiguration
import org.jenkinsci.plugins.workflow.libs.SCMSourceRetriever

println "--> configuring alti_jenkins"

As mentioned, I didn’t really know how Groovy imports worked when starting this so this is mostly not needed but I haven’t cleaned it up yet.

try {

By default, if an init hook script fails, Jenkins prints an error to the log and keeps on truckin’. The whole config code is inside a try/catch so we can at least attempt to not let it continue starting if we might have failed to configure something important (like, say, authentication). This will only catch runtime errors though, if there is a syntax error in the script, that will still result in Jenkins starting as per usual.

  //////// CONFIG
  def secretsRoot = System.getenv('JENKINS_SECRETS') ?: '/var/jenkins_secrets'
  def downwardRoot = System.getenv('DOWNWARD_VOLUME') ?: '/etc/downward'
  println "--> Loading configuration from from secrets:$secretsRoot and downward:$downwardRoot"
  def githubUser = new File("$secretsRoot/github-user").text.trim()
  def githubUserToken = new File("$secretsRoot/github-token").text.trim()
  def samlPass = new File("$secretsRoot/saml-pass").text.trim()
  def samlKeystore = "$secretsRoot/saml-keystore"
  def developmentMode = new File("$secretsRoot/development-mode").text.trim() == 'true'
  def kubeNamespace = new File("$downwardRoot/namespace").text.trim()
  def admins = [
    'nkantrowitz', // Noah Kantrowitz
    'etc', // Someone else
  ]
  def githubOrg = 'MyOrg'
  def librariesRepo = "$githubOrg/jenkins-pipeline-libs"
  def agentVersion = '3.10-1-alpine'

  // Parse the labels test.
  def labels = [:]
  new File("$downwardRoot/labels").eachLine {
    def parts = it.split('=')
    labels[parts[0]] = parts[1][1..-2]
  }

Next up, loading and parsing a bunch of configuration data. We’ll look at the pod configuration later on, but this is mostly reading from either a Kubernetes Secret volume (for secrets) or a Downward API volume (for metadata about the pod we are running inside of). And then a few hardcoded values that don’t change often enough to be exposed outside of the file/image like the name of the GitHub organization.

  //////// GENERAL SETTINGS
  // Bypass the setup wizard because this script defines all of our config.
  // This is _supposed_ to be handled by /usr/share/jenkins/ref/jenkins.install.UpgradeWizard.state
  // but that doesn't seem to be working. See https://github.com/jenkinsci/docker#script-usage.
  if (!jenkins.installState.isSetupComplete()) {
    println '--> Neutering SetupWizard'
    InstallState.INITIAL_SETUP_COMPLETED.initializeState()
  }
  // Disable CLI over the remoting protocol for security.
  jenkins.getDescriptor("jenkins.CLI").get().enabled = false
  // More security, disable old/unsafe agent protocols.
  jenkins.agentProtocols = ["JNLP4-connect", "Ping"] as Set
  // Enable CSRF.
  jenkins.crumbIssuer = new hudson.security.csrf.DefaultCrumbIssuer(true)
  // Disable execution on the main server.
  jenkins.numExecutors = 0

Then some baseline global configuration. This disables the “welcome to Jenkins” setup wizard, sets some security stuffs, and turns off job execution on the Jenkins server itself because we want all jobs to run inside Kubernetes workers.

  //////// AUTHENTICATION
  if (new File(samlKeystore).exists()) {
    // Configure the SAML plugin.
    println '--> Configuring SAML authentication realm'
    def realm = new SamlSecurityRealm(
      new File('/metadata.xml').text, // String idpMetadata,
      'display_name', // String displayNameAttributeName,
      '', // String groupsAttributeName,
      SamlSecurityRealm.DEFAULT_MAXIMUM_AUTHENTICATION_LIFETIME, // Integer maximumAuthenticationLifetime,
      'uid', // String usernameAttributeName,
      'email', //  String emailAttributeName,
      '', // String logoutUrl,
      null, // SamlAdvancedConfiguration advancedConfiguration,
      new SamlEncryptionData( // SamlEncryptionData encryptionData,
        samlKeystore, // String keystorePath,
        Secret.fromString(samlPass), // Secret keystorePassword,
        Secret.fromString(samlPass), // Secret privateKeyPassword,
        'saml-key' // String privateKeyAlias
      ),
      'lowercase' // String usernameCaseConversion,
    )
    jenkins.securityRealm = realm
  }
  else {
    println '--> Not configuring SAML'
    // TODO This shoud set up some fallback realm.
  }

Ahh the fabled dev TODO, I swear I’ll get back to that someday. This section is setting up the authentication system for logging in to Jenkins, the “security realm” in official parlance. For our production servers we’re using our company-wide SAML SSO system because if I can ever not have to store passwords, I’ll take that option in a heartbeat. If you don’t have a a similar internal SSO system, I would recommend looking at the GitHub OAuth plugin, but you can always use the internal login form realm if needed. If you are using SAML, the attribute configuration options are likely to be different for you, but the rest should look pretty similar. For unknown reasons, the Jenkins SAML plugin refers to the SP signing key as “encryption data”, but it definitely is the signing key. You can also see here is where we read back in the IdP metadata we embedded in the Jenkins container image up above.

  //////// AUTHORIZATION
  if (developmentMode) {
    // Turn off authorization in case hacking on SAML configs leads to lockout.
    // As it says in the values.yaml, do not do this in production.
    println "--> Configuring Unsecured authorization strategy. THIS BETTER NOT BE PROD."
    def unsecured = new hudson.security.AuthorizationStrategy$Unsecured()
    jenkins.authorizationStrategy = unsecured
  }
  else {
    // Configure matrix auth and ACLs.
    println "--> Configuring Matrix authorization strategy."
    def authz = new GlobalMatrixAuthorizationStrategy()
    [
      "hudson.model.Hudson.Read",
      "hudson.model.Item.Build",
      "hudson.model.Item.Cancel",
      "hudson.model.Item.Discover",
      "hudson.model.Item.Read",
      "hudson.model.Item.Workspace",
      "hudson.model.Run.Replay",
      "hudson.model.Run.Update",
    ].each {
      // Use the string form because I'm lazy and don't want to import all the things.
      authz.add(it + ":authenticated")
    }
    // Admins always get all permissions. Hopefully I won't regret this.
    Permission.getAll().each { perm ->
      admins.each { user ->
        authz.add(perm, user)
      }
    }
    jenkins.authorizationStrategy = authz
  }

While the authentication configuration (or security realm if you want to call it that) determines how users log in, the authorization strategy decides what they can do once logged in. For general use, we’re using the relatively-standard Matrix authorizer from the plugin of the same name. If you need more complex access controls you might want to look at the similarly more complex role-strategy authorizer. But here we only really have three bits of authorization config, first is the developmentMode setting coming in from Secret volume. If that is set we entirely disable authorization. This in here for times where I need to hack on the authentication config or if I’m offline somewhere and don’t have access to the corporate SAML servers. Otherwise for the Matrix we set up one entry for the generic “logged-in user” group to give them some minimal, read-only permissions, which is enough for normal users to view builds and force them to re-run if they end up with a flaky test (though hopefully they will fix it soon after). For admins, we create one row for each administrator giving them every permission available in Jenkins. Because this is rebuilt from scratch every time the configuration script runs, it means that is we do manage to break the permissions settings via “accidental” clicking in the web configuration GUI, it will at least get automatically restored as soon as we restart the container.

  //////// GITHUB CONFIG
  // Create the credentials used to access GitHub.
  def creds = CredentialsProvider.lookupCredentials(StandardUsernamePasswordCredentials, jenkins)
  def cred = creds.findResult { it.description == "GitHub access token" ? it : null }
  if (cred) {
    println "--> Updating existing GitHub access token credential ${cred.id}"
    def newCred = new UsernamePasswordCredentialsImpl(
      cred.scope,
      cred.id,
      cred.description,
      githubUser,
      githubUserToken)
    SystemCredentialsProvider.instance.store.updateCredentials(Domain.global(), cred, newCred)
  }
  else {
    println '--> Creating GitHub access token credential'
    cred = new UsernamePasswordCredentialsImpl(
      CredentialsScope.GLOBAL,
      java.util.UUID.randomUUID().toString(),
      "GitHub access token",
      githubUser,
      githubUserToken)
    SystemCredentialsProvider.instance.store.addCredentials(Domain.global(), cred)
  }

Next we construct a Jenkins credential with our GitHub access token. This was read in up at the top from the Secret volume, and here we either create a new credential if one isn’t found, or update the existing one. Again, the goal is convergent behavior so every time Jenkins start, it tries to match the persistent state to the desired state.

  //////// GLOBAL LIBRARIES
  def retriever = new SCMSourceRetriever(new GitSCMSource(
    "pipeline",
    "https://github.com/${librariesRepo}.git/",
    cred.id,
    "*",
    "",
    false))
  def pipeline = new LibraryConfiguration("pipeline", retriever)
  pipeline.defaultVersion = "master"
  pipeline.implicit = true
  pipeline.includeInChangesets = false
  jenkins.getDescriptor("org.jenkinsci.plugins.workflow.libs.GlobalLibraries").get().setLibraries([pipeline])

This configures the shared pipeline libraries. We’ll show the library code further down, but roughly this allows having a centralized repo with Groovy snippets that can be used by the per-repo Jenkinsfiles. In practical terms, this actually contains almost all of the pipeline logic, the final Jenkinsfile is currently always exactly one line long, calling one of the global presets. The implicit setting means Jenkinsfiles don’t have to explicitly include the library, and disabling includeInChangesets means that a new version of the library won’t trigger every job to build (though that would certainly be a nice load test).

  //////// CLOUUUUUD (NOT BUTT)
  // Register the Kubernetes magic secret.
  creds = CredentialsProvider.lookupCredentials(ServiceAccountCredential, jenkins)
  if (creds.isEmpty()) {
    println '--> Creating Kubernetes service account credential'
    kubeCred = new ServiceAccountCredential(
      CredentialsScope.GLOBAL,
      java.util.UUID.randomUUID().toString(),
      "Kubernetes service account")
    SystemCredentialsProvider.instance.store.addCredentials(Domain.global(), kubeCred)
  }
  else {
    kubeCred = creds[0]
  }

Starting in on setting up the Kubernetes support in Jenkins. The plugin declares a special credential type that effectively loads on the fly from the pod’s service account. But we still need to actually create that stub secret to feed into the rest of config. I guess if we weren’t using a service account this would be different, but service account are what the cool kids do.

  // Configure the cloud plugin.
  println '--> Configuring Kubernetes cloud plugin'
  def cloud = new KubernetesCloud('kubernetes')
  cloud.serverUrl = 'https://kubernetes.default'
  cloud.namespace = kubeNamespace
  cloud.jenkinsUrl = "http://${labels['app']}:8080"
  cloud.jenkinsTunnel = "${labels['app']}-agent:50000"
  cloud.credentialsId = kubeCred.id
  def podTemplate = new PodTemplate()
  podTemplate.name = 'default'
  podTemplate.label = "${labels['release']}-agent"
  def containerTemplate = new ContainerTemplate('jnlp', "jenkins/jnlp-slave:$agentVersion")
  containerTemplate.workingDir = '/home/jenkins'
  containerTemplate.command = ''
  containerTemplate.args = '${computer.jnlpmac} ${computer.name}' // Single quotes are intentional.
  containerTemplate.envVars.add(new ContainerEnvVar('JENKINS_URL', cloud.jenkinsUrl))
  containerTemplate.resourceRequestCpu = '200m'
  containerTemplate.resourceLimitCpu = '200m'
  containerTemplate.resourceRequestMemory = '256Mi'
  containerTemplate.resourceLimitMemory = '256Mi'
  podTemplate.containers.add(containerTemplate)
  cloud.addTemplate(podTemplate)
  jenkins.clouds.clear()
  jenkins.clouds.add(cloud)

Then the actual cloud plugin configuration. This aims the plugin at the same cluster as Jenkins is running inside of, and uses the pod’s service account credentials as mentioned above. Then we set up a pod and container template for the JNLP worker. This handles the communication with Jenkins once the worker pod launches, but we’ll add additional containers to it in our pipeline libraries to do the actual heavy lifting of the build. Those resource limits are based on the Helm community chart for Jenkins and I’m not yet sure if they reflect reality when Jenkins is under heavy load.

  //////// PROJECT FOLDER
  println '--> Creating organization folder'
  // Create the top-level item if it doesn't exist already.
  def folder = jenkins.items.isEmpty() ? jenkins.createProject(OrganizationFolder, 'MyName') : jenkins.items[0]
  // Set up GitHub source.
  def navigator = new GitHubSCMNavigator(githubOrg)
  navigator.credentialsId = cred.id // Loaded above in the GitHub section.
  navigator.traits = [
    // Too many repos to scan everything. This trims to a svelte 265 repos at the time of writing.
    new jenkins.scm.impl.trait.WildcardSCMSourceFilterTrait('*-cookbook', ''),
    // We have a ton of old branches so try to limit to just master and PRs for now.
    new jenkins.scm.impl.trait.RegexSCMHeadFilterTrait('^(master|PR-.*)'),
    new BranchDiscoveryTrait(1), // Exclude branches that are also filed as PRs.
    new OriginPullRequestDiscoveryTrait(1), // Merging the pull request with the current target branch revision.
  ]
  folder.navigators.replace(navigator)

This part I’m very proud of. The traditional way to automate Jenkins job creation is the venerable Job DSL plugin. Job DSL uses its own Groovy scripting API to create and manage jobs totally separately from the Jenkins Groovy scripting framework, usually using a single “seed job” to create all the others. This special DSL does (easily) support this use case, but I wanted to try and avoid it. Having one fewer plugin to worry about, as well as a more tightly integrated configuration seemed worth a bit of extra exploration. Building the job configuration in pure Jenkins Groovy turned out to be pretty straightforward other than being entirely undocumented, almost all of the default values are actually what you want in this case. This block of code will create the top-level folder and set up the GitHub folder source. As mentioned in the comments, the organization scan was being too slow for my tastes so I set up those two filter traits to cut down on the number of things Jenkins will even bother checking for a Jenkinsfile. As I expand past just cookbook testing those will probably go away, and the org scan only runs once a day so it being slow isn’t actually a runtime problem, I was just being impatient in development.

An aside about the GitHub plugin and webhooks. The plugin can automatically configure the organization webhook for you, however I’m not actually doing that here. That would require giving Jenkins an admin-capable token which I prefer not to do. If you’re adapting this config to create dozens of organization folder, maybe consider putting that back in (add navigator.afterSave(folder) after the save() in the next snippet) but barring that I would just configure it manually. You’ll want to set the URL to https://myjenkinsserver.com/github-webhook/ and enable the push, pull request, and repository events.

  println '--> Saving Jenkins config'
  jenkins.save()

We did it! I don’t think a manual save is actually required but it makes life slightly easier if we have to kubectl exec to log in and stare at the config.xml manually.

  println '--> Scheduling GitHub organization scan'
  Thread.start {
    sleep 30000 // 30 seconds
    println '--> Running GitHub organization scan'
    folder.scheduleBuild()
  }

Because we really wanted this to be fully hands-off, we schedule a org scan on startup. This has to wait for a few other startup tasks inside Jenkins, so it runs 30 seconds after this script.

  println "--> configuring alti_jenkins... done"
}
catch(Throwable exc) {
  println '!!! Error configuring alti_jenkins'
  org.codehaus.groovy.runtime.StackTraceUtils.sanitize(new Exception(exc)).printStackTrace()
  println '!!! Shutting down Jenkins to prevent possible mis-configuration from going live'
  jenkins.cleanUp()
  System.exit(1)
}

And then finally the catch for those runtime errors we talked about up at the start. This will print the error to the log (ends up kubectl logs or whatever else you are using) and then tells Jenkins to shut down.

And there you have it. A fully convergent configuration script to set up a working Jenkins based on GitHub and Pipelines and all that jazz. Now we just need to run this sucka’.

Helm Chart

We started out first trying to use the community Helm chart for Jenkins directly, and then making a wrapper chart, but both approaches turned out more complex than we wanted. At this time, my recommendation for charts as complex as Jenkins is to fork them and use them as a starting point for your own chart. As an example, the community chart uses the Jenkins container image directly, and installs the config and plugins in an initContainer, rather than building our image as shown above. We felt this approach left too many moving pieces at deploy time and potentially compromised our ability to do rollbacks if an upgrade didn’t go according to plan. I’m not going to show the entire chart as there is a lot that is going to be specific to my specific requirements but I do want to talk about the core of it.

Deployment

As with most simple applications, the heart of the deployment is a Kubernetes Deployment object, which manages the actual pods.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: {{ template "alti-jenkins.fullname" . }}
  labels:
    app: {{ template "alti-jenkins.fullname" . }}
    chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
    release: "{{ .Release.Name }}"
    heritage: "{{ .Release.Service }}"
    component: "{{ .Release.Name }}-jenkins"

We start with a pretty standard set of metadata for a Helm-created object. Some of these values are used for selectors in the service/ingress side of things, but mostly these are to aid in human debugging and management.

spec:
  replicas: 1
  strategy:
    type: RollingUpdate
  selector:
    matchLabels:
      component: "{{ .Release.Name }}-jenkins"

Simple defaults, we only want one pod at a time because Jenkins’ idea of HA is “restart it if it crashes” so leave things at that.

  template:
    metadata:
      labels:
        app: {{ template "alti-jenkins.fullname" . }}
        heritage: "{{ .Release.Service }}"
        release: "{{ .Release.Name }}"
        chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
        component: "{{ .Release.Name }}-jenkins"
      annotations:
        checksum/secret: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }}

Next the same metadata but for the Pod object this time. The annotation is somewhat standard Helm pattern where we put a checksum of the rendered Secret object in the Deployment so that when the Secret changes, the Deployment will re-roll the pods automatically. This is needed because our config code only looks at the secret data at container startup, so if they change after that it will be ignored. The actual checksum isn’t used for anything, but it changing will trigger Tiller (the server component of Helm) to update the Deployment, which triggers the Pods to re-roll.

    spec:
      serviceAccountName: {{ template "alti-jenkins.fullname" . }}
      imagePullSecrets:
        - name: {{ template "alti-jenkins.fullname" . }}-pull
      securityContext:
        # This is the default gid for the jenkins group in the upstream container.
        fsGroup: 1000

Some general Pod configuration, setting the service account for Jenkins, the image pull secret to talk to our internal registry to download the alti_jenkins image we made before, and setting the GID used for volume mounts down below so that we can lock down the file modes a little bit just in case someone manages to get a shell on the Jenkins container somehow.

      containers:
        - name: {{ template "alti-jenkins.fullname" . }}
          image: "{{ .Values.Server.Image }}:{{ .Values.Server.ImageTag }}"
          {{- if .Values.Server.ImagePullPolicy }}
          imagePullPolicy: "{{ .Values.Server.ImagePullPolicy }}"
          {{- end }}

Basic container set up, nothing too interesting here. The only reason to change the image pull policy is to set it to Always in local development if I’m rebuilding the same image version multiple times before I release it. But I usually do my development in minikube anyway, so I use minikube docker-env to build directly in the Docker daemon used by Kubernetes later on: ( eval "$(minikube docker-env)" && docker build -t myrepo.com/alti_jenkins:2.whatever ).

          ports:
            - containerPort: 8080
              name: http
            - containerPort: 50000
              name: agentlistener

Expose two ports, one for HTTP and the other for JNLP workers. Unlike the community chart, I didn’t see much reason to make these configurable since there is no worry of port collisions or whatever.

          resources:
            requests:
              cpu: "{{ .Values.Server.Cpu }}"
              memory: "{{ .Values.Server.Memory }}"

We don’t yet really have a good idea for what these limits should be from production data under heavy load, so just make them configurable for now.

          volumeMounts:
            - mountPath: /var/jenkins_home
              name: jenkins-home
            - mountPath: /var/jenkins_secrets
              name: jenkins-secrets
              readOnly: true
            - name: downward
              mountPath: /etc/downward
              readOnly: true
      volumes:
      - name: jenkins-home
        persistentVolumeClaim:
          claimName: {{ .Values.Persistence.ExistingClaim | default (include "alti-jenkins.fullname" .) }}
      - name: jenkins-secrets
        secret:
          secretName: {{ template "alti-jenkins.fullname" . }}
          defaultMode: 0440
      - name: downward
        downwardAPI:
          items:
            - path: labels
              fieldRef:
                fieldPath: metadata.labels
            - path: namespace
              fieldRef:
                fieldPath: metadata.namespace

And then the volumes. We need three volumes for three different purposes. The big one is the JENKINS_HOME mount. Despite my best efforts towards immutable configuration, Jenkins still does store a lot of state in the JENKINS_HOME directory, like job history and build artifacts. As such, this needs to be persistent at least over short timescales. If we lost this persistent volume we could still trivially rebuild Jenkins, but we would lose enough history that it might be frustrating. So for now, PVC.

Then the two configuration volumes, a Secret volume and the Downward API volume. As we saw back at the top of the config.groovy, these are used to feed configuration data into the Jenkins config. The defaultMode settings works with fsGroup up above to slightly restrict things, though probably not in a way that really matters but yay for defense in depth.

Secret

Mostly pretty rote, but including it here as an example:

apiVersion: v1
kind: Secret
metadata:
  name: {{ template "alti-jenkins.fullname" . }}
  labels:
    app: {{ template "alti-jenkins.fullname" . }}
    chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
    release: "{{ .Release.Name }}"
    heritage: "{{ .Release.Service }}"
type: Opaque
data:
  artifactory-token: {{ required "Secrets.ArtifactoryToken is required" .Values.Secrets.ArtifactoryToken | b64enc | quote }}
  github-user: {{ required "Secrets.GithubUser is required" .Values.Secrets.GithubUser | b64enc | quote }}
  github-token: {{ required "Secrets.GithubToken is required" .Values.Secrets.GithubToken | b64enc | quote }}
  saml-keystore: {{ required "Secrets.SamlKeystore is required" .Values.Secrets.SamlKeystore | nospace | quote }}
  saml-pass: {{ required "Secrets.SamlPass is required" .Values.Secrets.SamlPass | b64enc | quote }}
  # Not technically secret but convenient to put here because the same kind of code needs them.
  development-mode: {{ printf "%t" .Values.Server.DevelopmentMode | b64enc | quote }}

One thing to note is that the SamlKeystore value is coming in already base64-encoded because it’s a binary file format and it’s vastly easier to store it in the Helm values file (and Tiller storage of the same) as text data. Given that we expect this value to change infrequently (cert rotation, or in case of a security issue), we just put it in base64 up front by hand.

Services

We end up with two services, one each for HTTP and worker traffic.

apiVersion: v1
kind: Service
metadata:
  name: {{ template "alti-jenkins.fullname" . }}
  labels:
    app: {{ template "alti-jenkins.fullname" . }}
    chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
    release: "{{ .Release.Name }}"
    heritage: "{{ .Release.Service }}"
    component: "{{.Release.Name}}-jenkins"
spec:
  ports:
    - port: 8080
      targetPort: 8080
      name: http
  selector:
    component: "{{.Release.Name}}-jenkins"

and then:

apiVersion: v1
kind: Service
metadata:
  name: {{ template "alti-jenkins.fullname" . }}-agent
  labels:
    app: {{ template "alti-jenkins.fullname" . }}
    chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
    release: "{{ .Release.Name }}"
    heritage: "{{ .Release.Service }}"
    component: "{{ .Release.Name }}-jenkins"
spec:
  ports:
    - port: 50000
      targetPort: 50000
      name: agentlistener
  selector:
    component: "{{ .Release.Name }}-jenkins"
  type: ClusterIP

Ingress

And finally an Ingress to handle TLS in production. We went with an Ingress because we wanted to use kube-lego to automate certificates via LetsEncrypt. If you’re on AWS and want to use ACM instead, you can do that directly via the first Service object above.

{{- if .Values.Server.PublicHostname }}
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: {{ template "alti-jenkins.fullname" . }}
  labels:
    app: {{ template "alti-jenkins.fullname" . }}
    chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
    release: "{{ .Release.Name }}"
    heritage: "{{ .Release.Service }}"
  annotations:
    kubernetes.io/tls-acme: "true"
spec:
  tls:
    - secretName: {{ template "alti-jenkins.fullname" . }}-tls
      hosts:
        - {{ .Values.Server.PublicHostname | quote }}
  rules:
    - host: {{ .Values.Server.PublicHostname | quote }}
      http:
        paths:
          - path: /
            backend:
              serviceName: {{ template "alti-jenkins.fullname" . }}
              servicePort: 8080
{{- end }}

This will only be active on production, for local development I just talk directly to the service via minikube’s minikube service helper command. I’ve only tested with the Nginx ingress controller, but I would imagine it should work with the GCE ingress too, and any other controller supported by kube-lego. In production we installed both nginx-ingress and kube-lego using their community Helm charts directly:

$ helm install -n nginx-ingress stable/nginx-ingress
$ helm install --set config.LEGO_EMAIL=me@example.com \
  --set config.LEGO_URL=https://acme-v01.api.letsencrypt.org/directory \
  -n kube-lego stable/kube-lego

Global Pipeline Libraries

Okay, so we have a working, running Jenkins server. Progress! Next step is to get some actual builds on it. Because we have hundreds of cookbooks which should all use the same build logic, we wanted to make sure all of that was kept somewhere centralized. Jenkins’ global libraries system made this very easy, though as I’ve only attacked the cookbook testing use case I don’t actually have very much yet.

An aside about how to structure this: each helper goes in a file named vars/nameOfHelper.groovy. The bit after vars/ is what ends up being the function name for your Jenkinsfiles.

// vars/altiNode.groovy
def call(Closure body) {
  def secretsRoot = System.getenv('JENKINS_SECRETS') ?: '/var/jenkins_secrets'
  def artifactoryToken = new File("$secretsRoot/artifactory-token").text.trim()

  withEnv(['CI=true', "BERKSHELF_PATH=${env.WORKSPACE}/.berkshelf", "ARTIFACTORY_API_KEY=$artifactoryToken"]) {
    node('cookbook') {
      container('alti-pipeline') {
        body()
      }
    }
  }
}

First a utility helper to help avoid too many levels of indentation in other helpers. This is like the build in node {} pipeline step but with some standard stuffs for our testing environment.

// vars/altiCookbook.groovy
def call(Closure body) {
    def altiPipelineVersion = '4.9.2'

    def downwardRoot = System.getenv('DOWNWARD_VOLUME') ?: '/etc/downward'

    // Parse the labels test.
    def labels = [:]
    new File("$downwardRoot/labels").eachLine {
      def parts = it.split('=')
      labels[parts[0]] = parts[1][1..-2]
    }

Next is the interesting one, the cookbook testing pipeline, though currently a very simple one. First we do some configuration stuff like we saw in config.groovy. It is, in fact, a copy-pasta because I couldn’t find a reasonable way to share code between the two contexts and it’s not very long anyway.

    podTemplate(label: 'cookbook', imagePullSecrets: ["${labels['app']}-pull"], containers: [
      containerTemplate(name: 'alti-pipeline', image: "altiscale-docker-dev.jfrog.io/alti_pipeline:${altiPipelineVersion}", alwaysPullImage: false, command: "/bin/sh -c \"trap 'exit 0' TERM; sleep 2147483647 & wait\""),
    ]) {

Then we set up the pod for our job to build in. This will be combined with the podspec we gave the Kubernetes cloud plugin up in the Jenkins configuration so the final pod will end up with two containers, one for the JNLP worker and another with the build environment image. The build environment doesn’t actually have a service to run, so use the sleep wait to keep it busy until the pod is shut down.

        def integrationTests = []
        stage('Check') {
            altiNode {
                checkout scm
                // Check that we have an acceptable version of alti_pipeline, just looks at the major version.
                def gemfile = readFile('Gemfile')
                if(gemfile =~ /gem.*alti_pipeline.*\b${altiPipelineVersion[0]}\./) {
                    echo "Gemfile is compatible with alti_pipeline ${altiPipelineVersion}"
                } else {
                    error "Gemfile is not compatible with alti_pipeline ${altiPipelineVersion}:\n"+gemfile
                }
                // Parse out the integration tests for use in the next stage.
                integrationTests = sh(script: 'kitchen list --bare', returnStdout: true).split()
            }
        }

The first stage is a sanity check. Unlike many Chef shops, we don’t actually use ChefDK (that would have to be a whole ‘nother blog post so just take it as a given) and instead have a Gemfile in each cookbook that points it at our equivalent gem, alti_pipeline. Here we want to make sure that the cookbook’s Gemfile is the same major version as the build image since if it isn’t, the build is very unlikely to work. We also grab the list of all Test Kitchen instances to build in the next stage.

        stage('Test') {
            testJobs = [
                'Lint': {
                    altiNode {
                        checkout scm
                        sh 'rm -f Gemfile Gemfile.lock'
                        sh 'rake style'
                    }
                },
                'Unit Tests': {
                    altiNode {
                        checkout scm
                        try {
                            sh 'rm -f Gemfile Gemfile.lock'
                            sh 'rake spec'
                        } finally {
                            junit 'results.xml'
                        }
                    }
                },
            ]
            integrationTests.each { instance ->
              testJobs["Integration $instance"] = {
                    altiNode {
                        checkout scm
                        sh 'rm -f Gemfile Gemfile.lock'
                        sh "kitchen test --destroy always $instance"
                    }
                }
            }
            parallel(testJobs)
        }

        body()

    }
}

And then finally the actual test bit of the pipeline. This sets up jobs for lint checking, unit tests, and one job each for the integration tests so they can all run in parallel. You can see this uses the altiNode helper from above, instead of the usual node pipeline step. We’re also removing the Gemfile in-place since we’ve already installed all the needed gems in my build environment image and don’t want bundler to even try and activate.

Cookbook Integration Testing

As part of this project I also built a new Test Kitchen driver, kitchen-kubernetes, specifically for running Chef cookbook integration tests on top of Kubernetes. This works similarly to kitchen-docker and kitchen-dokken, but using Kubernetes machinery rather than plain Docker containers. If duplicating this set up for yourself, make sure you remember to include rsync in the job build image (alti_pipeline above) as that is required for kitchen-kubernetes’s file upload system.

Build Environment Image

While most people doing Chef cookbook testing should probably use the chef/chef-dk image, as mentioned before we are not using ChefDK for our environment management. The short version of “why” is that we’re still on Chef 12 but wanted newer versions of a lot of tools, as well as including a lot of our own utility gems. We may yet transition back to ChefDK but for now we needed to create a container image that included a bunch of private gems. Pulling in private gems means including an access token for the repository (careful readers have probably figured out by now that we use Artifactory), but unfortunately build-time secrets are still a notable problem with docker build. There are a few options, short-lived tokens that do get baked in to the image but are already expired by the time anyone could get them, localhost proxies that handle authentication, use of alternative image build systems like Habitat/buildah, but we decided to try and keep it simple and use the new “squash build” feature in Docker.

We decided to use alpine as the base image (shoutout to the great folks at Glider Labs) to minimize the file size. Kubernetes does cache images aggressively, but every little bit helps in improving build performance. The final Dockerfile looks like this:

FROM alpine:latest
ENV VERSION=4.9.2
ENV ALTISCALE_KITCHEN_KUBERNETES=true
ENV ALTISCALE_BERKS_ARTIFACTORY=true
COPY .gemrc /root
RUN set -x && \
    apk --update-cache add build-base ruby-io-console ruby ruby-dev libffi libffi-dev zlib zlib-dev curl git openssh-client rsync && \
    gem install alti_pipeline -v $VERSION && \
    git clone https://github.com/coderanger/kitchen-kubernetes /tmp/kitchen-kubernetes && \
    ( cd /tmp/kitchen-kubernetes && gem build *.gemspec && gem install --local *.gem ) && \
    curl -L -o /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl && \
    chmod +x /usr/local/bin/kubectl && \
    gem sources --clear-all && \
    rm -rf /root/.gemrc /usr/lib/ruby/gems/2.4.0/cache/*.gem /tmp/kitchen-kubernetes && \
    for f in /usr/lib/ruby/gems/2.4.0/gems/*; do rm -rf $f/spec $f/test $f/examples $f/distro $f/acceptance; done && \
    apk del build-base ruby-dev libffi-dev zlib-dev curl

This can be broken down in to four parts. First the base image and some environment variables we want set for all builds. Then copying the gem server credentials, from this point on things become radioactive because we have a secret value in the image. Then the installs, first a bunch of Alpine packages we need, then the top-level alti_pipeline gem, kitchen-kubernetes (from git because I haven’t actually put up a release yet), and kubectl (for use by kitchen-kubernetes). Finally a whole bunch of cleanup. This is mostly to reduce the final image size, removing files and packages we don’t need after image creation. But also we remove the .gemrc, making the image no longer radioactive if built correctly.

Even with this COPY+rm though, we need to make sure to build the image using docker build --squash (which requires experimental features be enabled on the Docker daemon, add --experimental=true to the daemon command line). If built without --squash, the final image would look like it doesn’t have the token, but it would still be visible in the intermediary layer created between the COPY and RUN. Hopefully at some point there will be a better solution for build-time secrets, but for now this is enough to get us a build environment weighing in at around 100MB.

Per-Repo Jenkinsfile

One requirement of this set up is you do need to put a Jenkinsfile in each repository you want to be built. This might be frustrating for some, having to touch every repo when that could potentially be (and in my case, is) hundreds of projects. That said, currently the Jenkinsfile we are adding to each repo is literally altiCookbook { }. So it’s not much in terms of footprint, but you do have to do the legwork, either by hand or via a script using the GitHub API.

To Conclude

As I mentioned at the start, my goal here is to provide a jump start in designing your own Jenkins deployment. I suspect the precise combo of design choices shown here might be literally unique in the world, but most of the bits are very modular and the overall structure should be a starting point for your own specifics.

If you have any questions on any of this code or the design decisions behind it you can reach me at noah@coderanger.net.

Thanks to SAP for supporting this work.

Back to articles