Gábor Hermann's blog


Standalone deployment of Airflow on Kubernetes

We will show how to do a very simple standalone production deployment of Airflow on Kubernetes.

The idea is to have a single Kubernetes Pod of two containers: one Postgres and one Airflow. To make the Postgres DB persistent, we deploy this with a StatefulSet and a PersistentVolumeClaim. This way we don't lose progress when we restart Airflow in the middle of executing a DAG.

Why?

This should work everywhere where Kubernetes is available. It has multiple advantages.

Of course, there's a price to pay for this simplicity.

Limitations

This is intended for folks that don't want to scale horizontally: we will have one instance, it won't be highly available, and we might lose DB state. This sounds scary, but if we already follow the best-practices, it should not be a problem at all.

So, let's make sure that

How?

Just to try this setup it's useful to set up local development with Kuberenetes first.

We're going to use Airflow 1.x, but this should work similarly with Airflow 2.x.

Base image

Let's assume we have a Docker registry running at localhost:5000. (In real life it might be something like gcr.io.)

We should have an Airflow base Docker image that only has Airflow. We can use the official Airflow image or we can build our own. To build our own, we will need these two files:

View Dockerfile
FROM python:3.7-slim-buster

# This is to install gcc, etc. that's needed by Airflow.
RUN export DEBIAN_FRONTEND=noninteractive && \
apt-get update && apt-get -y upgrade && \
apt-get -y install --no-install-recommends gcc python3-dev build-essential && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

RUN pip install --upgrade pip

RUN useradd -m airflow
WORKDIR /home/airflow

ENV AIRFLOW_HOME /home/airflow

COPY base-image/requirements.txt .
RUN pip install -r requirements.txt

USER airflow

ENTRYPOINT []
CMD []

### from here on app specific

COPY dags dags
View requirements.txt
# This is what we actually want to install.
apache-airflow[gcp,kubernetes]==1.10.15
# This is needed for Postgres connection
psycopg2-binary

# But there are breaking changes in some transitive dependencies
# so we need to pin them.
# See https://github.com/pallets/markupsafe/issues/284
markupsafe>=2.0.1,<2.1.0
# See https://stackoverflow.com/questions/69879246/no-module-named-wtforms-compat
wtforms>=2.3.3,<2.4.0
# See https://github.com/sqlalchemy/sqlalchemy/issues/6065
sqlalchemy>=1.3.20,<1.4.0
# See https://itsmycode.com/importerror-cannot-import-name-json-from-itsdangerous/
itsdangerous>=2.0.1,<2.1.0

And execute

docker build -t localhost:5000/airflow:latest .

Building an image with our code

Let's say we have a very simple dummy DAG in dags/dummy.py:

from airflow import DAG
from airflow.utils import timezone
from airflow.operators.dummy_operator import DummyOperator

dag = DAG(
    dag_id="dummy",
    schedule_interval=None,
    catchup=False,
    default_args=dict(start_date=timezone.datetime(2022, 1, 1)),
)

DummyOperator(task_id="dummy", dag=dag)

We can package this in a Docker image from our base image with this Dockerfile:

FROM localhost:5000/airflow:latest

COPY dags dags

To build and push it:

docker build -t localhost:5000/dummydag:latest .
docker push localhost:5000/dummydag:latest

Kubernetes deployment with StatefulSet

We can define a Kubernetes StatefulSet with Airflow and Postgres:

We define all of this in a Kubernetes spec file airflow.yml:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: airflow
spec:
  serviceName: "my-airflow-service"
  selector:
    matchLabels:
      my-app-label: airflow-app
  # Claiming a Persistent Volume for our Pod.
  volumeClaimTemplates:
    - metadata:
        name: postgres-volume
      spec:
        # Only one Pod should read and write it.
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 1Gi
  template:
    metadata:
      labels:
        my-app-label: airflow-app
    spec:
      containers:
        # Postgres container with dummy user and password defined.
        - name: postgres
          image: postgres:9
          env:
            - name: POSTGRES_USER
              value: airflow_user
            - name: POSTGRES_PASSWORD
              value: airflow_pass
            - name: POSTGRES_DB
              value: airflow_db
          volumeMounts:
            - name: postgres-volume
              # This is the default path where Postgres stores data.
              mountPath: /var/lib/postgresql/data
        # Airflow container with our code.
        - name: airflow
          # We use the image that we built.
          image: localhost:5000/dummydag:latest
          command: ['/bin/bash',
                    '-c',
                  # We need to start both the Web UI and the scheduler.
                    'airflow upgradedb && { airflow webserver -p 8080 & } && airflow scheduler']
          env:
            - name: AIRFLOW_HOME
              value: '/home/airflow'
            - name: AIRFLOW__CORE__LOAD_EXAMPLES
              value: 'false'
            - name: AIRFLOW__CORE__EXECUTOR
              # Use an executor that can actually run tasks in parallel.
              value: 'LocalExecutor'
            - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
              # We use the same user and password as defined above.
              value: 'postgresql+psycopg2://airflow_user:airflow_pass@localhost:5432/airflow_db'

Then we can deploy it:

kubectl apply -f airflow.yml

And we can see the StatefulSet and Pod.

kubectl get statefulsets
kubectl get pods

We can check Airflow Web UI with port-forwarding:

kubectl port-forward pod/airflow-0 8080:8080

Then go to http://localhost:8080 to see the Web UI.

Going further

This is a very basic setup, we can improve it:

About

I do software: data crunching, functional programming, pet projects. This blog is to rant about these.

Hit me up on Twitter or email blHIDDEN TEXTog@gaborhermHIDDEN TEXT 2ann.org

Of course, opinions are my own, they do not reflect the opinions of any past, present, future, or parallel universe employer.