Service Discovery and the CDN

This year we made a change to the CDN system. We enabled service discovery for the Varnish fleet. This is a followup to this post:
http://blog.majestik.org/internal-content-distribution-network/

The new components
  • etcd
  • custom scripts
  • python-etcd
Assumptions
  • Assume that we have a functional etcd installation and it's reachable at http://etcd:2379. We are also playing a little loose on security here, you shouldn't do that if you deploy this in production anywhere.
  • Networks at the remote sites are predictable.. we used 10.0.x.x networks for wired ports, 10.1.x.x for wireless, and 10.2.x.x for phones. VPN is also available, but we don't want a remote user coming into the site via VPN when the public caches are better suited for those users.
The health check

We used Varnish's built in health checks.

This script checks if Varnish believes that it is healthy, and if so, pushes that into etcd. I set a TTL on this so that we would drop off a crashed node, but not too quickly.

#!/bin/bash

# Checks if varish is healthy and registers in etcd

while true; do  
        IP=`ip addr show eth0 | grep "inet " | awk '{print $2}'`
        value=$IP:6081
        states=`varnishadm backend.list | grep probe | awk '{print $4}'`
        status="Sick"

        for state in $states; do
                if [ $state == "Healthy" ] ; then
                        status="Healthy"
                fi
        done


        if [ $status == "Sick" ] ; then
                curl -L http://etcd:2379/v2/keys/varnish/$HOSTNAME -XDELETE > /dev/null 2>&1
        else
                if [ $status == "Healthy" ] ; then
                        curl -L http://etcd:2379/v2/keys/varnish/$HOSTNAME 2>/dev/null | grep -v "Key not found" > /dev/null
                        isNotReg=$?
                        if [ $isNotReg ] ; then
                                curl -L http://etcd:2379/v2/keys/varnish/$HOSTNAME -XPUT -d value=$value -d ttl=120
                        else
                                curl -L http://etcd:2379/v2/keys/varnish/$HOSTNAME -XPUT -d value=$value -d prevExist=true -d ttl=120
                        fi
                fi
        fi

        sleep 5
done  

And a little upstart love to make sure that the register process is running:

start on (net-device-up  
          and local-filesystems
          and runlevel [2345])
stop on runlevel [016]

exec /etc/varnish/register.sh  
respawn  
The Server

If you recall from the previous example, we used a php script as a landing page for the users, including a $target array. We replace that with an include of a new file that the following python script creates anytime Etcd gets an update. There is one specific exception where the cache server is not on the same network as the clients.. so we have a special case below for that one.

#!/usr/bin/python

import etcd  
import time

sleep_timer = 5  
outFile = "/var/www/html/targets.php"

client = etcd.Client(host="etcd", port=2379)


data_prev = {}

changed = True

def updateFile (data):  
        print "Updating Data"
        targetString = "<?php\n#Automatically Generated by etcd-targets.py\n"
        targetString += "$targets = array(\n"
        for key in data:
                hostName = key.split("/")[3]
                hostData = data[key].split(":")
                network=hostData[0].split(".")
                port=hostData[1]
                networks = []
# Special case for the 10.10.0.0 network, the host is on that network, but clients are 10.x.240.0
                if int(network[1]) == 10 and int(network[2]) == 0:
                        networks.append("10.0.240.0/20")
                        networks.append("10.1.240.0/20")
                        networks.append("10.2.240.0/20")
                mask = network[3].split("/")[1]
                networks.append("%s.%s.%s.0/%s"%(network[0],network[1],network[2],mask))
                if int(network[1]) == 0:
                        networks.append("%s.1.%s.0/%s"%(network[0],network[2],mask))
                        networks.append("%s.2.%s.0/%s"%(network[0],network[2],mask))

                for net in networks:
                        targetString += '\t\t\tarray ("%s", "%s:%s"),\n'%(net, hostName, port)

#               if network == "10.10.

        targetString += ");\n?>\n"
        outputFile = open(outFile,"w")
        outputFile.write(targetString)
        outputFile.close()


while True:  
        data = client.read('/varnish', recursive=True)
        data_cur = {}
        for entry in data.children:
                data_cur[entry.key] = entry.value

        for entry in data_cur:
                if entry not in data_prev.keys():
                        print ("%s - Added"%entry)
                        changed = True
        for entry in data_prev:
                if entry not in data_cur.keys():
                        print ("%s - Removed"%entry)
                        changed = True

        if changed:
                updateFile(data_cur)

        changed = False


        data_prev=data_cur
        time.sleep(sleep_timer)

And, since we are using a more modern Ubuntu, systemd script to start the job:

[Unit]
Description=Etcd Poller

[Service]
ExecStart=/root/etcd-poller.py  
StandardOutput=null  
Restart=always

[Install]
WantedBy=multi-user.target  

And there we go, just need to setup the landing page and it will direct the user to the best cache for them.