Cassandra and Priam

Using Buri to assist deployment

Presented by Joe Hohertz / @joehohertz

Slides @ http://jhohertz.github.io/cass-buri

Welcome

Audience

  1. Anyone curious about Priam
  2. Those concerned with operation of Cassandra
  3. Users of Amazon Web Services
  4. People interested in NetflixOSS

Shameless Plug

Viafoura

Who am I?

Joe Hohertz

  • Been building networks/systems since 1996
  • More software development focus since 2005
  • Specialty in open source
  • Recent focus on clould systems

Summary

  • Explore what Priam is, what it does
  • Challenges of deploying Priam
  • Introduction to Buri
  • Using Buri to deploy Priam

Let's get started!

Priam

What is Priam?

  • Co-process / Sidecar for Cassandra
  • Released by Netflix as open source
  • Implemented in Java, as a web application
  • Assists the operation of Cassandra clusters in EC2

What operations does it handle?

  • Manages many of the cassandra.yml values
  • Start/Stop of cassandra processes
  • Discovery of topology info for configuring tokens
  • Within cassandra, a startup handler provides seed information
  • Controls bootstrap mode (with our patches)
  • Backup to S3 storage, restoration of nodes
  • Dead node replacements

Limitations

  • Does not yet support vnodes (multiple tokens)
  • Size growth therefore must be based on doubling clusters
  • Some awkward configuration when running in VPCs in AWS

Other challenges

  • Available documentation is not up to date
  • Project is a bit neglected relative to other NetflixOSS projects
  • Current Netflix tree has a serious bug
  • Forks outside of Netflix diverge heavily
  • Very deeply rooted in 1.x Cassandra. 2.0+ challenges some of its assumptions
  • New developments seem to indicate a trend towards becoming DSE-specific

Patches needed to run Priam sucessfully w/ C* 2.0+

  • 2.0+ Streaming API changes
  • Cluster bootstrapping changes
  • Gossipinfo REST call fixes

2.0+ Streaming API changes

  • Affects network statistics REST call (querying streams)
  • Also affects restoring backups. (initiating streams)
  • Using pull request #346 to effect this.

Cluster bootstrap changes

  • Priam attempts to set auto bootstrap on every node
  • In 1.x it was possible to get away with this
  • 2.x is more strict
  • We have modified Priam to ensure the very first node does not get this set.
  • Requires patch to Cassandra to expose auto bootstrap flag as a system property, included in 2.0.10+

Gossip info REST call fix

  • REST call meant to work like nodetool gossipinfo
  • Current code in Priam corrupts the response by not breaking up the data correctly
  • Causes duplicate entries for some nodes in the response
  • As of commit 6eb29e7 Priam started using this REST call to probe other nodes on launch to determine if it is performing a dead node replacement
  • Will cause eventual failures replacing nodes, requiring manual cleanup of gossipinfo via JMX
  • We have a still pending pull request, #350 addressing this issue.

Our Priam Fork

  • We'd rather not have a fork, however...
  • We wanted to have a fully patched tree that works.
  • No success with getting pull requests merged so far
  • Located here: https://github.com/viafoura/Priam

Deployment

Use of autoscale groups

  • Priam requires being deployed in an auto scale group
  • Not used for any aspect of "auto" scaling
  • Seperate ASGs per availability zone
  • Priam uses two things from the ASG:
    • Name, which must be composite of your cluster name and the availability zone
    • Maximum instances, used to determine size of cluster

SimpleDB for shared configuration

  • Two schemas are kept for common configuration
  • PriamProperties, which is both configuration of Priam itself, as well as variables it will pass to cassandra.yaml across a cluster
  • InstanceIdentity, which it uses to track the state of the cluster, active/dead nodes, and token assignments
  • This must be initialized prior to launching your cluster

Immutable deployment

  • How we (And Netflix) deploy Priam
  • What is it?
  • Machine images are generated via a build process
  • Live machines are never updated directly
  • Build a new machine image, deploy, cut over

Aminator

  • Tool for working with EC2 AMIs
  • Mounts volume from an existing EBS AMI's underlying snapshot
  • Runs a provisioner within chroot of mount point
  • Unmounts and snapshots the volume
  • Registers new AMI against the new snapshot
  • Built in provisioners for APT/YUM installations
  • Bring your own base AMIs

Layered AMI generation

  • Foundation: Very close to a vanilla install of the OS
  • Base: Local additions to the foundation AMI, things you want everywhere
  • Role-specific: Run against the base, for a particular application
  • Why?
  • Consistency + Speed in generating final role AMIs

Buri

What is Buri?

  • Implemented mostly in Ansible
  • Python-based wrapper to simplify use, provide some additional functions

Features of Buri

  • Uses Ansible to provide a collection of roles useful for NetflixOSS work
  • Allows "installation" of Ubuntu as a foundation AMI
  • Provides templated approach to role definitions for webapps under Jetty, and JSVC-compatible java daemons
  • Support for binary or source-based installations from git for most roles
  • Has its own Aminator-like provisioning, with different strengths and weaknesses
  • Early version of an Aminator plugin to use Buri as the povisioner is available
  • Provides off-cloud all-in-one demonstation roles for Flux Capacitor and Netflix RSS Recipes

Differences between Aminator and Buri's AMI generation features

  • Aminator has better support for running concurrent jobs. Buri has basic protections, but lacks hard locking due to limitations in Ansible
  • Buri supports the historical no-partition volumes, as well as normal partitioned systems, for all machine types. Aminator does not current support partitioned volumes.
  • Buri can register all combinations of HVM/PVM machine AMI, and S3/EBS root storage as a part of a single run. Aminator requires seperate jobs to be run.
  • Using Buri's provisioner directly may be more convenient when developing roles, and Aminator with the plugin used for the "real" production bound generations.

Differences between NetflixOSS-Ansible, and Buri's role library

  • Some roles are directly carried over and enhanced (Ice, Asgard, Edda)
  • Buri only currently targets Ubuntu LTS releases, NetflixOSS-Ansible targets Amazon Linux as well
  • Many new roles in Buri (Exhibitor, Priam, Flux Capacitor demo)
  • Focus in Buri on both EC2 and local development VM deployment
  • Buri biases access controls to be handled via IAM, vs. using API keys

Using Buri

Overview

  • Initial look at demo on Local VM via Vagrant
  • Configuring Buri for your EC2 environment
  • Bootstrapping a build environment in EC2
  • Creating a foundation AMI
  • Creating a base AMI
  • Creating a builder role AMI
  • Creating a other AMI roles

Requirements for local VM / Bootstrap

  • Ansible 1.6.x
  • For local VMs: JDK, Oracle VirtualBox, Vagrant, 8+GB RAM on workstation
  • git

Launch all-in-one Flux Capacitor demo


# checkout Buri
git clone -b develop https://github.com/viafoura/buri
cd buri

# add vagrant plugin requirement
vagrant plugin install vagrant-host-shell

# launch and provision!
vagrant up
                

Configure Buri for YOUR EC2 environment


# In Buri checkout
mkdir local                    # only needed if you never ran the VM above

# Copy default configurations as starting point
cp -rv etc/inventory local/

# Edit variables for target environment (we will use "test")
vi local/inventory/group_vars/test

# Uncomment the environment line and set default to test:
vi etc/buri.cfg
                
  • Account numbers, S3 buckets, are the first things to modify
  • You should commit local folder to a *private* repository, or manage in some other manner

Bootstrapping a build node

  • Setup an IAM role using policies/aminator.sample as template (modify S3 bucket reference to match what you created)
  • Launch an official Ubuntu AMI with the AMI role assigned
  • From your workstation with Buri:

# In Buri checkout
./buri --environment test buildhost HOSTNAME

# Pre-installed Buri is WIP, ignore for now, copy w/ local folder from workstation
scp -r . ubuntu@HOSTNAME:buri

# login to node and use it from here on
ssh ubuntu@HOSTNAME
cd buri
                

Create foundation AMI


# From buri folder on bootstrapped host:
sudo ./buri foundation
                
  • Make note of the EBS/PVM AMI ID, which is always used for re-snapshotting against an image

Derive base AMI from foundation


# From buri folder on bootstrapped host:
sudo ./buri resnap FOUNDATION-AMI-ID base
                
  • Make note of the EBS/PVM AMI ID, which most roles will use as the base to provision upon

Derive builder AMI from base

  • Recreates same build environment you are using now so it can be directly started
  • Eventually you want to boot it and shutdown the bootstrap node, but for now you can keep using it

# From buri folder on bootstrapped host:
sudo ./buri resnap BASE-AMI-ID aminator
                

Using Buri to Deploy Priam

Priam Configuration


# Key variables for Priam:

# Set this true unless in a VPC in a single region
priam_multiregion_enable: true

# How Priam reports cluster members to eachother changes in a VPC
priam_vpc: true

# Ec2MultiRegionSnitch recommended always, unless in a VPC, single region, set to Ec2Snitch
priam_endpoint_snitch: "org.apache.cassandra.locator.Ec2Snitch"

priam_zones_available: "us-east-1a,us-east-1d,us-east-1e"

priam_s3_bucket: "your_s3_bucket/some_optional_path"
                

Derive Priam AMI from base

  • Cluster names are special, specified from command line, as they can be used for generating several AMIs with small config difference if you like/need.
  • Role is special, in that it will also setup the SimpleDB entities if needed, for running Priam, as the image is generated.

# From buri folder on bootstrapped host:
sudo ./buri --cluster-name your-name resnap BASE-AMI-ID priam
                

Setup S3/IAM role for Priam

  • Setup an IAM role using policies/priam.sample as template
  • Modify S3 bucket reference to match what you create for the purpose
  • Note that you can use same bucket for multiple clusters

Create security group for Priam cluster

  • Need a security group per cluster name
  • Names are important, Buri convention is priam-CLUSTER-NAME
  • All members of the group should be able to talk to all others in group on TCP 1024-65535
  • Client port access (CQL, Thrift, etc)
  • Other admin access per your conventions

Create launch configuration for autoscale groups

  • Launch configuration should specify the Priam AMI, IAM role, security group.
  • If running in a non-default VPC, ensure assign public IP is selected.
  • Ensure all the ephemeral storage is activated in the config.

Create per-zone autoscale groups

  • Each uses same launch configuration
  • Names MUST be the cluster name, a dash, and the zone name stripped of its dashes
  • EG: if your cluster is name mycluster, and you are using us-east only, in zones b, d, and e, you would need ASG named:
    • mycluster-useast1b
    • mycluster-useast1d
    • mycluster-useast1e
  • Only set ASG size to one instance for now, do not set rules for automatic size changes
  • Once complete, you should see 3 instances launching

What happens as it comes online

  • If there is more than 1 ephemeral drive available, they will be striped into a single volume on /mnt. Depending on size of storage, this may take some time.
  • First node Priam comes up on will see no other seeds, and disable auto bootstrap to initialize the first node of the ring.
  • Subsequent nodes will see seeds available, and auto bootstrap into the ring using those seeds.

What to do from here?

  • Double the ring as needed to scale cluster (Make REST call, expand ASG size)
  • Load dummy data with cassandra-stress, specifying replication factor >1.... kill nodes one at a time, observe that replacement comes online, and streams data to itself on joining.
  • Check S3, which should be getting SSTables backed up in near real-time
  • Setup Expiration/Glacier policies on S3 bucket

Resources:

THE END