Audience
- Developers
- Operations
- Project managers
- Other interested parties
What is NetflixOSS?
The view from 10,000 feet
What does NetflixOSS consist of?
- Cloud Native Architecture Components
- Anti-Fragile Patterns
- Other Cloud Tools
- Other Miscellaneous Projects
Cloud Native
What is this?
- Master copies of data are cloud-resident
- Everything is dynamically provisioned
- All services are ephemeral
Anti-Fragile Patterns
What is this?
- Micro-services
- Highly available systems
- Ephemeral/Immutable components
- Circuit breakers
- Chaos engines
Micro-services
- Split everything into series of micro services
- Clients assume failure will occur, seek alternates
- Calls between services done through circuit breaker pattern
Chaos Engines
- Injected failures
- Expose weaknesses
- Ensure resiliency
Deep use of Metrics
- Most libraries include rich metrics
- Not just for reporting
- Used internally for load management by:
- Hystrix (breaks overloaded circuits)
- Eureka (service selection)
- Excellent real-time visualization of metrics
What is deployment like?
- Depends on Amazon's Cloud
- One service per node, combination discouraged
- Base AMI provisions a Java web container (Tomcat typical)
- Specific AMIs have WAR deployed onto base
- If WAR is a co-process, the service captured is also deployed to AMI image.
- Each AMI launches into it's own ASG (usually)
Java?!
- Everything runs in a JVM
- Uses JMX for some core functions
- Supports many JSR-233 languages
- Not everything need be in Java-proper
- Many components using Groovy, Scala, etc.
Image management
- Some pre-baked AMIs around...
- Not good for much more than some evaluation
- Bootstrapping your own necessary for real use
- Some level of config baked into AMI, some region-specific (AMI per service per region typical)
- Will build on available Ansible playbooks
Cluster Tools
The management elements of NetflixOSS
Aminator / BUILD
- AMI Build tool
- Works with Ansible playbooks
- Handles laying a specific service into a container AMI
- Generates a configured, deployable in ASG, service AMI
Asgard
- Application Deployment Manager
- Manages majority of AWS cloud
- Allows Red/Black rollouts
- Replaces use of EC2 console for most tasks
- Enforces deployment conventions / rules
- Does NOT cross production/test boundaries. (Each gets it's own Asgard)
Ice
- Cost analysis / optimization
- Imports from billing API @ Amazon
- Allows querying/reporting on data
- Aware of reserved instances
Edda
- Cluster state analysis/query
- Imports from config APIs @ Amazon
- Allows querying/reporting on data
- Display differences in environment over time
Denominator
- Library/CLI for manipulating hosted DNS config
- Like jclouds for DNS (same author)
- A future state component, not intrinsic to current stack
- Netflix positions this as part of a multi-region failover strategy
Base Cluster Services
The lowest level of the stack
Exhibitor/Curator (Zookeeper)
- Zookeeper = Cluster Co-ordination services
- Exhibitor = Zookeeper Co-process
- Curator = Client wrapper for Zookeeper
- Leveraged for configuration management of stack / application on a cluster-wide basis
Eureka
- Application services registry
- How parts of the application find provisioned services
- Central to scaling/HA
- Clients decide how to reach/retry a service given a registry list queried from Eureka (client-side load spreading)
Turbine / Hystrix Dashboard
- Turbine = metrics aggregation
- Hystrix Dashboard = visualization
- Provides real-time insights into all areas of application performance
Suro
- Distributed Data Pipeline
- Collection/Aggregation/Dispatch
- Integration w/ external pipelines (IE: Kafka)
- Supports data logging for Hadoop processing
Frameworks / Libraries
Application programming environment
Archaius
- Provides dynamic/composed config properties
- Can compose config from multiple sources
- Can handle environment/zone specific details
- Stores in zookeeper via curator
- Notifications on property changes available
Servo
- Application metrics
- Publish to JMX is primary channel
- Other publications available (IE: Graphite, Turbine, Cloudwatch)
Blitz4J
- Enhancements to Log4J
- Decouples logging from main thread
- Logging is async
- Can handle high log volumes
- Mitigates loss during log storms via summaries.
Ribbon
- Model for interprocess communication
- Core of how services communicate with other services
- Where client-side load balancers are implemented
- Provides REST-based model for service exposure
- Various serialization schemes supported (JSON primarily)
Governator
- Based on Google Guice, a dependency injection framework
- Field validation
- Config to field mapping
- Classpath management
- Lifecycle management
- More...
RxJava
- Implementation of "Rx Observables"
- Rx = Reactive Extensions
- A port of Rx.NET
- for composing asynchronous and event-based programs using observable sequences
Hystrix
- Latency and Fault Tolerance helper
- Prevents hung services from taking down app
- A primary source of application metrics
- Manages queues of requests to like sources
- Collapses requests into batches where possible
Commons
- Common utilities
- Implementation of an event bus
- Guice -> Jersey utilities
- Infix operator enhancements
- Helpers for adding custom metrics
Karyon
- Base container for applications/services in this stack
- Ties most of the other libraries together
- Gives a built-in admin console w/ insights/statistics
- Companion library (Pytheas) provides web development aspects
Pytheas
- Data source integration
- Provides common UI components
- Supports server-side events
- What front end development ultimately targets
- Edge services will be provisioned in terms of this and Karyon
Zuul (The Gatekeeper)
- An even higher level edge framework.
- Sort of like a front end load balancer on steriods
- If used, rendering to users pushed down to a middle tier service
- Routes/filters requests to the appropriate service
- Allows powerful debugging, route specific requests to serperate API clusters
- Can route cross-region.
Glisten
- Groovy library for building application with Amazon Simple Workflow
- Used by Asgard for co-ordinating deployment activities
Data Services
Persistence, Caching, Query
Priam
- Co-process for Cassandra
- Manages configuration, tokens, seed discovery
- AWS aware
- Backup/Recovery (Complete/Incremental)
- Exposes Cassandra as a framework service
Astyanax
- High level client side for Cassandra/Priam
- Provides recipes for common Cassandra use-cases
- Handles discovery of new nodes, marking of downed ones.
- Completely encapsulates client access
- Connection pooling
- In-progress: Rebasing to Datastax driver
EVCache
- Service for Ephemeral, Volatile Cache
- Distributed cache
- AWS Zone aware
- Allows replication of cached data
- Uses memcache
- Is a co-process like Priam
Genie
- Implements Hadoop as a service
- Runs Hadoop/Hive/Pig jobs
- Schedules jobs to available resources (including launch of transients)
- Callers can do so without directly invoking a Hadoop client.
- Provides monitoring of jobs as well
Aegisthus
- Bulk data pipeline from Cassandra
- Reads SSTables directly
- Emits compacted snapshot of column family
- Imports data into Apache Pig
PigPen
- Map/Reduce for Clojure
- supports distributed Clojure
- Compiles to Apache Pig
- Hides details of Pig, allows unit testing of map/reduce programs
Lipstick
- Visualization of Pig jobs
- Graphical depiction of job flow
- Can reflect status of job in progress
- Complements Genie, providing status of jobs it manages
Zeno
- In-memory data service for data sets with low latency tolerance
- Creates compact serialized representation of Java objects (with de-duplication)
- Creates smallest change set possible for propagation
- Used @ Netflix for video metadata that drives user experience
- Care taken to be as GC friendly as possible
Netflix Graph
- Compact in-memory directed graphs
- Predecessor to Zeno?
- Claims an order-of-magnitude reduction of memory footprint vs. native java objects
S3mper
- Extension for Hadoop's S3-based filesystem
- Uses DynamoDB to provide added consistency checking
STAASH
- A REST-based exposure of data services
- Exposure to external languages/stacks
- Still very much a proof-of-concept, limited use within Netflix
Other Components
Things that don't fit in other areas
GCViz
- Visualize JVM garbage collector activity
- Requires a HotSpot JVM (IE: Oracle)
- Requires a specific set of JVM options to emit log
Simian Army
- Chaos Monkey - Failure injection for resilience testing
- Janitor Monkey - Flag/Deactivate unused cloud resources
- Conformity Monkey - Scans cloud resources for non-conformance of best practices
Our additions
Things we've been working on
Buri
- Builds Ubuntu AMIs for all machine types
- A collection of Ansible recipes
- Works similar to Aminator
- But it includes foundation/base building
- Hosted on GitHub: https://github.com/viafoura/buri
Buri Roles (completed)
- Exhibitor/Zookeeper (clustered)
- Eureka (clustered)
- Ice
Buri Roles (in progress)
- Priam/Cassandra
- Asgard
- Turbine/Hystrix Console
NetflixOSS Reference Implementations