Monitoring with Ganglia

Book description

Written by Ganglia designers and maintainers, this book shows you how to collect and visualize metrics from clusters, grids, and cloud infrastructures at any scale. Want to track CPU utilization from 50,000 hosts every ten seconds? Ganglia is just the tool you need, once you know how its main components work together. This hands-on book helps experienced system administrators take advantage of Ganglia 3.x.

Learn how to extend the base set of metrics you collect, fetch current values, see aggregate views of metrics, and observe time-series trends in your data. You’ll also examine real-world case studies of Ganglia installs that feature challenging monitoring requirements.

  • Determine whether Ganglia is a good fit for your environment
  • Learn how Ganglia’s gmond and gmetad daemons build a metric collection overlay
  • Plan for scalability early in your Ganglia deployment, with valuable tips and advice
  • Take data visualization to a new level with gweb, Ganglia’s web frontend
  • Write plugins to extend gmond’s metric-collection capability
  • Troubleshoot issues you may encounter with a Ganglia installation
  • Integrate Ganglia with the sFlow and Nagios monitoring systems

Contributors include: Robert Alexander, Jeff Buchbinder, Frederiko Costa, Alex Dean, Dave Josephsen, Peter Phaal, and Daniel Pocock. Case study writers include: John Allspaw, Ramon Bastiaans, Adam Compton, Andrew Dibble, and Jonah Horowitz.

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. Safari® Books Online
    4. How to Contact Us
  2. 1. Introducing Ganglia
    1. It’s a Problem of Scale
    2. Hosts ARE the Monitoring System
    3. Redundancy Breeds Organization
    4. Is Ganglia Right for You?
    5. gmond: Big Bang in a Few Bytes
    6. gmetad: Bringing It All Together
    7. gweb: Next-Generation Data Analysis
    8. But Wait! That’s Not All!
  3. 2. Installing and Configuring Ganglia
    1. Installing Ganglia
      1. gmond
        1. Requirements
        2. Linux
          1. Debian-based distributions
          2. RPM-based distributions
        3. OS X
        4. Solaris
        5. Other platforms
      2. gmetad
        1. Requirements
        2. Linux
          1. Debian-based distributions
          2. RPM-based distributions
        3. OS X
        4. Solaris
      3. gweb
        1. Requirements
        2. Linux
          1. Debian-based distributions
          2. RPM-based distributions
        3. OS X
        4. Solaris
    2. Configuring Ganglia
      1. gmond
        1. Topology considerations
        2. Configuration file
          1. Section: globals
          2. Section: cluster
          3. Section: host
          4. Section: UDP channels
          5. Section: TCP Accept Channels
          6. Access control
          7. Optional section: sFlow
          8. Section: modules
          9. Section: collection_group
      2. gmetad
        1. gmetad topology
        2. gmetad.conf: gmetad configuration file
          1. The data_source attribute
          2. gmetad daemon behavior
          3. RRDtool attributes
          4. Graphite support
          5. gmetad interactive port query syntax
      3. gweb
        1. Apache virtual host configuration
        2. gweb options
          1. Application settings
          2. Look and feel
          3. Security
          4. Advanced features
    3. Postinstallation
      1. Starting Up the Processes
      2. Testing Your Installation
      3. Firewalls
  4. 3. Scalability
    1. Who Should Be Concerned About Scalability?
    2. gmond and Ganglia Cluster Scalability
    3. gmetad Storage Planning and Scalability
      1. RRD File Structure and Scalability
      2. Acute IO Demand During gmetad Startup
      3. gmetad IO Demand During Normal Operation
      4. Forecasting IO Workload
      5. Testing the IO Subsystem
      6. Dealing with High IO Demand from gmetad
  5. 4. The Ganglia Web Interface
    1. Navigating the Ganglia Web Interface
      1. The gweb Main Tab
      2. Grid View
      3. Cluster View
        1. Physical view
        2. Adjusting the time range
      4. Host View
        1. Viewing individual metrics
        2. Node view
      5. Graphing All Time Periods
    2. The gweb Search Tab
    3. The gweb Views Tab
    4. The gweb Aggregated Graphs Tab
      1. Decompose Graphs
    5. The gweb Compare Hosts Tab
    6. The gweb Events Tab
      1. Events API
        1. Examples
    7. The gweb Automatic Rotation Tab
    8. The gweb Mobile Tab
    9. Custom Composite Graphs
    10. Other Features
    11. Authentication and Authorization
      1. Configuration
      2. Enabling Authentication
        1. Sample Apache configuration
        2. Other web servers
      3. Access Controls
      4. Actions
      5. Configuration Examples
  6. 5. Managing and Extending Metrics
    1. gmond: Metric Gathering Agent
    2. Base Metrics
    3. Extended Metrics
    4. Extending gmond with Modules
      1. C/C++ Modules
        1. Anatomy of a C/C++ module
          1. mmodule structure
          2. Ganglia_25metric structure
          3. metric_init callback function
          4. metric_cleanup function
          5. metric_handler function
        2. Configuring a C/C++ metric module
        3. Deploying a C/C++ metric module
        4. Cloning and building a C/C++ module with autotools
          1. Adding a module within either project
          2. Creating a new project
          3. Putting it all together with autotools
      2. Mod_Python
        1. Configuring gmond to support Python metric modules
        2. Writing a Python metric module
        3. Debugging and testing a Python metric module
        4. Configuring a Python metric module
        5. Deploying a Python metric module
      3. Spoofing with Modules
    5. Extending gmond with gmetric
      1. Running gmetric from the Command Line
      2. Spoofing with gmetric
    6. How to Choose Between C/C++, Python, and gmetric
    7. XDR Protocol
      1. Packets
      2. Implementations
    8. Java and gmetric4j
    9. Real World: GPU Monitoring with the NVML Module
      1. Installation
      2. Metrics
      3. Configuration
  7. 6. Troubleshooting Ganglia
    1. Overview
      1. Known Bugs and Other Limitations
    2. Useful Resources
      1. Release Notes
      2. Manpages
      3. Wiki
      4. IRC
      5. Mailing Lists
      6. Bug Tracker
    3. Monitoring the Monitoring System
    4. General Troubleshooting Mechanisms and Tools
      1. netcat and telnet
      2. Logs
      3. Running in Foreground/Debug Mode
      4. strace and truss
      5. valgrind: Memory Leaks and Memory Corruption
      6. iostat: Checking IOPS Demands of gmetad
      7. Restarting Daemons
      8. gstat
    5. Common Deployment Issues
      1. Reverse DNS Lookups
      2. Time Synchronization
      3. Mixing Ganglia Versions Older than 3.1 with Current Versions
      4. SELinux and Firewall
    6. Typical Problems and Troubleshooting Procedures
      1. Web Issues
        1. Blank page appears in the browser
        2. Browser displays white page with error message
        3. Cluster view shows uppercase hostname, link doesn’t work
        4. Host appears in the wrong cluster
        5. Host appears multiple times in web, different variations of the hostname (or IP address)
        6. Some hosts appear with shortname instead of FQDN
        7. One or more hosts don’t appear in the web interface
        8. Hosts don’t appear/data stale after UDP aggregator restarted
        9. Dead/retired hosts still appearing in the Web
        10. Wrong CPU count or other metrics are missing
        11. Fonts in graphs are too big or too small
        12. Spikes in graphs
        13. Custom metrics don’t appear
        14. Custom metric’s value is truncated
        15. Gaps appear randomly in the graphs
        16. Some host is completely missing from the cluster
        17. gmetad hierarchy and federation; some grids don’t appear on the Web
      2. gmetad Issues
        1. Empty (size = 0) RRD files created
        2. gmetad takes a long time to start
        3. gmetad segmentation fault writing to RRD
        4. gmetad doesn’t poll all nodes defined in data_source
        5. RRA definition changed in gmetad.conf, but RRD files are unchanged
      3. rrdcached Issues
      4. gmond Issues
        1. gmond fails to start or localhost issues
        2. gmond uses a lot of RAM
        3. gmond doesn’t start properly on bootup
        4. UDP receives buffer errors on a machine running gmond
  8. 7. Ganglia and Nagios
    1. Sending Nagios Data to Ganglia
    2. Monitoring Ganglia Metrics with Nagios
      1. Principle of Operation
      2. Check Heartbeat
      3. Check a Single Metric on a Specific Host
      4. Check Multiple Metrics on a Specific Host
      5. Check Multiple Metrics on a Range of Hosts
      6. Verify that a Metric Value Is the Same Across a Set of Hosts
    3. Displaying Ganglia Data in the Nagios UI
    4. Monitoring Ganglia with Nagios
      1. Monitoring Processes
      2. Monitoring Connectivity
      3. Monitoring cron Collection Jobs
      4. Collecting rrdcached Metrics
  9. 8. Ganglia and sFlow
    1. Architecture
    2. Standard sFlow Metrics
      1. Server Metrics
      2. Hypervisor Metrics
      3. Java Virtual Machine Metrics
      4. HTTP Metrics
      5. memcache Metrics
    3. Configuring gmond to Receive sFlow
    4. Host sFlow Agent
      1. Host sFlow Subagents
      2. Custom Metrics Using gmetric
    5. Troubleshooting
      1. Are the Measurements Arriving at gmond?
      2. Are the Measurements Being Sent?
    6. Using Ganglia with Other sFlow Tools
  10. 9. Ganglia Case Studies
    1. Tagged, Inc.
      1. Site Architecture
      2. Monitoring Configuration
        1. Apache
        2. memcached
        3. Java
      3. Examples
        1. Optimizing memcached efficiency
        2. Web load
        3. Java performance
    2. SARA
      1. Overview
      2. Advantages
        1. Operational
        2. Users
      3. Customizations
        1. Metrics
        2. Custom graphs
      4. Challenges
        1. Central collector unicast receiver
        2. Server RRD IO
      5. Conclusion
    3. Reuters Financial Software
      1. Ganglia in the QA Environment
        1. Market data overload
        2. Analysis and reproducing the problem
        3. Validating the solution
      2. Ganglia in a Major Client Project
        1. Upgrading takes too long
        2. Analysis and studying the problem
        3. Using Ganglia for the analysis
        4. Results
    4. Lumicall (Mobile VoIP on Android)
      1. Monitoring Mobile VoIP for the Enterprise
      2. Ganglia Monitoring Within Lumicall
      3. Implementing gmetric4j Within Lumicall
      4. Lumicall: Conclusion
    5. Wait, How Many Metrics? Monitoring at Quantcast
      1. Reporting, Analysis, and Alerting
        1. Holt-Winters aberrance detection
      2. Ganglia as an Application Platform
      3. Best Practices
        1. Using tmpfs to handle high IOPS
        2. Sharding and instancing
      4. Tools
        1. snmp2ganglia
        2. json2gmetrics
        3. gmond plug-ins
        4. RRD management scripts
      5. Drawbacks
        1. Necessity of sharding
        2. RRD data consolidation
        3. Coordination over a WAN
        4. Excessive IOPS for RRD updates
      6. Conclusions
    6. Many Tools in the Toolbox: Monitoring at Etsy
      1. Monitoring Is Mandatory
      2. A Spectrum of Tools
      3. Embrace Diversity
      4. Conclusion
  11. A. Advanced Metric Configuration and Debugging
    1. Module Metric Definitions
      1. Mod_MultiCPU
      2. Mod_GStatus
      3. Multidisk
      4. memcached
      5. TcpConn
    2. Advanced Metrics Aggregation and You
      1. Configuring statsd
        1. statsd
        2. statsd-c
        3. py-statsd
      2. Configuring VDED
    3. rrdcached
      1. Installing
      2. Configuring gmetad for rrdcached
      3. Controlling rrdcached
      4. Troubleshooting
        1. Permissions
        2. Delays in metrics
    4. Debugging with gmond-debug
  12. B. Ganglia and Hadoop/HBase
    1. Introducing Hadoop and HBase
    2. Configuring Hadoop and HBase to Publish Metrics to Ganglia
  13. Index
  14. About the Authors
  15. Colophon
  16. Copyright

Product information

  • Title: Monitoring with Ganglia
  • Author(s): Alex Dean, Robert Alexander, Dave Josephsen, Vladimir Vuksan, Bernard Li, Brad Nicholes, Jeff Buchbinder, Frederiko Costa, Matt Massie, Peter Phaal, Daniel Pocock
  • Release date: November 2012
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781449329709