Site Reliability Engineering (SRE) and Operations teams responsible for operating virtual machines (VMs) are always looking for ways to provide a more reliable, more scalable environment for their development partners. Part of providing that stable experience is having telemetry data (metrics, logs and traces) from systems and applications so you can monitor and troubleshoot effectively. Many Google Cloud services, including Google Compute Engine, provide basic system metrics out of the box. However, if you want in-depth metrics about your VMs or application telemetry, installing the Google Cloud Ops Agent is necessary.

At Cloud Ops we make it easy to install the Ops Agent in our UI on one or a handful of VMs, but installing, configuring, and managing an agent on a fleet of VMs, especially when many are hosting production workloads at an enterprise organization can be incredibly taxing. There are simply too many configuration and provisioning tools and often simply too much complexity. In that vein, we at Cloud Operations want to meet our users where they are in their process of digital transformation. That’s why we’ve introduced support for the most common automation tools in the configuration and provisioning space to deploy the Cloud Ops Agent. This lets our users prioritize automation as a way to reduce operational toil so they can  focus on building and managing reliable and highly performant infrastructure.

 

Today we’ll be taking a look at how to deploy the Cloud Ops agent in an automated fashion across a fleet of VMs, and in this example we’ll use Ansible. Ansible is a popular open source configuration management tool that provides a lightweight way to get started automating your infrastructure. We’ll also look at a more advanced example, using some templating tools available to streamline your automation code. But first let’s talk a little about what Ansible is, and how it works.

What is Ansible, and how does it work?

Ansible is an open source tool written in Python which provides an agentless framework for connecting and interacting with machines. To do this it leverages the native connection protocols for Linux and Windows, SSH and Powershell respectively. The key benefit of using existing connection protocols is that it helps to reduce overhead on the systems, while benefiting from the security of these longstanding and heavily adopted protocols. When working with Ansible, one of the simplest units of work is a playbook:

 

---
- name: Sample playbook
  hosts: localhost
  tasks:
    - ansible.builtin.debug:
        msg: "Hello World!"

 

This really simple playbook runs against your localhost, and executes a task essentially equivalent to echoing “Hello World!”

Deploying the Ops Agent to monitor and troubleshoot VMs

The new Google Cloud Ops Agent makes it really easy to immediately start collecting telemetry data from your systems at a high level. By simply installing the agent we can immediately ingest standard system logs and additional telemetry about the system beyond the defaults, including running processes.

Adding workload specifics to your configuration

Now let’s take a look at a more complex example, like a playbook that will deploy Nginx and a custom configuration for the Ops Agent to collect telemetry.

Here’s what the simple custom configuration file looks like for the Ops Agent, to collect default metrics and logs from Nginx, also written in YAML format:

 

logging:
  receivers:
    nginx_default_access:
      type: nginx_access
    nginx_default_error:
      type: nginx_error
  service:
    pipelines:
      nginx:
        receivers:
          - nginx_default_access
          - nginx_default_error
metrics:
  receivers:
    nginx_metrics:
      type: nginx
      stub_status_url: http://127.0.0.1:80/status
      collection_interval: 60s
  service:
    pipelines:
      nginx_pipeline:
        receivers:
          - nginx_metrics

 

And here’s a playbook, specifying the custom `ops_agent.yaml` configuration file in the role:

 

---
- name: Deploy and configure Cloud Ops Agent
  hosts: all
  become: true
  roles:
    - role: googlecloudplatform.google_cloud_ops_agents
      vars:
        agent_type: ops-agent
        version: 1.0.1
        main_config_file: ops_agent.yaml
     notify:
        - Restart Ops Agent

  tasks:
    - name: Install nginx
      ansible.builtin.package: 
        name: nginx
        state: present

    - name: Customize nginx config for telemetry
      ansible.builtin.template:
        src: ansible_templates/status.conf
        dest: /etc/nginx/conf.d/status.conf
      notify:
        - Restart Nginx


    - name: Start nginx
      ansible.builtin.service:
        name: nginx
        state: started
        enabled: yes

    - name: Start Ops Agent
      ansible.builtin.service:
        name: google-cloud-ops-agent
        state: started
        enabled: yes

  handlers:
    - name: Restart Nginx
      ansible.builtin.service:
        name: nginx
        state: restarted
        enabled: yes

    - name: Restart Ops Agent
      ansible.builtin.service:
        name: google-cloud-ops-agent
        state: restarted
        enabled: yes

 

After running this playbook we should have successfully installed NGINX in all hosts within our inventory, and should be submitting both metrics and data from Nginx! To copy the example playbook check out this GitHub sample.

Now it’s time to visualize some of this information! We provide an out of the box dashboard for Nginx, that you can import like so:

 

 

 

And that’s it! Now we can see the metrics we’ve been collecting from Nginx with the Cloud Ops Agent

Get started today

Whether you are managing a handful of VMs or an entire fleet, ensuring robust observability data is available from systems and applications is key to effective monitoring and troubleshooting. With the VM Instances dashboard in Cloud Monitoring, Agent Policies, or use of open source tooling such as Ansible, Chef, Puppet and Terraform, you have many options to install agents on your Google Cloud VMs. The Ops Agent helps you gather data to keep your infrastructure and applications performing their very best, and automating the deployment makes day to day management all that much easier.

If you’d like to watch a video where I walk through these steps, check out our YouTube video that demonstrates this blog post, and see the rest of our O11y In Depth playlist!

Or if you’d like to get started with a tutorial, you can also use our Cloud Ops Agent tutorial for Ansible to walkthrough a simple deployment in Google Cloud Shell.

Lastly, if you have feedback or want to ask us questions, drop us a line on the Google Cloud Community Cloud Ops area!

 

 

By: Kyle Benson (Product Manager, Cloud Ops) and Rahul Harpalani (Product Manager)
Source: Google Cloud Blog

Previous Crate And Barrel Boosts Online Customer Experience With Better Site Search Powered By Lucidworks On Google Cloud
Next Use Graphs For Smarter AI With Neo4j And Google Cloud Vertex AI