Home arrow Knowledge Center arrow Technologies arrow Networking & Internet arrow LAN/WAN arrow Building a Monitoring Infrastructure
Home
Tools
Knowledge Center
Howto's
Latest Jobs
Latest jobs from IT Contractor Jobs
The latest jobs registered on the IT Contractor Jobs web site.
  • Test Analyst

    - Test Analyst (New Zealand, Auckland - Auckland CBD)

    Test Analyst - Contract A software test Engineer is required for a 6 month contract. Working in the Finance sector and for a...

  • Test Analyst

    - Test Analyst (New Zealand, Auckland - Auckland CBD)

    Test Analyst - Contract A software test Engineer is required for a 6 month contract. Working in the Finance sector and for a high...

  • Senior Infrustructure Engineer

    - Senior Infrustructure Engineer (New Zealand, Wellington - Wellington CBD)

    Senior Infrustructure Engineer - Hourly Rate Contract   I require an experienced Infrastructure Engineer to work in a technological environment that is constantly...


Building a Monitoring Infrastructure PDF Print E-mail

Building a monitoring infrastructure is a complex undertaking. The system can potentially interact with every system in the environment, and its users range from the layman to the highly technical. Building the monitoring infrastructure well requires not only considerable systems know-how, but also a global perspective and good people skills. Most importantly, building monitoring systems also requires a light touch. The most important distinction between good monitoring systems and bad ones is the amount of impact they have on the network environment, in areas such as resource utilization, bandwidth utilization, and security. This fi rst chapter contains a collection of advice gleaned from mailing lists such as This e-mail address is being protected from spam bots, you need JavaScript enabled to view it , other systems administrators, and hard-won experience. My hope is that this chapter helps you to make some important design decisions up front, to avoid some common pitfalls, and to ensure that the monitoring system you build becomes a huge asset instead of a huge burden.

A Procedural Approach to Systems Monitoring

Good monitoring systems are not built one script at a time by administrators (admins) in separate silos. Admins create them methodically with the support of their management teams and a clear understanding of the environment—both procedural and computational—within which they operate.

Without a clear understanding of which systems are considered critical, the monitoring initiative is doomed to failure. It’s a simple question of context and usually plays out something like this:

Manager: “I need to be added to all the monitoring system alerts.”
Admin: “All of them?”
Manager: “Well yes, all of them.”
Admin: “Er, ok.”
The next day:
Manager: “My pager kept me up all night. What does this all mean?”
Admin: “Well, /var fi lled up on Server1, and the VPN tunnel to site5 was up and down.”
Manager: “Can’t you just notify me of the stuff that’s an actual problem?”
Admin: “Those
are actual problems.”

For whatever reason, monitoring systems seem to have been left out of this procedural
approach to contingency planning. Most monitoring systems come in to the network as a
pet project of one or two small tech teams who have a very specifi c need for them.

Often many different teams will employ their own monitoring tools independent of, and oblivious
of, other monitoring initiatives going on within the organization. There seems to be no need
to involve anyone else. Although this single-purpose approach to systems monitoring may solve an individual’s or small group’s immediate need, the organization as a whole suffers, and fragile monitoring systems always grow from it.

To understand why, consider that in the absence of a procedurally implemented monitoring
framework, hundreds of critically important questions are nearly impossible to answer.

For example, consider the following questions.

What amount of overall bandwidth is used for systems monitoring?

What routers or other systems are the monitoring tools dependent on?

Is sensitive information being transmitted in clear text between hosts and the monitoring

system?

If it was important enough to write a script to monitor a process, then it’s important enough to consider what happens when the system running the script goes down, or when the person who wrote the script leaves and his user ID is disabled. The piecemeal approach is by far the most common way monitoring systems are created, yet the problems that arise from it are too many to be counted.

The core issue in our previous example is that there are no criteria that coherently define what a “problem” is, because these criteria don’t exist when the monitoring system has been installed in a vacuum. Our manager felt that he had no visibility into system problems and when provided with detailed information, still gained nothing of signifi cance. This is why a procedural approach is so important. Before they do anything at all, the people undertaking the monitoring project should understand which systems in the organization are critical to the organization’s operational well-being, and what management’s expectation is regarding the uptime of those systems.

Given these two things, policy can be formulated that details support and escalation plans. Critical systems should be given priority and their requisite pieces defi ned. That’s not to say that the admin in the example should not be notifi ed when /var is full on Server1;only that when he is notifi ed of it, he has a clear idea of what it means in an organizational context.

Does management expect him to fi x it now or in the morning? Who else was notified in parallel? What happens if he doesn’t respond? This helps the manager, as well. By clearly defi ning what constitutes a problem, management has some perspective on what types of alerts to ask for and more importantly...when they can go back to sleep.

Smaller organizations, where there may be only a single part-time system administrator (sysadmin), are especially susceptible to piece-meal monitoring pitfalls. Thinking about operational policy in a four-person organization may seem silly, but in small environments, critical system awareness is even more important. When building monitoring systems, always maintain a big-picture outlook. If the monitoring endeavor is successful, it will grow quickly and the well-being of the organization will come to depend on it.

Ideally, a monitoring system should enforce organizational policy rather than merely reflect it. If management expects all problems on Server1 to be looked at within 10 minutes, then the monitoring system should provide the admin with a clear indicator in the message (such as a priority number), a mechanism to acknowledge the alert, and an automatic escalation to someone else at the end of the 10-minute window.

So how do we find out what the critical systems are? Senior management is ultimately responsible for the overall well-being of the organization, so they should be the ones making the call. This is why management buy-in is so vitally important. If you think this is beginning to sound like disaster recovery planning, you’re ahead of the curve. Disaster recovery works toward identifying critical systems for the purpose of prioritizing their recovery, and therefore, it is a methodologically identical process to planning a monitoring infrastructure.

In fact, if a disaster recovery plan already exists, that’s the place to begin. The critical systems have already been identified.

Critical systems, as outlined by senior management, will not be along the lines of “all problems with Server1 should be looked at within 10 minutes.” They’ll probably be defined as logical entities. For example “Email is critical.” So after the critical systems have been identifi ed, the implementers will dissect them one by one, into the parts of which they are composed. Don’t just stay at the top; be sure to involve all interested parties. Email administrators will have a good idea of what “email” is composed of and criteria, which, if not met, will mean them rolling their own monitoring tools.

Work with all interested parties to get a solution that works for everyone. Great monitoring systems are grown from collaboration. Where custom monitoring scripts already exist, don’t dismiss them; instead, try to incorporate them. Groups tend to trust the tools they’re already using, so co-opting those tools usually buys you some support. Nagios is excellent at using external monitoring logic along with its own scheduling and escalation rules.

Processing and Overhead

Monitoring systems necessarily introduce some overhead in the form of network traffic and resource utilization on the monitored hosts. Most monitoring systems typically have a few specific modes of operation, so the capabilities of the system, along with implementation choices, dictate how much, and where, overhead is introduced.

Remote Versus Local Processing

Nagios exports service checking logic into tiny single-purpose programs called plugins. This makes it possible to add checks for new types of services quickly and easily, as well as co-opt existing monitoring scripts. This modular approach makes it possible to execute the plugins themselves, either locally on the monitoring server or remotely on the monitored hosts. Centralized execution is generally preferable whenever possible because the monitored hosts bear less of a resource burden. However, remote processing may be unavoidable, or even preferred, in some situations. For large environments with tens of thousands of hosts, centralized execution may be too much for a single monitoring server to handle. In this case, the monitoring system may need to rely on the clients to run their own service checks and report back the results. Some types of checks may be impossible to run from the central server. For example, plugins that check the amount of free memory may require remote execution.

As a third option, several Nagios servers may be combined to form a single distributed monitoring system. Distributed monitoring enables centralized execution in large environments by distributing the monitoring load across several Nagios servers. Distributed monitoring is also good for situations in which the network is geographically disperse, or otherwise inconveniently segmented.

Bandwidth Considerations

Plugins usually generate some IP traffic. Each network device that this traffic must traverse introduces network overhead, as well as a dependency into the system. In Figure 1.1, there is a router between the Nagios Server and Server1. Because Nagios must traverse the router to connect to Server1, Server1 is said to be a child of the router. It is always desirable to do as little layer 3 routing between the monitoring system and its target hosts as possible, especially  where devices such as firewalls and WAN links are concerned. So the location of the monitoring system within the network topology becomes an important implementation detail.

Figure 1.1 The router between Nagios and Server1 introduces a dependency and some network overhead in the form of layer 3 routing decisions.
Figure 1.1 The router between Nagios and Server1 introduces a dependency and some network overhead in the form of layer 3 routing decisions.

Figure 1.1 The router between Nagios and Server1 introduces a dependency and some network overhead in the form of layer 3 routing decisions.

In addition to minimizing layer 3 routing of traffi c from the monitoring host, you also want to make sure that the monitoring host is sending as little traffi c as possible. This means paying attention to things such as polling intervals and plugin redundancy. Plugin redundancy is when two or more plugins effectively monitor the same service.

Redundant plugins may not be obvious. They usually take the form of two plugins that measure the same service, but at different depths. Take, for example, an imaginary Web service running on Server1. The monitoring system may initially be set up to connect to port 80 of the Web service to see if it is available. Then some months later, when the Web site running on Server1 has some problems with users being able to authenticate, a plugin may be created that verifi es authentication works correctly. All that is actually needed in this example is the second plugin. If it can log in to the Web site, then port 80 is obviously available and the fi rst plugin does nothing but waste resources. Plugin redundancy may not be a problem for smaller sites with less than a thousand or so servers. For large sites, however, eliminating plugin redundancy (or better, ensuring it never occurs in the fi rst place) can greatly reduce the burden on the monitoring system and the network.

Minimizing the overhead incurred on the environment as a whole means maintaining a global perspective on its resources. Hosts connected by slow WAN links that are heavily utilized, or are otherwise sensitive to resource utilization, should be grouped logically.

Nagios provides hostgroups for this purpose. These allow configuration settings to be optimized to meet the needs of the group. For example, plugins may be set to a higher timeout for the Remote-Offi ce hostgroup, ensuring that network latency doesn’t cause a false alarm for hosts on slower networks. Special consideration should be given to the location of the monitoring system to reduce its impact on the network, as well as to minimize its dependency on other devices. Finally, make sure that your confi guration changes don’t needlessly increase the burden on the systems and network you monitor, as with redundant plugins. The last thing a monitoring system should do is cause problems of its own.

Network Location and Dependencies

The location of the monitoring system within the network topology has wide-ranging architectural ramifi cations, so you should take some time to consider its placement within your network. Your implementation goals are threefold.

1. Maintain existing security measures.

2. Minimize impact on the network.

3. Minimize the number of dependencies between the monitoring system and the most critical systems.

No single ideal solution exists, so these three goals need to be weighed against each other for each environment. The end result is always a compromise, so it’s important to spend some time diagramming out a few different architectures and considering the consequences of each.

The network topology shown in Figure 1.2 is a simple example of a network that should be familiar to any sysadmin. Today, most private networks that provide Internet-facing services have at least three segments: the inside, the outside, and the demilitarized zone (DMZ).

In our example network, the greatest number of hosts exists on the inside segment. Most of the critically important hosts (they are important because these are Web servers), however, exist on the DMZ.

Image

Figure 1.2 A typical two-tiered network .

Following the implementation rules at the beginning of this section, our fi rst priority is to maintain the security of the network. Creating a monitoring framework necessitates that some ports on the fi rewalls be opened, so that, for example, the monitoring host can connect to port 80 on hosts in other network segments. If the monitoring system were placed in the DMZ, many more ports on the fi rewalls would need to be opened than if the monitoring system were placed on the inside segment, simply because there are more hosts on the internal segment. For most organizations, placing the monitoring server in the DMZ would be unacceptable for this reason. More information on security is discussed later in this chapter, but for this example, it’s simple arithmetic.

There are many ways to reduce the impact of the monitoring system on the network. For example, the use of a modem to send messages via the Public Switched Telephone Network (PSTN) reduces network traffi c and removes dependencies. The best way to minimize network impact in this example, however, is by placing the monitoring system on the segment with the largest number of hosts, because this ensures that less traffi c must traverse the firewalls and router. This, once again, points to the internal network.

Finally, placing our monitoring system in a separate network segment from most of the critical systems is not ideal, because if one of the network devices becomes unavailable, the monitoring system loses visibility to the hosts behind it. Nagios refers to this as a networkblocking outage. The hosts on the DMZ are children of their fi rewall, and when confi gured as such, Nagios is aware of the dependency. If the fi rewall goes down, Nagios does not have to send notifi cations for all of the hosts behind it (but it can if you want it to), and the status of those hosts will be fl agged unknown in availability reports for the amount of time that they were not visible. Every network will have some amount of dependency, so this needs to be considered in the context of the other two goals. In the example, despite the dependency, the inside segment is probably the best place for the monitoring host.

Security

The ease with which large monitoring systems can become large root kits makes it imperative that security is considered sooner, rather than later.

Because monitoring systems usually need remote execution rights to the hosts it monitors, it’s easy to introduce backdoors and vulnerabilities into otherwise secure systems. Worse, because they’re installed as part of a legitimate system, these vulnerabilities may be overlooked by penetration testers and auditing tools. The fi rst, and most important, thing to look for when building secure monitoring systems is how remote execution is accomplished.

Historically, commercial monitoring tools have included huge monolithic agents, which must be installed on every client to enable even basic functionality. These agents usually include remote shell functionality and proprietary byte code interpreters, which allow the monitoring host carte blanche to execute anything on the client, via its agent. This implementation makes it diffi cult, at best, to adhere to basic security principles, such as least privilege.

Anyone with control over the monitoring system has complete control over every box it monitors.

Nagios, by comparison, follows the UNIX adage: “Do one thing and do it well.” It is really nothing but a task optimized scheduler and notifi cation framework. It doesn’t have an intrinsic ability to connect to other computers and contains no agent software at all. These functions exist as separate, single-purpose programs that Nagios must be confi gured to use. By outsourcing remote execution to external programs, Nagios maintains an off-by-default policy and doesn’t attempt to reinvent things like encryption protocols, which are critically important and diffi cult to implement. With Nagios, it’s simple to limit the monitoring server’s access to its clients, but poor security practices on the part of admin can still create insecure systems; so in the end, it’s up to you.

The monitoring system should have only the access it needs to remotely execute the specific plugins required. Avoid rexec style plugins that take arbitrary strings and execute them on the remote host. Ideally, every remotely executed plugin should be a single-purpose program, which the monitoring system has specifi c access to execute. Some useful plugins provide lots of functionality in a single binary. NSCLIENT++ for Windows, for example, can query any perfmon counter. These multipurpose plugins are fine, if they limit access to a small subset of query-only functionality.

The communication channel between the remotely executed plugin and the monitoring system should be encrypted. Though it’s a common mistake among commercial-monitoring applications, avoid nonstandard, or proprietary, encryption protocols. Encryption protocols are notoriously diffi cult to implement, let alone create. The popular remote execution plugins for Nagios use the industry-standard OpenSSL library, which is peer reviewed constantly by smart people. Even if none of the information passed is considered sensitive, the implementation should include encrypted channels from the get-go as an enabling step. If the system is implemented well, it will grow fast, and it’s far more diffi cult to add encrypted channels after the fact than it is to include them in the initial build.

Simple Network Management Protocol (SNMP) , a mainstay of systems monitoring that is supported on nearly every computing device in existence today, should not be used on public networks, and avoided, if possible, on private ones. For most purposes involving general-purpose workstations and servers, alternatives to SNMP can be found. If SNMP must be used for network equipment, try to use SNMPv3, which includes encryption, and no matter what version you use, be sure it’s confi gured in a read-only capacity and only accepts connections from specific hosts. For whatever reason, sysadmins seem chronically incapable of changing SNMP community string names. This simple implementation fl aw accounts for most of SNMP’s bad rap.

Many organizations have network segments that are physically separated, or otherwise inaccessible, from the rest of the network. In this case, monitoring hosts on the isolated subnet means adding a Network Interface Card (NIC) to the monitoring server and connecting it to the private segment. Isolated network segments are usually isolated for a reason, so at a minimum, the monitoring system should be confi gured with strict local fi rewall rules so that they don’t forward traffi c from one subnet to the other. Consideration should be paid to building separate monitoring systems for nonaccessible networks.

When holes must be opened in the firewall for the monitoring server to check the status of hosts on a different segment, consider using remote execution to minimize the number of ports required. For example, the Nagios Box in Figure 1.3 must monitor the Web server and SMTP daemon on Server1. Instead of opening three ports on the fi rewall, the same outcome may be reached by running a service checker plugin remotely on Server1 to check that the apache and qmail daemons are running. By opening only one port instead of three, there is less opportunity for abuse by a malicious party.

 

 
< Prev   Next >
Powered by IT CONTRACTORS and designed by EZPrinting web hosting