Software Defined Networking (SDN)

In this lecture, “Software-Defined Networking at the Crossroads”, Scott Shenker, University of California, Berkeley discusses SDN, it’s evolution, principals, and current state.

 

[pro-player type=’video’]http://www.youtube.com/watch?v=WabdXYzCAOU[/pro-player]

I’d like to solicit comments on their presumptions.  Are networks really difficult to manage?  If so it it because of the technology or because management is often an afterthought rather than an integral part of the system design?

Pay particular attention to the term “operator”.  What department or role is Dr. Shenker referring to as operator?  Is it the NOC or the Network Engineering department?

[button link=”http://www.networkperformanceinnovations.com/contact-us/” color=”#c02942″ size=”3″ style=”1″ dark=”0″ radius=”auto” target=”self”]Contact Us Today[/button]

Network Capacity Planning – The Way Ahead

If you’re looking at implementing capacity planning or hiring someone to do capacity planning there are a few things you should consider.

Performance Management

Capacity Management Program

Capacity planning should be an ongoing part of the lifecycle of any network (or any IT service for that matter).  The network was designed to meet a certain capacity knowing that may grow as the network gets larger and/or support more users and services.  There are several way to go about this and the best approach is dependent on your situation.  There should be some fairly specific plans on how to measure utilization, forecast, report, make decisions, and increase or decrease capacity.  There are also many aspects to capacity.  Link utilization is one obvious capacity limitation, but processor utilization may not be so obvious, and where VPNs are involved there are logical limits to the volume of traffic that can be handled by each device.  There are also physical limitations such as port and patch panel connections, power consumption, UPS capacity, etc. These should all be addressed as an integral part of the network design, and if it has been overlooked, the design needs to be re-evaluated in light of the capacity management program.  There are also the programatic aspects – frequency of evaluation, control gates, decision points, who to involve where, etc.  This is all part of the lifecycle.

Capacity Management Tools

There are a wide variety of tools available for capacity planning and analysis.  Which are selected will be determined by the approach you’re taking to manage capacity, how the data is to be manipulated, reported, and consumed, as well as architectural factors such as hardware capabilities, available data, and other network management systems in use.  One simple approach is to measure utilization through SNMP and use linear forecasting to predict future capacity requirements.  This is very easy to set up, but doesn’t provide the most reliable results.  A much better approach is to collect traffic data, overlay it on a dynamic model of the network, then use failure analysis to predict capacity changes as a result of limited failures.  This can be combined with linear forecasting; however, failure scenarios will almost always be the determining factor.  Many organizations use QoS to prioritize certain classes of traffic over others.  This adds yet another dimension to the workflow.  There is also traffic engineering design, third party and carrier capabilities, and the behavior of the services supported by the network.  It can become more complicated than it might appear at first glance.

Evaluating Data and Producing Reports

Some understanding of the technologies is necessary to evaluate the data and make recommendations on any changes.  If dynamic modeling is a tool used to forecast, there are another set of skills.  The tools may produce much of the reporting; however, there will need to be some analysis captured in a report that will be evaluated by other elements in the organization requiring communication and presentation skills.

 Personnel

It’s highly unlikely that the personnel responsible for defining the program, gathering requirements, selecting COTS tools, writing middleware, and implementing all this will be the same as those that use the tools or produce the reports or maybe even read the reports and evaluate them.  The idea of “hiring a capacity management person” to do all this isn’t really feasible.  Those with the skills and motivation to define the program and/or design and implement it will not likely be interested in operating the system or creating the reports.  One approach to this is to bring in someone with the expertise to define the approach, design and implement the tools, then train the personnel who will be using them.  These engagements are usually relatively short and provide a great value.

 

Contact us if you’d like some help developing a capacity management program or designing and installing capacity management tools

 

[button link=”http://www.networkperformanceinnovations.com/contact-us/” color=”#c02942″ size=”3″ style=”1″ dark=”0″ radius=”auto” target=”self”]Contact Us Today[/button]

Seven Indicators of an Ineffective Network Design Activity

This is a follow-on to the post Well Executed Service Design Provides Substantial ROI.  The previous post discusses the rationale for investing in the service design activity.  This provides some indicators that the service design processes are ineffective in the context of ITSM best practices.  Take the quiz below and see how your organization scores:

How much of your network is in compliance with the standard design templates?

  1. Over 90%
  2. Between 50% and 90%
  3. Less than 50%
  4. We don’t have design standards
  5. What are design standards?

If you answered “E” just follow the “Contact Us” link at the bottom of the page; you’re in trouble. If you don’t have design standards then do likewise. If less than 95% of your network isn’t in compliance with established standards, then they aren’t standards – they’re suggestions. It’s very likely that you are provisioning devices without a detailed design. This is the most common indicator of an ineffective service design process.

How do you know if your devices are in compliance with standards?

  1. We provisioned them by design and don’t change them ad-hoc
  2. We use a configuration management system that validates compliance
  3. Spot check
  4. We don’t really know
  5. Answered D or E to the previous question

Ideally you should build and provision your network by design and not make changes to it that aren’t defined by subsequent releases of the design. There are many compliance validation systems on the market that do a great job of validating IOS compliance, configuration compliance against a template, and so on. While these tools have great value in an organization with a loose design program, they address the symptom rather than the problem.

How much effort does your engineering team expend resolving incidents?

  1. Engineering is rarely involved in operational issues
  2. 20% or less – only difficult issues Tier 3 can’t isolate
  3. Over 50%
  4. Engineering is Tier 3
  5. Don’t know

There is some merit to engineering working operational issues. It keeps the engineers sharp on troubleshooting skills and helps them understand the details of the incidents in the field. Often an engineer will identify a design fault easily and open a problem case. However, if the engineers can’t spend adequate time doing engineering because they’re troubleshooting operational issues, then something is terribly amiss. If the network is so complicated that Tier III can’t identify a problem, then engineering needs to develop better tools to identify those problems rapidly. If engineering is constantly needed to make changes to devices to resolve failures, then the network design is lacking. Engineering being involved in incident resolution is an indicator of a poor or ineffective network service design.

Who identifies network metrics and the means to collect them?

  1. Engineering
  2. Performance Management Team
  3. NOC
  4. Senior Management
  5. Don’t know

Did you have to install the speedometer, oil pressure gauge, etc. as aftermarket add-ons to your automobile? If you’re designing something, and there are performance criteria, the design needs to include a means to measure that performance to ensure the system meets the requirements. Your service design program should be developing metrics and alerts as an integral part of the design.

Does Tier III ever have to modify a device configuration to resolve an incident?

  1. Only in predefined circumstances, such as hot spares
  2. Sometimes
  3. Only in outages with very high impact
  4. Absolutely! That’s why it’s necessary for tier III to maintain level 15 privileges
  5. This question doesn’t make sense

The configuration of a device is part of the system/service design. Does the auto repair shop ever redesign your car in order to repair it? There are a few circumstances in a network system where config change may be a valid repair action. For example: Bouncing a port on a device requires the configuration to be changed momentarily, then changed back. Perhaps there are hot spare devices or interfaces that are configured but disabled, and the repair plan is to enable the secondary and disable the primary (this is actually part of the system design plan). Aside from these few exceptions any modification to a device configuration is a modification to the network design. If the design has a flaw, there should be a problem case opened, the root cause identified, and the fix rolled into the next release of the service design. When design change is a mechanism used to correct incidents, it indicates a lack of a cohesive design activity.

What does Release Management consist of?

  1. A scheduled update of the (network) service design with fixes and enhancements
  2. A schedule indicating when field devices will be modified
  3. Scheduled releases of IOS and network management software updates
  4. Patches to Windows desktops
  5. This concept doesn’t apply to networks

The design of the network must continually be improved to add new features, support additional IT services, improvements to existing aspects of the system, fixes to problems, etc. A best practice is to release these design changes on a scheduled cycle. Everything in ITIL is in the context of the “service”. In the case of network services this is the entire network viewed as a cohesive system. IOS updates, software updates to support systems, etc. are released by the manufacturer. Although these are part of the network system design, they do not constitute a release of the network service. For example: Network service version 2.2 may define JUNOS 10.4.x on all routers and version IOS 12.4.x on all switches. Release 2.3 of the network service may include a JUNOS update as well as an enhancement to the QOS design and a configuration modification to fix a routing anomaly discovered during the previous release. It is the service that is subject to release management. The updated service design is then provisioned on the applicable devices using a schedule that is dependent on multiple factors – mostly operational. The provisioning schedule is not the same as the release schedule. Release 2.3 may become approved on March 1 but not provisioned across the entire network until August through a series of change orders. A well established network service design program uses release management.

Is there a distinction between the design activity and the provisioning activity?

  1. Design is an engineering function, provisioning is an operational function
  2. The design activity involves a great deal of testing and predictive analysis
  3. Provisioning effects operations and subject to change management
  4. Provisioning is applying the standardized design to distinct devices and sub-systems
  5. All of the above

Network service design and provisioning are too often blurred or indistinguishable. A design is abstract and applies to no particular device, but contains enough detail to be provisioned on any particular device in a turn-key fashion. Most organizations design particular devices, thus skipping the design and incorporating it with provisioning. When this happens the design process must be repeated for each similar instance. Few CABs make the distinction between the two activities causing change management to become very labor intensive because the provisioning activity becomes subject to all the testing and scrutiny of the design activity and the design activity subject to all the operational concerns of provisioning. This is another indicator of a poorly functioning network service design activity. Note: This is the only question where “E” is the best answer.

 

Contact us if you’d like some help moving forward.

 

[button link=”http://www.networkperformanceinnovations.com/contact-us/” color=”#c02942″ size=”3″ style=”1″ dark=”0″ radius=”auto” target=”self”]Contact Us Today[/button]

Well Executed Service Design Provides Substantial ROI

SE_V_Diagram

I’ve been engineering, maintaining, and managing network and IT systems for numerous organizations, large and small, for longer than I care to elaborate on. In all but three cases, all of which were small with little structure, the organization had a change management process that held to most of the ITIL process framework to some extent. All had many of the service operations processes and were working to improve what was lacking. All the CTO/ITOs understood the value of service transition and service operations processes. However, few had services catalog or any of the ITIL service design process when I first began working with them, and nobody really gave it much thought. Almost all of the articles and discussions on the internet related to ITIL are about service transition or operation. Rarely if ever is anything written about service design or strategy. Half of the ITIL service lifecycle gets all the attention, while the other half is relegated to part of the academic exercise of certification. If it isn’t important or necessary, why is it there?

Who needs Service Strategy and Service Design anyway?
Who needs Service Strategy and Service Design anyway?

Perhaps it’s that most people who are proponents of the ITIL framework have either an operations or project management background and don’t understand the engineering process and how it relates to ITIL. How many engineers or developers do you know who embrace ITIL? Most see it as a burden forced upon them by PMs and operations. Isn’t the CAB a stick used to pummel engineering into writing documents that nobody, not even the installer, will ever read?  What if they saw themselves as a vital part of the service lifecycle and key to cost saving measures?

SOA Design Model

Service transition and operations are where cost recovery is most evident. I’ve often heard it said that service operations and transition are where ITIL realizes its ROI. I argue that a well executed service design provides even more ROI than operations and transition, though it is not evident in the design part of the lifecycle.

To illustrate this point, consider an auto manufacturer. A lot goes into the design of the auto. The customer doesn’t see the design or the manufacturing process, but they see the O&M processes. Do you know anyone who had a vehicle that was always needing repair? The repair was costly but how much of the need for repair was due to poor design? I had a Bronco II that would constantly break the ring gear which often led to a transmission rebuild. The aluminum ring gear couldn’t handle the torque and would rip into pieces. I had several friends who owned minivans that would throw a transmission every 30,000 miles. It wasn’t bad luck or abuse, it was a bad design. The manufacturer fixed that in later years, but it gave them a bad reputation and caused sales on that model to fall. How about recalls? They are very costly. First there’s the issue of diagnosing a problem with a product that’s already in the field, then the redesign, and the retrofit. The point I’m trying to illustrate is that design flaws are very costly, but that cost shows up in the operations and transition part of the lifecycle, not the design stage.

The Rogers Cellular outage in Oct 2013 is one example.  Rogers has not had a very good record for service and availability.  They suffered an outage impacting their entire network for a few hours that made national news. How do you suppose this outage affected sales?  An inadequate design can have some very expensive consequences.

The business case for change management is built on reducing the cost associated with service disruptions due to change. While change management is good, the real problem is unexpected service disruption as a result of change. Planned service disruption can be scheduled in a manner to least impact customers. It’s the unintended consequences that are the trouble. A well executed service design process produces a transition plan that correctly identifies the impact of the change (change evaluation). Change management has nothing to do with this; in fact, change management relies on this being done correctly. A large part of what most organizations are using change management to correct isn’t even addressed by change management; it’s addressed by service design. This may be counter-intuitive, but it’s true nonetheless.

CloudFlare made the news when they experienced an hour long outage effecting their entire worldwide network in March 2013. This outage was due to a change that caused the Juniper routers to become resource starved after a firewall rule was applied. Juniper received the bad rap on this; however, it was the network engineering team at CloudFlare that was to blame, not Juniper. Although this was due to a JUNOS bug, Juniper had identified the bug and released a patch in October 2012. CloudFlare made a change to the network (service design change) that was released immediately to ward off a DDoS attack (this would be a service patch in ITIL terms). The change was not tested adequately and the behavior was not as expected. It was the service design process at fault here, and there was nothing in the change management process to check this. This is because change management attempts to manage the risk associated with change by controlling “how” the change is executed. Change management does nothing with the content of the change. It is presupposed that the “what” being changed has been adequately designed and tested as part of the service design process.

IT services such as Windows domain, Exchange, and databases, seldom have any resemblance of an engineering practice, but typically function as a product implementation center. Implementing a well defined service design program requires a major paradigm shift for an organization. Most organizations don’t view the engineering process with the same discipline as other areas of industry. Because most networks and IT systems are composed of COTS products that need relatively little configuration, the configuration details of the COTS system and how they integrate with other systems are not viewed as a system that should be subject to the same engineering processes as any other system that needs to be designed.  This is a very costly assumption.

The limit as x approaches c

In mathematics, a limit is the value that a function “approaches” as the input approaches some value. Limits are essential to mathematical analysis and are used to define continuity, derivatives, and integrals. If we were to take the limit of service availability as service design approaches perfection, we would see that there were no unexpected outcomes during service transition – everything would behave exactly as expected thus eliminating the cost of unintended service disruptions due to change. The service would operate perfectly and incidents would be reduced to the happenstance where a component malfunctioned as expected MTBF. There would be no problems to identify workarounds or resolutions for. This would greatly increase service availability and performance and produce a substantial ROI.

Network instrumentation is another area where the design is seldom on target.  Most networks are poorly instrumented until after they’ve grown to a point where the lack of visibility is causing problems.  This applies to event management – what events are being trapped or logged, how those events are filtered, enriched, and/or suppressed in the event management system to provide the NOC with the most useful data.  It also applies to performance management – what metrics are collected, how thresholds are set, how metrics are correlated to produce indicators useful to the consumers of that data.  It also applies to traffic statistics such as Netflow or similar technologies.  This should all be part of the network design from the beginning, because the network has to be maintained as part of the lifecycle.

The service design aspect of the ITIL service lifecycle is greatly undervalued and often overlooked entirely. Take a look at what ITIL recommends should be in the Service Design Package – the topology is only one of many aspects of the design.  Poorly executed service design results in increased costs due to unexpected service disruptions during service transition and decreased system availability due to flaws in the design. All these require additional effort to identify the problem, compensate for the service disruption, and the rework the design and execute the transition again.

A wise approach to ITSM is to expend the effort in service design rather than expending much more cost to deal with the fallout from a poorly designed network.  The engineering staff should see themselves as having a key role in the service design process and know the value they contribute to the entire system lifecycle.

Read the follow-on post if you’d like some specific indicators of an ineffective network service design.

NPI provides consulting services and an inexpensive design tool to aid in the services design process for network and IT services.

 

[button link=”http://www.networkperformanceinnovations.com/contact-us/” color=”#c02942″ size=”3″ style=”1″ dark=”0″ radius=”auto” target=”self”]Contact Us Today[/button]