information security strategy: metrics

Showing posts with label metrics. Show all posts

Friday, February 20, 2009

The next 12 months

Yesterday at the Chicago ISACA meeting I had the opportunity to hear Dave Ostertag from Verizon walk through the 2008 Verizon Data Breach Investigations Report, point by point. At the time of publication, the report included over 100 data points from 500 cases, but the base is now up to 700 cases and still more interesting patterns in the data continue to emerge.

The report is 27 pages long, but it informs an information security strategy by simply and persuasively answering one simple question: “What changes can I make in the next 12 months that will significantly reduce the likelihood and impact of a security incident in my organization?”

Across all the activities lumped under the banner of information security, Verizon found that a surprisingly small set of outcomes (or more accurately, the absence of these outcomes) mattered most. The survey lists nine recommendations, but I’ve re-worded and consolidated them a bit here:
1. Execute: ensure that security processes implement the identity management, patch management and configuration management basics. From the survey: “Eighty-three percent of breaches were caused by attacks not considered to be highly difficult. Eighty-five percent were opportunistic…criminals prefer to exploit weaknesses rather than strengths. In most situations, they will look for an easy opportunity and, finding none, will move on.” In contrast, among poor-performers, “…the organization had security policies … but these were not enacted through actual processes…victims knew what they needed to do … but did not follow through.”
2. Inventory, segment and protect sensitive information: “Sixty-six percent of breaches involved data that the victim did not know was on the system.” Know where critical data is captured and processed, and where it flows. Secure partner connections, and consider creating “transaction zones” at the network level to separate baseline business activities from high sensitivity environments.
3. Increase awareness. “Twelve percent of data breaches were discovered by employees of the victim organization. This may not seem like much, but it is significantly more than any other means of internal discovery observed during investigations.”
4. Strengthen incident handling capabilities. Monitor event logs, create an incident response plan, and engage in mock incident testing.

Steps 1 and 2 reduce the likelihood of an incident; steps 3 and 4 primarily reduce the potential impact by decreasing the time lag between an intrusion and its eventual identification and containment.

As for step four, my first thought is that mock testing won’t be much of a need for most incident response teams because of the natural cycle of event monitoring, suspected incident reporting, and initial response to events that are often false positives. Organizations that promote active reporting of suspicious events, and who treat each one as an actual incident will have much of the practice in a live setting that mock drills would otherwise offer. Instead of trying to prevent false postitives from occurring, an IR team should work to become more efficient at quickly ruling them out. As they do, the threshold for activating an initial review will drop, and ultimately they’ll catch more events closer to the time of occurrence.

It’s still a good idea to ensure that all stages from identification through remediation and recovery are fully practiced, but in general achieving containment quickly reduces the number of records exposed, and thus the eventual full cost of the breach.

Which brings us to next steps for Verizon; it seems that they’re now working on developing an incident costing model. This will be huge, because without it, organizations will continue to struggle with how to set specific protection goals that align with their cost structure and business strategy.

As an example, the survey looked at four sectors. Retail was one that contributing a sizeable amount of data (which is a polite way to say they got hacked a lot.) No surprise that simple survival is usually a bigger concern than security for many retailers: net profit margin among publicly traded companies in this sector often ranges between two and six percent. An additional dollar spent on physical security needs to be matched by up to $25 in additional sales … just to break even. Considering the wholesale cost of merchandise, it’s understandable why management accepts the risk of physical theft, formally accounting for it as “shrinkage.”

Unfortunately, while this mindset towards risk carries over into the electronic space, the analogy doesn’t. A dollar lost to computer crime, either through the cost of the incident itself, or the cost of organizational response, comes straight out of profits. It’s a much more damaging effect.

But, without a clear measure of the cost of an incident, the value of steps 1-4 to the CFO are murky at best. It doesn’t need to stay this way: calculating the direct and indirect handling costs of an incident isn’t a terribly difficult exercise, and most organizations already have the data needed to put it together. At JMU I started down this path with Dr. Mike Riordan in his Managerial Accounting class, drawing heavily on Gary Cokins’ paper Identifying and Measuring the Cost of Error and Waste to frame the problem. We need a credible model backed by lots of data, and I’m really hoping Verizon is able to put it together.

As for the next 200+ cases, I can’t wait to see how they present the 2009 findings. To characterize the survey as “pathology” might be a bit strong, but I thought it was interesting to note Dave’s background as a former homicide investigator. During the live session, you get some answers to the “so then what happened?” questions that the report doesn’t touch.

On our end it may feel like a never ending battle, so it’s good to talk to someone with a broad view of what is going on internationally. It’s more than a little comforting to learn how much progress is being made in locating and taking legal action against the bad guys…

Tuesday, January 27, 2009

Making the most of forensic downtime

As a computer forensic investigator, at times the caseload can become a bit overwhelming. Sometimes the requests come pouring in; other times the queue will be empty. Looking back through several months of work, you can put together a reasonable estimate of the average arrival rate and average completion rate for forensic investigation requests. Armed with these two pieces of data and some equations from queuing theory, you’ll be able to estimate the amount of cumulative non-investigation time that will likely be available for other tasks over the course of a year.

Naturally much of that down time should be spent “sharpening the saw;” maintaining tools and scripts, etc. But as discussed in the last post, it may also be helpful to leverage those forensic tools and skills to measure risks in the end user environment. Taken periodically, these measurements support the execution of a security strategy by providing the evidence needed to drive changes that will ultimately reduce the frequency and impact of incidents that do occur.

Queuing theory can become complex in a hurry, but there are a few formulas that are easy to use and very helpful if you make a few reasonable simplifying assumptions. For the long version, check out Contemporary Management Science by Anderson, Sweeney and Williams. Or an online excerpt of the relevant chapter is available here.

If you know the average arrival rate of requests, and have a good feel for how long it takes to complete the typical investigation, you can calculate the following:

1. Probability of an empty queue with no requests in process:
1 – (arrival rate / service rate)
2. Average number of pending requests:
((arrival rate) ^ 2) / (service rate * (service rate – arrival rate))
3. Average number of investigations in the systems:
Average number of pending requests + (arrival rate / service rate)
4. Average time a request has to wait before the investigation starts:
Average number of pending requests / arrival rate
5. Average resolution time; from initial request to completed resolution:
Request wait time + (1 / service rate)
6. Probability that a new request has to wait for service:
Arrival rate / service rate
7. The probability of N number of investigations and requests that are in the system at a given point in time:
(arrival rate / service rate)^N * probability of an empty queue

These equations provide a good approximation if the following assumptions hold true:
1. For a given time period, you’re almost always going to get between zero and 2 requests (i.e. 95% likely) and only rarely do you get a bunch of requests (5% chance of 3 or more requests arriving at once).
2. Few service requests will take significantly than the average service time to complete.
3. You’ve got one investigator servicing requests – a “single service channel.”

So, as an example, suppose the average request arrival rate is about 3 cases per month, and an investigator can complete about 4 cases each month. Calculate expected downtime using this proces:

First, convert the monthly numbers to a weekly rate, 3 arrivals per month is a 0.75 weekly arrival rate, and 4 completions per month is a 1.0 service rate. Then, plug and go:

The probability of zero requests in queue is:

1 – (.75 / 1) = 0.25

A 25% chance of having a week with no requests in progress.

So in this “best case” scenario, roughly 25% of an annual 2000 hours worked won’t be directly allocated to investigating live cases; 500 hours. In reality, this estimate is almost certainly too low for a couple of reasons. First, management isn’t likely to over-staff a forensic role; demand will rise to fill that capacity, and inevitably many long-running difficult cases will come up that fall outside the average completion rate by a big margin. And investigators will need to factor in time for script development, system administration, and other tasks.

Assuming that a residual 200 hours (10%) of time remains available throughout the course of the year, this can provide the perfect opportunity to quantify policy compliance against specific goals.

So how much can you do with 200 hours? Turns out, quite a bit.

Saturday, January 10, 2009

Getting privileged accounts under control: spend less time finding, more time fixing

Are there too many privileged accounts on the business critical systems in your organization? If you suspect so, how would you find out, and how would you energize the leadership in your organization to act? And once you get management endorsement, what number would you set as the maximum allowable number of accounts on a system as a benchmark for non-compliant system owners to shoot for? You'll want all owners to verify compliance, but would a positive response from 50% of those owners justify the call to action?

Perhaps most important of all, after driving this change and moving on to the next problem, will you have the time and resources needed to follow up later in the year and make sure that the problem hasn’t reappeared?

As with any security issue, a small amount of effort should go into finding the problem, and the majority into solving it. To paraphrase Tom Clancy from Into the Storm: “The art of command is to husband that strength for the right time and the right place. You want to conduct your attack [in this example, on the problem] in such a way that you do not spend all your energy before you reach the decisive point." (page 153)

Using a tool like dumpsec for Windows it doesn’t take long to pull group memberships remotely from any given system. But if you’re dealing with hundreds or even thousands of systems, well, that’s a lot of energy to spend before reaching the decisive point, i.e. when system owners start removing excessive accounts.

Intuitively, it makes sense that you wouldn’t want to poll every system in a large environment. Instead, you’d take a sample. But how big of a sample is needed for you – and senior management – to be confident that you know the current state?

Turns out, you (and your boss) can be 90% confident of knowing the median number of privileged accounts on all systems across the server population if you start with a randomly selected sample of 18 systems. And because by definition the median is the middle value, you know that half of the systems are above the sampled value. If this value is too high based on the risk requirements of the environment, you can set a compliance goal such as “reduce the number of privileged accounts on each Windows systems to X by the end of the year.”

To find the median, follow these steps:
1. Pick 18 systems at random across the system population. Dump the list of users with privileged access from each system.
2. Arrange them from fewest to most accounts.
3. Throw out the lowest six and the highest six values, and keep the middle six.

The median number of privileged accounts will be between the low value and the high value of the middle six numbers out of the sample of 18.

For example, if I dumped the local admins group across a set of systems, I might get a result like this. (The “middle six” values are highlighted in bold):

49, 23, 17, 33, 17, 16, 28, 14, 29, 40, 12, 44, 34, 12, 25, 9, 10, 32**

So based on this sample, the median number of privileged accounts across all systems are 90% likely to be between 17 and 29. Granted, due to the architecture certain accounts may be present across all systems. And other factors may help determine if 29 is too high … or 17. But once you decide, you have a baseline value that defines the boundary between acceptable risk and excessive access, which can be communicated across the organization.

Once you’ve gotten buy-in and communicated the requirement, each system owner who wasn’t sampled can compare and confirm that they comply. And in keeping with Clancy’s principle above, only a fraction of your time was spent identifying the problem and communicating it: the rest goes in to helping fix it.

But why does 18 work? Where does the 90% confidence come from, and why throw out the bottom six and top six values?

Doug Hubbard explains it in Chapter 3 of his book “How to Measure Anything.” And while this isn’t a specific example in the text, there are a lot of intriguing applications to information security that he does cover.

Hubbard introduces the idea of finding the median from a small sample as “the rule of five:”

“When you get answers from five people, stop…Take the highest and lowest values in the sample…There is a 93% chance that the median of the entire population … is between those two numbers.” Why? “The chance of randomly picking a value above the median is, by definition, 50% -- the same as a coin flip resulting in “heads.” The chance of randomly selecting five values that happen to be all above the median is like flipping a coin and getting heads five times in a row.(pp. 28-29)”

In other words: 0.5 x 0.5 x 0.5 x 0.5 x 0.5 = .03125 With a random sample of five, there’s only a 3.125% chance of being above the median all five times, and the same 3.125% chance of being below the median all five times. So each time you take five random samples, you’re going to get values on both sides of the median 93% of the time -- the median will very frequently be between your lowest and highest value.

So if five samples gives you 93% confidence, why take 18 samples? From the example above, if you picked the first five at random and stopped, you would have found this:

49, 23, 17, 33, 17

With 93% confidence, you’d be able to assert that applications contain between 17 and 49 privileged accounts. With small samples randomly chosen, high confidence comes at the expense of intervals that are often quite wide. And in this case, it may be too wide to be useful. But picking more samples and tossing out six of the lows and six of the highs retains roughly the same level of confidence in the middle six, with the advantage of a much smaller range between the low and high values. And it’s the smaller range that allows you to understand the state of the environment, and set a credible level of improvement that the organization can meet.

More info I found useful:
How to Measure Anything http://www.howtomeasureanything.com/ Lots of gems on the site; check out the PowerPoint on measuring unobserved intrusions in information systems.

Confidence intervals for a median, with different size samples: http://www.math.unb.ca/~knight/utility/MedInt95.htm

**These numbers were generated by Excel; try it out for yourself. For this example I used the formula =5+(40*RAND()) to give a higher starting value than just "1."

Friday, December 05, 2008

Risk metrics should drive security, without dictating it

How precise do risk measures need to be in order to be of value to an organization? Is it necessary to calculate an annual loss expectancy (ALE) for each type of information security risk in order to justify security decisions? For better or worse, most organizations have settled on a security budget that is a fraction of the overall IT budget, which in mature companies remains a steady proportion of annual revenue.

Given the challenge of putting together credible loss numbers across the range of identified threats against the organization, it doesn’t make much sense to try to optimize budgets purely against a risk forecast. Instead, security is best treated as a constraint in decisions to optimize revenue, operating costs, profit or other key measures. Protection for critical assets needs to cross an “adequacy” threshold. Conversely, when changes stress or stretch protection capabilities to the point of exposing critical assets to threats, the information security function begins raising the case for change.

So if risk management is more about being on the right side of a threshold, as is literally specified in the EU Privacy Directive / US Safe Harbor guidance, then precision is not nearly as important as confidence. Polling organizations such as Gallup provide a margin of error of 2% because the difference between winning and losing a contest is often very close. But in contrast, safety and security based decisions i.e. “we need to act, now” can become clear with margins of 10-15% or more. As an example, if the brakes on the family minivan squeak and start slipping, its time to get them replaced.

With the help of a few reasonable, simplifying assumptions, it is possible to make trustworthy risk-based decisions based on just two critical metrics: security control coverage, and information asset exposure.

These assumptions are as follows:
1. The impact of security incidents are best characterized in financial terms, i.e. information security incidents have the potential to affect current and/or future costs, and current and/or future sales. (Health and safety critical environments are an exception that should be treated differently.)
2. The value that IT security provides to an organization comes from decreasing the frequency and severity of security incidents by:
a. Preventing incidents from occurring whenever possible
b. Detecting relevant events where and when they occur, and mobilizing an effective response to minimize the damage and restore normal operation as quickly as possible.
3. Security control coverage is a leading indicator of risk to information systems, business processes and data.

Based on these assumptions, two key metrics for decision makers can persuasively frame the security “threshold” decision without requiring an unreasonable level of precision:
1. Information asset exposure: a measure of the relative contribution of that asset to the current and future revenue of the organization.
2. Security control coverage: a measure of the number and type of industry best practice recommendations implemented independently as layers of protection on each asset and process owned or used by the organization to serve its customers and stakeholders.

As an example, consider a company with $120 million in annual sales, $150 million in assets, 500 employees, tens of thousands of current and former customers, Market capitalization of $110 million, and an operating margin of about 18%. Based on these estimates, here’s a quick back-of-the-envelope estimate of the scale involved in information protection decisions:

$120 million in annual sales works out to about $330,000 per day or between $10,000 and $25,000 per hour. So to this company, the loss of several hours of downtime from a key system or systems, plus incident handling costs and lost worker time, etc. can run between $150,000 to $200,000.

According to a 2006 report from the Association of Certified Fraud Examiners, the median fraud loss for asset misappropriation (skimming, payroll fraud or fraudulent invoicing) is $150,000.

Forrester estimates that a privacy breach cost between $90 and $305 per record to address; the Ponemon Institute provides a similar number. Based on those estimates, losing personal information on 5,000 customers would result in costs of between $500,000 and $1,000,000.

Asset exposure, described as a fraction of revenue, is a linear function: the longer the downtime, or more records exposed, the higher the cost. But as described in an earlier post, security is not linear. In a population of systems connected by trust relationships, a failure in server A will lead to a compromise of server B, C, D and on down the line.

Earlier this year, Verizon published a Data Breach Investigation Report based on follow-up on over 500 cases in a four year period. While there’s much to take away from the results, two measures stand out in terms of shaping risk decisions: 85% of identified breaches were the result of opportunistic attacks, and 87% were considered avoidable through reasonable controls. That is; security control coverage provides a strong leading indicator as to the likelihood of experiencing a security breach.

So, given an operating margin of 18% (roughly average for the S&P 500) it could take $5 to $6 of additional revenue to make up for each dollar lost due to a security incident.

Against these measures, determining levels of acceptable risk becomes a much more straightforward exercise without the need for precise risk forecasting. Instead, it becomes a question of risk tolerance: will the extensions to the customer-facing systems generate enough new revenue to justify exposure to some of the scenarios listed above?

Metrics can frame the issues, but ultimately the business has to drive it.

Sunday, October 26, 2008

Can you afford bad security?

Within the current economic turmoil and uncertainty its becoming clear that the global economy is slowing, pressuring organizations of all sizes to compete more intensely for revenue while taking an even harder look at reigning in costs. These concerns cascade through the overall project portfolio to IT and security in the form of two very basic questions: What do we need? What can we afford?

In a company fighting for its survival, talking to management about improvements in information security may seem as relevant as changing the locks on a burning building. Naturally, fire is an immediate threat to an asset and its contents, but over a longer time horizon so is the risk of theft … or foreclosure.

Bottom line, some organizations can afford bad security. Others can’t. In some situations, immediate survival concerns will temporarily trump long term protection goals. But as the market meltdown in the United States in 2008 is showing us, it is just as plausible to see that relaxing key control requirements for short term profitability puts entire companies, and even markets, at risk.

The only way to get this right is to view security in light of the survival needs of the firm, and measure it to the same standard of every other investment. In the past, information security hasn’t been held to this standard, mostly due to measurement challenges. Hopefully, for the good of the profession as well as the entities we protect, those days are over and we can take up the challenge of proving our value more accurately and more persuasively than we have in the past.

“What the CEO wants you to know”
In 2001 Ram Charan wrote a gem of a book called “What the CEO Wants You to Know,” distilling business acumen into the effective management of five core measures of business health: cash, margin, velocity, growth and customers. Charan: “Cash generation is the difference between all the cash that flows into the business and all the cash that flows out of the business in a given time period …it is a company’s oxygen supply” pp.30-31

Margin is the difference between the price and cost of goods sold, while velocity is the rate at which those goods are sold. Growth includes expansion (more sales) and extension (new markets) while the Customers category represents how well the organization responds and aligns with market demands.

Naturally, some of these needs can become tactical and immediate while others are more strategic in nature. But all must be functioning effectively for a company to succeed, and any threat to these measures ultimately threatens the health of the company.

“What the CISO wants you to know”
If the five factors above represent the keys to a successful business, then good security is important to a company only to the extent that it affects those factors. If there’s no impact on customers, growth, etc. then there’s no value to security. Or, as your CFO probably read in school:

“A potential project creates value for the firm’s shareholders if and only if the net present value of the incremental cash flows from the project is positive.” [Brigham and Ehrhardt, Financial Management: Theory and Practice, 11th Edition, p.389]

Security issues expressed in terms of cash, margin, velocity, growth and customers, and measured in terms of net impact to the company have the best chance of resonating with decision makers.

Gordon and Loeb propose a three dimensional Cybersecurity cost grid as a tool for building that business case. The authors suggest failures of confidentiality, integrity and availability are to be analyzed in terms of direct and indirect costs, as well as explicit and implicit costs.

For me, the distinction between indirect and implicit didn’t seem as compelling as the difference between a net positive or negative effect on security, so I started segmenting the effect of security across Charan’s five categories this way:

Of course, measuring it is the real trick. But there are quite a few resources available to help with that...

information security strategy