Measuring Manager Performance: Dos and Don’ts

656316_tape-measureIn the first part of this discussion, I talked about the history of benchmarks and their shortcomings. In this article, I discuss their problematic impact on performance measurement – and their use in the hiring, firing, and retention of managers.

There is one truth in the investment industry: past performance is not a predictor of future performance.  So, with all the sophisticated tools now at our disposal, we are only a little further ahead in determining manager “skill” and whether this can reasonably be expected to be sustained in the future.  We are a lot better at determining the reasons we have earned that extra dollar – but only a little better at knowing how to generate that extra dollar in the future.

What makes measuring manager performance difficult over time is that there are so many moving parts. Investors tend to focus too closely on four main measurement tools – inflation, peer group comparisons, indexes, and statistical measures. Each of these is problematic for a number of reasons:

Inflation-Driven

This bogey does not meet any of the criteria outlined by CFA Institute in what constitutes an effective benchmark. It is not representative of an asset class, not transparent, not investable and not representative of a manager’s style/approach. The only positive with this measurement approach was the longer-term time frame accepted for determining success/failure. Its primary rationale was as a proxy for the long-term rate of growth of liabilities, which, ideally, the balanced fund managers would outpace.  However, it is only a crude approximation for the rate of growth of liabilities and, even then, only on for a fully-indexed defined benefit plan.

DB plan sponsors and their advisers today can, and should, determine the innate rates of growth in liabilities, related to both inflation (in the long-term) and movement in bond yields (in the short-term) used in placing discounted values on expected benefit streams. With this information they can determine what assets best match their liabilities, and can then determine the extent to which they match liabilities versus maximizing long-term returns.

However, the relevance of an inflation-related benchmark may be returning as inflation is a major element in the retirement income needs of a member of a DC plan.

Peer Group Sampling:

As mentioned earlier, peer group sampling now plays a backup role to index-focused tools – yet, still investment managers are, typically, more often hired or fired based on “relative” performance. A manager who exceeded his/her bogey (the index plus value-added target) might be terminated if the portfolio placed in the third or fourth quartile of a peer group sample – the manager achieving the primary goal, but, falling short on the secondary objective; or, on the other side, the manager could fall short on the primary objective, however, place in the second quartile and keep the client – and, probably, gather more assets from that client. So, peer group sampling, as a measurement tool, tends to create the most manager turnover activity.  And, in turn, managers are often more concerned with how they place against others year-by-year than how to maximize long-term value for the beneficiaries.  Mutual fund managers, operating in the areas of media scrutiny, are most prone to this effect.

As with the inflation bogey, the Median Fund does not meet the criteria of an effective measuring stick – it is not representative of any asset class or mandate, not specified in advance, not transparent and not investable.  In fact, the Median Fund does not exist.  Or, on the other hand, consistently underperforming other managers in the same area of investment can hardly be ignored entirely by both the sponsor and the manager.

Nevertheless, peer group results have to be interpreted with care if the wrong conclusions are not to be drawn, due to the following biases which the user needs to consider:

Composition bias:  measurement service providers gather information from various sources which can (and does) result in the quartile breaks being different for each service – although the title for each grouping seems the same from each service provider; a manager giving the same information on the same fund to two service providers could place in the second quartile on one and in the third quartile on the other:

Classification bias: The service providers try to find a home for all funds submitted; given different methodology, managers could fall into the “growth” category in one service and in “market-oriented” within another service. As well, the success of a management organization which specializes in mid-cap stocks could grow in assets that could force a manager into a higher cap box. Also, there can be style drift as markets force managers to adjust their style.

Construction bias: Measurement service providers also have different criteria in the construction of their various samples (e.g., equal-weighted sample versus a fund-weighted sample, pooled funds versus segregated funds, specialty funds versus components of a balanced fund, manager funds versus plan sponsor funds, etc.).

Selection bias: Managers, generally, have control over the timing of when they want to include a fund within a sample. It is unlikely that a manager will choose to include a fund that is not doing well if it is at the manager’s discretion.

Survivorship bias. This occurs when funds (typically, poor performing funds) are removed from a sample; if the history is removed, only the better performing funds remain – resulting in the historical Median Fund return moving up. The good news is that most services now do retain the performance records of funds that have been withdrawn for whatever reason.

Size bias: The size of the sample becomes important. Over the years the variance between the quartile breaks has become larger. As a result, some services have eliminated the outliers by removing the top and bottom 5% of the sample participants. As manager styles have become more specific, more samples have emerged resulting in samples becoming smaller — outliers can skew these samples.

Inclusion bias: Measurement services attempt to be the first ones to publish their samples (a branding issue as the first out gets quoted in the papers). As a result, funds who did not get their information in on time might be excluded from the sample they were in in the previous quarter, creating a sampling error. As well, there might be some direct constraints placed on the portfolio that could be detrimental to overall performance during specific time periods.

There is also a clear conflict of interest if measurement services are charging a fee to money managers to be included in their samples. Some managers might not be willing to pay a fee and, therefore, the measurement service might not be totally representative of the universe of money managers available to investors.

One last difficulty in comparing managers within a peer group sample is through the difference even one quarter can make. A manager who placed in the third quartile last quarter could move into the second quartile this quarter on a four-year moving-average basis with just the elimination of one bad quarter and the addition of a good quarter.

I believe there are enough concerns with peer group sampling that it should not be the primary reason for managers to be either hired or fired. Qualitative evaluation of people, beliefs, processes and portfolio construction is equally important in making these decisions. Yes, I know, investment committees are likely not going to hire managers who are currently in the third or fourth quartile (even though it might be the best thing for the performance of the pension fund). However, I believe investment committees are often too quick to fire their managers who fall into the third, or even fourth, quartile. At the very least, try to determine whether this was caused by their investment style, a naturally unfavourable environment or their apparent bad luck or reduced skill.

Index Comparisons

Comparing performance results against an index meets the CFA Institute’s criteria of an effective benchmark.  However, when selecting an index as a proxy to measure a manager’s performance, the main concern should be whether it is representative of the manager’s style/approach, against which one is attempting to assess skill.  Index creators have their own objectives.  As an example, our current S&P/TSX Index was not designed for the sole purpose of providing an “effective” benchmark for measuring the performance of Canadian money managers (or even as a proxy for pension fund fiduciaries to use when formulating investment policy).  It was designed to fit nicely as a component within the global equity index created by Standard and Poor’s. By virtue of the Canadian economy, this Index is quite concentrated by sector – more than what one would seek in a well-diversified portfolio.  How do you compare, fairly, money managers against an Index that is so skewed?

On the fixed income side, the vast majority of funds are compared to the Scotia Bond indexes. Again, these indexes, typically, are not representative of a manager’s portfolio either in term or sector weightings. As with equities, the user has to assess whether it is both feasible and necessary to create a custom index, or whether to use the standard products offered by being fully cognizant of their construction and their individual implications for performance comparisons.

Statistical Techniques

Users of time series data need to go one step further than is commonly done – asking themselves and their suppliers: “what is the statistical significance (if any) of the value added, information ratio or other deviations from benchmark performance”.  The standard 95% statistical confidence test is way too high for anything as uncertain and evolutionary as investment performance.  However, a “businessman’s” test of 67% confidence (i.e., there is a two out of three chance that the observed result is not the result of random chance) is surely relevant.  Also, a powerful tool to avoid spending excessive time on short-term variability and noise rather than signal is to mutually establish “tolerance bands” with the manager for say 1-year, 3-year, 5-year and cumulative results, and then to focus analysis and evaluation on events which fall outside the tolerance bands (either above or below).

 Summary

There are a number of factors which must be considered when setting an appropriate hurdle rate for effectively judging the success of an investment policy/strategy or an investment manager, as the case may be. They are:

  1. Determining what needs to be measured and how to evaluate the results. The inflation-driven bogey was originally selected to measure the success of the plan itself, however, it became a useless tool when trying to evaluate the performance of the money manager. Relative rankings can mislead without sufficient knowledge as to the make-up of each sample. At the same time, comparisons against either an index or custom-designed benchmark rely on a complete understanding of the methodology, inclusion and representation of the index used, and a statistical understanding of the result.
  2. The selected time frame is critical. Why four years? Not all asset classes are equal — managers of some asset classes (e.g., real estate, small cap, high yield, infrastructure, private equity, etc.) require longer-term time horizons to be able to demonstrate investment success. Managers of longer-term fundamental strategies, or those making larger or less frequent “bets” require longer periods before success or failure can be judged than those making multiple and/or frequent bets and/or having lower tracking error versus the benchmark. As well, for some asset classes, determining an effective benchmark might be difficult.
  3. To justify “active” management, a value-added target is linked to the specific index used to measure a manager’s success – and, demonstrate why active management was selected over a passive alternative. For a core common stock portfolio, the value-added target is, typically, 200 basis points above the index. With the historical return of the Canadian and U.S. stock markets at close to 10% per annum, this value-added target would be 20% of the historical return. Should the markets in future years be half the long-term average, the value-added that would have to be delivered would be 40% (i.e., the success or otherwise of active management has become much larger as a share of the total return outcome). Determining the right benchmark and value-added target gives a tell-tale to the manager and, if not done properly, can send the wrong message.

As a consultant, I had one client that gave their manager full discretion to be in either the Canadian stock market or U.S. stock market. Back then, the benchmark was the Toronto Stock Exchange (TSE) Composite plus 400 basis points or the S&P 500 plus 400 basis points. In my first meeting with the plan sponsor after being retained, the existing manager talked about how attractive the U.S. market was compared to the Canadian market. I looked at the portfolio and it was 100% invested in Canadian stocks. When asked why, the manager stated that he believed he had a better chance of outperforming the Canadian market by 400 basis points (which we also thought was impossible) than outperforming the S&P 500 by 400 basis points. The manager could not do what he felt was best for the client given the client’s performance objective.

One last point here: a study by Goyal and Wahali titled, The Selection and Termination of Investment Management Firms by Plan Sponsors published in the August 2008 issue of the Journal of Finance concluded that the vast majority of replacement managers who were hired underperformed the managers that were fired over the next two-year period.

Conclusion

Determining the appropriate benchmark for measuring success (of lack thereof) and interpreting the outcome wisely is both an art and a science. The majority of tools have their flaws. Yet, investment committees and investors must make decisions to hire, retain and cannot avoid using some form of benchmarking.

Untimely firing and inappropriate hiring is costing pension and investment funds millions – and in a DC world this cost will be borne by individuals who have little influence over the choices of active managers on the “menu”.  In the example I used above, the benchmark design cost the client $23 million over the previous four-year period – an opportunity cost.

The estimated cost of changing managers is estimated to be around 1% of the portfolio value of the specific mandate.  On a $100 million portfolio, this represents around $1 million – not counting the fee charged by the service provider for finding a new manager.

The selection of the benchmark and value-added target is as important as the management of the investment portfolio.

Make sure it sends the right message.  And, then, make sure performance against these metrics is evaluated accurately, patiently and fairly.

The author would like to thank his partner, Colin Carlton, for his valuable insights and contributions.