FRM学习——更好地风险建模：显著性检验有效吗？

本文作者Tony Hughes

那些从事社会科学的人——包括风险建模人员——经常被指责为不是真正的科学家。这些指控通常来自物理科学家，他们能奢侈地重复实验，却几乎不会导致现实世界的后果。然而，如果一家银行仅仅为了改进其模型，通过贷款给无法负担的人来收集数据，大多数人都会认为这是不可行的。

虽然，在我看来，风险建模的非实验性使其比火箭科学困难得多，但我们仍然应该立志成为更好的"科学家"。

我撰写这篇文章的原因是，在2019年年中，颇具影响力的科学杂志《自然》(Nature)的编辑们对是否应该彻底抛弃统计学的显著性概念进行了讨论。与此同时，《美国统计学家》的特刊发表了43篇论文，抨击了传统观点，即p值在统计学中被赋予了太多的重要性，应该被降级或搁置。

显著性检验的主要问题是它的任意性和二分性。在标准的统计学计算过程中，p值为0.04表示研究成果是成功的——因为结果是“显著”的——而0.06则表示研究结果是不显著的。实际上，这两个结果之间几乎没有什么实际的区别：一个可能只是测量得更精确了一点，或者可以获得更多的数据。采用硬性规定0.05来判断显著性似乎不科学。

我是在假日中写这篇文章的，所以请允许我讲一个个人的故事。在我还在上学的时候，我搜集了一些数据，用来测试“葡萄收成不好的年份”是否影响了某一高档酒在后续年份的拍卖价格。

这在技术上是非常有趣的，因为它涉及使用面板数据模型(panel data model)，在两个不同的时间维度上选择滞后的变量——酒的年份和拍卖日期。我花了六个月的时间来研究这个理论，并仔细地计算这个模型。虽然这是20年前的事了，我仍然记得系数的p值为0.063。这篇论文当然尚未发表。

撇开我自己在学术上的失望不谈，我们必须记住，《自然》杂志的编辑们主要是在可实验的领域工作，在这个领域，许多影响因素实际上是可控的。而在风险建模大多以观察为主时，与显著性检验相关的困难只会成倍增加。

在我那篇关于葡萄酒的论文中，任务是检验一个单一假设：即原假设中，变量“制作年份滞后”的相关参数是零。而大多数风险建模研究都有更广泛的目标。通常，分析师被要求评估无数可能影响投资组合行为的风险因素，并告诉他们的老板哪些因素是重要的。与其说这是一次纯粹的假设检验，不如说这是一次发现之旅。

用于这些目的的模型不是平白无故出现的——它是用大量的统计测试建立起来的。这些预先测试通常是不可避免的，它会扭曲模型估计和标准误差的特性——您认为您在测试5%置信区间下的系数，但是测试的置信区间可能是1%、5%或50%，这很难确定。

合理调整：揭示误解，正确认识不确定性

The p-value detractors know they are fighting an uphill battle to change perceptions of statistical testing.It is so easy,and so readily accepted,to blindly apply the 5%rule that proposing a radical alternative is probably futile.The adjustments they propose are therefore very reasonable and accommodative.

One interesting suggestion from the American Statistician contributors is that academic journals agree to publish the findings of pre-registered research projects,irrespective of the statistical findings they uncover.This would mean that surprising results that happen to be negative will gain more attention,possibly spurring fruitful future research projects.

In the context of risk,this suggestion is especially useful.If analysts report suspected risk factors that ultimately prove to be statistically insignificant,this will provide a great deal of insight to model users.It seems to me that debunking bankers'misconceptions should be a core duty of empirical risk modelers;giving prominence to negative results should enable this to happen far more frequently than it does at present.

The second point concerns the language of significance testing.Rather than using a strict dichotomy between significant and insignificant results,analysts should instead“embrace uncertainty”and try to express the spectrum of outcomes that occur when testing.This means that we should avoid saying that there is“no association”upon finding an insignificant p-value and instead use terms like“insufficient evidence”of a relationship or,better yet,use confidence intervals to express the results.

This seems like a small shift,and something we could easily embrace without sparking much backlash.

The third suggestion concerns repeatability–the situation where multiple independent researchers find ostensibly the same result if they go looking.In the academic scene,the authors suggested that crowdsourcing is a good way to check this.If many researchers consider a particular question and if a plurality of opinion emerges,the result is far more likely to stand up to rigorous scrutiny.

This option is probably not available in the financial world.I can't really imagine a bank posting their data on the web with a view to crowdsourcing the final specification.That said,Fannie Mae and Freddie Mac released their data a few years ago,which sparked a huge amount of research in the private sector and academia.One imagines executives at financial institutions voraciously consuming this research and using it productively whenever the results are compelling.

进一步思考

I've always thought that banks should use lots of models,reflecting the fact that one model will rarely answer all the questions you might want to ask.Crowdsourcing,even if the crowd is internal,is a great way to achieve this outcome.

I'm not sure if risk modelers,or scientists more generally for that matter,will ever be able to wean themselves off p-values.That said,there are some simple steps we can take to improve communication,and these should be tried.

I'm at least 95%sure that this is the case.

Tony Hughes是穆迪分析(Moody's analytics)经济研究和信用分析部门的董事总经理。过去15年，他的工作横跨金融风险建模领域，从企业和零售风险敞口，到存款和收入。他还从事资产价格预测和一般宏观经济分析。

本文作者Tony Hughes，来自GARP公众号，版权归作者所有。