8.1 理解文本复杂度

文本的复杂性是指阅读、写作或理解一份文件的相对难度，或这些活动的组合。会计和财务研究人员对公司信息披露的复杂性特别感兴趣。复杂的信息披露难以阅读和理解，从而增加了信息披露使用者的处理成本。研究者们提出了几个原因，说明为什么一些公司的文本披露比其他公司更复杂。例如，复杂的报告可能是管理者试图混淆信息或掩盖公司的不良表现（Li，2008；Lo等人，2017）。另一方面，复杂的披露可能是复杂的商业运作和监管报告要求的结果，而不是公司管理层故意混淆视听（Guay等人，2016；Dyer等人，2017；Chychyla等人，2019）。不管是什么原因，复杂的披露都与负面的外部性有关，如信息环境不透明（You和Zhang，2009；Lehavy等，2011）和财务误报风险增加（Filzen和Peterson，2015；Hoitash和Hoitash，2018）。

量化文本复杂性不是一项简单的任务，因为用于量化文本复杂性的方法用来量化文本复杂性的方法取决于研究问题和基础文本来源。例如，如果研究者对测量阅读一段文本的相对难度感兴趣，研究者可以利用一般的文本可读性测量方法，如迷雾指数（fog index）或Flesch-Kincaid可读性分数（Gunning等人，1952；Kincaid等人，1975）。对于一个给定的文本，这些测量方法根据每句话的平均字数和每个字的平均音节数输出可读性分数；分数越高，表明文本的复杂性越高。

然而，如果研究者对更复杂的可读性测量感兴趣，比如，寻求写作风格的复杂性，BOG指数将是一个更好的选择（见Bonsall等人，2017）。此外，如果研究目标是量化编制财务报告的难度（而不是阅读它们），那么更合适的复杂性衡量标准可以是适用于量化这些财务报告的会计准则的复杂性（见Chychyla等人，2019）。

在本章中，我们将介绍几种常用的文本复杂性测量方法，并演示如何用Python计算它们。

8.2 计算文本长度

文本长度是衡量文本复杂性的最简单，但信息量却很大。它等同于特定文本中的总字数。由于较长的文本需要更多的时间来阅读（和书写），因此它们在处理（和准备）方面也更加困难。在第七章中，为了计算语言语气，我们写了一个Python代码，用来计算给定文本中的总字数。现在我们把这段代码作为一个函数，称为count_words ，计算输入文本的字数：

 import re
 
 def identify_words ( input_text : str ):
     """ Extracts all words from a given text."""
     words = re. findall (r"\\b[a-zA -Z\\ '\\ -]+\\b",input_text )
     return words
 
 def count_words ( input_text : str ):
     """ Counts the number of words in a given text."""
     words = identify_words ( input_text )
     # calculates the number of words in a given text
     word_count = len (words)
     return word_count

在上面的代码中，函数identify_words使用正则表达式来找到并输出给定文本中的所有单词。函数count_words首先将identify_words应用于一个给定的文本，以获得该文本中所有单词的列表（存储在变量单词中）。然后，它输出该列表的长度，这相当于文本中的总字数。

对一个短文本应用count_words会产生：

输入

 # excerpt from Microsoft Corporation 's 2016 10-K.
 text = """ We acquire other companies and intangible assets and may not realize all the economic benefit from those acquisitions , which could cause an impairment of goodwill or intangibles . We review our amortizable intangible assets for impairment when events or changes in circumstances indicate the carrying value may not be recoverable . We test goodwill for impairment at least annually . Factors that may be a change in circumstances , indicating that the carrying value of our goodwill or amortizable intangible assets may not be recoverable , include a decline in our stock price and market capitalization , reduced future cash flow estimates , and slower growth rates in industry segments in which we participate . We may be required to record a significant charge on our consolidated financial statements during the period in which any impairment of our goodwill or amortizable intangible assets is determined , negatively affecting our results of operations ."""
 
 text_length = count_words (text)
 print (f" Number of words in text: { text_length }")

输出

 Number of words in text: 143

以类似的方式，我们可以编写一个函数，通过修改identify_words中的正则表达式来计算输入文本中的句子数量，以匹配第七章中讨论的句子。

输入

 def identify_sentences ( input_text : str ):
     """ Extracts all sentences from a given text."""
     sentences = re. findall (r"\\b[A-Z ](?:[^\\.!?]|\\.\\ d)*[\\.!?] ",input_text )
     return sentences
 
 def count_sentences ( input_text : str ):
     """ Counts the number of sentences in input_text ."""
     sentences = identify_sentences ( input_text )
     sentence_count = len ( sentences )
     return sentence_count
 
 num_sentences = count_sentences (text)
 print (f" Number of sentences in text: { num_sentences }")

输出

 Number of sentences in text: 5

8.3 使用迷雾指数测量文本可读性

迷雾指数（Gunning fog index）是分配给输入文本的一个数字分数，数值越大，表明阅读文本的难度越大。各年级的阅读水平大约对应于以下迷雾指数的分数。

Untitled

对于给定的文本，分数的计算是每句的平均字数和每个字的平均复杂字数（即有三个或更多音节的词）的加权和：

$$ Fog \ index= 0.4 \times [ \frac { all\ words } { all \ sentences } + 100 \times \frac {complex \ words } { all \ words } ] $$

有几个 Python 包提供了计算迷雾指数的函数。我们将在本章后面展示如何使用这些包中的一个，py-readability-metrics。然而，编写一个只依靠正则表达式来计算雾度指数的函数并不是很困难。下面我们将演示如何编写这样的函数。我们鼓励读者在学习Python和文本分析技术时，练习从头开始编写Python代码（而不是依赖可用的库）。

8.3.1 编写一个迷雾指数的函数

根据上面公式，我们需要三个数字信息来计算给定文本的迷雾指数：句子数、单词数和复杂单词数（即有三个或更多音节的单词）。我们已经知道如何计算一个给定文本的字数和句子数。然而，我们仍然需要一种方法来识别复杂的词语。一个简单的，尽管不是100%准确的方法来计算一个词中的音节，就是计算在词的开头或跟在辅音后面的元音的数量，如果元音e是该词的最后一个字符，则不包括在计算范围内。

下面的正则表达式可以用来找到所有这样的元音，当应用于单个单词时：'(^|[^aeuoiy])(?!e$)[aeouiy]'，其中。

第一部分，(^|[^aeuoiy])，是一个与单词开头相匹配的组，即^，或一个非元音字符（辅音），[^aeuoiy] 。
最后一部分，[aeuoiy] ，是一个与元音相匹配的字符集。最后一部分和第一部分一起识别出任何位于单词开头或前面有辅音的元音字符。
最后，中间部分(?!e$)是一个负向前瞻组。负查找组的语法如下(?! pattern)。如果一个负向前瞻组出现在文本中，那么该正则表达式将无法匹配其中输入的文本。在我们的例子中，负查找组中的模式是e$ ；也就是说，如果元音e是输入词中的最后一个字符，那么该模式将匹配元音e。因此，如果元音e是给定单词中的最后一个字符，中间部分确保我们的正则表达式不会捕获，那么我们的正则表达式就不会捕获它。

我们可以使用上面的正则表达式来编写一个函数count_syllables，计算给定单词中的音节数，以及函数is_complex_word，对于给定的单词，如果该单词中有三个以上的音节，则返回True，如果没有则返回False。

输入

 # regex pattern that matches vowels in a word (case - insensitive ); used for syllable count
 re_syllables = re. compile (r'(^|[^ aeuoiy ]) (?! e$)[ aeouiy ]', re. IGNORECASE )
 
 def count_syllables (word: str ):
     """ Counts the number of syllables in a word."""
     # gets all syllable regex pattern matches in the input word
     syllables_matches = re_syllables . findall (word)
     return len ( syllables_matches )
 
 def is_complex_word (word: str ):
     """ Checks whether word has three or more syllables ."""
     return count_syllables (word) >= 3

考虑下面的例子，用count_syllables和is_complex_word函数。

输入

 print (" Number of syllables in word \\" Text \\":",count_syllables ("Text"))
 print ("Is word \\" Text \\" complex :",is_complex_word ("Text"))
 print (" Number of syllables in word \\" analysis \\":",count_syllables (" analysis "))
 print ("Is word \\" analysis \\" complex ?:",is_complex_word (" analysis "))
 print (" Number of syllables in word \\" procedure \\":",count_syllables (" procedure "))
 print ("Is word \\" procedure \\" complex ?:",is_complex_word (" procedure "))

输出

 Number of syllables in word "Text": 1
 Is word "Text" complex : False
 Number of syllables in word " analysis ": 4
 Is word " analysis " complex ?: True
 Number of syllables in word " procedure ": 3
 Is word " procedure " complex ?: True

我们终于有了所有必要的工具来编写一个函数，以计算公式（8.1）中的迷雾指数得分。

def calculate_fog (text: str ):
    """ Calculates the fog index for a given text."""
    # extracts all sentences from the input text
    sentences = identify_sentences (text)
    # extracts all words from the input text
    words = identify_words (text)
    # creates a list of complex words by using is_complex_word function as a filter
    complex_words = list ( filter ( is_complex_word , words ))
    # calculates and returns the fog index
    return 0.4*( float ( len (words))/ float ( len ( sentences )) + 100* float ( len ( complex_words ))/ float ( len (words)) )

对于一个给定的输入文本，calculate_fog识别所有的句子和单词，并将它们分别存储在句子和单词的列表变量中。接下来，该函数创建了一个名为complex_words的列表，其中包括所有具有三个或更多音节的单词。最后，该函数计算并返回公式（8.1）中定义的表达式。

现在我们可以将calculate_fog函数应用于任何文本。例如，将calculate_fog应用于我们在文本长度例子中使用的文本（见上一节）将产生。

输入

fog_score = calculate_fog (text)
print ("The fog index score is ", fog_score )

输出

the fog index score is 21.78965034965035

有趣的是，例子中文本的迷雾指数约为21.8，明显高于研究生院阅读水平的预期雾度指数17。在会计文献中，Li（2008）发现，公司年度（10-K）文件的平均迷雾指数也很高，平均为19.39。

8.3.2 使用Python软件包来计算迷雾指数

有许多Python软件包可以实现文本分析技术。在前面的章节中，我们介绍了NTLK和Spacy Python库。在这一节中，我们考虑一个相对较小的Python包，叫做py-readability-metrics，它有助于计算各种文本可读性指标。这个包依赖于NTLK包，可以用pip包管理器安装，如下所示。

pip install py-readability-metrics
python -m nltk.downloader punkt

让我们看一下如何使用这个包来计算迷雾指数的例子:

输入

# Readability class provides methods to compute various readability metrics
from readability import Readability

# create a new Readability object with the example text as an input
r = Readability (text)

# calculate and output the fog index
fog_score = r. gunning_fog ()
print ( fog_score )

输出

score: 21.78965034965035 ,
grade_level : 'college_graduate '

在上面的代码中，我们从属于py-readability-metrics包的readability库中导入了一个叫做Readability的类。然后，我们用我们的例子文本作为输入，通过初始化可读性类来创建对象r。最后，执行方法r.gunning_fog()返回文本的迷雾指数分数和建议的阅读等级。

8.4 使用BOG指数测量文本可读性

在最近的一项研究中，Bonsall等人（2017）提出了一个替代的复杂性衡量标准，即BOG指数。我们认为，BOG指数更好地反映了美国证券交易委员会1998年的 "普通英语任务"[美国证券交易委员会规则421(d)]的情绪。简明英语（plain English）任务要求注册者使用 "简明英语 "编写首次公开招股说明书。美国证券交易委员会主张在财务披露中避免某些写作结构，如被动语态、薄弱或隐藏的动词、赘语、法律术语、大量的定义术语、抽象的词语、不必要的细节、冗长的句子以及难以阅读的设计和布局（证券和委员会，1998）。

迷雾指数的一个缺点是，它只抓住了美国证券交易委员会列出的纯英语的九个属性之一，即长句子。它也只抓住了抽象词的程度，即抽象词和单词中的音节数有关系。BOG指数使用一个超过200,000个单词的专有列表来衡量抽象单词，并根据出现频率进行评分。在互联网上出现频率较低的词被认为是比较抽象的。

Bog指数的一个实际缺陷是，计算Bog指数的唯一方法是通过StyleWriter程序运行文件，一次一个文件，这可能需要大量时间。然而，StyleWriter最近发布了一个版本的软件，可以一次加载大约100个文件，进行批量处理，并将汇总统计数据打印到一个输出文件中。此外，该软件的Python封装器正在开发中，以允许对更多的文件进行批量处理。10-Ks的BOG指数得分请访问 http://textart.us。