README

7 基于词典法的文本分析

7.1 基于词典的文本分析的优点

一般来说,有两种方法可以从文本文件中识别有意义的信息内容:

  1. 使用机器学习。算法首先在先前标记的文本数据中寻找不同模式,然后应用学习到的模型来标记其余未标记的数据。
  2. 使用预先确定的词汇分类字典。文本分析依赖于一个研究者准备好的用于输入的字典其中每个文本文件由其在输入字典中的相对词频来表示。

例如,如果我们想把大量的句子样本自动分类为正面或负面,我们可以使用有监督(supervised)机器学习算法进行分类。首先,我们需要人工打标签将一组随机的句子标记为正面、负面或中性。然后,我们可以使用已标记的句子作为选定的分类算法的训练数据。然后,该算法会 "学习 "哪些词在中性、积极和消极的句子中出现的频率最高,并可以应用这一规则来标记以前未见过的句子。一旦所有的文档句子都被贴上标签,我们就可以平均它们的分数来计算整体的文档积极性/消极性。Li(2010b)是一个很好的研究例子,它应用机器学习算法来确定SEC定期报告中前瞻性句子的语气,即积极与消极。

另外,为了确定文件的语气,我们可以使用现有的正面和反面词汇的字典,只需计算文本中这类词汇的比例。基于字典的方法成本较低,也不太复杂,因此在会计和财务研究人员中很受欢迎。此外,Henry和Leone(2016)最近发表的一篇关于衡量披露语气的不同方法的评论文章认为,简单的基于字典的方法往往与更复杂的机器学习算法的表现一样好甚至更好。因此,在本章中,我们重点讨论了使用字典来分析文本。

文本分析有许多不同的字典,如:

  1. 用于语气/情感分析的通用词典: http://www.wjh.harvard.edu/~inquirer/homecat.htm 例如, Tetlock(2007)、Tetlock等人(2008)和Price等人(2012)是使用正面和负面词汇的通用词典来衡量金融文本的语气/情绪的研究。
  2. 用于语气/情感分析的商业领域特定词典: https://sraf.nd.edu/textual-analysis/resources/#LM Sentiment Word Lists Loughran和McDonald(2011)、Loughran和McDonald(2013)、Huang等人(2014)、Bochkay和Levine(2019)等,都是在商业领域依靠正负字典进行研究的例子。Heitmann等人(2020)最近的一项研究评估了不同的情感分析方法的准确性。我们推荐读者参考该研究,以了解更多关于流行的情感分析方法。
  3. 前瞻性词语的词典: Muslu等人(2015)和Bozanic等人(2018)开发了前瞻性词语词典,以识别SEC年度和季度文件中的前瞻性句子。
  4. 语言学程度词典: https://inkwellanalytics.com/textart/extreme.html

例如,Bochkay等人(2020)开发了一个语言程度词典,以衡量企业披露的夸张程度。

  1. 法律术语词典: http://www.learnaboutlaw.com/General/glossary.htm 例如,Hanley和Hoberg(2010)使用带有法律术语的词典来识别IPO招股说明书中的法律文本。
  2. 公司管理术语词典: www.corp-gov.org/glossary.php3 例如,Hanley和Hoberg(2010)使用一本公司管理术语词典来衡量公司治理披露在IPO定价中的相关性。

根据你的研究问题,你可以随时构建你自己的字典(例如,Bentley等人(2018)识别非GAAP报告的特定会计术语的字典,Campbell等人(2014)识别风险披露的词语的字典,Brochet等人(2018)识别宏观经济术语的字典,Larcker和Zakolyukina(2012)识别股东价值和价值创造披露的词语的字典)。

7.2 理解词典

在根据一个特定的词典进行大规模的文本分析之前,了解输入词典的结构是很重要的。词典中所有的词都以小写字母或大写字母开头,还是两种情况都有可能?词典中所有的词都是以其基本形式出现的吗?基本形式是指该词的词性(例如,to deliver, to develop, to earn),但不包括 "to"。或者,字典中是否包括基本形式的单词和它们的变体(例如,deliver, delivered, delivering, delivers)?

为了减少由于单词大小写字母不一致而导致的词典单词不匹配的问题,我们强烈建议将输入的词典以及正文中的所有单词转换为小写。事实上,这应该是大规模文本分析的首先必备步骤。

如果输入的字典只包括基本形式的单词,那么计算文本中这些单词的频率将导致大大低估文档中的 "真实 "单词数。这是因为一个简单的r'\b' + word + **r'\b'**形式的正则表达式将只找到基础词的匹配,而所有具有不同词尾的词都会被忽略掉。例如,如果一个输入词典包含'damage'(没有其他相同开头的词)作为它的一个消极词,那么正则表达式 r'bdamage\b' 将在一个句子 "Our business could be damaged "中返回零匹配,因为结尾'ed'在正则表达式中没有被指定。

有几种方法来处理这个问题。首先,我们可以编写一个更复杂的正则表达式,允许在正则表达式匹配中使用不同的词尾。这种方法效果相对较好;但是,我们必须小心匹配那些根据词尾可能有不同含义的词(例如,careful 和 careless)。第二,我们可以对输入的文本文件进行词形还原(lemmatization)或词干提取(stemming),使所有的输入文件词都是它们的基本形式。这种方法也很有效;然而,词形还原或词干提取并不总是100%准确的,在随后的单词计数中会引入一些噪音。最后,我们可以通过手动添加衍生词来修改原始输入词典,即把‘damage’, ‘damages’, ‘damaging’, ‘damaged’作为负面词加入词典。如果处理相对较短的词列表,这种方法是可行的,但如果处理字典中的数千个词,成本可能会越来越高。

当输入的字典同时包含单字和多字短语时,我们必须更加小心。那么,除了不同的词尾之外,我们还应该考虑在某一短语的中间是否可能存在其他的词。例如,如果一个输入词典包含"economic environment"作为它的一个条目,那么正则表达式 r'\beconomic environment\b' 将不会在句子 "Our performance greatly depends on economic and competitive environment. "中产生匹配。这是因为'and competitive'不是正则表达式的一部分,我们没有让其他的词成为这个正则表达式的一部分。为了允许 "economic environment "短语中间有两个词,我们可以使用正则表达式r'\beconomic\W+(?:w+\W+){0,2}?environment\b' 。量词{0,2}? 允许 "economic"和 "environment"之间有零到两个词。

7.3 识别文本中的单词和句子

文本分析的很大一部分内容涉及识别文本中的单词和句子。在本章中,我们提供了几个例子,说明如何将文本文件分割成单字和/或句子。例如,为了计算文本中所有单词的数量,我们可以使用以下带有正则表达式的Python代码。

输入

 text = "We invested in six areas of the business that account for nearly 40% of total Macy 's sales. Dresses , fine jewelry , big ticket , men 's tailored , women 's shoes and beauty , these investments were aimed at driving growth through great products , top performing colleagues , improved environment and enhanced marketing . All six areas continued to outperform the balance of the business on market share , return on investment and profitability . And we capture approximately 9% of the market in these categories ."
 x = re. findall (r"\\b[a-zA-Z\\'\\-]+\\b", text)
 # Regex "\\b[a-zA -Z\\ '\\ -]+\\b" searches for all words in text , allowing apostrophes and hyphens in words , e.g., company 's, state -of -the -art
 print (x)
 print (len (x))

输出

 ['We ', 'invested ', 'in ', 'six ', 'areas ', 'of ', 'the ', ' business ', 'that ', 'account ', 'for ', 'nearly ', 'of' , 'total ', "Macy 's", 'sales ', 'Dresses ', 'fine ', ' jewelry ', 'big ', 'ticket ', "men 's", 'tailored ', " women 's", 'shoes ', 'and ', 'beauty ', 'these ', ' investments ', 'were ', 'aimed ', 'at', 'driving ', ' growth ', 'through ', 'great ', 'products ', 'top-performing ', 'colleagues ', 'improved ', 'environment ', 'and ', 'enhanced ', 'marketing ', 'All ', 'six ', ' areas ', 'continued ', 'to ', 'outperform ', 'the ', ' balance ', 'of', 'the ', 'business ', 'on ', 'market ', 'share ', 'return ', 'on ', 'investment ', 'and ', ' profitability ', 'And ', 'we', 'capture ', ' approximately ', 'of', 'the ', 'market ', 'in ', 'these ', 'categories ']

使用正则表达式识别句子更复杂,因为句子本身是比单字更复杂的文本单位。在语言学中,句子是一个由图形学特征划定的文本单位,如大写字母和标记,如句号、问号和感叹号("。"、"!"、"?")。因此,为了识别一个句子,我们需要一个能够搜索这些特征的正则表达式。以下的正则表达式可以帮助我们识别一个句子:

下面的例子使用上面描述的正则表达式来识别句子。为了方便代码的阐述,我们继续使用前一个例子的输入文本。

输入

 # Regex pattern that identifies a sentence re.compile compiles a regular expression pattern into a regular expression object in Python
 sentence_regex = re. compile (r"\\b[A-Z](?:[^\\.!?]|\\.\\ d)*[\\.!?] ")
 
 def identify_sentences ( input_text : str ):
     # finds all matches of sentence_regex in input_text
     sentences = re. findall ( sentence_regex , input_text )
     return sentences
 
 sentences = identify_sentences (text)
 
 # enumerate is a Python function that when applied to a list , returns list elements along with their indexes ( counter ); 1 indicates that the counter should start from 1 instead of default 0
 for counter , sentence in enumerate (sentences , 1):
     print (counter , sentence )

输出

 1 "We invested in six areas of the business that account for nearly 40% of total Macy 's sales."
 2 "Dresses , fine jewelry , big ticket , men 's tailored , women 's shoes and beauty , these investments were
 aimed at driving growth through great products , top - performing colleagues , improved environment and enhanced marketing ."
 3 "All six areas continued to outperform the balance of the business on market share , return on investment and profitability ."
 4 "And we capture approximately 9% of the market in these categories ."

一些Python库提供了识别文本中单词和句子的替代方法。如,spacy是一个开源的Python自然语言处理库,可以识别各种语言结构。spacy可以用conda(如果使用anaconda发行版)或pip软件包管理器安装。

 conda install -c conda-forge spacy
 pip install spacy

在使用spacy之前,我们需要下载其英文(或其他语言)模型。这只需要做一次就可以了。

 python -m spacy download en

下面的例子显示了如何使用spacy来识别文本中的词和句。

输入

 import spacy
 # load the English language model in spacy
 nlp = spacy .load('en_core_web_sm ')
 # create an "nlp" object that parses a textual document
 a_text = nlp(text)
 
 # create a list of word tokens ; note , this list will include punctuation marks and other symbols
 token_list = []
 for token in a_text :
     token_list.append (token .text)
 print ( token_list )
 
 sentences = list ( a_text .sents)
 
 # print all sentences
 for counter , sentence in enumerate (sentences , 1):
     print (counter , sentence )

输出

 ['We ', 'invested ', 'in ', 'six ', 'areas ', 'of ', 'the ', ' business ', 'that ', 'account ', 'for ', 'nearly ', '40', '%', 'of ', 'total ', 'Macy ', "'s", 'sales ', '.', 'Dresses ', ',', 'fine ', 'jewelry ', ',', 'big ', 'ticket ', ',', 'men ', "'s", 'tailored ', ',', 'women ', "'s", 'shoes ', 'and ', 'beauty ', ',', 'these ', 'investments ', 'were ', 'aimed ', 'at', 'driving ', 'growth ', 'through ', 'great ', 'products ', ',', 'top ', '-', 'performing ', 'colleagues ', ',', 'improved ','environment ', 'and ', 'enhanced ', 'marketing ', '.', 'All ', 'six ', 'areas ', 'continued ', 'to ', 'outperform ', 'the ', 'balance ', 'of ', 'the ', 'business ', 'on ', 'market ', 'share ', ',', 'return ','on', 'investment ', 'and ', 'profitability ', '.', 'And ', 'we', 'capture ', 'approximately ', '9', '%', 'of', 'the ', 'market ', 'in ', 'these ', 'categories ','.']
 
 1 "We invested in six areas of the business that account for nearly 40% of total Macy 's sales."
 2 "Dresses , fine jewelry , big ticket , men 's tailored , women 's shoes and beauty , these investments were aimed at driving growth through great products , top - performing colleagues , improved environment and enhanced marketing ."
 3 "All six areas continued to outperform the balance of the business on market share , return on investment and profitability ."
 4 "And we capture approximately 9% of the market in these categories .

7.4 词干提取和词形还原

在第7.2章中,我们提到了只有基础词的字典可能会导致对文本中出现的字典词的数量有很大程度的低估。词干提取和词形还原是缓解这一问题的两种常用方法。词干提取是将受影响的词(即派生词)减少到其基本形式的过程,称为词干。词形还原是一种更复杂的方法来确定一个词的词干;它需要确定一个词的语境和它在文本中的含义。直观地看,词干提取和词形还原是密切相关的。主要的区别在于,词干提取是在单个单词的基础上进行的,没有考虑到单词的上下文,而词形还原则需要对上下文的了解。Cecchini等人(2010)、Lang和Stice-Lawrence(2015)以及Bochkay和Levine(2019)是将这些技术应用于企业披露的研究实例。

在这一章中,我们演示了如何使用Python对文本进行词干提取和词形还原。关于这两种算法的其他信息可以访问 https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

在Python中,自然语言工具包(NLTK)中有一个词干模块,其中包括一系列用于自然语言处理(NLP)的库和程序。我们首先需要导入NLTK的词干提取和词形还原模块,然后对感兴趣的词应用词干提取和词形还原命令。比如:

输入

# import Porter stemmer Module
from nltk.stem import PorterStemmer
# import WordNet lemmatization Module
from nltk.stem import WordNetLemmatizer
# object for Porter stemmer
stemmer = PorterStemmer ()
# object for WordNet lemmatizer
lemmatizer = WordNetLemmatizer ()

# Then , performing stemming on single words is as simple as:
print (f" Stemming for 'increasing ' is { stemmer .stem('increasing ')}")
print (f" Stemming for 'increases ' is { stemmer .stem('increases ')}")
print (f" Stemming for 'increased ' is { stemmer .stem('increased ')}")

# To improve the accuracy of lemmatization , we need to provide each word 's part of the speech (POS) specifying POS as verb "v"
print (f" Lemmatization for 'increasing ' is { lemmatizer .lemmatize (' increasing ', pos='v ')}")
print (f" Lemmatization for 'increases ' is { lemmatizer .lemmatize (' increases ', pos='v ')}")
print (f" Lemmatization for 'increased ' is { lemmatizer .lemmatize (' increased ', pos='v ')}")

输出

Stemming for 'increasing ' is increas
Stemming for 'increases ' is increas
Stemming for 'increased ' is increas
Lemmatization for 'increasing ' is increase
Lemmatization for 'increases ' is increase
Lemmatization for 'increased ' is increase

在句子层面上进行词干提取和词形还原需要更多的工作,因为我们需要将句子分割成单个单词并识别每个单词的语篇。下面的代码详细演示了这一点。

输入

# WordNet is just another NLTK corpus reader
from nltk. corpus import wordnet
# 如果报错 'averaged_perceptron_tagger ' has not been yet downloaded ,就加上下面一行的代码
# nltk. download (' averaged_perceptron_tagger ')

# import NLTK tokenizer and (part of speech ) POS tagger
from nltk import word_tokenize , pos_tag
# import Porter stemmer class
from nltk.stem import PorterStemmer
# import WordNet lemmatizer class
from nltk.stem import WordNetLemmatizer
# default dictionary is similar to Python 's regular
# dictionary , but allows the dictionary to return a
# default value if a requested key does not exist in
# the dictionary from collections import defaultdict

# object for Porter stemmer
stemmer = PorterStemmer ()
# object for WordNet lemmatizer
lemmatizer = WordNetLemmatizer ()

# create a dictionary where single - letter keys are
# mapped to part of speech (noun , adjective , etc .)
# WordNet identifiers ; by default , if a key does not
# exists the dictionary , return noun ( wordnet .NOUN)
tag_map = defaultdict ( lambda : wordnet .NOUN)
# add key 'J' to the dictionary indicating adjective
tag_map ['J'] = wordnet .ADJ
# add key 'V' to the dictionary indicating verb
tag_map ['V'] = wordnet .VERB
# add key 'R' to the dictionary indicating adverb
tag_map ['R'] = wordnet .ADV

text = "We delivered adjusted earnings per share of $2.12. For the year , comparable sales were down 0.7% on an owned plus licensed basis , and we delivered adjusted earnings per share of $2.91."

# function that stems text
def stem_text (text: str ):
    # split text into (word) tokens
    tokens = word_tokenize (text)
    stemmed_text = []
    for token in tokens :
    stem = stemmer .stem(token )
    stemmed_text . append (stem)
    # concatenate stemmed tokens elements with
    # space (" ") in - between
    return " ".join( stemmed_text )

# function that to lemmatizes text
def lemmatize_text (text: str ):
    # splits text into tokens
    tokens = word_tokenize (text)
    lemmatized_text = []
    for token , tag in pos_tag ( tokens ):
    # lemmatize word tokens , tag [0] returns POS
    # letter identifier
    lemma = lemmatizer . lemmatize (token , tag_map [tag[0]])
    lemmatized_text . append ( lemma )
    # concatenate lemmatized tokens elements with
    # space in - between
    return " ".join( lemmatized_text )

# print stemmed version of text
print ( stem_text (text))
# print lemmatized version of text
print ( lemmatize_text (text))

输出

"We deliv adjust earn per share of $ 2.12 . for the year , compar sale were down 0.7 % on an own plu licens basi , and we deliv adjust earn per share of $ 2.91 ."

"We deliver adjusted earnings per share of $ 2.12 . For the year , comparable sale be down 0.7 % on an owned plus licensed basis , and we deliver adjusted earnings per share of $ 2.91 ."

7.5 词的权重

在进行基于词典的词数统计时,需要做出的一个重要选择是如何对单个词数进行加权。是否所有的词都与我们试图测量的概念同样相关?或者,有些词比其他词权重更大(即在某种意义上更重要)?

最简单的方法是给单个词分配相等的权重,使总数等于:

$$ Proportion \ Of \ Dictionary \ Words_j=\frac { \sum _ { i }Count_{i,j} } {TotalWords _ { j } } $$

其中Count_{i,j}是字典中的术语i在文档j中出现的次数,TotalWords_{j}是文档中的总词数。

一个普遍的替代所有词加权的方法是以其文档频率来加权每个字数。具体来说,词语 iinverse document frequency (IDF)

$$ IDF_i=Log ( \frac { Number \ of \ documents \ in \ the \ sample } {Number \ of \ documents \ in \ the \ sample \ containing \ a \ word \ i } ) $$

会对常用的词进行惩罚,并对不常用的词赋予更大的权重。例如,如果 "increase "出现在样本中的每个文档中,那么它的IDF权重就是零(即log(1))。从形式上看,我们应用以下公式来计算字典词在文档中的加权比例。

$$ Weighted \ Proportion \ Of \ Dictionary \ Words_j=\frac { \sum _ { i }Count_{i,j} \times idf_i } {TotalWords _ { j } } $$

在第10章中有一个如何在Python中计算ididf权重的例子。关于词权重的更多信息可以在以下网站上找到:https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html

另外,在企业披露的背景下,Loughran和McDonald(2011)以及Jegadeesh和Wu(2013)提供了在计算字典单词时不同的单词加权方法的很好的例子。

7.6 基于词典法的词频统计

第7.1-7.5节建立了对文本文件进行大规模基于字典的字数统计的方法。在这一章中,我们提供了说明性的例子,说明如何应用这些方法来计算文档的语气(tone)。为了便于解释Python代码,我们在整个代码中加入了解释性的注释。

我们首先上传我们感兴趣的统计词典。我们建议这些字典文件为纯文本(.txt)格式,以制表符分隔或以逗号分隔,每个字典中的单词/短语都是单独的一行。在下面的例子中,"positive.txt "和 "negative.txt "这两个词典文件都包含基本形式的词和内含的词(increase, increases, increasing, increased),所以我们不需要进行词干提取和词形还原。

输入

import re

# Let us start with a simple tone analysis , where each word is equally - weighted and we do not account for negators .

# First , we need to specify the locations of our dictionary files. file path ( location ) to a text file with positive words; every word is in a separate line in the file
positive_words_dict = r"./dictionaries/positive.txt"
# file path to a text file with negative words
negative_words_dict = r"./dictionaries/negative.txt"

# To be able to match all positive and negative words from the dictionaries , we need to create a list of regular expressions corresponding to these words

# The following function reads all dictionary terms to a Python list , and converts the terms regular expressions

def create_dict_regex_list ( dict_file : str ):
    """ Creates a list of regex expressions of dictionary terms."""
    # opens the specified dict_file in "r" (read) mode
    with open (dict_file ,"r") as file :
        # reads the content of the file
        # line -by -line and creates a list of
        # dictionary phrases
    	dict_terms = file .read (). splitlines ()
   	# re. compile ( pattern ) in Python compiles a regular
    # expression pattern , which can be used for
    # matching using its re.search , re.findall , etc.
    # by adding "\\b" (i.e., word boundary ) on each
    # side of a dictionary term in Regex , we force
    # an exact match that dictionary term
    dict_terms_regex = [re. compile (r'\\b' + term + r'\\b') for term in dict_terms ]

    # specifies the output of the function - in our
	# case , a list of Regex expressions that
	# correspond to the input dictionary file
    return dict_terms_regex

# Now we can apply our function to create Regex lists for positive and negative dictionary terms
positive_dict_regex = create_dict_regex_list (positive_words_dict )
negative_dict_regex = create_dict_regex_list (negative_words_dict )

# print the first three entries of each Regex dictionary
print ( positive_dict_regex [0:3])
print ( negative_dict_regex [0:3])

输出

[re. compile ('\\\\ bable \\\\b'), re. compile ('\\\\ babundance \\\\b'), re. compile ('\\\\ babundant \\\\b')]
[re. compile ('\\\\ babandon \\\\b'), re. compile ('\\\\ babandoned \\\\b'), re. compile ('\\\\ babandoning \\\\b')]

接下来,我们需要写一个函数,计算给定文本中的正、负和所有词数,这样我们就可以计算出文档语气(Tone),如下所示:

$$ Tone(\%) =100 \times \frac { Positive \ Word \ Count − Negative \ Word \ Count} {TotalWordCount} $$

和以前一样,为了便于阐述我们的代码,我们在每一行代码中都添加了注释。

输入

def get_tone ( input_text : str ):
    """ Counts All and Specific Words in Text """
    ### Positive Words ###
    # finds all regex matches and returns them as a
    # list of lists so , the output of this search
    # will be of the following format :
    # [[' able '], [], [' abundant ','abundant '], [], ... ]
    positive_words_matches = [re. findall(regex,input_text) for regex in positive_dict_regex]
    # len () measures the length of each list match
    # so , the output of this list transformation
    # will be of the following format : [1 ,0 ,2 ,0 ,...]
    positive_words_counts = [ len (match) for match in positive_words_matches ]
    positive_words_sum = sum ( positive_words_counts )
    ### Negative Words ###
    # in similar manner , we can get word counts for
    # negative words
    # finds all matches of negative words '
    # regular expressions
    negative_words_matches = [re. findall(regex ,input_text) for regex in negative_dict_regex]
    # calculates the number of matches for each
    # dictionary term regex
    negative_words_counts = [ len (match) for match in negative_words_matches ]
    negative_words_sum = sum ( negative_words_counts )
    ### Total Words ###
    # searches for all words in text , allowing
    # apostrophes and hyphens in words , e.g.,
    # " company 's", "state-of-the-art"
    total_words = re. findall (r"\\b[a-zA -Z\\ '\\ -]+\\b", input_text )
    # calculates the number of all words in text
    total_words_count = len ( total_words )
    # Finally , we can calculate Tone
    # ( expressed in % terms) as:
    tone = 100 * ( positive_words_sum - negative_words_sum ) / total_words_count
    return ( total_words_count , positive_words_sum , negative_words_sum , tone)

# Applying our count_words function to an input text:
counts = get_tone ("At FedEx Ground , we have the market leading ecommerce portfolio . We continue to see strong demand across all customer segments with our new seven -day service . We will increase our speed advantage during the New Year. Our Sunday roll -out will speed up some lanes by one and two full transit days. This will increase our advantage significantly . And as you know , we are already faster by at least one day when compared to UPS 's ground service in 25% of lanes. It is also really important to note our speed advantage and seven -day service is also very valuable for the premium B2B sectors , including healthcare and perishables shippers . Now , turning to Q2 , I'm not pleased with our financial results .")

# output the results as (Total Word Count, Number of Positive Words, Number of Negative Words , Tone)
print ( counts )

输出

(114 , 7, 0, 6.140350877192983)

在计算文档语气时,我们经常会遇到文本中正面和负面词汇被否定的问题(例如,not bad, not good。有几种方法来处理这个问题。我们可以简单地将否定者旁边的词的情感倒置。比如:

或者,我们可以以某种系统的方式为被否定的词分配一个新的权重。例如,我们可以给所有被否定的词分配0.5×原始情感分数。

下面是我们在正则表达式中可能要考虑的否定词列表。

not, never, no, none, nobody, nothing, don’t, doesn’t, won’t, shan’t, didn’t, shouldn’t, wouldn’t, couldn’t, can’t, cannot, neither, nor

为了在我们之前的代码中考虑到否定词,我们需要将词频统计的正则表达式重写如下。

输入

# First , we update our function that compiles regular expressions
def create_dict_regex_list_with_negators ( dict_file : str ):
    """ Creates a list of regex expressions of dictionary terms."""
    with open (dict_file ,"r") as file :
        # reads dictionary lines one-by-one
        dict_terms = file .read (). splitlines ()

    # the first capturing group in this Regex captures all possible negators , allowing for zero or one match as indicated by ? after the group; the second group captures dictionary terms
    dict_terms_regex =[re. compile (r"(not|never|no|none|nobody|nothing |don\\'t|doesn\\'t|won\\'t|shan\\'t|didn\\'t| shouldn \\'t| wouldn \\'t| couldn \\'t|can\\'t| cannot |neither |nor)?\\s(" + term + r")\\b") for term in dict_terms ]
    # returns a list of Regex expressions that correspond to the input dictionary file , allowing for negators
    return dict_terms_regex

# Now we can apply our function to create Regex lists for positive and negative dictionary terms
positive_dict_regex = create_dict_regex_list_with_negators ( positive_words_dict )
negative_dict_regex = create_dict_regex_list_with_negators ( negative_words_dict )

# prints the first entries of each Regex dictionary
print ( positive_dict_regex [0])
print ( negative_dict_regex [0])

输出

re. compile ("(not|never|no|none| nobody | nothing |don \\\\'t| doesn \\\\'t|won \\\\'t|shan \\\\'t|didn \\\\'t| shouldn \\\\'t| wouldn \\\\'t| couldn \\\\'t|can \\\\'t| cannot | neither |nor) ?\\\\s(able)\\\\b")

re. compile ("(not|never|no|none| nobody | nothing |don \\\\'t| doesn \\\\'t|won \\\\'t|shan \\\\'t|didn \\\\'t| shouldn \\\\'t| wouldn \\\\'t| couldn \\\\'t|can \\\\'t| cannot | neither |nor) ?\\\\s(abandon)\\\\b")

然后,我们计算文档语气的函数的更新版本如下。

输入

# calculates tone with negators
def get_tone2 ( input_text : str ):
    """ Counts All and Specific Words in Text , and checks for the presence of negators """
    # find all words in text
    total_words = re. findall (r"\\b[a-zA-Z\\'\\-]+\\b", input_text )
    total_words_count = len ( total_words )
    # Positive Words #
    # To account for negators , we can separately count positive and negated positive words
    positive_word_count = 0
    negated_positive_word_count = 0

    for regex in positive_dict_regex :
        # searches for all occurences of Regex
		matches = re. findall (regex , input_text )
        for match in matches :
            # if match is not empty
            if len (match) >0:
                # prints the match output ; this is for illustration purposes (i.e., optional )
                print (match)
            # if the first element of the match is empty , no negator is present
            if match [0] == '':
                # so , increase the count of positive words by 1
                positive_word_count += 1
            else:
                # otherwise , a negator is present , so increase the count of negated positive words by 1
				negated_positive_word_count += 1

    # If we are simply shifting the sentiment of negated positive words (from +1 to -1), then the final positive word count is just:
    positive_words_sum = positive_word_count

    # Repeat the same for Negative Words:
    negative_word_count = 0
	negated_negative_word_count = 0

    for regex in negative_dict_regex  :
        # searches for all occurences of Regex
		matches = re. findall (regex , input_text )
        for match in matches :
            # if match is not empty
            if len (match) >0:
                print (match)
            # if the first element of the match is empty , no negator is present
            if match [0] == '':
                # so , increase the count of positive words by 1
                negative_word_count += 1
            else:
                # otherwise , a negator is present , so increase the count of negated negative words by 1
                negated_negative_word_count += 1

            # If we are simply shifting the sentiment of negated negative words (from -1 to +1) , then the final negative word count is just:
    negative_words_sum = negative_word_count

    # Then , Tone is:
    tone = 100 * ( positive_words_sum - negative_words_sum )/ total_words_count
    return ( total_words_count , positive_words_sum , negative_words_sum , tone)

# Applying function get_tone2 function to an example text:
counts = get_tone2 ("At FedEx Ground , we have the market leading ecommerce portfolio . We continue to see strong demand across all customer segments with our new seven -day service . We will increase our speed advantage during the New Year. Our Sunday roll -out will speed up some lanes by one and two full transit days. This will increase our advantage significantly . And as you know , we are already faster by at least one day when compared to UPS 's ground service in 25% of lanes. It is also really important to note our speed advantage and seven -day service is also very valuable for the premium B2B sectors , including healthcare and perishables shippers . Now , turning to Q2 , I'm not pleased with our financial results .")
# output results
print ( counts )

输出

('', 'advantage ')
('', 'advantage ')
('', 'advantage ')
('', 'leading ')
('not ', 'pleased ')
('', 'strong ')
('', 'valuable ')
(114 , 6, 0, 5.2631578947368425)

总而言之,进行基于字典的计数是一种相对简单的捕捉文本中信息内容的方法。然而,我们必须注意由于输入词典的格式和不同的文本语义结构而可能产生的潜在问题。