7 基于词典法的文本分析

7.1 基于词典的文本分析的优点


  1. 使用机器学习。算法首先在先前标记的文本数据中寻找不同模式,然后应用学习到的模型来标记其余未标记的数据。
  2. 使用预先确定的词汇分类字典。文本分析依赖于一个研究者准备好的用于输入的字典其中每个文本文件由其在输入字典中的相对词频来表示。

例如,如果我们想把大量的句子样本自动分类为正面或负面,我们可以使用有监督(supervised)机器学习算法进行分类。首先,我们需要人工打标签将一组随机的句子标记为正面、负面或中性。然后,我们可以使用已标记的句子作为选定的分类算法的训练数据。然后,该算法会 "学习 "哪些词在中性、积极和消极的句子中出现的频率最高,并可以应用这一规则来标记以前未见过的句子。一旦所有的文档句子都被贴上标签,我们就可以平均它们的分数来计算整体的文档积极性/消极性。Li(2010b)是一个很好的研究例子,它应用机器学习算法来确定SEC定期报告中前瞻性句子的语气,即积极与消极。



  1. 用于语气/情感分析的通用词典: 例如, Tetlock(2007)、Tetlock等人(2008)和Price等人(2012)是使用正面和负面词汇的通用词典来衡量金融文本的语气/情绪的研究。
  2. 用于语气/情感分析的商业领域特定词典: Sentiment Word Lists Loughran和McDonald(2011)、Loughran和McDonald(2013)、Huang等人(2014)、Bochkay和Levine(2019)等,都是在商业领域依靠正负字典进行研究的例子。Heitmann等人(2020)最近的一项研究评估了不同的情感分析方法的准确性。我们推荐读者参考该研究,以了解更多关于流行的情感分析方法。
  3. 前瞻性词语的词典: Muslu等人(2015)和Bozanic等人(2018)开发了前瞻性词语词典,以识别SEC年度和季度文件中的前瞻性句子。
  4. 语言学程度词典:


  1. 法律术语词典: 例如,Hanley和Hoberg(2010)使用带有法律术语的词典来识别IPO招股说明书中的法律文本。
  2. 公司管理术语词典: 例如,Hanley和Hoberg(2010)使用一本公司管理术语词典来衡量公司治理披露在IPO定价中的相关性。


7.2 理解词典

在根据一个特定的词典进行大规模的文本分析之前,了解输入词典的结构是很重要的。词典中所有的词都以小写字母或大写字母开头,还是两种情况都有可能?词典中所有的词都是以其基本形式出现的吗?基本形式是指该词的词性(例如,to deliver, to develop, to earn),但不包括 "to"。或者,字典中是否包括基本形式的单词和它们的变体(例如,deliver, delivered, delivering, delivers)?


如果输入的字典只包括基本形式的单词,那么计算文本中这些单词的频率将导致大大低估文档中的 "真实 "单词数。这是因为一个简单的r'\b' + word + **r'\b'**形式的正则表达式将只找到基础词的匹配,而所有具有不同词尾的词都会被忽略掉。例如,如果一个输入词典包含'damage'(没有其他相同开头的词)作为它的一个消极词,那么正则表达式 r'bdamage\b' 将在一个句子 "Our business could be damaged "中返回零匹配,因为结尾'ed'在正则表达式中没有被指定。

有几种方法来处理这个问题。首先,我们可以编写一个更复杂的正则表达式,允许在正则表达式匹配中使用不同的词尾。这种方法效果相对较好;但是,我们必须小心匹配那些根据词尾可能有不同含义的词(例如,careful 和 careless)。第二,我们可以对输入的文本文件进行词形还原(lemmatization)或词干提取(stemming),使所有的输入文件词都是它们的基本形式。这种方法也很有效;然而,词形还原或词干提取并不总是100%准确的,在随后的单词计数中会引入一些噪音。最后,我们可以通过手动添加衍生词来修改原始输入词典,即把‘damage’, ‘damages’, ‘damaging’, ‘damaged’作为负面词加入词典。如果处理相对较短的词列表,这种方法是可行的,但如果处理字典中的数千个词,成本可能会越来越高。

当输入的字典同时包含单字和多字短语时,我们必须更加小心。那么,除了不同的词尾之外,我们还应该考虑在某一短语的中间是否可能存在其他的词。例如,如果一个输入词典包含"economic environment"作为它的一个条目,那么正则表达式 r'\beconomic environment\b' 将不会在句子 "Our performance greatly depends on economic and competitive environment. "中产生匹配。这是因为'and competitive'不是正则表达式的一部分,我们没有让其他的词成为这个正则表达式的一部分。为了允许 "economic environment "短语中间有两个词,我们可以使用正则表达式r'\beconomic\W+(?:w+\W+){0,2}?environment\b' 。量词{0,2}? 允许 "economic"和 "environment"之间有零到两个词。

7.3 识别文本中的单词和句子



 text = "We invested in six areas of the business that account for nearly 40% of total Macy 's sales. Dresses , fine jewelry , big ticket , men 's tailored , women 's shoes and beauty , these investments were aimed at driving growth through great products , top performing colleagues , improved environment and enhanced marketing . All six areas continued to outperform the balance of the business on market share , return on investment and profitability . And we capture approximately 9% of the market in these categories ."
 x = re. findall (r"\\b[a-zA-Z\\'\\-]+\\b", text)
 # Regex "\\b[a-zA -Z\\ '\\ -]+\\b" searches for all words in text , allowing apostrophes and hyphens in words , e.g., company 's, state -of -the -art
 print (x)
 print (len (x))


 ['We ', 'invested ', 'in ', 'six ', 'areas ', 'of ', 'the ', ' business ', 'that ', 'account ', 'for ', 'nearly ', 'of' , 'total ', "Macy 's", 'sales ', 'Dresses ', 'fine ', ' jewelry ', 'big ', 'ticket ', "men 's", 'tailored ', " women 's", 'shoes ', 'and ', 'beauty ', 'these ', ' investments ', 'were ', 'aimed ', 'at', 'driving ', ' growth ', 'through ', 'great ', 'products ', 'top-performing ', 'colleagues ', 'improved ', 'environment ', 'and ', 'enhanced ', 'marketing ', 'All ', 'six ', ' areas ', 'continued ', 'to ', 'outperform ', 'the ', ' balance ', 'of', 'the ', 'business ', 'on ', 'market ', 'share ', 'return ', 'on ', 'investment ', 'and ', ' profitability ', 'And ', 'we', 'capture ', ' approximately ', 'of', 'the ', 'market ', 'in ', 'these ', 'categories ']




 # Regex pattern that identifies a sentence re.compile compiles a regular expression pattern into a regular expression object in Python
 sentence_regex = re. compile (r"\\b[A-Z](?:[^\\.!?]|\\.\\ d)*[\\.!?] ")
 def identify_sentences ( input_text : str ):
     # finds all matches of sentence_regex in input_text
     sentences = re. findall ( sentence_regex , input_text )
     return sentences
 sentences = identify_sentences (text)
 # enumerate is a Python function that when applied to a list , returns list elements along with their indexes ( counter ); 1 indicates that the counter should start from 1 instead of default 0
 for counter , sentence in enumerate (sentences , 1):
     print (counter , sentence )


 1 "We invested in six areas of the business that account for nearly 40% of total Macy 's sales."
 2 "Dresses , fine jewelry , big ticket , men 's tailored , women 's shoes and beauty , these investments were
 aimed at driving growth through great products , top - performing colleagues , improved environment and enhanced marketing ."
 3 "All six areas continued to outperform the balance of the business on market share , return on investment and profitability ."
 4 "And we capture approximately 9% of the market in these categories ."


 conda install -c conda-forge spacy
 pip install spacy


 python -m spacy download en



 import spacy
 # load the English language model in spacy
 nlp = spacy .load('en_core_web_sm ')
 # create an "nlp" object that parses a textual document
 a_text = nlp(text)
 # create a list of word tokens ; note , this list will include punctuation marks and other symbols
 token_list = []
 for token in a_text :
     token_list.append (token .text)
 print ( token_list )
 sentences = list ( a_text .sents)
 # print all sentences
 for counter , sentence in enumerate (sentences , 1):
     print (counter , sentence )


 ['We ', 'invested ', 'in ', 'six ', 'areas ', 'of ', 'the ', ' business ', 'that ', 'account ', 'for ', 'nearly ', '40', '%', 'of ', 'total ', 'Macy ', "'s", 'sales ', '.', 'Dresses ', ',', 'fine ', 'jewelry ', ',', 'big ', 'ticket ', ',', 'men ', "'s", 'tailored ', ',', 'women ', "'s", 'shoes ', 'and ', 'beauty ', ',', 'these ', 'investments ', 'were ', 'aimed ', 'at', 'driving ', 'growth ', 'through ', 'great ', 'products ', ',', 'top ', '-', 'performing ', 'colleagues ', ',', 'improved ','environment ', 'and ', 'enhanced ', 'marketing ', '.', 'All ', 'six ', 'areas ', 'continued ', 'to ', 'outperform ', 'the ', 'balance ', 'of ', 'the ', 'business ', 'on ', 'market ', 'share ', ',', 'return ','on', 'investment ', 'and ', 'profitability ', '.', 'And ', 'we', 'capture ', 'approximately ', '9', '%', 'of', 'the ', 'market ', 'in ', 'these ', 'categories ','.']
 1 "We invested in six areas of the business that account for nearly 40% of total Macy 's sales."
 2 "Dresses , fine jewelry , big ticket , men 's tailored , women 's shoes and beauty , these investments were aimed at driving growth through great products , top - performing colleagues , improved environment and enhanced marketing ."
 3 "All six areas continued to outperform the balance of the business on market share , return on investment and profitability ."
 4 "And we capture approximately 9% of the market in these categories .

7.4 词干提取和词形还原





# import Porter stemmer Module
from nltk.stem import PorterStemmer
# import WordNet lemmatization Module
from nltk.stem import WordNetLemmatizer
# object for Porter stemmer
stemmer = PorterStemmer ()
# object for WordNet lemmatizer
lemmatizer = WordNetLemmatizer ()

# Then , performing stemming on single words is as simple as:
print (f" Stemming for 'increasing ' is { stemmer .stem('increasing ')}")
print (f" Stemming for 'increases ' is { stemmer .stem('increases ')}")
print (f" Stemming for 'increased ' is { stemmer .stem('increased ')}")

# To improve the accuracy of lemmatization , we need to provide each word 's part of the speech (POS) specifying POS as verb "v"
print (f" Lemmatization for 'increasing ' is { lemmatizer .lemmatize (' increasing ', pos='v ')}")
print (f" Lemmatization for 'increases ' is { lemmatizer .lemmatize (' increases ', pos='v ')}")
print (f" Lemmatization for 'increased ' is { lemmatizer .lemmatize (' increased ', pos='v ')}")


Stemming for 'increasing ' is increas
Stemming for 'increases ' is increas
Stemming for 'increased ' is increas
Lemmatization for 'increasing ' is increase
Lemmatization for 'increases ' is increase
Lemmatization for 'increased ' is increase



# WordNet is just another NLTK corpus reader
from nltk. corpus import wordnet
# 如果报错 'averaged_perceptron_tagger ' has not been yet downloaded ,就加上下面一行的代码
# nltk. download (' averaged_perceptron_tagger ')

# import NLTK tokenizer and (part of speech ) POS tagger
from nltk import word_tokenize , pos_tag
# import Porter stemmer class
from nltk.stem import PorterStemmer
# import WordNet lemmatizer class
from nltk.stem import WordNetLemmatizer
# default dictionary is similar to Python 's regular
# dictionary , but allows the dictionary to return a
# default value if a requested key does not exist in
# the dictionary from collections import defaultdict

# object for Porter stemmer
stemmer = PorterStemmer ()
# object for WordNet lemmatizer
lemmatizer = WordNetLemmatizer ()

# create a dictionary where single - letter keys are
# mapped to part of speech (noun , adjective , etc .)
# WordNet identifiers ; by default , if a key does not
# exists the dictionary , return noun ( wordnet .NOUN)
tag_map = defaultdict ( lambda : wordnet .NOUN)
# add key 'J' to the dictionary indicating adjective
tag_map ['J'] = wordnet .ADJ
# add key 'V' to the dictionary indicating verb
tag_map ['V'] = wordnet .VERB
# add key 'R' to the dictionary indicating adverb
tag_map ['R'] = wordnet .ADV

text = "We delivered adjusted earnings per share of $2.12. For the year , comparable sales were down 0.7% on an owned plus licensed basis , and we delivered adjusted earnings per share of $2.91."

# function that stems text
def stem_text (text: str ):
    # split text into (word) tokens
    tokens = word_tokenize (text)
    stemmed_text = []
    for token in tokens :
    stem = stemmer .stem(token )
    stemmed_text . append (stem)
    # concatenate stemmed tokens elements with
    # space (" ") in - between
    return " ".join( stemmed_text )

# function that to lemmatizes text
def lemmatize_text (text: str ):
    # splits text into tokens
    tokens = word_tokenize (text)
    lemmatized_text = []
    for token , tag in pos_tag ( tokens ):
    # lemmatize word tokens , tag [0] returns POS
    # letter identifier
    lemma = lemmatizer . lemmatize (token , tag_map [tag[0]])
    lemmatized_text . append ( lemma )
    # concatenate lemmatized tokens elements with
    # space in - between
    return " ".join( lemmatized_text )

# print stemmed version of text
print ( stem_text (text))
# print lemmatized version of text
print ( lemmatize_text (text))


"We deliv adjust earn per share of $ 2.12 . for the year , compar sale were down 0.7 % on an own plu licens basi , and we deliv adjust earn per share of $ 2.91 ."

"We deliver adjusted earnings per share of $ 2.12 . For the year , comparable sale be down 0.7 % on an owned plus licensed basis , and we deliver adjusted earnings per share of $ 2.91 ."

7.5 词的权重



$$ Proportion \ Of \ Dictionary \ Words_j=\frac { \sum _ { i }Count_{i,j} } {TotalWords _ { j } } $$


一个普遍的替代所有词加权的方法是以其文档频率来加权每个字数。具体来说,词语 iinverse document frequency (IDF)

$$ IDF_i=Log ( \frac { Number \ of \ documents \ in \ the \ sample } {Number \ of \ documents \ in \ the \ sample \ containing \ a \ word \ i } ) $$

会对常用的词进行惩罚,并对不常用的词赋予更大的权重。例如,如果 "increase "出现在样本中的每个文档中,那么它的IDF权重就是零(即log(1))。从形式上看,我们应用以下公式来计算字典词在文档中的加权比例。

$$ Weighted \ Proportion \ Of \ Dictionary \ Words_j=\frac { \sum _ { i }Count_{i,j} \times idf_i } {TotalWords _ { j } } $$



7.6 基于词典法的词频统计


我们首先上传我们感兴趣的统计词典。我们建议这些字典文件为纯文本(.txt)格式,以制表符分隔或以逗号分隔,每个字典中的单词/短语都是单独的一行。在下面的例子中,"positive.txt "和 "negative.txt "这两个词典文件都包含基本形式的词和内含的词(increase, increases, increasing, increased),所以我们不需要进行词干提取和词形还原。


import re

# Let us start with a simple tone analysis , where each word is equally - weighted and we do not account for negators .

# First , we need to specify the locations of our dictionary files. file path ( location ) to a text file with positive words; every word is in a separate line in the file
positive_words_dict = r"./dictionaries/positive.txt"
# file path to a text file with negative words
negative_words_dict = r"./dictionaries/negative.txt"

# To be able to match all positive and negative words from the dictionaries , we need to create a list of regular expressions corresponding to these words

# The following function reads all dictionary terms to a Python list , and converts the terms regular expressions

def create_dict_regex_list ( dict_file : str ):
    """ Creates a list of regex expressions of dictionary terms."""
    # opens the specified dict_file in "r" (read) mode
    with open (dict_file ,"r") as file :
        # reads the content of the file
        # line -by -line and creates a list of
        # dictionary phrases
    	dict_terms = file .read (). splitlines ()
   	# re. compile ( pattern ) in Python compiles a regular
    # expression pattern , which can be used for
    # matching using its , re.findall , etc.
    # by adding "\\b" (i.e., word boundary ) on each
    # side of a dictionary term in Regex , we force
    # an exact match that dictionary term
    dict_terms_regex = [re. compile (r'\\b' + term + r'\\b') for term in dict_terms ]

    # specifies the output of the function - in our
	# case , a list of Regex expressions that
	# correspond to the input dictionary file
    return dict_terms_regex

# Now we can apply our function to create Regex lists for positive and negative dictionary terms
positive_dict_regex = create_dict_regex_list (positive_words_dict )
negative_dict_regex = create_dict_regex_list (negative_words_dict )

# print the first three entries of each Regex dictionary
print ( positive_dict_regex [0:3])
print ( negative_dict_regex [0:3])


[re. compile ('\\\\ bable \\\\b'), re. compile ('\\\\ babundance \\\\b'), re. compile ('\\\\ babundant \\\\b')]
[re. compile ('\\\\ babandon \\\\b'), re. compile ('\\\\ babandoned \\\\b'), re. compile ('\\\\ babandoning \\\\b')]


$$ Tone(\%) =100 \times \frac { Positive \ Word \ Count − Negative \ Word \ Count} {TotalWordCount} $$



def get_tone ( input_text : str ):
    """ Counts All and Specific Words in Text """
    ### Positive Words ###
    # finds all regex matches and returns them as a
    # list of lists so , the output of this search
    # will be of the following format :
    # [[' able '], [], [' abundant ','abundant '], [], ... ]
    positive_words_matches = [re. findall(regex,input_text) for regex in positive_dict_regex]
    # len () measures the length of each list match
    # so , the output of this list transformation
    # will be of the following format : [1 ,0 ,2 ,0 ,...]
    positive_words_counts = [ len (match) for match in positive_words_matches ]
    positive_words_sum = sum ( positive_words_counts )
    ### Negative Words ###
    # in similar manner , we can get word counts for
    # negative words
    # finds all matches of negative words '
    # regular expressions
    negative_words_matches = [re. findall(regex ,input_text) for regex in negative_dict_regex]
    # calculates the number of matches for each
    # dictionary term regex
    negative_words_counts = [ len (match) for match in negative_words_matches ]
    negative_words_sum = sum ( negative_words_counts )
    ### Total Words ###
    # searches for all words in text , allowing
    # apostrophes and hyphens in words , e.g.,
    # " company 's", "state-of-the-art"
    total_words = re. findall (r"\\b[a-zA -Z\\ '\\ -]+\\b", input_text )
    # calculates the number of all words in text
    total_words_count = len ( total_words )
    # Finally , we can calculate Tone
    # ( expressed in % terms) as:
    tone = 100 * ( positive_words_sum - negative_words_sum ) / total_words_count
    return ( total_words_count , positive_words_sum , negative_words_sum , tone)

# Applying our count_words function to an input text:
counts = get_tone ("At FedEx Ground , we have the market leading ecommerce portfolio . We continue to see strong demand across all customer segments with our new seven -day service . We will increase our speed advantage during the New Year. Our Sunday roll -out will speed up some lanes by one and two full transit days. This will increase our advantage significantly . And as you know , we are already faster by at least one day when compared to UPS 's ground service in 25% of lanes. It is also really important to note our speed advantage and seven -day service is also very valuable for the premium B2B sectors , including healthcare and perishables shippers . Now , turning to Q2 , I'm not pleased with our financial results .")

# output the results as (Total Word Count, Number of Positive Words, Number of Negative Words , Tone)
print ( counts )


(114 , 7, 0, 6.140350877192983)

在计算文档语气时,我们经常会遇到文本中正面和负面词汇被否定的问题(例如,not bad, not good。有几种方法来处理这个问题。我们可以简单地将否定者旁边的词的情感倒置。比如:



not, never, no, none, nobody, nothing, don’t, doesn’t, won’t, shan’t, didn’t, shouldn’t, wouldn’t, couldn’t, can’t, cannot, neither, nor



# First , we update our function that compiles regular expressions
def create_dict_regex_list_with_negators ( dict_file : str ):
    """ Creates a list of regex expressions of dictionary terms."""
    with open (dict_file ,"r") as file :
        # reads dictionary lines one-by-one
        dict_terms = file .read (). splitlines ()

    # the first capturing group in this Regex captures all possible negators , allowing for zero or one match as indicated by ? after the group; the second group captures dictionary terms
    dict_terms_regex =[re. compile (r"(not|never|no|none|nobody|nothing |don\\'t|doesn\\'t|won\\'t|shan\\'t|didn\\'t| shouldn \\'t| wouldn \\'t| couldn \\'t|can\\'t| cannot |neither |nor)?\\s(" + term + r")\\b") for term in dict_terms ]
    # returns a list of Regex expressions that correspond to the input dictionary file , allowing for negators
    return dict_terms_regex

# Now we can apply our function to create Regex lists for positive and negative dictionary terms
positive_dict_regex = create_dict_regex_list_with_negators ( positive_words_dict )
negative_dict_regex = create_dict_regex_list_with_negators ( negative_words_dict )

# prints the first entries of each Regex dictionary
print ( positive_dict_regex [0])
print ( negative_dict_regex [0])


re. compile ("(not|never|no|none| nobody | nothing |don \\\\'t| doesn \\\\'t|won \\\\'t|shan \\\\'t|didn \\\\'t| shouldn \\\\'t| wouldn \\\\'t| couldn \\\\'t|can \\\\'t| cannot | neither |nor) ?\\\\s(able)\\\\b")

re. compile ("(not|never|no|none| nobody | nothing |don \\\\'t| doesn \\\\'t|won \\\\'t|shan \\\\'t|didn \\\\'t| shouldn \\\\'t| wouldn \\\\'t| couldn \\\\'t|can \\\\'t| cannot | neither |nor) ?\\\\s(abandon)\\\\b")



# calculates tone with negators
def get_tone2 ( input_text : str ):
    """ Counts All and Specific Words in Text , and checks for the presence of negators """
    # find all words in text
    total_words = re. findall (r"\\b[a-zA-Z\\'\\-]+\\b", input_text )
    total_words_count = len ( total_words )
    # Positive Words #
    # To account for negators , we can separately count positive and negated positive words
    positive_word_count = 0
    negated_positive_word_count = 0

    for regex in positive_dict_regex :
        # searches for all occurences of Regex
		matches = re. findall (regex , input_text )
        for match in matches :
            # if match is not empty
            if len (match) >0:
                # prints the match output ; this is for illustration purposes (i.e., optional )
                print (match)
            # if the first element of the match is empty , no negator is present
            if match [0] == '':
                # so , increase the count of positive words by 1
                positive_word_count += 1
                # otherwise , a negator is present , so increase the count of negated positive words by 1
				negated_positive_word_count += 1

    # If we are simply shifting the sentiment of negated positive words (from +1 to -1), then the final positive word count is just:
    positive_words_sum = positive_word_count

    # Repeat the same for Negative Words:
    negative_word_count = 0
	negated_negative_word_count = 0

    for regex in negative_dict_regex  :
        # searches for all occurences of Regex
		matches = re. findall (regex , input_text )
        for match in matches :
            # if match is not empty
            if len (match) >0:
                print (match)
            # if the first element of the match is empty , no negator is present
            if match [0] == '':
                # so , increase the count of positive words by 1
                negative_word_count += 1
                # otherwise , a negator is present , so increase the count of negated negative words by 1
                negated_negative_word_count += 1

            # If we are simply shifting the sentiment of negated negative words (from -1 to +1) , then the final negative word count is just:
    negative_words_sum = negative_word_count

    # Then , Tone is:
    tone = 100 * ( positive_words_sum - negative_words_sum )/ total_words_count
    return ( total_words_count , positive_words_sum , negative_words_sum , tone)

# Applying function get_tone2 function to an example text:
counts = get_tone2 ("At FedEx Ground , we have the market leading ecommerce portfolio . We continue to see strong demand across all customer segments with our new seven -day service . We will increase our speed advantage during the New Year. Our Sunday roll -out will speed up some lanes by one and two full transit days. This will increase our advantage significantly . And as you know , we are already faster by at least one day when compared to UPS 's ground service in 25% of lanes. It is also really important to note our speed advantage and seven -day service is also very valuable for the premium B2B sectors , including healthcare and perishables shippers . Now , turning to Q2 , I'm not pleased with our financial results .")
# output results
print ( counts )


('', 'advantage ')
('', 'advantage ')
('', 'advantage ')
('', 'leading ')
('not ', 'pleased ')
('', 'strong ')
('', 'valuable ')
(114 , 6, 0, 5.2631578947368425)
