7 基于词典法的文本分析

8 量化文本的复杂性

9 句子结构和分类


9.1 识别前瞻性的句子



因此,为了自动找到前瞻性声明,我们需要编写一段代码,测试三个FLS条件中是否至少有一个是真的。我们将按照Muslu等人(2015)的附录 "识别前瞻性披露",首先生成对应于面向未来的术语的正则表达式。为了便于阐述代码,我们在每一行都加入了用于解释的注释。


 import re
 # To identify FLS, we need a dictionary file that
 # includes future-oriented verbs and their
 # conjugations as well as terms that identify
 # references to the future. In our case, this
 # file is "fls_terms.txt."
 # file path (location) to a text file with FLS
 # terms (dictionary structure: one term per line)
 fls_terms_file = r".\\dictionaries\\fls_terms.txt"
 # next, create a list of regex expressions that
 # match FLS terms
 def create_fls_regex_list(fls_terms_file:str):
     """Creates a list of regex expressions of
     FLS terms"""

     # opens the specified dict_file in "r" (read) mode
     with open(fls_terms_file,"r") as file:
         # reads the content of the file line-by-line
         # and creates a list of FLS terms
         fls_terms =

     # creates a list of FLS regex expressions by adding
     # word boundary (\\b) anchors to the beginning and
     # the ending of each FLS term
     fls_terms_regex = [re.compile(r'\\b' + term + r'\\b') for term in fls_terms]
     return fls_terms_regex
 # creates a list of FLS regex expressions
 fls_terms_regex = create_fls_regex_list(fls_terms_file)


 [re.compile('\\\\bwill\\\\b'), re.compile('\\\\bfuture\\\\b'), re.compile('\\\\bnext fiscal\\\\b')]



 def is_forward_looking(sentence:str, year:int):
     """Returns whether sentence is forward-looking."""

     #creates a list of regex expression that match up
     # to 10 years into the future
     future_year_terms=[re.compile(r"[^$,]\\b" +
                                   str(y) +
                        for y in range(year+1,year+10)]

     # combines FLS regex expressions, i.e., regular
     # expressions for FLS terms and future years
     fls_terms_with_future_years = fls_terms_regex + future_year_terms

     for fls_term in fls_terms_with_future_years: returns a match object
         # if there is a match, and "None" if there is no
         # FLS term match in the sentence
             return True
     return False

 #Input text - excerpt from Apple's Q4 2018
 # Earnings Conference Call Transcript
 text = """Finally, we launched a completely new website
 experience for Atlanta. The new online experience
 provides a modern and fresh brand look and includes
 enhanced simplicity and flexibility for shopping and
 buying that easily transitions to a home delivery or
 in-store experience. We are excited to put the customer
 in the driver seat. This experience is a unique and
 powerful integration of our own in-store and online
 capabilities. Keep in mind, we will continue to improve
 both the customer and associate experience in Atlanta
 and use these earnings to inform how we roll out into
 other markets. As we previously announced, we
 anticipate having the omni channel experience available
 to the majority of our customers by February 2020. To
 expand omni channel, we anticipate opening additional
 customer experience centers. We're currently in the
 process of planning the next locations while taking
 state regulations into consideration."""
 sentence_regex = re.compile(r"\\b[A-Z](?:[^\\.!?]|\\.\\d)*[\\.!?]")
 def identify_sentences(input_text:str):
     sentences = re.findall(sentence_regex, input_text)
     return sentences

 sentences = identify_sentences(text)
 for sentence in sentences:
     print(is_forward_looking(sentence,2018),":", sentence)


 False : Finally, we launched a completely new website
 experience for Atlanta.
 False : The new online experience
 provides a modern and fresh brand look and includes
 enhanced simplicity and flexibility for shopping and
 buying that easily transitions to a home delivery or
 in-store experience.
 False : We are excited to put the customer
 in the driver seat.
 False : This experience is a unique and
 powerful integration of our own in-store and online
 True : Keep in mind, we will continue to improve
 both the customer and associate experience in Atlanta
 and use these earnings to inform how we roll out into
 other markets.
 True : As we previously announced, we
 anticipate having the omni channel experience available
 to the majority of our customers by February 2020.
 True : To
 expand omni channel, we anticipate opening additional
 customer experience centers.
 False : We're currently in the
 process of planning the next locations while taking
 state regulations into consideration.

9.2 基于字典的句子分类方法




 # This code implements is a simplified version of
 # sentence classification as earnings-oriented or
 # not and quantitative or not as in Bozanic et
 # al.(2018)
 # regex for identifying sentences
 sentence_regex = re.compile(r"\\b[A-Z](?:[^\\.!?]|\\.\\d)*[\\.!?]")
 def identify_sentences(input_text:str):
     """Returns all sentences in the input text"""
     sentences = re.findall(sentence_regex, input_text)
     return sentences
 earn_terms = ["earnings", "EPS", "income", "loss",
               "losses", "profit", "profits"]
 quant_terms = ["thousand", "thousands", "million",
                "millions", "billion", "billions",
                "percent", "%", "dollar", "dollars",
 # creates a list of earnings regex expressions
 earn_terms_regex = [re.compile(r'\\b' + term + r'\\b')
                     for term in earn_terms]
 # creates a list of regexes for quantitative terms
 quant_terms_regex = [re.compile(r'\\b' + term + r'\\b')
                      for term in quant_terms]
 # checks if there is a match for at least one earnings
 # term in the input sentence
 def is_earn_oriented(sentence:str):
     """Checks whether a sentence is earnings-oriented."""
     for term in earn_terms_regex:
         if, re.IGNORECASE):
             return True
     return False
 # checks if there is a match for at least one qualitative
 # term in the input sentence
 def is_quantitative(sentence:str):
     """Checks whether a sentence is quantitative
     in nature."""
     for term in quant_terms_regex:
         if, re.IGNORECASE):
             return True
     return False
 # input text
 text = """Operating income margins, excluding the
 restructuring charges, are projected to be in the
 range of 4.5% to 4.8%, and interest expense and
 other income are forecasted to be approximately
 $18 million and $6 million, respectively. While
 operating performance is expected to remain
 strong, Agribusiness profits are expected to be
 lower in the third and fourth quarters as pricing
 for subsequent sales will not match the high level
 of the June delivery. The Company expects its
 capital expenditures in 2008 to be approximately
 $300 million, an 8% reduction from 2007 capital
 expenditures of $326 million. During the third
 quarter, the company made further progress
 implementing the strategic cost reductions that
 will support the targeted growth investments
 announced in July 2005."""
 sentences = identify_sentences(text)
 # next, we classify each sentence as earnings-
 # oriented or not, quantitative or not
 for sentence in sentences:
           "---", sentence)


 ***Earnings-oriented: True ***Quantitative: True --- Operating income margins, excluding the
 restructuring charges, are projected to be in the
 range of 4.5% to 4.8%, and interest expense and
 other income are forecasted to be approximately
 $18 million and $6 million, respectively.
 ***Earnings-oriented: True ***Quantitative: False --- While
 operating performance is expected to remain
 strong, Agribusiness profits are expected to be
 lower in the third and fourth quarters as pricing
 for subsequent sales will not match the high level
 of the June delivery.
 ***Earnings-oriented: False ***Quantitative: True --- The Company expects its
 capital expenditures in 2008 to be approximately
 $300 million, an 8% reduction from 2007 capital
 expenditures of $326 million.
 ***Earnings-oriented: False ***Quantitative: False --- During the third
 quarter, the company made further progress
 implementing the strategic cost reductions that
 will support the targeted growth investments
 announced in July 2005.

9.3 识别句子的主语和宾语




 import spacy
 # load spacy's English language model
 nlp = spacy.load("en_core_web_sm")
 # a sample text
 text = """Q1 revenue reached $12.7 billion. We are
 thrilled with the continued growth of Apple Card.
 We experienced some product shortages due to very
 strong customer demand for both Apple Watch and
 AirPod during the quarter. Apple is looking at
 buying U.K. startup for $1 billion."""
 # parses the input text using spacy's nlp class
 parsed_text = nlp(text)
 # gets a list of sentences identified by spacy
 # property "sents" yields identified sentences
 sentences = list(parsed_text.sents)
 # recall that function enumerate() when applied
 # to a list, returns its elements along with their
 # indexes
 for num,sentence in enumerate(sentences,1):
     print("Sentence", str(num), ":", sentence)


 Sentence 1 : Q1 revenue reached $12.7 billion.
 Sentence 2 : We are
 thrilled with the continued growth of Apple Card.
 Sentence 3 : We experienced some product shortages due to very
 strong customer demand for both Apple Watch and
 AirPod during the quarter.
 Sentence 4 : Apple is looking at
 buying U.K. startup for $1 billion.

接下来,我们可以应用spacy的标记方法来识别句子中的主体和客体。为此,我们继续之前的代码,创建一个函数,从一个句子中提取所有(词)标记和它们的依赖关系,然后对它们进行筛选,只保留句子中的 "subj"(主语)或 "obj"(宾语)的标记。


def sentence_subj_obj(sentence):
    """Identifies subjects and objects in a sentence"""
    results = []
    for token in sentence:
         # records the token's text and its dependency
        entry = {"Token": token.text,
                 "Dependency": token.dep_}

    # spacy parses token dependencies and assigns a
    # dependency code for each token; tokens that are
    # either objects or subjects will include "obj" or
    # "subj" in their dependency codes; for a full list
    # of spacy's dependencies and their codes, visit

    # creates a new list of tokens and their
    # dependencies based on results list by keeping
    # only tokens with "obj" and "subj" dependencies
    filtered_results=[entry for entry in results
                      if ('obj' in entry['Dependency'])
                      ('subj' in entry['Dependency'])]
    return filtered_results

# recall that function enumerate() when applied to a
# list, returns its elements along with their indexes
for num,sentence in enumerate(sentences,1):
    print("Sentence", str(num), ":",


Sentence 1 : [{'Token': 'revenue', 'Dependency': 'nsubj'}, {'Token': 'billion', 'Dependency': 'dobj'}]
Sentence 2 : [{'Token': 'We', 'Dependency': 'nsubj'}, {'Token': 'growth', 'Dependency': 'pobj'}, {'Token': 'Card', 'Dependency': 'pobj'}]
Sentence 3 : [{'Token': 'We', 'Dependency': 'nsubj'}, {'Token': 'shortages', 'Dependency': 'dobj'}, {'Token': 'demand', 'Dependency': 'pobj'}, {'Token': 'Watch', 'Dependency': 'pobj'}, {'Token': 'quarter', 'Dependency': 'pobj'}]
Sentence 4 : [{'Token': 'Apple', 'Dependency': 'nsubj'}, {'Token': 'startup', 'Dependency': 'dobj'}, {'Token': 'billion', 'Dependency': 'pobj'}]

那么,“Q1 revenue reached $12.7 billion”这句话用spacy可视化表示为


9.4 识别命名的实体



虽然Hope等人(2016)使用斯坦福大学的NER工具使用命名实体识别(NER)技术,但我们可以使用spacy轻松复制这种方法。我们将使用与上一节相同的样本文本,以及其解析后的版本parsed_text 。首先,我们演示如何从文本中识别和提取命名实体。


# create a dictionary with descriptions for spacy's
# entity type codes; the list is available on
entity_type_descriptions = {
    'PERSON':'People, including fictional.',
    'NORP':'Nationalities or religious or political groups.',
    'FAC':'Buildings, airports, highways, bridges, etc.',
    'ORG':'Companies, agencies, institutions, etc.',
    'GPE':'Countries, cities, states.',
    'LOC':'Non-GPE locations, mountain ranges, bodies of water.',
    'PRODUCT':'Objects, vehicles, foods, etc. (Not services.)',
    'EVENT':'Named hurricanes, battles, wars, sports events, etc.',
    'WORK':'OF_ART	Titles of books, songs, etc.',
    'LAW':'Named documents made into laws.',
    'LANGUAGE':'Any named language.',
    'DATE':'Absolute or relative dates or periods.',
    'TIME':'Times smaller than a day.',
    'PERCENT':'Percentage, including "%".',
    'MONEY':'Monetary values, including unit.',
    'QUANTITY':'Measurements, as of weight or distance.',
    'ORDINAL':'"first", "second", etc.',
    'CARDINAL':'Numerals that do not fall under another type.'}

# gets a list of all named entities identified
# by spacy, and output them
# property "ents" returns all identified named
# entities in the text
named_entities = parsed_text.ents
for ent in named_entities:
    # gets the named entity (ent.text)
    entity = ent.text
    # gets the named entity type code
    # (e.g., PERSON, ORG, etc.)
    entity_type = ent.label_
    # gets the named entity description from
    # entity_type_descriptions dictionary using
    # its type code
    entity_desc = entity_type_descriptions[entity_type]



Q1             CARDINAL  Numerals that do not fall under another type.
$12.7 billion  MONEY     Monetary values, including unit.
Apple Card     ORG       Companies, agencies, institutions, etc.
Apple Watch    ORG       Companies, agencies, institutions, etc.
AirPod         ORG       Companies, agencies, institutions, etc.
the quarter    DATE      Absolute or relative dates or periods.
Apple          ORG       Companies, agencies, institutions, etc.
U.K.           GPE       Countries, cities, states.
$1 billion     MONEY     Monetary values, including unit.



# counts the number of all words
# we assume that every token in a sentence is a word
# unless it is punctuation.
num_words = len([token
                 for token in parsed_text
                 if not token.is_punct])

num_entities = len(named_entities)
specificity_score = num_words / num_entities

print('Number of named entities:', num_entities)
print('Number of words:', num_words)
print('Specificity score:', specificity_score)


Number of named entities: 9
Number of words: 52
Specificity score: 5.777777777777778

9.5 使用Stanford NLP进行词性标注与命名实体识别任务

在上面的代码中,我们展示了如何使用spacy库来标记文本和识别命名实体。另一套流行的自然语言分析工具是Stanford NLP。例如,Hope等人(2016)使用Stanford NLP来计算文本独特性。斯坦福NLP的主要Python库被称为Stanza。它的功能包括句子和单词识别、多词分词扩展、词性还原、部分语音依存分析和名称实体识别解析。下面,我们将演示如何在Python中使用Stanford NLP进行语料部分和NER的应用。


conda install -c stanfordnlp stanza
pip install stanza


import stanza
# downloads the English module . The size of the
# downloaded module is about 400 MB. The module
# has to be download only once
stanza . download ('en')

# creates a (text processing ) Pipeline object using
# the English language module with tokenizer , part
# of speech and named entity recognition
nlp = stanza . Pipeline (lang = 'en', processors = 'tokenize ,
pos ,ner ')



# sample text (same as in the previous example )
text = """ Q1 revenue reached $12 .7 billion . We are
thrilled with the continued growth of Apple Card.
We experienced some product shortages due to very
strong customer demand for both Apple Watch and
AirPod during the quarter . Apple is looking at
buying U.K. startup for $1 billion ."""

# creates Stanza document object
doc = nlp(text)

# extracts sentences
sentences = doc. sentences

print ('Sentences :')
# prints the first 20 characters of each sentence
for sentence in sentences :
	print ( sentence .text [0:20] + '... ')

    print ('\\ nWords :')
# prints all the words in the first sentence
for word in sentences [0]. words:
	print (word.text)


Sentences :
Q1 revenue reached $...
We are thrilled with ...
We experienced some ...
Apple is looking at ...



# outputs POS information for each word in the second sentence
for word in sentences [1]. words:
	print (f'{word.text : <10} {word.pos}')


are AUX
thrilled ADJ
with ADP
the DET
continued VERB
growth NOUN
of ADP



# outputs all entities identified in the input text
for ent in doc.ents:
	print (f'{ent.text : <15} {ent.type}')


$12 .7 billion MONEY
Apple Card ORG
Apple Watch ORG
AirPod ORG
the quarter DATE
Apple ORG
$1 billion MONEY

请注意,上述代码的输出与spacy's NER工具的输出非常相似,除了一个实体(spacy也将 "Q1 "识别为一个基数词)。