10.1 使用相似度度量对文本进行比较

文本分析中一个常见的问题是评估两个（或多个）文本之间的相似程度。会计和金融研究中的一些研究考察了披露信息的相似度。例如，Brown和Tucker（2011）发现，经济变化较大的公司在不同年份的MD&A披露的相似度较低。Kravet和Muslu（2013）发现，风险披露中更大的年度变化与披露日期周围更大的股票波动和交易量有关。Hoberg和Phillips（2016）利用10-K文件中公司产品描述的文本相似度来生成一系列行业竞争者。

文本的相似性可以在词汇和语义层面上进行定义。词汇上的相似性指的是源于常用词或字符的使用的相似性（例如，"good "和 "goodness "有相似的字符）。语义相似性指的是基于词/短语含义相似性的相似性（例如，"good "和 "nice "有相似的含义）。衡量语义相似性是一个非常困难的问题，会计和金融文献中的大多数文本相似性措施都是词汇性的。在本章中，我们将重点讨论词汇性文本相似性测量。

最后，在考虑使用哪种相似性指标时，文本的长度是一个重要因素。如果一个输入文本相对较长（5个字或更多），那么选择一个单词字的层面上操作的相似性度量是比较合适的。反之，如果文本比较短，则更适合使用在字符层面上操作的度量。在本章中，我们将演示如何使用长文本和短文本的相似度量。

10.2 长文本的文本相似性测量：余弦相似度

对于相对较长的文本，有各种相似度测量方法，如欧氏距离（Euclidean distance）、余弦相似度（cosine similarity）和Jaccard相似系数（Jaccard similarity coefficient）。大多数会计和财务研究使用余弦相似度来比较文本。因此，在本章中，我们将展示如何在Python中计算文本的余弦相似度。

10.2.1 什么是余弦相似性？

在我们定义余弦相似性之前，我们首先需要介绍用于表示文本的词袋（bag-of-words）模型。在这种方法下，每一段文本都被表示为一个单词和它们的数量的向量。例如，短语 "cash ﬂows from operations "和 "cash ﬂows from investing "可以分别表示为向量u和v，如下所示。

Untitled

在这个例子中，向量u = (1, 1, 1, 0, 1)是在上述两个短语中出现的所有单词的特征空间中定义的，在 "cash ﬂows from operations "这个短语中，每个单词的出现都是1，每个单词的缺席都是0。如果 "cash "一词在该短语中出现了两次（例如，"cash ﬂows from cash operations"），向量u将包括2个 "现金 "词组：u=（2，1，1，0，1）。向量v代表 "cash ﬂows form investing "的字数，并以类似方式定义。请注意，词的顺序、它们的语篇、句子结构和其他语言学信息并不记录在词包向量中。

两个向量u和v之间的余弦相似度被定义为这两个向量之间的余弦的角度。它可以按以下方式计算。

$$ \cos ( u , v ) = \frac { u \cdot v } { | u | | v | } = \frac { \sum _ { i = 1 } ^ { N } u _ { i } v _ { i } v _ { i } } { \sqrt { ( \sum _ { i = 1 } ^ { N } u _ { i } ^ { 2 } ) ( \sum _ { i = 1 } ^ { N } v _ { i } ) } } $$

10.2.2 在Python中把文本表示为向量

为了在Python中计算几段文本之间的余弦相似性，我们首先需要将这些文本转换为词袋向量。这个过程包括两个步骤。(1)从文本中提取单词；(2)将单个单词的数量表示为数字向量。我们使用NLTK库从文本中提取单词，使用Scikit-learn库将单词列表转换为向量。这两个库都包含在Anaconda发行版中，但如果需要，也可以通过pip包安装程序来安装。

虽然我们可以使用简单的正则表达式从文本中提取单词，但NTLK库提供了额外的文本处理能力，以改善文本相似性比较。当使用词袋方法进行余弦相似性时，研究人员通常会进行词干化，并去除停顿词（例如，Lang和Stice-Lawrence，2015）。正如第七章已经提到的，词干化是一种文本分析技术，它将单词转换为其基本形式（例如，"reporting "转换为 "report"）。此外，停用词是语言中的常见词，如 "a"、"the"、"on"、"her "等。因此，在使用词袋方法时，停用词往往被剔除，以使向量表示和比较更有意义。

为了准备用于余弦相似性比较的文本，我们首先编写一个函数，对一个给定的文本提取该文本中的所有单词，删除停用词，并对所有剩余的单词进行词干化。我们使用NLTK库中的word_tokenize（见第7章）作为起点来编写一个自定义的单词分割器。为了访问NLTK的单词分割器和停止词列表，我们需要下载两个NLTK模块，如下所示（这只需要做一次）。

 import nltk
 #download NLTK's stopwords module
 nltk.download('stopwords')
 #downlod NLTK's punkt module
 nltk.download('punkt')

NLTK的单词分割器从给定的文本中提取单词，并将其作为一个词列表输出。然而，如果输入的文本包括标点符号或撇号字符（例如逗号，感叹号或单引号），NLTK的单词分割器会将这些字符作为单独的词（除了单词之外）输出。当计算文本相似度时，我们应该排除这些标点符号，因为它们会给词包向量带来噪音。方便的是，Python包含了一个标点符号的列表；我们只需要在这个列表中加入撇号（"’"）。

输入

 # Python includes a collection of all punctuation
 # characters
 from string import punctuation
 
 # add apostrophe to the punctuation character list
 punctuation_w_apostrophe = punctuation + "’"
 
 # print all characters
 print(punctuation_w_apostrophe)

输出

 !"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~’

现在，我们可以使用NTLK的停用词列表和PorterStemmer编写一个自定义的单词分割器。

 # imports word tokenizer from NLTK
 from nltk import word_tokenize
 # imports list of stop words from NLTK
 from nltk.corpus import stopwords
 # imports Porter Stemmer module from NLTK
 from nltk.stem import PorterStemmer
 
 # creates a list of English stop words
 set_stopwords = set(stopwords.words('english'))
 # creates a Porter stemmer object
 stemmer = PorterStemmer()
 
 # creates a custom tokenizer that removes stop words,
 # punctuation, and stems the remaining words
 def custom_tokenizer(text:str):
     # gets all tokens (words) from the lower-cased
     # input text
     tokens = word_tokenize(text.lower())
     # filters out stop words
     no_sw_tokens = [t for t in tokens
                     if t not in set_stopwords]
     # filters out punctuation character tokens
     no_sw_punct_tokens = [t for t in no_sw_tokens
                           if t not in
                           punctuation_w_apostrophe]
     # stems the remaining words
     stem_tokens = [stemmer.stem(t) for t in
                    no_sw_punct_tokens]
     # returns stemmed tokens (words)
     return stem_tokens

让我们用三家电信公司的10-文件表的业务描述部分的文本摘录来演示这个分割器是如何工作的。

输入

 # excerpt from Verizon Communications Inc. 2018 10-K
 doc_verizon = """Verizon Communications Inc. (Verizon or the Company) is a holding company that, acting through its subsidiaries, is one of the world’s leading providers of communications, information and entertainment products and services to consumers, businesses and governmental agencies."""
 # excerpt from AT&T Inc. 2018 10-K
 doc_att = """We are a leading provider of communications and digital entertainment services in the United States and the world. We offer our services and products to consumers in the U.S., Mexico and Latin America and to businesses and other providers of telecommunications services worldwide."""
 # excerpt from Sprint Corporation 2018 10-K
 doc_sprint = """Sprint Corporation, including its consolidated subsidiaries, is a communications company offering a comprehensive range of wireless and wireline communications products and services that are designed to meet the needs of individual consumers, businesses, government subscribers and resellers."""
 
 tokens_verizon = custom_tokenizer(doc_verizon)
 print(tokens_verizon)
 
 tokens_att = custom_tokenizer(doc_att)
 print(tokens_att)
 
 tokens_sprint= custom_tokenizer(doc_sprint)
 print(tokens_sprint)

输出

 ['verizon', 'commun', 'inc.', 'verizon', 'compani', 'hold', 'compani', 'act', 'subsidiari', 'one', 'world', 'lead', 'provid', 'commun', 'inform', 'entertain', 'product', 'servic', 'consum', 'busi', 'government', 'agenc']
 ['lead', 'provid', 'commun', 'digit', 'entertain', 'servic', 'unit', 'state', 'world', 'offer', 'servic', 'product', 'consum', 'u.s.', 'mexico', 'latin', 'america', 'busi', 'provid', 'telecommun', 'servic', 'worldwid']
 ['sprint', 'corpor', 'includ', 'consolid', 'subsidiari', 'commun', 'compani', 'offer', 'comprehens', 'rang', 'wireless', 'wirelin', 'commun', 'product', 'servic', 'design', 'meet', 'need', 'individu', 'consum', 'busi', 'govern', 'subscrib', 'resel']

请注意，输出列表中的单词是经过词干处理的，不包括停用词。最后，我们可以使用Scikit-learn的CountVectorizer类来将文本文件转换为词包向量。

输入

 # CountVectorizer converts text to bag-of-words vectors
 from sklearn.feature_extraction.text import CountVectorizer
 
 # creates a list of three documents; one for each
 # company
 documents = [doc_verizon,doc_att,doc_sprint]
 
 # creates a CountVectorizer object with the custom
 # tokenizer
 count_vectorizer = CountVectorizer(tokenizer=custom_tokenizer)
 
 # converts text documents to bag-of-word vectors
 count_vecs = count_vectorizer.fit_transform(documents)
 
 # prints first ten bag-of-words features (words)
 print(count_vectorizer.get_feature_names()[:10])
 
 # prints first ten bag-of-words elements (counts) for
 # each vector the output is a matrix where each row
 # represents a document vector the element (count)
 # order in each vector corresponds to the order of
 # the bag-of-word features
 print(count_vecs.toarray()[:,:10])

输出

 ['act', 'agenc', 'america', 'busi', 'commun', 'compani', 'comprehens', 'consolid', 'consum', 'corpor']
 [[1 1 0 1 2 2 0 0 1 0]
  [0 0 1 1 1 0 0 0 1 0]
  [0 0 0 1 2 1 1 1 1 1]]

在上面的代码中，我们创建了一个CountVectorizer类的新对象count_vectorizer，它将使用我们的自定义分割器（function custom_tokenizer ）来提取和处理文本文档中的单词。然后我们使用count_vectorizer将三个文本文档转换为词袋字向量。请注意，CountVectorizer按字母顺序对所有（词干）单词进行排序，然后返回输入文本文件的单词数向量。

10.2.3 计算余弦相似度

一旦我们有了词袋向量，使用Scikit-learn软件包计算余弦相似度就相对容易了。

输入

# cosine_similarity calculates cosine similarity
# between vectors
from sklearn.metrics.pairwise import cosine_similarity

# calculates text cosine similarity and stores results
# in a matrix. The matrix stores pairwise similarity
# scores for all documents, similarly to a covariance
# matrix
cosine_sim_matrix = cosine_similarity(count_vecs)

# outputs the similarity matrix
print(cosine_sim_matrix)

输出

[[1.         0.44854261 0.40768712]
 [0.44854261 1.         0.32225169]
 [0.40768712 0.32225169 1.        ]]

根据上述输出，Verizon和AT&T的文本摘录的相似度为0.4485（最相似），Verizon和Sprint的相似度为0.4077，AT&T和Sprint的相似度为0.3226（最不相似）。

除了去除停用词之外，为了在测量文本相似度时进一步减少常用词的影响，我们可以使用术语频率-反文档频率（term frequency–inverse document frequency，TF-IDF）技术。正如第七章所讨论的，反文档频率（inverse document frequency，IDF）是一种词语加权技术，常用的词语（在整个文档语料库中）被赋予较小的词语权重。TF-IDF仅仅是术语（词）频率（TF）和其反文档频率（IDF）权重的乘积。

我们可以稍微修改一下之前的代码，通过使用TfidfVectorizer类而不是CountVectorizer来创建具有IDF权重的词袋向量。前一个类会自动计算并为文档列表（语料库）中的每个文档应用IDF权重。

输入

# TfidfVectorizer converts text to TF-IDF bag-of-words
# vectors
from sklearn.feature_extraction.text import TfidfVectorizer

# creates a TfidfVectorizer object with the custom
# tokenizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer)

# converts text documents to TF-IDF vectors
tfidf_vecs = tfidf_vectorizer.fit_transform(documents)

# prints first four bag-of-words features (words)
print(tfidf_vectorizer.get_feature_names()[:4])

# prints first four bag-of-words TF-IDF counts for each
# vector. The output is a matrix where each row
# represents a document vector
print(tfidf_vecs.toarray()[:,:4]) # prints first four elements of each vector

输出

['act', 'agenc', 'america', 'busi']
[[0.22943859 0.22943859 0.         0.13551013]
 [0.         0.         0.23464902 0.13858749]
 [0.         0.         0.         0.13365976]]

请注意，输出中的词袋向量是由实数（连续）组成的，而不是整数。最后，为了计算TF-IDF向量之间的余弦相似度，我们可以使用NTLK的余cosine_similarity （就像我们之前做的那样）。

输入

# computes the cosine similarity matrix for TF-IDF
# vectors
tfidf_cosine_sim_matrix = cosine_similarity(tfidf_vecs)
# outputs the similarity matrix
print(tfidf_cosine_sim_matrix)

输出

[[1.         0.30593809 0.23499515]
 [0.30593809 1.         0.17890296]
 [0.23499515 0.17890296 1.        ]]

基于TF-IDF向量计算的相似度分数与普通的词包向量相比要小一些。这可能是由于每个文本文件中的独特词汇（如公司名称）具有更大的反文档频率权重的结果。

10.3 短文的文本相似性测量：Levenshtein距离

评估短文片断之间的相似性往往是有用的。例如，我们可能需要根据公司或人名合并两个数据集中来自不同来源的观测数据。在这种情况下，短文相似性度量比余弦相似性等措施更适用。长文本相似度测量是在单词层面上不同的是，短文本相似度测量是在字符层面上确定文本的相似度，如最长共同子串、Levenshtein距离和Jaro距离。作为短文相似性编码的一个例子，我们展示了如何计算一个流行的短文相似性测量，即Levenshtein距离。

10.3.1 Levenshtein距离的介绍

两段文字t1和t2之间的Levenshtein距离（编辑距离edit distance），计算出使t1与t2相同所需的最小单字编辑（插入、删除和替换）的数量。例如，单词 "account "和 "accounts "之间的Levenshtein距离是1，因为我们需要一个编辑来将 "account "改为 "accounts"：在单词的末尾插入s。而 "account "和 "count "之间的Levenshtein距离为2：需要从 "account "中删除两个字符才能变成 "count"。最后，"account "和 "access "之间的Levenshtein距离是4："account "中的字符out必须替换成ess，最后一个字符t必须被删除。

NLTK库提供了一个名为edit_distance的函数，用于计算两个文本之间的列文斯坦距离。

输入

# edit_distance computes Levenshtein distance between
# two pieces of text
from nltk import edit_distance

#example: account and accounts
print(edit_distance("account","accounts"))

#example: account and count
print(edit_distance("account","count"))

#example: account and access
print(edit_distance("account","access"))

输出

1
2
4

10.3.2 使用Levenshtein距离创建一个相似性测量标准

余弦相似度和Levenshtein距离测量法的输出结果之间有一个重要区别。余弦相似度输出的是0到1之间的实数，数值越高表示文本越相似；列文斯坦距离输出的是一个整数（0，1，2，等等），数值越高表示文本越不相似。另外，Levenshtein距离可能会随着输入文本的长度而机械地增加，因为较长的文本可能需要更多的编辑才能与其他文本等同。

我们可以在Levenshtein距离的基础上制定一个类似于余弦相似度的指标，即数值从0到1不等，数值越高表示相似度越高。为此，我们首先用较长的输入文本的长度来衡量列文斯坦距离。然后，我们用1减去缩放后的距离，得到一个介于0和1之间的数字，数值越大表示相似程度越高。

# similarity measure based on the Levenshtein distance
# greater values indicate more similar text
def edit_similarity(t1,t2):
    # lowercase the input strings
    (t1,t2) = (t1.lower(),t2.lower())
    # calculates the Levenshtein distance between the
    # input strings
    distance = edit_distance(t1,t2)
    # calculates length of the longest input string
    longest_text_len = max(len(t1),len(t2))
    # if both t1 and t2 are empty strings, they are
    # identical; thus return 1 as the output
    if longest_text_len == 0:
        return 1.0
    # else compute the similarity measure as
    # 1 - (levenshtein_distance / length of the longest input string)
    else:
        return (1.0 - float(distance) / float(longest_text_len))

让我们在一个例子上演示如何应用这种相似度测量。考虑一下基于公司名称的观察结果的匹配问题。标准普尔500指数公司Fidelity National Information Services的名称在Capital IQ的Compustat数据库中被记录为 "Fidelity National Info Svcs"。也就是说，最后两个词被缩短了。理想情况下，对于公司名称的缩短版本和原始版本之间，相似度测量将产生一个高的相似度。

输入

# original company name
orig_name = "Fidelity National Information Services"
# shortened company name
comp_name = "Fidelity National Info Svcs"

# calculates and outputs the Levenshtein distance
levenshtein_distance = edit_distance(orig_name,comp_name)
print("Levenshtein distance:",levenshtein_distance)

# calculates and output the similarity score based on
# Levenshtein distance
levenshtein_similarity = edit_similarity(orig_name,comp_name)
print("Levenshtein similarity score:",levenshtein_similarity)

输出

Levenshtein distance: 11
Levenshtein similarity score: 0.7105263157894737

两个版本的公司名称之间的相似度为0.71分，表明相似度很高。

10.2 使用Word2Vec词嵌入计算语义相似度

词嵌入（Word embedding）是指将词语表示为数字的向量。在第10.2.1节中，我们介绍了最简单的词嵌入方法，称为词袋法（ bag-of-words，BOW），其中文本文件是由它们所包含的单个词的频率来表示。BOW的主要优点之一是简单。然而，它的一个主要缺点是失去了上下文信息和单个词在句子中出现的顺序。

Mikolov等人（2013a）和Mikolov等人（2013b）开发的Word2Vec嵌入方法被认为是最先进的文本数字表示方法。它使用一个神经网络模型将单词转换为向量，其方式是通过其上下文，即通过与感兴趣的单词相近的单词来推断单词的含义。换句话说，如果我们有两个在同一语境中使用的词，那么这些词很可能在意义上相似或相关。例如，pleased、glad和happy经常在类似的语境中使用。Word2Vec模型在研究者中非常受欢迎，该模型的一些实际意义包括词聚类和语义相似性、同义词检测、内容分类和推荐。

在本节中，我们展示了Python中Word2Vec模型的基本例子。想了解更多关于Word2Vec及其应用的信息，我们向读者推荐开发该模型的两篇论文，即Mikolov等人（2013a）和Mikolov等人（2013b），以及关于Word2Vec的众多在线资源，包括。

Google Word2Vec
Deep Learning with Word2Vec

假设我们想为苹果公司2018年10-K报告中的MD&A部分创建一个Word2Vec模型。一旦MD&A部分从10-K报告中提取出来，我们就需要为Word2Vec预处理其内容，去除停顿词、特殊字符、数字和多余的空格，并从输入文本中提取单个单词。下面的代码总结了这些数据预处理的步骤。

输入

import re
# imports word tokenizer from NLTK
import nltk
# download NLTK 's stopwords module
nltk. download ('stopwords ')
from nltk import word_tokenize
# imports list of stop words from NLTK
from nltk. corpus import stopwords

# creates a list of English stop words
set_stopwords = set ( stopwords .words('english '))
# path to the input txt file with Apple 's 2018 MD&A

input_file = r".../ Apple_MDNA .txt"
# reads file content
file_content = open (input_file ,"r").read ()

# converts text to lowercase ; removes all special
characters , digits and extra spaces
processed_content = file_content . lower ()
processed_content = re.sub(r'[^a-zA -Z]', ' ', processed_content )
processed_content = re.sub(r'\\s+', ' ', processed_content )

# creates a list of lists of individual words - this is
the input format to Word2Vec model
processed_content = [ processed_content ]
words = [nltk. word_tokenize (e) for e in
processed_content ]

# removes stop words from the list of words
for i in range ( len (words)):
words[i] = [w for w in words[i] if w not in set_stopwords ]

现在，在识别了苹果公司MD&A部分的各个单词并保留了它们的顺序和上下文之后，我们可以使用Gensim库来建立我们的Word2Vec模型。Gensim可以用conda或pip安装，如下所示。

conda install -c anaconda gensim
pip install gensim

如上所述，使用Word2Vec方法时，苹果公司2018年MD&A中单词的上下文信息不会丢失。因此，我们可以在文件中找到单词集群。例如，根据苹果公司MD&A的措辞，我们可以找到与 "sales "这个词相似或相关的词。

输入

# imports Word2Vec from Gensim library
from gensim . models import Word2Vec

# creates a Word2Vec model , ignoring words that occur less than two times in the input text
word2vec = Word2Vec (words , min_count =2)

# identifies most related / similar words to 'sales ' based on the input text provided
related_words = word2vec .wv. most_similar ('sales ')
related_words

输出

[( 'rate ', 0.9937257766723633) ,
('interest ', 0.9924227595329285) ,
('assets ', 0.992143988609314) ,
('risk ', 0.99197918176651) ,
('million ', 0.9914211630821228) ,
('capital ', 0.9911070466041565) ,
('mortgage ', 0.9908864498138428) ,
('securities ', 0.9907544255256653) ,
('financial ', 0.9906978011131287) ,
('december ', 0.9902232885360718) ]

使用苹果公司的MD&A作为Word2Vec模型的输入，我们找到了与 "sales "相关/相似的词，以及它们的相似度指数。例如，在输入的MD&A文件中，"rate "和 "interest "等词经常与 "sales "一词共存，其相似度很高。

在上面的例子中，我们只用了一个文本文档来训练Word2Vec模型。然而，当我们增加训练语料库时，该模型在识别词簇和相似性方面的性能将大大改善。训练Word2Vec的一个流行选择是谷歌新闻数据集模型。它由大约300万个单词和短语的300维嵌入组成（详见 https://code.google. com/archive/p/word2vec/，并下载'GoogleNewsvectors-negative300.bin.gz'文件（∼1.5GB））。有了预先训练好的模型，我们就可以访问词向量并获得相似度分数，具体如下。

输入

from gensim . models import KeyedVectors
# load embeddings directly from the downloaded file
called " GoogleNews -vectors - negative300 .bin"
model = KeyedVectors . load_word2vec_format ('GoogleNews-vectors-negative300 .bin ', binary =True)
# similarity between pairs of words
a = model . similarity ('confident ', 'uncertain ')
b = model . similarity ('recession ', 'crisis ')
# most similar words
c = model . most_similar ('accounting ')
# identifies a word that does not belong in the list
d = model . doesnt_match ("good great amazing bad".split ())
print (a)
print (b)
print (c)
print (d)

输出

0.38531393
0.59829676
[( 'Accounting ', 0.6579887866973877) ,
('bookkeeping ', 0.6002781391143799) ,
('auditing ', 0.5503429174423218) ,
('Arthur_Andersen_Enron ', 0.5320826768875122) ,
('restatement ', 0.5319857597351074) ,
('accountancy ', 0.5315808057785034) ,
('bookeeping ', 0.5051406621932983) ,
('Generally_Accepted_Accounting_Principles ',
0.5034366846084595) ,
('accouting ', 0.5023787021636963) ]
bad

使用在谷歌新闻上预先训练的Word2Vec模型，我们能够识别出 "bookkeeping "和 "auditing "以及 "Generally Accepted Accounting Principles"是与 "accounting"这个词最密切相关的词和短语。同样，Word2Vec能够检测到 "bad "这个词不属于由 "good great amazing bad "这个词组成的列表中。

总而言之，至少有三种不同的数字表示文本的方法：词袋（BOW）方法、加权（反文档频率（idf））BOW和Word2Vec的词嵌入。在处理文本时，每一种方法都有其优点和缺点（例如，单词上下文的建模与上下文的丢失）。因此，我们建议读者在选择一种方法时要仔细考虑他们的研究问题，因为有时一种方法的复杂性增加不一定会带来更好的结果。