[문서유사도] 코사인 유사도(TfidfVectorizer, cosine_similarity)

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Sun.El Data Analysis

[문서유사도] 코사인 유사도(TfidfVectorizer, cosine_similarity)_ 본문

Machine Learning

[문서유사도] 코사인 유사도(TfidfVectorizer, cosine_similarity)_

Sun.El 2023. 8. 2. 23:34

728x90

문서유사도란? 문서와 문서간의 유사도가 어느정도인지 나타내는 척도로
지금 보고 있는 뉴스와 가장 유사한 뉴스를 추천해주기도 하고,
줄거리를 기반으로 내가 본 영화와 가장 유사한 영화를 추천해 줄 수 있음

문서 유사도를 측정하는 방법으로 다음과 같은 지표가 있지만,
코사인 유사도(Cosine Similarity)를 많이 사용함
Cosine Similarity, Jaccard Similarity, Manhattan Distance, Eucliden Distance

1. 코사인 유사도(Cosine Similarity) 개념

코사인 유사도란 벡터와 벡터 간의 유사도를 비교시 두 벡터 간의 사잇각을 구해서 얼마나 유사한지 수치로 나타낸 것
벡터 방향이 비슷할 수록 두 벡터는 서로 유사하며, 두 벡터가 완전히 동일할 시 코사인 유사도가 0이고, 두 벡터간 사잇각이 90도일 때는 두 벡터 간의 관련성이 없고 코사인 유사도 0이며, 두 벡터 간 사잇각이 90도여서 반대일 경우에는 코사인 유사도는 -1임
단, 벡터 행렬은 음수값이 없으므로 코사인 유사도가 음수가 되지는 않아 코사인 유사도는 0~1값을 갖음
문서 간 유사도 츢정하는 방법 중 유클리드 거리 기반 지표가 있지만, 문서와 문서 벡터 간의 크기에 기반한 유사도 지표는 정확도가 떨어져 코사인 유사도가 가장 많이 쓰임

※ 코사인 유사도 참고 링크 : https://en.wikipedia.org/wiki/Cosine_similarity#:~:text=In%20data%20analysis%2C%20cosine%20similarity%20is%20a%20measure,vectors%20divided%20by%20the%20product%20of%20their%20lengths

2. 문장간 코사인 유사도 구하기

코사인 유사도 함수 정의

[In]

import numpy as np

def cos_similarity(v1, v2): 
    dot_product = np.dot(v1, v2)
    l2_norm = (np.sqrt(sum(np.square(v1))) * np.sqrt(sum(np.square(v2))))
    similarity = dot_product / l2_norm     
    
    return similarity

TF-IDF 벡터화 후 코사인 유사도 비교

[In]
from sklearn.feature_extraction.text import TfidfVectorizer

doc_list = ['if you take the blue pill, the story ends',
            'if you take the red pill, you stay in Wonderland',
            'if you take the red pill, I show you how deep the rabbit hole goes']

tfidf_vect_simple = TfidfVectorizer()
feature_vect_simple = tfidf_vect_simple.fit_transform(doc_list)
print(feature_vect_simple.shape) #문장 3개, 피쳐 18개
print(type(feature_vect_simple))
[Out]
(3, 18) #문서 3개, 총 단어의 개수(중복 제거) 18개
<class 'scipy.sparse._csr.csr_matrix'>
[In] 희소행렬을 밀집행렬로 변화한 후 array로 변환

add check point!
<희소 표현(Sparse Representation)>
문장을 벡터로 나타낼 때 대부분의 값이 0인 희소행렬 개념 이용 -> 표현하고자 하는 단어 1, 나머지 0 설정
단어의 수가 늘어나면 차원도 커지는 문제점 존재
<희소 행렬(Sparse Matrix)>
희소 행렬은 행렬의 원소 중에 많은 항들이 '0'으로 구성되어 있는 행렬로 실제 사용하지 않는 메모리 공간으로 인해 메모리 낭비가 발생하게 됨
희소 행렬을 가지고 learning 할시 잘 되지 않을 가능성이 있기에 NLP에서 사용하는 word/sentence embedding 등과 같이 차원 축소 과정으 통해 데이터를 밀집(dense)하게 만들 필요성이 존재함
<밀집 표현(Dense Representation)>
단어의 개수와 상관없이 사용자가 차원 값을 설정하기 때문에 차원 축소의 장점이 있음
특정 단어를 표현하기 위해 여러 특성을 고려하여 각 요소에 대한 정보가 실수로 표현됨
#앞에 정의한 cos_similarity() 함수의 인자인 array로 만들기 위해 밀집행렬로 변환 후 다시 배열로 변환해야 함
#TFidfVectorizer로 transform()한 결과는 희소행렬(Sparese Matrix)이므로 밀집행렬(Dense Matrix)로 변환

feature_vect_dense = feature_vect_simple.todense()

#첫번째 문장과 두번째 문장의 feature vector 추출한 후 1차원으로 변환
#첫번째 문장의 벡터
#np.array(feature_vect_dense[0]) ---> 2차원
#np.array(feature_vect_dense[0]).reshape(-1,) -----> 1차원
vect1 = np.array(feature_vect_dense[0]).reshape(-1,)
vect2 = np.array(feature_vect_dense[1]).reshape(-1,)
vect3 = np.array(feature_vect_dense[2]).reshape(-1,)

#첫번째 문장과 두번째 문장의 feature vector로 두개 문장의 Cosine 유사도 추출
similarity_simple = cos_similarity(vect1, vect2)
print('문장1, 문장 2 Cosine 유사도 : {0:.3f}'.format(similarity_simple))

#첫번째 문장과 세번째 문장의 feature vector로 두개 문장의 Cosine 유사도 추출
similarity_simple = cos_similarity(vect1, vect3)
print('문장1, 문서 3 Cosine 유사도 : {0:.3f}'.format(similarity_simple))

#두번째 문장과 세번째 문장의 feature vector로 두개 문장의 Cosine 유사도 추출
similarity_simple = cos_similarity(vect2, vect3)
print('문장2, 문장 3 Cosine 유사도 : {0:.3f}'.format(similarity_simple))
[Out]
문장1, 문장 2 Cosine 유사도 : 0.402
문장1, 문장 3 Cosine 유사도 : 0.404
문장2, 문장 3 Cosine 유사도 : 0.456

사이킷런의 cosine_similarity()함수를 이용하여 비교

[In]

from sklearn.metrics.pairwise import cosine_similarity

similarity_simple_pair = cosine_similarity(feature_vect_simple[0],
                                          feature_vect_simple)

print(similarity_simple_pair)
# 1,1 / 1,2 / 1,3 문장 비교

[Out]

[[1.         0.40207758 0.40425045]]

[In]

similarity_simple_pair = cosine_similarity(feature_vect_simple, feature_vect_simple)
print(similarity_simple_pair)
print('shape:', similarity_simple_pair.shape)

[Out]

[[1.         0.40207758 0.40425045]
 [0.40207758 1.         0.45647296]
 [0.40425045 0.45647296 1.        ]]
shape: (3, 3)

3. Opinion Review 데이터 셋을 이용한 문서 유사도 측정

데이터 불러오기 및 함수 정의

[In]

from nltk.stem import WordNetLemmatizer
import nltk
import string

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
lemmar = WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmar.lemmatize(token) for token in tokens]

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

군집화

[In]

import pandas as pd
import glob ,os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

path = r'C:\Users\norii\Documents\DataScience\source\DL\0731\OpinosisDataset\topics'
all_files = glob.glob(os.path.join(path, "*.data"))     
filename_list = []
opinion_text = []

for file_ in all_files:
    df = pd.read_table(file_,index_col=None, header=0,encoding='latin1')
    filename_ = file_.split('\\')[-1]
    filename = filename_.split('.')[0]
    filename_list.append(filename)
    opinion_text.append(df.to_string())

document_df = pd.DataFrame({'filename':filename_list, 'opinion_text':opinion_text})

tfidf_vect = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english' , \
                             ngram_range=(1,2), min_df=0.05, max_df=0.85 )
feature_vect = tfidf_vect.fit_transform(document_df['opinion_text'])

km_cluster = KMeans(n_clusters=3, max_iter=10000, random_state=0)
km_cluster.fit(feature_vect)
cluster_label = km_cluster.labels_
cluster_centers = km_cluster.cluster_centers_
document_df['cluster_label'] = cluster_label

호텔(label 2)로 클러스터링 문서 중에서 비슷한 문서 추출

[In]

from sklearn.metrics.pairwise import cosine_similarity

# cluster_label=2인 데이터는 호텔로 클러스터링된 데이터 DF에서 해당 Index를 추출
hotel_indexes = document_df[document_df['cluster_label']==2].index
print('호텔로 클러스터링 된 문서들의 DataFrame Index:', hotel_indexes)

# 호텔로 클러스터링된 데이터 중 첫번째 문서를 추출하여 파일명 표시.  
comparison_docname = document_df.iloc[hotel_indexes[0]]['filename']
print('##### 비교 기준 문서명 ',comparison_docname,' 와 타 문서 유사도######')

''' document_df에서 추출한 Index 객체를 feature_vect로 입력하여 호텔 클러스터링된 feature_vect 추출 
이를 이용하여 호텔로 클러스터링된 문서 중 첫번째 문서와 다른 문서간의 코사인 유사도 측정.'''
similarity_pair = cosine_similarity(feature_vect[hotel_indexes[0]] , feature_vect[hotel_indexes])
print(similarity_pair)

[Out]

호텔로 클러스터링 된 문서들의 DataFrame Index: Int64Index([1, 13, 14, 15, 20, 21, 24, 28, 30, 31, 32, 38, 39, 40, 45, 46], dtype='int64')
##### 비교 기준 문서명  bathroom_bestwestern_hotel_sfo  와 타 문서 유사도######
[[1.         0.0430688  0.05221059 0.06189595 0.05846178 0.06193118
  0.03638665 0.11742762 0.38038865 0.32619948 0.51442299 0.11282857
  0.13989623 0.1386783  0.09518068 0.07049362]]

문서 간 유사도 시각화

[In]

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# argsort()를 이용 첫번째 문서와 타 문서간 유사도가 큰 순으로 정렬한 인덱스 반환하되 자기 자신은 제외. 
sorted_index = similarity_pair.argsort()[:,::-1]
sorted_index = sorted_index[:, 1:]
print(sorted_index)

# 유사도가 큰 순으로 hotel_indexes를 추출하여 재 정렬. 
print(hotel_indexes)
hotel_sorted_indexes = hotel_indexes[sorted_index.reshape(-1,)]

# 유사도가 큰 순으로 유사도 값을 재정렬하되 자기 자신은 제외
hotel_1_sim_value = np.sort(similarity_pair.reshape(-1,))[::-1]
hotel_1_sim_value = hotel_1_sim_value[1:]

# 유사도가 큰 순으로 정렬된 Index와 유사도값을 이용하여 파일명과 유사도값을 Seaborn 막대 그래프로 시각화
hotel_1_sim_df = pd.DataFrame()
hotel_1_sim_df['filename'] = document_df.iloc[hotel_sorted_indexes]['filename']
hotel_1_sim_df['similarity'] = hotel_1_sim_value

sns.barplot(x='similarity', y='filename',data=hotel_1_sim_df)
plt.title(comparison_docname)

[Out]

[[10  8  9 12 13  7 11 14 15  5  3  4  2  1  6]]
Int64Index([1, 13, 14, 15, 20, 21, 24, 28, 30, 31, 32, 38, 39, 40, 45, 46], dtype='int64')
Text(0.5, 1.0, 'bathroom_bestwestern_hotel_sfo')

'Machine Learning' 카테고리의 다른 글

[자연어처리] 네이버 영화 평점 정보 분석(konlpy - Twitter, sklearn - TfidfVectorizer, LogisticRegression, GridSearchCV, accuracy_score)_ (0)	2023.07.31
[자연어처리] 문서 군집화(nltk - WordNetLemmatizer, sklearn - TfidfVectorizer, KMeans)_ (0)	2023.07.31
[사이킷런] Kneighborsclassier() (k-최근접 이웃 분류 모델) - 데이터스케일링 적용 (0)	2023.07.30
[사이킷런] Kneighborsclassier() (k-최근접 이웃 분류 모델) (0)	2023.07.30

'Machine Learning' Related Articles

Sun.El Data Analysis

[문서유사도] 코사인 유사도(TfidfVectorizer, cosine_similarity)_ 본문

[문서유사도] 코사인 유사도(TfidfVectorizer, cosine_similarity)_

1. 코사인 유사도(Cosine Similarity) 개념

2. 문장간 코사인 유사도 구하기

3. Opinion Review 데이터 셋을 이용한 문서 유사도 측정

'Machine Learning' 카테고리의 다른 글

티스토리툴바