2-3 표제어와 어간

자연어처리 과정에서 말뭉치 속 단어의 개수를 줄이기 위해 표제어 추출(lemmatization)과 어간 추출(stemming)기법을 사용합니다.

표제어 추출

표제어(lemma)는 단어의 기본형을 뜻합니다. ‘dive’라는 동사를 한 번 분석해봅시다. 이 동사는 diving, dove, dived, dives 등 문장 속에서의 역할에 따라 다양한 형태로 변형됩니다. ‘be’라는 단어를 또 예로 들자면 am, was, are, is 등 여러가지의 모습들로 나타납니다. 이때, dive와 be가 각각의 모든 단어들의 기본형, 즉 표제어인 겁니다. 이때 토큰들을 이러한 표제어로 바꾸며 벡터 표현의 차원을 축소하는 이 과정을 표제어 추출이라고 합니다.

import spacy
nlp = spacy.load('en')
doc = nlp(u"he was running late")
for token in doc:
		print('{} --> {}'.format(token, token.lemma_))   #format() 함수는 따옴표 속 {} 안에 지정해준 변수들을 차례대로 넣어줍니다.
Python
복사

“he was running late”라는 문장을 토큰화하여 얻은 토큰들을 token, 그리고 그 토큰들이 표제어 추출 과정 거친 결과를 token.lemma_로 나타냅니다. 코드를 돌리면 아래와 같이 출력됩니다.

he --> he
was --> be
running --> run
late --> late
Python
복사

이때 표제어 추출에서 주의해야할 점은, 단어의 품사 정보나 형태소 정보를 알아야 정확한 결과를 얻을 수 있다는 것입니다. 위 코드에서 사용한 spaCy는 이미 많은 단어들의 기본형이 정의된 WordNet 사전을 사용하여 표제어를 추출합니다. NLTK(Natural Language ToolKit)에서 제공하는 표제어 추출 도구인 WordNetLemmatizer와 같은 경우에는 단어를 입력할 때 그 단어의 품사를 함께 입력해주어, 품사가 보존된 표제어를 출력해준다고 합니다.

어간 추출

어간(stem)은 단어에서 변하지 않는 부분을 뜻합니다. 그리고 문장 속 이러한 어간들을 추출해내는 기법을 어간 추출이라고 합니다. 이때, 어간 추출은 단순히 수동으로 정한 규칙을 사용하여 단어의 끝을 잘라 결과를 도출해내는데, 이 규칙이 세상 모든 단어들에 맞게 적용될 수는 없기 때문에 정확도가 다소 부족하다는 단점이 있습니다.

어간 추출기로는 Porter와 Snowball이 유명한데, 아래 코드는 Porter를 이용하여 This was not the map we found in Billy Bones's chest, but an accurate copy, complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes.라는 문장에 어간 추출 기법을 사용한 경우입니다.

import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
s = PorterStemmer()
text = "This was not the map we found in Billy Bones's chest, but an accurate copy, complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes."
words = word_tokenize(text)
print(words)
  #문장의 단어들을 나눔
['This', 'was', 'not', 'the', 'map', 'we', 'found', 'in', 'Billy', 'Bones', "'s", 'chest', ',', 'but', 'an', 'accurate', 'copy', ',', 'complete', 'in', 'all', 'things', '--', 'names', 'and', 'heights', 'and', 'soundings', '--', 'with', 'the', 'single', 'exception', 'of', 'the', 'red', 'crosses', 'and', 'the', 'written', 'notes', '.']
  #각 단어들이 어간 추출 과정을 거친 결과
[s.stem(w) for w in words]
['thi', 'wa', 'not', 'the', 'map', 'we', 'found', 'in', 'billi', 'bone', "'s", 'chest', ',', 'but', 'an', 'accur', 'copi', ',', 'complet', 'in', 'all', 'thing', '--', 'name', 'and', 'height',
 'and', 'sound', '--', 'with', 'the', 'singl', 'except', 'of',
 'the', 'red', 'cross', 'and', 'the', 'written', 'note', '.']
Python
복사
(출처: https://omicro03.medium.com/%EC%9E%90%EC%97%B0%EC%96%B4%EC%B2%98%EB%A6%AC-nlp-5%EC%9D%BC%EC%B0%A8-%EC%96%B4%EA%B0%84-%EC%B6%94%EC%B6%9C-%ED%91%9C%EC%A0%9C%EC%96%B4-%EC%B6%94%EC%B6%9C-4a967d830cc2)

코드 속 어간 추출 과정을 거친 결과에서 볼 수 있듯이 올바르게 출력된 단어들도 있지만 틀린 단어들도 굉장히 많습니다. 어간 추출은 표제어 추출과 달리 품사의 보존도 이루어지지 않다는 것도 알 수 있습니다. 정확도는 표제어 추출보다 부족한 대신 축소 과정을 비교적 빨리 수행한다는 장점을 가집니다.

이전 글 읽기

2-2 N-그램

다음 글 읽기

2-4 문장과 문서 분류 : TF-IDF