Chap04-1: Part-of-speech Tagging

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

EXCELSIOR

Chap04-1: Part-of-speech Tagging 본문

NLP/NLTK

Chap04-1: Part-of-speech Tagging

Excelsior-JH 2017. 1. 11. 17:47

Part-Of-Speech tagging(POS tagging)은 문장 내 단어들의 품사를 식별하여 태그를 붙여주는 것을 말한다. 투플(tuple)의 형태로 출력되며 (단어, 태그)로 출력된다. 여기서 태그는 품사(POS) 태그다.

1. Default tagging

Default tagging은 POS-tagging에 있어 베이스라인을 제공해준다. Default tagging은 DefaultTagger 클래스를 사용하여 모든 토큰들(tokens)에 대해 동일한 POS를 부여한다. 이 Tagger는 정확도 향상을 위해 마지막 수단으로 사용하기에 적합하다.

DefaultTagger클래스는 태그에 적용하고자하는 단일 인수를 가진다. 아래의 예제에서는 'NN' 태그를 적용해 보았다.

from nltk.tag.sequential import DefaultTagger
tagger = DefaultTagger('NN')
print(tagger.tag(['Hello', 'World']))
#결과
[('Hello', 'NN'), ('World', 'NN')]

모든 tagger들은 tag( )라는 메소드를 가지며, tag( ) 메소드는 tagging된 토큰들을 리스트 형태로 return 한다.

DefaultTagger는 SequentialBackoffTagger의 서브클래스이며 SequentialBackoffTagger의 choose_tag( ) 메소드를 implement 해야한다.

DefaultTagger의 choose_tag( ) 메소드는 사용자가 정의한 태그를 반환하는 구조로 매우 간단하다. Part-of-speech Tag의 종류는 아래의 표에 나타내었다.

Number	Tag	Description
1	CC	Coordinating conjunction
2	CD	Cardinal number
3	DT	Determiner
4	EX	Existential there
5	FW	Foreign word
6	IN	Preposition or subordinating conjunction
7	JJ	Adjective
8	JJR	Adjective, comparative
9	JJS	Adjective, superlative
10	LS	List item marker
11	MD	Modal
12	NN	Noun, singular or mass
13	NNS	Noun, plural
14	NNP	Proper noun, singular
15	NNPS	Proper noun, plural
16	PDT	Predeterminer
17	POS	Possessive ending
18	PRP	Personal pronoun
19	PRP$	Possessive pronoun
20	RB	Adverb
21	RBR	Adverb, comparative
22	RBS	Adverb, superlative
23	RP	Particle
24	SYM	Symbol
25	TO	to
26	UH	Interjection
27	VB	Verb, base form
28	VBD	Verb, past tense
29	VBG	Verb, gerund or present participle
30	VBN	Verb, past participle
31	VBP	Verb, non-3rd person singular present
32	VBZ	Verb, 3rd person singular present
33	WDT	Wh-determiner
34	WP	Wh-pronoun
35	WP$	Possessive wh-pronoun
36	WRB	Wh-adverb

1) Evaluating accuracy

tagger의 정확도를 측정하려면 evaluate( ) 메소드를 사용하여 확인할 수 있다. 아래의 예제는 treebank corpus의 tagged_sents 일부와 비교해 보았다.

from nltk.tag.sequential import DefaultTagger
tagger = DefaultTagger('NN')
from nltk.corpus import treebank
test_sents = treebank.tagged_sents()[3000:]
print(tagger.evaluate(test_sents))
#결과
0.14331966328512843

2) Untagging a tagged sentence

Tagging된 토큰들은 nltk.tag.untag( )를 이용하여 태그들을 제거할 수 있다.

from nltk.tag.util import untag
print(untag([('Hello', 'NN'), ('World', 'NN')]))
#결과
['Hello', 'World']

2. Training a unigram part-of-speech tagger

unigram은 일반적으로 단일 토큰을 참조하며, unigram tagger는 POS tag를 결정하기위한 컨텍스트로 한 단어만 사용한다. UnigramTagger는 SequentialBackoffTagger에서 상속받은 ContextTagger의 하위 클래스인 NgramTagger로 부터 상속받는다.

from nltk.corpus import treebank
from nltk.tag.sequential import UnigramTagger

train_sents = treebank.tagged_sents()[:3000]
tagger = UnigramTagger(train_sents)
print(treebank.sents()[0])
print(tagger.tag(treebank.sents()[0]))
#결과
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
[('Pierre', None), ('Vinken', None), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', None), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', None), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]

1) Overriding the context model

ContextTagger에서 상속받은 모든 Tagger들은 미리 생성된 모델을 사용하여 tagging 할 수 있다. 이 모델은 Python 딕셔너리이며, {단어 : 태그} 형태로 이루어져있다.

from nltk.corpus import treebank
from nltk.tag.sequential import UnigramTagger

train_sents = treebank.tagged_sents()[:3000]
tagger = UnigramTagger(train_sents)
print(tagger.tag(treebank.sents()[0]))

tagger = UnigramTagger(model={'Pierre' : 'NN'})
print(tagger.tag(treebank.sents()[0]))
#결과
[('Pierre', None), ('Vinken', None), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', None), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', None), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
[('Pierre', 'NN'), ('Vinken', None), (',', None), ('61', None), ('years', None), ('old', None), (',', None), ('will', None), ('join', None), ('the', None), ('board', None), ('as', None), ('a', None), ('nonexecutive', None), ('director', None), ('Nov.', None), ('29', None), ('.', None)]

2) Minimum frequency cutoff

ContextTagger 클래스는 발생 빈도를 사용하여 해당 토큰의 가장 절절한 태그를 결정한다. 디폴트 값으로 단어와 태그가 한번만 나타나더라도 이 작업을 수행한다. UnigramTagger 클래스에서 최소 빈도의 cutoff value를 설정할 수 있다.

from nltk.corpus import treebank
from nltk.tag.sequential import UnigramTagger

train_sents = treebank.tagged_sents()[:3000]
tagger = UnigramTagger(train_sents)
print(tagger.tag(treebank.sents()[0]))

tagger = UnigramTagger(train_sents, cutoff=5)
print(tagger.tag(treebank.sents()[0]))
#결과
[('Pierre', None), ('Vinken', None), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', None), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', None), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
[('Pierre', None), ('Vinken', None), (',', ','), ('61', None), ('years', 'NNS'), ('old', None), (',', ','), ('will', 'MD'), ('join', None), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', None), ('director', 'NN'), ('Nov.', 'NNP'), ('29', None), ('.', '.')]

3. Combining taggers with backoff tagging

Backoff tagging은 SequentialBackoffTagger의 핵심 기능 중 한가지이다. Backoff tagging은 어떤 tagger가 단어를 tagging하지 못하는 경우 다음 backoff tagger에 단어를 전달할 수 있도록 tagger를 하나로 연결할 수 있도록 해준다. 아래의 예제에서 ('Pierre', None) → ('Pierre', 'NN')으로 tagging 된것을 확인할 수 있다.

from nltk.corpus import treebank
from nltk.tag.sequential import UnigramTagger, DefaultTagger

train_sents = treebank.tagged_sents()[:3000]
tagger = UnigramTagger(train_sents)

tagger1 = DefaultTagger('NN')
tagger2 = UnigramTagger(train_sents, backoff=tagger1)
print(tagger.tag(treebank.sents()[0]))
print(tagger2.tag(treebank.sents()[0]))
#결과
[('Pierre', None), ('Vinken', None), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', None), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', None), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
[('Pierre', 'NN'), ('Vinken', 'NN'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'NN'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'NN'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]

4. Training and combining ngram taggers

UnigramTagger 뿐만 아니라 NgramTagger 서브 클래스로는 BigramTagger와 TrigramTagger가 있다. BigramTagger 서브 클래스는 이전 태그를 컨텍스트의 일부로 사용하고, TrigramTagger 서브 클래스는 이전 두 태그를 사용한다. ngram은 n 개 항목의 하위 순서이므로 BigramTagger 하위 클래스는 두 개의 항목 (이전 태그 단어와 현재 단어)을보고 TrigramTagger 하위 클래스는 세 가지 항목을 본다. 이러한 두 tagger들은 POS-tag가 컨택스트에 종속적인 단어를 처리하는데 적합하다. 예를 들어 단어 'cook'은 '요리하다'라는 동사와 '요리'라는 명사 두가지 품사를 가질 수 있다. NgramTagger의 서브 클래스의 아이디어는 이전 단어와 POS-tag를 살펴보면 현재 단어에 대한 POS-tag를 더 잘 추측할 수 있다는 것이다.

하지만, BigramTagger와 TrigramTagger 자체만 사용하면 아래의 예제와 같이 아주 저조한 성능을 보인다.

from nltk.corpus import treebank
from nltk.tag.sequential import BigramTagger, TrigramTagger

train_sents = treebank.tagged_sents()[:3000]
test_sents = treebank.tagged_sents()[3000:]

bitagger = BigramTagger(train_sents)
print(bitagger.tag(treebank.sents()[0]))
print(bitagger.evaluate(test_sents))

tritagger = TrigramTagger(train_sents)
print(tritagger.tag(treebank.sents()[0]))
print(tritagger.evaluate(test_sents))
#결과
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
0.11305849341679257
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
0.06906971724584503

BigramTagger와 TrigramTagger는 backoff tagging과 결합할 때 좋은 성능을 나타낸다. 다음 tag_util.py의 코드는 Tagger 클래스 목록을 받아 backoff를 이전 Tagger 클래스를 backoff로 조정하는 코드이다.

#tag_util.py
def backoff_tagger(train_sents, tagger_classes, backoff=None):
    for cls in tagger_classes:
        backoff = cls(train_sents, backoff=backoff)
    
    return backoff

다음 예제는 tag_util.py의 backoff_tagger 메소드를 사용한 예제이다. 예제의 결과에서 알 수 있듯이, tagger의 정확도가 상당히 올라간 것을 알 수 있다.

from nltk.corpus import treebank
from nltk.tag.sequential import BigramTagger, TrigramTagger, DefaultTagger, \
    UnigramTagger
from tag_util import backoff_tagger

train_sents = treebank.tagged_sents()[:3000]
test_sents = treebank.tagged_sents()[3000:]
backoff = DefaultTagger('NN')
tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger]
                        , backoff=backoff)

print(tagger.tag(treebank.sents()[0]))
print(tagger.evaluate(test_sents))
#결과
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
0.880897906324196

저작자표시

'NLP > NLTK' 카테고리의 다른 글

Chap03-2 : Creating Custom Corpora (0)	2017.01.09
Chap03 -1 : Creating Custom Corpora(corpus, chunk) (3)	2017.01.06
Chap02-2 : Replacing and Correcting Words (2)	2016.12.26
Chap02-1 : Stemming, Lemmatizing (0)	2016.12.22
Chap01-2 : WordNet, Part-Of-Speech(POS) (1)	2016.12.14

'NLP/NLTK' Related Articles

Comments

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

EXCELSIOR

EXCELSIOR

Chap04-1: Part-of-speech Tagging 본문

Chap04-1: Part-of-speech Tagging

'NLP > NLTK' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역