Chap02-2 : Replacing and Correcting Words

Notice

Recent Posts

Recent Comments

Link

« 2024/12 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Tags more

Archives

Today

Total

관리 메뉴

EXCELSIOR

Chap02-2 : Replacing and Correcting Words 본문

NLP/NLTK

Chap02-2 : Replacing and Correcting Words

Excelsior-JH 2016. 12. 26. 14:10

1. Replacing words matching regular expressions

앞의 포스팅(Stemming, Lemmatizing)에서는 언어의 압축(linguistic compression)이었다면 Word Replacement는 텍스트 정규화(text normalization) 또는 오타 수정으로 볼 수 있다.

아래의 예제는 영어의 축약형 표현을 원래의 표현으로 바꿔주는 예제이다. 예를 들어, "can't → can not", "would've → would have"로 바꿔준다. replacers.py의 RegexpReplacer( )를 import 하여 구현하였다. replacers.py

r'(\w+)\'ve'는 've을 포함하는 단어들을 찾아서 '\g<1> have'에서 've 앞의 단어를 g<1>로 그룹핑한 뒤 have를 그 뒤에 붙여준다.(내가 이해한 바로는... 혹시 아니면 알려주세요...)

replacement_patterns = [
    (r'won\'t', 'will not'),
    (r'can\'t', 'cannot'),
    (r'i\'m', 'i am'),
    (r'ain\'t', 'is not'),
    (r'(\w+)\'ll', '\g<1> will'),
    (r'(\w+)n\'t', '\g<1> not'),
    (r'(\w+)\'ve', '\g<1> have'),
    (r'(\w+)\'s', '\g<1> is'),
    (r'(\w+)\'re', '\g<1> are'),
    (r'(\w+)\'d', '\g<1> would'),
]

class RegexpReplacer(object):
    """ Replaces regular expression in a text.
    >>> replacer = RegexpReplacer()
    >>> replacer.replace("can't is a contraction")
    'cannot is a contraction'
    >>> replacer.replace("I should've done that thing I didn't do")
    'I should have done that thing I did not do'
    """
    def __init__(self, patterns=replacement_patterns):
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
    
    def replace(self, text):
        s = text
        
        for (pattern, repl) in self.patterns:
            s = re.sub(pattern, repl, s)
        
        return s

from replacers import RegexpReplacer
replacer = RegexpReplacer()
print(replacer.replace("can't is a contraction"))
print(replacer.replace("I should've done that thing I didn't do"))
#결과
cannot is a contraction
I should have done that thing I did not do

이러한 과정은 정규표현에 의해 이루어 지는데 Python에서는 're'라는 모듈을 통해 작성할 수 있다. 다음은 re 모듈을 활용한 정규 표현의 한 예이다. 주민등록 번호의 뒷자리를 '*'로 바꾸는 소스코드 이다. 정규표현에 관련하여 작성된 표는 여기 서 확인할 수 있다.

import re 

data = """
park 800905-1049118
kim  700905-1059119
"""
pat = re.compile("(\d{6})[-]\d{7}")
print(pat.sub("\g<1>-*******", data))
#결과
park 800905-*******
kim  700905-*******

2. Removing repeating characters

영어의 경우 'I love it.'를 강조한다고 할 때, 'I looooove it.' 이렇게 'love'를 강조하기도 한다. 사람의 경우 'looooove'가 'love'라고 이해하겠지만 컴퓨터는 그렇지 않다. 이를 해결하기 위해, 반복되는 단어를 제거하는 방법을 알아보자.

backreference를 이용해서 반복되는 단어들을 제거할 수 있다. backreference는 정규표현에서 매칭된 그룹을 참조하는 방법이다.

replacer.py의 RepeatReplacer( )를 import하여 반복되는 단어를 제거하는 예제이다.

0개 이상의 처음 단어 (\w*)
단일 단어(\w) 와 그 뒤의 다른 단어(\2) (?)
0개 이상의 마지막 단어(\w*)

위의 규칙에 따라 'looooove'는 (looo) (o) o (ve)로 나뉘게 된다. 그 다음 2번째 그룹인 (o)는 제거된 후 'loooove'로 재결합 된다. 이를 다시 위의 규칙에 따라 나눈뒤 제거하는 과정을 반복하게 되면 최종적으로 'love'가 출력되게 된다.

import re
class RepeatReplacer(object):
    """ Removes repeating characters until a valid word is found.
    >>> replacer = RepeatReplacer()
    >>> replacer.replace('looooove')
    'love'
    >>> replacer.replace('oooooh')
    'ooh'
    >>> replacer.replace('goose')
    'goose'
    """
    def __init__(self):
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        self.repl = r'\1\2\3'

    def replace(self, word):
        if wordnet.synsets(word):
            return word
        
        repl_word = self.repeat_regexp.sub(self.repl, word)
        
        if repl_word != word:
            return self.replace(repl_word)
        else:
            return repl_word

from chapter02.replacers import RepeatReplacer
replacer = RepeatReplacer()
print(replacer.replace('looooove'))
print(replacer.replace('ooooooh'))
#결과
love
ooh

3. Replacing synonyms

동의어 처리는 단어의 빈도분석(frequency analysis)이나 인덱싱(text indexing)과정에서 유용하게 쓰인다.

1) Dictionary를 이용한 동의어 처리

아래 예제인 replacer.py의 WordReplacer는 단순히 python의 dictionary를 이용하여 동의어 처리를 해줬다.

class WordReplacer(object):
    """ WordReplacer that replaces a given word with a word from the word_map,
    or if the word isn't found, returns the word as is.
    >>> replacer = WordReplacer({'bday': 'birthday'})
    >>> replacer.replace('bday')
    'birthday'
    >>> replacer.replace('happy')
    'happy'
    """
    def __init__(self, word_map):
        self.word_map = word_map
    
    def replace(self, word):
        return self.word_map.get(word, word)

from chapter02.replacers import WordReplacer

replacer = WordReplacer({'bday':'birthday'})
print(replacer.replace('bday'))
#결과
birthday

2) CSV파일에 동의어 저장 후 동의어 처리

다음은 csv파일에 ,(콤마) 단위로 동의어를 정의한 다음에 바꿔주는 예시이다.

class CsvWordReplacer(WordReplacer):
    """ WordReplacer that reads word mappings from a csv file.
    >>> replacer = CsvWordReplacer('synonyms.csv')
    >>> replacer.replace('bday')
    'birthday'
    >>> replacer.replace('happy')
    'happy'
    """
    def __init__(self, fname):
        word_map = {}
        
        for line in csv.reader(open(fname)):
            word, syn = line
            word_map[word] = syn
        
        super(CsvWordReplacer, self).__init__(word_map)

from chapter02.replacers import CsvWordReplacer
replacer = CsvWordReplacer('synonym_test.csv')
print(replacer.replace('bday'))
print(replacer.replace('happy'))
#결과
birthday
happy

3) YAML 을 이용한 동의어 처리

YAML module을 이용하여 동의어 처리를 해준다. 파일 형식이 .yaml이라는 파일에 [단어: 동의어] 형식으로 적어준다. (이때 반드시 ':' (콜론) 뒤에 빈칸이 있어야한다!)

import yaml
class YamlWordReplacer(WordReplacer):
    """ WordReplacer that reads word mappings from a yaml file.
    >>> replacer = YamlWordReplacer('synonyms.yaml')
    >>> replacer.replace('bday')
    'birthday'
    >>> replacer.replace('happy')
    'happy'
    """
    def __init__(self, fname):
        word_map = yaml.load(open(fname))
        super(YamlWordReplacer, self).__init__(word_map)

from chapter02.replacers import YamlWordReplacer
replacer = YamlWordReplacer('synonym_test.yaml')
print(replacer.replace('bday'))
#결과
birthday

저작자표시

'NLP > NLTK' 카테고리의 다른 글

Chap03-2 : Creating Custom Corpora (0)	2017.01.09
Chap03 -1 : Creating Custom Corpora(corpus, chunk) (3)	2017.01.06
Chap02-1 : Stemming, Lemmatizing (0)	2016.12.22
Chap01-2 : WordNet, Part-Of-Speech(POS) (1)	2016.12.14
Chap01-1: Token, Tokenize, Tokenizer (0)	2016.12.13

'NLP/NLTK' Related Articles

Comments

EXCELSIOR

Chap02-2 : Replacing and Correcting Words 본문

Chap02-2 : Replacing and Correcting Words

'NLP > NLTK' 카테고리의 다른 글

티스토리툴바