[개념] Bag of Words

# Bag of Words(BoW)
# CountVectorizer 클래스로 BoW 만들기
# NLTK에서 지원하는 불용어 사용

# Bag of Words(BoW)

단어들의 순서는 전혀 고려하지 않고 단어들의 출현 빈도(frequency)에만 집중하는 텍스트 데이터의 수치화 표현 방법
- 각 단어가 등장한 횟수를 수치화하는 텍스트 표현 방법
- 어떤 단어가 얼마나 등장했는지를 기준으로 문서가 어떤 성격의 문서인지를 판단하는 작업에 쓰임
- ex) '달리기', '체력'과 같은 단어가 자주 등장하면 해당 문서를 체육 관련 문서로 분류 가능

=> 분류 문제나 여러 문서 간의 유사도를 구하는 문제에 주로 사용

국소 표현 방법(Local Representation = 이산 표현, Discrete Representation)
- 해당 단어 그 자체만을 보고 특정값을 매핑하여 단어를 표현하는 방법
- 단어의 의미, 뉘앙스 표현 불가능
- <-> 분산 표현 방법(Distributed Representation = 연속 표현, Continuous Representation)
  - 단어의 뉘앙스 표현 가능
BoW를 만드는 과정

1) 각 단어에 고유한 정수 인덱스 부여
2) 각 인덱스의 위치에 단어 토큰의 등장 횟수를 기록한 벡터 만들기

from konlpy.tag import Okt
#okt(open korean text) = 트위터에서 만든 오픈소스 한국어 처리기인 twitter-korean-text를 이어받아 만들고 있는 프로젝트
# konlpy(코엔엘파이) = 한국어 정보 처리를 위한 파이썬 패키지

okt = Okt()

text = "샤릉해"
# normalize = 정규화 처리를 해주는 함수 => 문장을 깔끔하게 만들어줌
# print(okt.normalize(text)) # 사랑해

def build_bag_of_words(document):
  # 온점 제거 및 형태소 분석
  document = document.replace('.', '')
  tokenized_document = okt.morphs(document) # morphs = 형태소 기반으로 리스트 타입 리턴
  # print(okt.morphs(document)) # ['정부', '가', '발표', '하는', '물가상승률', '과', '소비자', '가', '느끼는', '물가상승률', '은', '다르다']

  word_to_index = {} # class 'dict'
  bow = []

  for word in tokenized_document:
    if word not in word_to_index.keys(): # 처음 등장하는 단어인 경우
      word_to_index[word] = len(word_to_index)
      # BoW에 전부 기본값 1을 넣는다.
      bow.insert(len(word_to_index) - 1, 1)
    else: # 재등장하는 단어의 인덱스
      index = word_to_index.get(word)
      # 재등장한 단어는 해당하는 인덱스의 위치에 1을 더한다.
      bow[index] = bow[index] + 1

  return word_to_index, bow

doc1 = "정부가 발표하는 물가상승률과 소비자가 느끼는 물가상승률은 다르다."
vocab, bow = build_bag_of_words(doc1)
print('vocabulary :', vocab) # vocabulary : {'정부': 0, '가': 1, '발표': 2, '하는': 3, '물가상승률': 4, '과': 5, '소비자': 6, '느끼는': 7, '은': 8, '다르다': 9}
print('bag of words vector :', bow) # bag of words vector : [1, 2, 1, 1, 2, 1, 1, 1, 1, 1]

# CountVectorizer 클래스로 BoW 만들기

CountVectorizer = 각 텍스트에서 단어 출현 횟수를 카운팅한 벡터
- 모두 소문자로 변환시키기 때문에 me와 Me는 모두 같은 특성이 됨
길이가 2 이상인 문자에 대해서만 토큰으로 인식 => 'I'는 제외됨
- 영어에서는 길이가 짧은 문자를 제거하는 것 또한 전처리 작업으로 고려되기 때문
CountVectorizer는 띄어쓰기만을 기준으로 단어를 자르는 낮은 수준의 토큰화를 진행
- 영어의 경우 문제가 없지만 한국어의 경우 조사 등의 이유로 BoW가 제대로 만들어지지 않음
- ex) CountVectorizer는 '물가상승률과'와 '물가상승률은'을 다른 단어로 판단

from sklearn.feature_extraction.text import CountVectorizer

corpus = ['you know I want your love. because I love you.']
vector = CountVectorizer()

# 코퍼스로부터 각 단어의 빈도수를 기록
print('bag of words vector :', vector.fit_transform(corpus).toarray()) # bag of words vector : [[1 1 2 1 2 1]]

# 각 단어의 인덱스가 어떻게 부여되었는지를 출력
# vocabulary_ = dict 타입
# A mapping of terms to feature indices.
print('vocabulary :',vector.vocabulary_) # vocabulary : {'you': 4, 'know': 1, 'want': 3, 'your': 5, 'love': 2, 'because': 0}

# NLTK에서 지원하는 불용어 사용

from sklearn.feature_extraction.text import CountVectorizer
# import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords

# (1) 사용자가 직접 정의한 불용어 사용
text = ["Family is not an important thing. It's everything."]
vect = CountVectorizer(stop_words=["the", "a", "an", "is", "not"])
print('bag of words vector :',vect.fit_transform(text).toarray()) # bag of words vector : [[1 1 1 1 1]]
print('vocabulary :',vect.vocabulary_) # vocabulary : {'family': 1, 'important': 2, 'thing': 4, 'it': 3, 'everything': 0}

# (2) CountVectorizer에서 제공하는 자체 불용어 사용
text = ["Family is not an important thing. It's everything."]
vect = CountVectorizer(stop_words="english")
print('bag of words vector :',vect.fit_transform(text).toarray()) # bag of words vector : [[1 1 1]]
print('vocabulary :',vect.vocabulary_) # vocabulary : {'family': 0, 'important': 1, 'thing': 2}

# (3) NLTK에서 지원하는 불용어 사용
text = ["Family is not an important thing. It's everything."]
stop_words = stopwords.words("english")
vect = CountVectorizer(stop_words=stop_words)
print(stop_words) # ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
print('bag of words vector :',vect.fit_transform(text).toarray()) # bag of words vector : [[1 1 1 1]]
print('vocabulary :',vect.vocabulary_) # vocabulary : {'family': 1, 'important': 2, 'thing': 3, 'everything': 0}

# 참고글

https://wikidocs.net/31767

04-01 다양한 단어의 표현 방법

여기서는 카운트 기반의 단어 표현 방법 외에도 다양한 단어의 표현 방법에는 어떤 것이 있으며, 앞으로 이 책에서는 어떤 순서로 단어 표현 방법을 학습하게 될 것인지에 대해서 먼저…

wikidocs.net

https://needjarvis.tistory.com/645

형태소 분석기, Okt(Open Korean Text) (구)트위터 형태소분석기

Okt(Open Korean Text)는 트위터에서 만든 오픈소스 한국어 처리기인 twitter-korean-text를 이어받아 만들고 있는 프로젝트이다. 기존 형태소 분석기인 트위터 형태소 처리기의 깃헙(github)을 들어가면 githu

needjarvis.tistory.com

https://wikidocs.net/33661

강의 03 단어 카운트 (CountVectorizer)

CountVectorizer : 단어들의 카운트(출현 빈도(frequency))로 여러 문서들을 벡터화 카운트 행렬, 단어 문서 행렬 (Term-Document Matri…

wikidocs.net

https://wiserloner.tistory.com/917

사이킷런 자연어 특징 추출(CountVectorizer, TfidfVectorizer)

- 데이터를 분석함에 있어서 특징이라는 것을 추출하는 것은 무척 중요한 일입니다.- 특징은, 데이터 분석의 입력값으로 사용될만큼의 가치를 지닌 비교적 간단한 형태의 데이터인데,특징 추출

wiserloner.tistory.com

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

sklearn.feature_extraction.text.CountVectorizer

Examples using sklearn.feature_extraction.text.CountVectorizer: Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation Topic extraction with Non-negative Matrix Fac...

scikit-learn.org

'딥러닝' 카테고리의 다른 글

[개념] 한국어 전처리 패키지(Text Preprocessing Tools for Korean Text) (0)	2023.01.25
[개념] 어간 추출(Stemming) 및 표제어 추출(Lemmatization) (0)	2023.01.22
[개념] 토큰화(Tokenization) (0)	2023.01.20
[개념] 문서 단어 행렬(Document-Term Matrix, DTM) (0)	2023.01.20
[개념] TF-IDF (0)	2023.01.18

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[개념] Bag of Words

# Bag of Words(BoW)

# CountVectorizer 클래스로 BoW 만들기

# NLTK에서 지원하는 불용어 사용

'딥러닝' 카테고리의 다른 글

# Bag of Words(BoW)

# CountVectorizer 클래스로 BoW 만들기

# NLTK에서 지원하는 불용어 사용

'딥러닝' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역