Text Understanding from Scratch

Jump to bottom

kwj962004 edited this page Jan 18, 2018 · 19 revisions

https://arxiv.org/abs/1502.01710

Ybigta 10기 김우정

1. Introduction

텍스트를 단어, 구, 문장 단위로 나눠서 통계적 분류 -> domain dependent, pre-defined dictionary 필요-> * 딥러닝 쓰자 *
대부분 lookup table이나 word2vec과 같이 단어로부터 피쳐를 추출하고 임베딩해서 모델에 넣음
우리는 사전학습된 임베딩 없이, 문자로부터 직접 학습하는 CNN 모델 을 만들자
두가지 접근
1. ConvNets은 단어에 대한 지식을 요구하지 않음 - 문자 단위로 처리 단어 기반 feature extractor나 word2vec이 필요하지 않음 단어 단위로 하면 차원 커져서 conv layer 적용하기 쉽지않음
2. 문법적인, 의미론적인 구조에 대한 지식 요구하지 않음 - structured prediction이나 language model이 필요하지 않음

2. ConvNet Model Design

2.1. Key Modules

1-D convolution between input and output
- discrete input function
  - g(x) ∈ [1,l] -> R
- discrete kernel function
  - f(x) ∈ [1,k] -> R
- convolution
  - h(y) ∈ [1, ⌊(l-k)/d⌋ + 1] -> R
max pooling
non-linear activation function
- ReLU

2.2. Character quantization

input으로 sequence of encoded characters 받음
각각의 character 1-of-m encoding (m은 알파벳 개수)
공백 등 알파벳에 포함되지 않는 문자는 all zero vector
we quantize characters in backward order. This way, the latest reading on characters is always placed near the beginning of the output, making it easy for fully connected layers to associate correlations with the latest memory.
input sequence length = l , frame size = m

2.3. Model Design

Imgur

1 large ConvNet, 1 small ConvNet
둘다 9개의 레이어 - 6 convolutional layers, 3 fully-connected layers
fully-connected layer 사이에 dropout

2.4. Data Augmentation using Thesaurus

이미지에서 scaling, rotating, flipping 하는 것 처럼 텍스트도 해보자
사람이 rephrase하면 좋겠지만...흑
단어나 구를 유의어로 대체하자

2.5. Comparison Modelsus

bag-of-words model
- 빈도수 top 5000개
- multinomial logistic regression
bag-of-centroid model via word2vec
- pretrained word2vec
- 5000개 centroid로 k-means
- 이 centroid로 똑같이 multinomial logistic regression

3. Datasets and Results

3.1. DBpedia Ontology Classification

1014개만 자름

3.2. Amazon Review Sentiment Analysis

100 ~ 1014개 글자로 되어있는 리뷰만 골라서

3.3. Yahoo! Answers Topic Classification

생략

3.4. 3.4. News Categorization in English

생략

3.5. News Categorization in Chinese

dictionary-free design
중국어, 일본어, 한국어는 글자 너무 많아서 힘들어
중국어 병음 이용 - 다른 병음은 다른 글자