[ElasticSearch] nori analyzer 옵션 살펴보기

ElastciSearch

[ElasticSearch] nori analyzer 옵션 살펴보기

GaGah 2021. 4. 1. 21:28

Nori Analyzer 란?

ElasticSearch에서 공식적으로 제공하는 한글 형태소분석기
mecab-ko-dic 사전을 재가공 하여 사용
1개의 토크나이저와 2개의 토큰 필터를 가지고 있음

ElasticSearch에서 사용가능한 한글 형태소 분석기 종류

1. Nori Analyzer

2. 아리랑

3. 은전한닢 (seunjeon)

4. Open Korean Text

nori_tokenizer

decompound_mode
- 복합 토큰을 어떻게 처리할지 결정하는 방식 정하기
- none
  - 가거도항
  - 가곡역
  - 이런 것들을 나누지 않고 그대로 사용한다는 것
- discard
  - 가곡역 -> 가곡, 역 (최소한의 단위로 나누는 느낌)
  - default
- mixed
  - 가곡역 => 가곡역, 가곡, 역
discard_punctuation
- 출력할 때, 구두점(쉼표, 마침표)을 삭제할지 여부
- 기본값은 true
user_dictionary
- 기본적으로 Nori_tokenizer는 mecab-ko-dic을 기반으로 한다.
- NNG가 있는 사용자 지정 명사를 기본 사전에 추가할 수 있다.
- ex) "C샤프"라는 단어를 하나의 명사로 지정하고 싶으면 추가하면 된다.

실습 예제

예제1) 셋팅

[PUT] http://localhost:9200/nori_sample


{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed",
            "discard_punctuation": "false",
            "user_dictionary": "userdict_ko.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict"
          }
        }
      }
    }
  }
}

예제1) 결과값

{
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "nori_sample"
}

예제2) nori_검색

[GET] http://localhost:9200/nori_sample/_analyze

{
  "analyzer": "my_analyzer",
  "text": "C샤프"  
}

예제2) nori_검색 결과

{
    "tokens": [
        {
            "token": "C샤프",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        }
    ]
}

더 나아가서,

user_nori_user_dict 에서 user_dictionary에 txt파일을 만들어서 사용자 언어로 추가를 시켰다.

이 방법 외에도 아래와 같이 작성 가능하다. ( 단, 옵션은 user_dictionary_rules 로 집어 넣어야 함 )

"user_dictionary_rules": ["c++", "C샤프", "세종", "세종시 세종 시"]

token_filter

nori_readingform
- 한자를 인식해서 한글로 변경해준다.
nori_number
- 한글로 쓰여진 숫자를 인식하여 변경/나눔
- ex) 영영칠 → 7
- ex) 십만이천오백과 ３.２천 -> 102500/과/3200

필요없는 불용어 제거

Stopping
색인어로서 가치가 없는 불용어 제거

{
    "analysis":{
        "tokenizer":{
            "korean_nori_tokenizer":{
                "type":"nori_tokenizer",
                "decompound_mode":"mixed",
                "user_dictionary":"user_dictionory_kor.txt"
            }
        },
        "analyzer":{
            "nori_analyzer":{
                "type":"custom",
                "tokenizer":"korean_nori_tokenizer",
                "filter":[
                    "nori_posfilter"    
                ]
            }
        },
        "filter":{
            "nori_posfilter":{
                "type":"nori_part_of_speech",
                "stoptags":[
                    "E",
                    "IC",
                    "J",
                    "MAG",
                    "MM",
                    "NA",
                    "NR",
                    "SC",
                    "SE",
                    "SF",
                    "SH",
                    "SL",
                    "SN",
                    "SP",
                    "SSC",
                    "SSO",
                    "SY",
                    "UNA",
                    "UNKNOWN",
                    "VA",
                    "VCN",
                    "VCP",
                    "VSV",
                    "VV",
                    "VX",
                    "XPN",
                    "XR",
                    "XSA",
                    "XSN",
                    "XSV"
                ]
            }
        }
    }
}

stoptags에 필요없는 것들을 제외하라고 명시한다.

위의 Setting으로 분석기 돌려본 결과

{
    "analyzer":"nori_analyzer",
    "text":"아빠가방에들어가신다."
}

{
    "tokens": [
        {
            "token": "아빠",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "가방",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 1
        }
    ]
}

참고 자료

www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-nori-tokenizer.html

esbook.kimjmin.net/06-text-analysis/6.7-stemming/6.7.2-nori

LIST

저작자표시 비영리 변경금지 (새창열림)