Elasticsearch Tokenizers – Partial Word Tokenizers
Elasticsearch
20
tokenizers
3
word
2
Partial
1
Male avatar

loveprogramming viết ngày 22/05/2021

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-partial-word-tokenizers

Elasticsearch Tokenizers – Partial Word Tokenizers

In this tutorial, we're gonna look at 2 tokenizers that can break up text or words into small fragments, for partial word matching: N-Gram Tokenizer and Edge N-Gram Tokenizer.

I. N-Gram Tokenizer

ngram tokenizer does 2 things:

  • break up text into words when it encounters specified characters (whitespace, punctuation...)
  • emit N-grams of each word of the specified length (quick with length = 2 -> [qu, ui, ic, ck] )

=> N-grams are like a sliding window of continuous letters.

For example:


POST _analyze
{
  "tokenizer": "ngram",
  "text": "Spring 5"
}

It will generate terms with a sliding (1 char min-width, 2 chars max-width) window:


[ "S", "Sp", "p", "pr", "r", "ri", "i", "in", "n", "ng", "g", "g ", " ", " 5", "5" ]

Configuration

  • min_gram: minimum length of characters in a gram (min-width of the sliding window). Defaults to 1.
  • max_gram: maximum length of characters in a gram (max-width of the sliding window). Defaults to 2.
  • token_chars: character classes that will be included in a token. Elasticsearch will split on characters that don’t belong to:
  • letter (a, b, ...)
  • digit (1, 2, ...)
  • whitespace (" ", "\n", ...)
  • punctuation (!, ", ...)
  • symbol ($, %, ...)

Defaults to [] (keep all characters).

For example, we will create a tokenizer with sliding window (width = 3) and character classes: only letter & digit.


PUT jsa_index_n-gram
{
  "settings": {
    "analysis": {
      "analyzer": {
        "jsa_analyzer": {
          "tokenizer": "jsa_tokenizer"
        }
      },
      "tokenizer": {
        "jsa_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

POST jsa_index_n-gram/_analyze
{
  "analyzer": "jsa_analyzer",
  "text": "Tut101: Spring 5"
}

Terms:

More at:

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-partial-word-tokenizers

Elasticsearch Tokenizers – Partial Word Tokenizers

Bình luận


White
{{ comment.user.name }}
Bỏ hay Hay
{{comment.like_count}}
Male avatar
{{ comment_error }}
Hủy
   

Hiển thị thử

Chỉnh sửa

Male avatar

loveprogramming

545 bài viết.
97 người follow
Kipalog
{{userFollowed ? 'Following' : 'Follow'}}
Cùng một tác giả
Male avatar
1 0
Tutorial Link: (Link) (Ảnh) Django is a Pythonbased free and opensource web framework that follows the modeltemplateview architectural pattern. A...
loveprogramming viết 10 tháng trước
1 0
Male avatar
1 0
https://loizenai.com/angular11nodejspostgresqlcrudexample/ Angular 11 Node.js PostgreSQL Crud Example (Ảnh) Tutorial: “Angular 11 Node.js Postg...
loveprogramming viết 9 tháng trước
1 0
Male avatar
1 0
Angular Spring Boot jwt Authentication Example Github https://loizenai.com/angularspringbootjwt/ (Ảnh) Tutorial: ” Angular Spring Boot jwt Authe...
loveprogramming viết 9 tháng trước
1 0
Bài viết liên quan
Male avatar
0 0
https://grokonez.com/elasticsearch/elasticsearchtokenizerswordorientedtokenizers Elasticsearch Tokenizers – Word Oriented Tokenizers A tokenizer ...
loveprogramming viết 4 tháng trước
0 0
{{like_count}}

kipalog

{{ comment_count }}

bình luận

{{liked ? "Đã kipalog" : "Kipalog"}}


Male avatar
{{userFollowed ? 'Following' : 'Follow'}}
545 bài viết.
97 người follow

 Đầu mục bài viết

Vẫn còn nữa! x

Kipalog vẫn còn rất nhiều bài viết hay và chủ đề thú vị chờ bạn khám phá!