Elasticsearch Tokenizers – Word Oriented Tokenizers
Elasticsearch
20
tokenizers
3
word-oriented
1
Male avatar

loveprogramming viết ngày 22/05/2021

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-word-oriented-tokenizers

Elasticsearch Tokenizers – Word Oriented Tokenizers

A tokenizer breaks a stream of characters up into individual tokens (characters, words...), then outputs a stream of tokens. We can also use tokenizer to record the order or position of each term (for phrase and word proximity queries), or the start and end character offsets of the original word which the term represents (for highlighting search snippets).

In this tutorial, we're gonna look at how to use some Word Oriented Tokenizers which tokenize full text into individual words.

1. Standard Tokenizer

standard tokenizer provides grammar based tokenization:


POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Token:


{
  "tokens": [
    {
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "",
      "position": 0
    },
    {
      "token": "2",
      "start_offset": 4,
      "end_offset": 5,
      "type": "",
      "position": 1
    },
    {
      "token": "QUICK",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 2
    },
    {
      "token": "Brown",
      "start_offset": 12,
      "end_offset": 17,
      "type": "",
      "position": 3
    },
    {
      "token": "Foxes",
      "start_offset": 18,
      "end_offset": 23,
      "type": "",
      "position": 4
    },
    {
      "token": "jumped",
      "start_offset": 24,
      "end_offset": 30,
      "type": "",
      "position": 5
    },
    {
      "token": "over",
      "start_offset": 31,
      "end_offset": 35,
      "type": "",
      "position": 6
    },
    {
      "token": "the",
      "start_offset": 36,
      "end_offset": 39,
      "type": "",
      "position": 7
    },
    {
      "token": "lazy",
      "start_offset": 40,
      "end_offset": 44,
      "type": "",
      "position": 8
    },
    {
      "token": "dog's",
      "start_offset": 45,
      "end_offset": 50,
      "type": "",
      "position": 9
    },
    {
      "token": "bone",
      "start_offset": 51,
      "end_offset": 55,
      "type": "",
      "position": 10
    }
  ]
}

To keep things simple, we can write term from tokens in this way:


[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

Max Token Length

We can configure maximum token length (max_token_length - Defaults to 255).
If a token exceeds this length, it is split at max_token_length intervals.
For example, we set max_token_length to 4, it makes QUICK separate to QUIC and K.

More at:

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-word-oriented-tokenizers

Elasticsearch Tokenizers – Word Oriented Tokenizers

Bình luận


White
{{ comment.user.name }}
Bỏ hay Hay
{{comment.like_count}}
Male avatar
{{ comment_error }}
Hủy
   

Hiển thị thử

Chỉnh sửa

Male avatar

loveprogramming

545 bài viết.
97 người follow
Kipalog
{{userFollowed ? 'Following' : 'Follow'}}
Cùng một tác giả
Male avatar
1 0
Tutorial Link: (Link) (Ảnh) Django is a Pythonbased free and opensource web framework that follows the modeltemplateview architectural pattern. A...
loveprogramming viết 7 tháng trước
1 0
Male avatar
1 0
https://loizenai.com/angular11nodejspostgresqlcrudexample/ Angular 11 Node.js PostgreSQL Crud Example (Ảnh) Tutorial: “Angular 11 Node.js Postg...
loveprogramming viết 6 tháng trước
1 0
Male avatar
1 0
Angular Spring Boot jwt Authentication Example Github https://loizenai.com/angularspringbootjwt/ (Ảnh) Tutorial: ” Angular Spring Boot jwt Authe...
loveprogramming viết 6 tháng trước
1 0
Bài viết liên quan
Male avatar
0 0
https://grokonez.com/elasticsearch/elasticsearchtokenizersstructuredtexttokenizers Elasticsearch Tokenizers – Structured Text Tokenizers In this...
loveprogramming viết 23 ngày trước
0 0
{{like_count}}

kipalog

{{ comment_count }}

bình luận

{{liked ? "Đã kipalog" : "Kipalog"}}


Male avatar
{{userFollowed ? 'Following' : 'Follow'}}
545 bài viết.
97 người follow

 Đầu mục bài viết

Vẫn còn nữa! x

Kipalog vẫn còn rất nhiều bài viết hay và chủ đề thú vị chờ bạn khám phá!