Elasticsearch 是一个应用非常广泛的搜索引擎。它可以对文字进行分词,从而实现全文搜索。在实际的使用中,我们会发现有一些文字中包含一些表情符号,比如笑脸,动物等等,那么我们该如何对这些表情符号来进行搜索呢?
🏻 => 🏻, light skin tone, skin tone, type 1–2 🏼 => 🏼, medium-light skin tone, skin tone, type 3 🏽 => 🏽, medium skin tone, skin tone, type 4 🏾 => 🏾, medium-dark skin tone, skin tone, type 5 🏿 => 🏿, dark skin tone, skin tone, type 6 ♪ => ♪, eighth, music, note ♭ => ♭, bemolle, flat, music, note ♯ => ♯, dièse, diesis, music, note, sharp 😀 => 😀, face, grin, grinning face 😃 => 😃, face, grinning face with big eyes, mouth, open, smile 😄 => 😄, eye, face, grinning face with smiling eyes, mouth, open, smile 😁 => 😁, beaming face with smiling eyes, eye, face, grin, smile 😆 => 😆, face, grinning squinting face, laugh, mouth, satisfied, smile 😅 => 😅, cold, face, grinning face with sweat, open, smile, sweat 🤣 => 🤣, face, floor, laugh, rofl, rolling, rolling on the floor laughing, rotfl 😂 => 😂, face, face with tears of joy, joy, laugh, tear 🙂 => 🙂, face, slightly smiling face, smile 🙃 => 🙃, face, upside-down 😉 => 😉, face, wink, winking face 🐅 => 🐅, tiger 🐆 => 🐆, leopard 🐴 => 🐴, face, horse 🐎 => 🐎, equestrian, horse, racehorse, racing 🦄 => 🦄, face, unicorn 🦓 => 🦓, stripe, zebra 🦌 => 🦌, deer
在上面,我们可以看到各种各样的 emoji 符号。比如我们想搜索 grin,那么它就把含有 😀 emoji 符号的文档也找出来。在今天的文章中,我们来展示如何实现对 emoji 符号的进行搜索。文章源自IT老刘-https://wp.itlao6.com/5024.html
安装
如果你还没有对 Elasticsearch 及 Kibana 进行安装的话,请参阅之前的文章 “Elastic:菜鸟上手指南” 进行安装。 另外,我们必须安装 ICU analyzer。关于 ICU analyzer 的安装,请参阅之前的文章 “Elasticsearch:ICU 分词器介绍”。我们在 Elasticsearch 的安装根目录中,打入如下的命令:文章源自IT老刘-https://wp.itlao6.com/5024.html
./bin/elasticsearch-plugin install analysis-icu
等安装好后,我们需要重新启动 Elasticsearch 让它起作用。运行:文章源自IT老刘-https://wp.itlao6.com/5024.html
./bin/elasticsearch-plugin list
上面的命令显示:文章源自IT老刘-https://wp.itlao6.com/5024.html
$ ./bin/elasticsearch-plugin install analysis-icu -> Installing analysis-icu -> Downloading analysis-icu from elastic [=================================================] 100% -> Installed analysis-icu $ ./bin/elasticsearch-plugin list analysis-icu
安装完 ICU analyzer 后,我们必须重新启动 Elasticsearch。文章源自IT老刘-https://wp.itlao6.com/5024.html
搜索 emoji 符号
我们先做一个简单的实验:文章源自IT老刘-https://wp.itlao6.com/5024.html
GET /_analyze { "tokenizer": "icu_tokenizer", "text": "I live in 🇨🇳 and I'm 👩🚀" }
上面使用 icu_tokenizer 来对 “I live in 🇨🇳 and I'm 👩🚀” 进行分词。 👩🚀 表情符号非常独特,因为它是更经典的 👩 和 🚀 表情符号的组合。 中国的国旗也很特别,它是 🇨 和 🇳 的组合。 因此,我们不仅在谈论正确地分割 Unicode 代码点,而且在这里真正地了解了表情符号。文章源自IT老刘-https://wp.itlao6.com/5024.html
上面的请求的返回结果为:文章源自IT老刘-https://wp.itlao6.com/5024.html
{ "tokens" : [ { "token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "live", "start_offset" : 2, "end_offset" : 6, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "in", "start_offset" : 7, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : """🇨🇳""", "start_offset" : 10, "end_offset" : 14, "type" : "<EMOJI>", "position" : 3 }, { "token" : "and", "start_offset" : 16, "end_offset" : 19, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "I'm", "start_offset" : 20, "end_offset" : 23, "type" : "<ALPHANUM>", "position" : 5 }, { "token" : """👩🚀""", "start_offset" : 24, "end_offset" : 29, "type" : "<EMOJI>", "position" : 6 } ] }
显然 emoji 的符号被正确地分词,并能被搜索。文章源自IT老刘-https://wp.itlao6.com/5024.html
在实际的使用中,我们可能并不限限于对这些 emoji 的符号的搜索。比如我们想对如下的文档进行搜索:文章源自IT老刘-https://wp.itlao6.com/5024.html
PUT emoji-capable/_doc/1 { "content": "I like 🐅" }
上面的文档中含有一个 🐅,也就是老虎。针对上面的文档,我们想搜索 tiger 的时候,也能正确地搜索到文档,那么我们该如何去做呢?文章源自IT老刘-https://wp.itlao6.com/5024.html
在 github 上面,有一个项目叫做 https://github.com/jolicode/emoji-search/。在它的项目中,有一个目录 https://github.com/jolicode/emoji-search/tree/master/synonyms。这里其实就是同义词的目录。我们现在下载其中的一个文件 https://github.com/jolicode/emoji-search/blob/master/synonyms/cldr-emoji-annotation-synonyms-en.txt 到 Elasticsearch 的本地安装目录:文章源自IT老刘-https://wp.itlao6.com/5024.html
config ├── analysis │ ├── cldr-emoji-annotation-synonyms-en.txt │ └── emoticons.txt ├── elasticsearch.yml ...
在我的电脑上:文章源自IT老刘-https://wp.itlao6.com/5024.html
$ pwd /Users/liuxg/elastic1/elasticsearch-7.11.0/config $ tree -L 3 . ├── analysis │ └── cldr-emoji-annotation-synonyms-en.txt ├── elasticsearch.keystore ├── elasticsearch.yml ├── jvm.options ├── jvm.options.d ├── log4j2.properties ├── role_mapping.yml ├── roles.yml ├── users └── users_roles
在上面的 cldr-emoji-annotation-synonyms-en.txt 的文件中,它包含了常见 emoji 的符号的同义词。比如:文章源自IT老刘-https://wp.itlao6.com/5024.html
😀 => 😀, face, grin, grinning face 😃 => 😃, face, grinning face with big eyes, mouth, open, smile 😄 => 😄, eye, face, grinning face with smiling eyes, mouth, open, smile 😁 => 😁, beaming face with smiling eyes, eye, face, grin, smile 😆 => 😆, face, grinning squinting face, laugh, mouth, satisfied, smile 😅 => 😅, cold, face, grinning face with sweat, open, smile, sweat ....
为此,我们来进行如下的实验:文章源自IT老刘-https://wp.itlao6.com/5024.html
PUT /emoji-capable { "settings": { "analysis": { "filter": { "english_emoji": { "type": "synonym", "synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt" } }, "analyzer": { "english_with_emoji": { "tokenizer": "icu_tokenizer", "filter": [ "english_emoji" ] } } } }, "mappings": { "properties": { "content": { "type": "text", "analyzer": "english_with_emoji" } } } }
在上面,我们定义了 english_with_emoji 分词器,同时我们在定义 content 字段时也使用相同的分词器 english_with_emoji。我们使用 _analyze API 来进行如下的使用:文章源自IT老刘-https://wp.itlao6.com/5024.html
GET emoji-capable/_analyze { "analyzer": "english_with_emoji", "text": "I like 🐅" }
上面的命令返回:文章源自IT老刘-https://wp.itlao6.com/5024.html
{ "tokens" : [ { "token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "like", "start_offset" : 2, "end_offset" : 6, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : """🐅""", "start_offset" : 7, "end_offset" : 9, "type" : "SYNONYM", "position" : 2 }, { "token" : "tiger", "start_offset" : 7, "end_offset" : 9, "type" : "SYNONYM", "position" : 2 } ] }
显然它除了返回 🐅, 也同时返回了 tiger 这样的 token。也就是说我们可以同时搜索这两种,都可以搜索到这个文档。同样地:文章源自IT老刘-https://wp.itlao6.com/5024.html
GET emoji-capable/_analyze { "analyzer": "english_with_emoji", "text": "😀 means happy" }
它返回:文章源自IT老刘-https://wp.itlao6.com/5024.html
{ "tokens" : [ { "token" : """😀""", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "face", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "grin", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "grinning", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "means", "start_offset" : 3, "end_offset" : 8, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "face", "start_offset" : 3, "end_offset" : 8, "type" : "SYNONYM", "position" : 1 }, { "token" : "happy", "start_offset" : 9, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 2 } ] }
它表明,如果我们搜索 face, grinning,grin,该文档也会被正确地返回。文章源自IT老刘-https://wp.itlao6.com/5024.html
现在,我们输入如下的两个文档:
PUT emoji-capable/_doc/1 { "content": "I like 🐅" } PUT emoji-capable/_doc/2 { "content": "😀 means happy" }
我们对文档进行如下的搜索:
GET emoji-capable/_search { "query": { "match": { "content": "🐅" } } }
或:
GET emoji-capable/_search { "query": { "match": { "content": "tiger" } } }
他们都将返回第一个文档:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.8514803, "hits" : [ { "_index" : "emoji-capable", "_type" : "_doc", "_id" : "1", "_score" : 0.8514803, "_source" : { "content" : """I like 🐅""" } } ] } }
通用地,我们进行如下的搜索:
GET emoji-capable/_search { "query": { "match": { "content": "😀" } } }
或者:
GET emoji-capable/_search { "query": { "match": { "content": "grin" } } }
它们都将返回第二个文档:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.8514803, "hits" : [ { "_index" : "emoji-capable", "_type" : "_doc", "_id" : "2", "_score" : 0.8514803, "_source" : { "content" : """😀 means happy""" } } ] } }
转自 csdn Elastic中国社区官方博客
评论