安装ik分词器
<code>cd plugins/
mkdir ik
cd ik
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
unzip elasticsearch-analysis-ik-7.4.2.zip
# 重新启动es
/<code>
两种analyzer:
ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合;
ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。
测试:
<code>DELETE /my_index
PUT /my_index
{
"mappings": {
"properties": {
"text":{
"type":"text",
"analyzer": "ik_max_word"
}
}
}
}
# 添加数据
POST /my_index/_bulk
{ "index": { "_id": "1"} }
{ "text": "男子偷上万元发红包求交女友 被抓获时仍然单身" }
{ "index": { "_id": "2"} }
{ "text": "16岁少女为结婚“变”22岁 7年后想离婚被法院拒绝" }
{ "index": { "_id": "3"} }
{ "text": "深圳女孩骑车逆行撞奔驰 遭索赔被吓哭(图)" }
{ "index": { "_id": "4"} }
{ "text": "女人对护肤品比对男票好?网友神怼" }
{ "index": { "_id": "5"} }
{ "text": "为什么国内的街道招牌用的都是红黄配?" }
# 测试分词
GET /my_index/_analyze
{
"text": "男子偷上万元发红包求交女友 被抓获时仍然单身",
"field": "text"
}
GET /my_index/_search
{
"query": {
"match": {
"text": "16岁少女结婚好还是单身好?"
}
}
}
/<code>
ik配置文件
目录地址:plugins/ik/config
文件说明:
- IKAnalyzer.cfg.xml:用来配置自定义词库
- main.dic:ik原生内置的中文词库,总共有27万多条,只要是这些单词,都会被分在一起
- quantifier.dic:放了一些单位相关的词
- suffix.dic:放了一些后缀
- surname.dic:中国的姓氏
- stopword.dic:英文停用词
自定义ik词库
<code>
mkdir githen
cd githen
echo "蓝瘦香菇" > mydict.dic
echo "恩" > ext_stopword.dic
vim IKAnalyzer.cfg.xml
?xml version="1.0" encoding="UTF-8"?>
<properties>
<comment>IK Analyzer 扩展配置/<comment>
<entry>githen/mydict.dic/<entry>
<entry>githen/ext_stopword.dic/<entry>
/<properties>
GET _analyze
{
"text": "蓝瘦香菇",
"analyzer": "ik_max_word"
}
{
"tokens" : [
{
"token" : "蓝瘦香菇",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "香菇",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
}
]
}
/<code>
基于MySQL热更新
- 修改ik分词器源码,然后手动支持从MySQL中每隔一定时间,自动加载新的词库代码修改:
Dictionary类,169行:Dictionary单例类的初始化方法,在这里需要创建一个我们自定义的线程,并且启动它
HotDictReloadThread类:就是死循环,不断调用Dictionary.getSingleton().reLoadMainDict(),去重新加载词典
Dictionary类,389行:this.loadMySQLExtDict();
Dictionary类,683行:this.loadMySQLStopwordDict();
- mvn package打包代码target\\releases\\elasticsearch-analysis-ik-5.2.0.zip
- 解压缩ik压缩包将mysql驱动jar,放入ik的目录下
- 修改jdbc相关配置
- 重启es
- 基于ik分词器原生支持的热更新方案,部署一个web服务器,提供一个http接口,通过modified和tag两个http响应(不建议)
_bucket 和metric
- bucket:一个数据分组,按照某个字段进行bucket划分,那个字段的值相同的那些数据,就会被划分到一个bucket中
- metric:对一个数据分组执行的统计,就是对一个bucket执行的某种聚合分析的操作,比如说求平均值,求最大值,求最小值
分组汇总个数
<code># 定义数据类型
PUT /tvs
{
"mappings": {
"properties": {
"price": {
"type": "long"
},
"color": {
"type": "keyword"
},
"brand": {
"type": "keyword"
},
"sold_date": {
"type": "date"
}
}
}
}
# 插入数据
POST /tvs/_bulk
{ "index": {}}
{ "price" : 1000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-10-28" }
{ "index": {}}
{ "price" : 2000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-11-05" }
{ "index": {}}
{ "price" : 3000, "color" : "绿色", "brand" : "小米", "sold_date" : "2016-05-18" }
{ "index": {}}
{ "price" : 1500, "color" : "蓝色", "brand" : "TCL", "sold_date" : "2016-07-02" }
{ "index": {}}
{ "price" : 1200, "color" : "绿色", "brand" : "TCL", "sold_date" : "2016-08-19" }
{ "index": {}}
{ "price" : 2000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-11-05" }
{ "index": {}}
{ "price" : 8000, "color" : "红色", "brand" : "三星", "sold_date" : "2017-01-01" }
{ "index": {}}
{ "price" : 2500, "color" : "蓝色", "brand" : "小米", "sold_date" : "2017-02-12" }
# 分组
GET /tvs/_search
{
"size": 0,
"aggs": {
"popular_colors": {
"terms": {
"field": "color"
}
}
}
}
size:只获取聚合结果,而不要执行聚合的原始数据
aggs:固定语法,要对一份数据执行分组聚合操作
popular_colors:就是对每个aggs,都要起一个名字,这个名字是随机的,你随便取什么都ok
terms:根据字段的值进行分组
field:根据指定的字段的值进行分组
# 查询结果
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"popular_colors" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "红色",
"doc_count" : 4
},
{
"key" : "绿色",
"doc_count" : 2
},
{
"key" : "蓝色",
"doc_count" : 2
}
]
}
}
}
hits.hits:我们指定了size是0,所以hits.hits就是空的,否则会把执行聚合的那些原始数据给你返回回来
aggregations:聚合结果
popular_color:我们指定的某个聚合的名称
buckets:根据我们指定的field划分出的buckets
key:每个bucket对应的那个值
doc_count:这个bucket分组内,有多少个数据
/<code>
分组查询平均数
<code>
GET /tvs/_search
{
"size": 0,
"aggs": {
"colors": {
"terms": {
"field": "color"
},
"aggs":{
"avg_price":{
"avg": {
"field": "price"
}
}
}
}
}
}
/<code>
下钻
<code>
GET /tvs/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color"
},
"aggs": {
"color_avg_price": {
"terms": {
"field": "price"
}
},
"group_by_brand":{
"terms": {
"field": "brand"
},
"aggs": {
"brand_avg_price": {
"terms": {
"field": "price"
}
}
}
}
}
}
}
}
/<code>
最大值 ,最小值,平均,求和计算
<code>GET /tvs/_search
{
"size": 0,
"aggs":{
"colors":{
"terms": {
"field": "color"
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
},
"max_price":{
"max": {
"field": "price"
}
},
"min_price":{
"min": {
"field": "price"
}
},
"sum_price":{
"sum": {
"field": "price"
}
}
}
}
}
}
/<code>
histogram
histogram:接收一个field,按照这个field的值的各个范围区间,进行bucket分组操作
<code>
GET /tvs/_search
{
"size":0,
"aggs": {
"price": {
"histogram": {
"field": "price",
"interval": 1000
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
# 展示
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"price" : {
"buckets" : [
{
"key" : 1000.0,
"doc_count" : 3,
"avg_price" : {
"value" : 1233.3333333333333
}
},
{
"key" : 2000.0,
"doc_count" : 3,
"avg_price" : {
"value" : 2166.6666666666665
}
},
{
"key" : 3000.0,
"doc_count" : 1,
"avg_price" : {
"value" : 3000.0
}
},
{
"key" : 4000.0,
"doc_count" : 0,
"avg_price" : {
"value" : null
}
},
{
"key" : 5000.0,
"doc_count" : 0,
"avg_price" : {
"value" : null
}
},
{
"key" : 6000.0,
"doc_count" : 0,
"avg_price" : {
"value" : null
}
},
{
"key" : 7000.0,
"doc_count" : 0,
"avg_price" : {
"value" : null
}
},
{
"key" : 8000.0,
"doc_count" : 1,
"avg_price" : {
"value" : 8000.0
}
}
]
}
}
}
/<code>
date histogram(月分析)
按照我们指定的某个date类型的日期field,以及日期interval,按照一定的日期间隔,去划分bucket
<code>GET /tvs/_search
{
"size": 0,
"aggs": {
"sales": {
"date_histogram": {
"field": "sold_date",
"interval": "month",
"format": "yyyy-MM-dd",
"min_doc_count": 0,
"extended_bounds": {
"min": "2017-01-01",
"max": "2017-12-31"
}
}
}
}
}
min_doc_count:即使某个日期interval,2017-01-01~2017-01-31中,一条数据都没有,那么这个区间也是要返回的,不然默认是会过滤掉这个区间的
extended_bounds,min,max:划分bucket的时候,会限定在这个起始日期,和截止日期内
/<code>
date histogram(季度分析)
<code>GET /tvs/_search
{
"size": 0,
"aggs": {
"group_by_sold_date": {
"date_histogram": {
"field": "sold_date",
"interval": "quarter",
"format": "yyyy-MM-dd",
"min_doc_count": 0,
"extended_bounds": {
"min": "2017-01-01",
"max": "2017-12-31"
}
},
"aggs": {
"group_by_brand": {
"terms": {
"field": "brand"
},
"aggs": {
"sum_price": {
"sum": {
"field": "price"
}
}
}
},
"total_sum_price":{
"sum": {
"field": "price"
}
}
}
}
}
}
/<code>
指定条件分析
<code>GET /tvs/_search
{
"size": 0,
"query": {
"term": {
"brand": {
"value": "小米"
}
}
},
"aggs": {
"group_by_color": {
"terms": {
"field": "color"
}
}
}
}
/<code>
_global blcket
<code>GET /tvs/_search
{
"size": 0,
"query": {
"term": {
"brand": {
"value": "长虹"
}
}
},
"aggs": {
"single_brand_avg_price": {
"avg": {
"field": "price"
}
},
"all":{
"global": {}, # 指定所有的
"aggs": {
"all_brand_avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
global:就是global bucket,就是将所有数据纳入聚合的scope,而不管之前的query
/<code>
最近几个月分析
<code>GET /tvs/_search
{
"size": 0,
"query": {
"term": {
"brand": {
"value": "长虹"
}
}
},
"aggs": {
"recent_150d": {
"filter": {
"range": {
"sold_date": {
"gte": "2016-01-01"
}
}
},
"aggs": {
"recent_150d_avg_price": {
"avg": {
"field": "price"
}
}
}
},
"recent_140d":{
"filter": {
"range": {
"sold_date": {
"gte": "now-140d"
}
}
},
"aggs": {
"recent_140d_avg_price": {
"avg": {
"field": "price"
}
}
}
},
"recent_130d":{
"filter": {
"range": {
"sold_date": {
"gte": "now-130d"
}
}
},
"aggs": {
"recent_130d_avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
/<code>
其它组合
<code>GET /tvs/_search
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"range": {
"price": {
"gte": 1200
}
}
}
}
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
GET /tvs/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color",
"order": {
"avg_price": "desc"
}
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}/<code>
閱讀更多 碼代碼 的文章
關鍵字: ElasticSearch Vim 分词器