ElasticSearch学习笔记三


ElasticSearch学习笔记三


安装ik分词器

<code>cd plugins/
mkdir ik
cd ik
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip

unzip elasticsearch-analysis-ik-7.4.2.zip

# 重新启动es
/<code>

两种analyzer:

ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合;

ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。

测试:

<code>DELETE /my_index

PUT /my_index
{
"mappings": {
"properties": {
"text":{
"type":"text",
"analyzer": "ik_max_word"
}
}
}

}

# 添加数据
POST /my_index/_bulk
{ "index": { "_id": "1"} }
{ "text": "男子偷上万元发红包求交女友 被抓获时仍然单身" }
{ "index": { "_id": "2"} }
{ "text": "16岁少女为结婚“变”22岁 7年后想离婚被法院拒绝" }
{ "index": { "_id": "3"} }
{ "text": "深圳女孩骑车逆行撞奔驰 遭索赔被吓哭(图)" }
{ "index": { "_id": "4"} }
{ "text": "女人对护肤品比对男票好?网友神怼" }
{ "index": { "_id": "5"} }
{ "text": "为什么国内的街道招牌用的都是红黄配?" }

# 测试分词
GET /my_index/_analyze
{
"text": "男子偷上万元发红包求交女友 被抓获时仍然单身",
"field": "text"
}

GET /my_index/_search
{
"query": {
"match": {
"text": "16岁少女结婚好还是单身好?"
}
}
}
/<code>

ik配置文件

目录地址:plugins/ik/config

文件说明:

  • IKAnalyzer.cfg.xml:用来配置自定义词库
  • main.dic:ik原生内置的中文词库,总共有27万多条,只要是这些单词,都会被分在一起
  • quantifier.dic:放了一些单位相关的词
  • suffix.dic:放了一些后缀
  • surname.dic:中国的姓氏
  • stopword.dic:英文停用词

自定义ik词库

<code>
mkdir githen
cd githen

echo "蓝瘦香菇" > mydict.dic
echo "恩" > ext_stopword.dic

vim IKAnalyzer.cfg.xml

?xml version="1.0" encoding="UTF-8"?>

<properties>
<comment>IK Analyzer 扩展配置/<comment>

<entry>githen/mydict.dic/<entry>

<entry>githen/ext_stopword.dic/<entry>




/<properties>



GET _analyze
{
"text": "蓝瘦香菇",
"analyzer": "ik_max_word"
}

{
"tokens" : [
{
"token" : "蓝瘦香菇",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "香菇",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
}
]
}
/<code>

基于MySQL热更新

  1. 修改ik分词器源码,然后手动支持从MySQL中每隔一定时间,自动加载新的词库代码修改:

Dictionary类,169行:Dictionary单例类的初始化方法,在这里需要创建一个我们自定义的线程,并且启动它

HotDictReloadThread类:就是死循环,不断调用Dictionary.getSingleton().reLoadMainDict(),去重新加载词典

Dictionary类,389行:this.loadMySQLExtDict();

Dictionary类,683行:this.loadMySQLStopwordDict();

  • mvn package打包代码target\\releases\\elasticsearch-analysis-ik-5.2.0.zip
  • 解压缩ik压缩包将mysql驱动jar,放入ik的目录下
  • 修改jdbc相关配置
  • 重启es
  1. 基于ik分词器原生支持的热更新方案,部署一个web服务器,提供一个http接口,通过modified和tag两个http响应(不建议)

_bucket 和metric

  • bucket:一个数据分组,按照某个字段进行bucket划分,那个字段的值相同的那些数据,就会被划分到一个bucket中
  • metric:对一个数据分组执行的统计,就是对一个bucket执行的某种聚合分析的操作,比如说求平均值,求最大值,求最小值

分组汇总个数

<code># 定义数据类型
PUT /tvs
{
"mappings": {
"properties": {
"price": {
"type": "long"
},
"color": {
"type": "keyword"
},
"brand": {
"type": "keyword"
},
"sold_date": {
"type": "date"
}
}
}
}

# 插入数据

POST /tvs/_bulk
{ "index": {}}
{ "price" : 1000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-10-28" }
{ "index": {}}
{ "price" : 2000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-11-05" }
{ "index": {}}
{ "price" : 3000, "color" : "绿色", "brand" : "小米", "sold_date" : "2016-05-18" }
{ "index": {}}
{ "price" : 1500, "color" : "蓝色", "brand" : "TCL", "sold_date" : "2016-07-02" }
{ "index": {}}
{ "price" : 1200, "color" : "绿色", "brand" : "TCL", "sold_date" : "2016-08-19" }
{ "index": {}}
{ "price" : 2000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-11-05" }
{ "index": {}}
{ "price" : 8000, "color" : "红色", "brand" : "三星", "sold_date" : "2017-01-01" }
{ "index": {}}
{ "price" : 2500, "color" : "蓝色", "brand" : "小米", "sold_date" : "2017-02-12" }

# 分组

GET /tvs/_search
{
"size": 0,
"aggs": {
"popular_colors": {
"terms": {
"field": "color"
}
}
}
}

size:只获取聚合结果,而不要执行聚合的原始数据
aggs:固定语法,要对一份数据执行分组聚合操作
popular_colors:就是对每个aggs,都要起一个名字,这个名字是随机的,你随便取什么都ok
terms:根据字段的值进行分组
field:根据指定的字段的值进行分组


# 查询结果
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"popular_colors" : {
"doc_count_error_upper_bound" : 0,

"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "红色",
"doc_count" : 4
},
{
"key" : "绿色",
"doc_count" : 2
},
{
"key" : "蓝色",
"doc_count" : 2
}
]
}
}
}

hits.hits:我们指定了size是0,所以hits.hits就是空的,否则会把执行聚合的那些原始数据给你返回回来
aggregations:聚合结果
popular_color:我们指定的某个聚合的名称
buckets:根据我们指定的field划分出的buckets
key:每个bucket对应的那个值
doc_count:这个bucket分组内,有多少个数据
/<code>

分组查询平均数

<code>
GET /tvs/_search
{
"size": 0,
"aggs": {
"colors": {
"terms": {
"field": "color"
},
"aggs":{
"avg_price":{

"avg": {
"field": "price"
}
}
}
}

}
}

/<code>

下钻

<code>
GET /tvs/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color"
},
"aggs": {
"color_avg_price": {
"terms": {
"field": "price"
}
},
"group_by_brand":{
"terms": {
"field": "brand"
},
"aggs": {
"brand_avg_price": {
"terms": {
"field": "price"
}
}
}
}

}
}
}
}

/<code>

最大值 ,最小值,平均,求和计算

<code>GET /tvs/_search
{
"size": 0,
"aggs":{
"colors":{
"terms": {
"field": "color"
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
},
"max_price":{
"max": {
"field": "price"
}
},
"min_price":{
"min": {
"field": "price"
}
},
"sum_price":{
"sum": {
"field": "price"
}
}
}
}
}
}

/<code>

histogram

histogram:接收一个field,按照这个field的值的各个范围区间,进行bucket分组操作

<code>
GET /tvs/_search
{
"size":0,
"aggs": {
"price": {
"histogram": {

"field": "price",
"interval": 1000
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}

# 展示

{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"price" : {
"buckets" : [
{
"key" : 1000.0,
"doc_count" : 3,
"avg_price" : {
"value" : 1233.3333333333333
}
},
{
"key" : 2000.0,
"doc_count" : 3,
"avg_price" : {
"value" : 2166.6666666666665
}
},

{
"key" : 3000.0,
"doc_count" : 1,
"avg_price" : {
"value" : 3000.0
}
},
{
"key" : 4000.0,
"doc_count" : 0,
"avg_price" : {
"value" : null
}
},
{
"key" : 5000.0,
"doc_count" : 0,
"avg_price" : {
"value" : null
}
},
{
"key" : 6000.0,
"doc_count" : 0,
"avg_price" : {
"value" : null
}
},
{
"key" : 7000.0,
"doc_count" : 0,
"avg_price" : {
"value" : null
}
},
{
"key" : 8000.0,
"doc_count" : 1,
"avg_price" : {
"value" : 8000.0
}
}
]
}
}
}
/<code>

date histogram(月分析)

按照我们指定的某个date类型的日期field,以及日期interval,按照一定的日期间隔,去划分bucket

<code>GET /tvs/_search
{
"size": 0,
"aggs": {
"sales": {
"date_histogram": {
"field": "sold_date",
"interval": "month",
"format": "yyyy-MM-dd",
"min_doc_count": 0,
"extended_bounds": {
"min": "2017-01-01",
"max": "2017-12-31"
}
}
}
}
}

min_doc_count:即使某个日期interval,2017-01-01~2017-01-31中,一条数据都没有,那么这个区间也是要返回的,不然默认是会过滤掉这个区间的
extended_bounds,min,max:划分bucket的时候,会限定在这个起始日期,和截止日期内
/<code>

date histogram(季度分析)

<code>GET /tvs/_search
{
"size": 0,
"aggs": {
"group_by_sold_date": {
"date_histogram": {
"field": "sold_date",
"interval": "quarter",
"format": "yyyy-MM-dd",
"min_doc_count": 0,
"extended_bounds": {
"min": "2017-01-01",

"max": "2017-12-31"
}
},
"aggs": {
"group_by_brand": {
"terms": {
"field": "brand"
},
"aggs": {
"sum_price": {
"sum": {
"field": "price"
}
}
}
},
"total_sum_price":{
"sum": {
"field": "price"
}
}
}
}
}
}


/<code>

指定条件分析

<code>GET /tvs/_search
{
"size": 0,
"query": {
"term": {
"brand": {
"value": "小米"
}
}
},
"aggs": {
"group_by_color": {
"terms": {
"field": "color"
}
}
}
}

/<code>

_global blcket

<code>GET /tvs/_search
{
"size": 0,
"query": {
"term": {
"brand": {
"value": "长虹"
}
}
},
"aggs": {
"single_brand_avg_price": {
"avg": {
"field": "price"
}
},
"all":{
"global": {}, # 指定所有的
"aggs": {
"all_brand_avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}


global:就是global bucket,就是将所有数据纳入聚合的scope,而不管之前的query

/<code>

最近几个月分析

<code>GET /tvs/_search
{
"size": 0,
"query": {
"term": {
"brand": {
"value": "长虹"

}
}
},
"aggs": {
"recent_150d": {
"filter": {
"range": {
"sold_date": {
"gte": "2016-01-01"
}
}
},
"aggs": {
"recent_150d_avg_price": {
"avg": {
"field": "price"
}
}
}
},
"recent_140d":{
"filter": {
"range": {
"sold_date": {
"gte": "now-140d"
}
}
},
"aggs": {
"recent_140d_avg_price": {
"avg": {
"field": "price"
}
}
}
},
"recent_130d":{
"filter": {
"range": {
"sold_date": {
"gte": "now-130d"
}
}
},
"aggs": {
"recent_130d_avg_price": {
"avg": {
"field": "price"
}
}

}
}
}
}

/<code>

其它组合

<code>GET /tvs/_search
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"range": {
"price": {
"gte": 1200
}
}
}
}
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}

GET /tvs/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color",
"order": {
"avg_price": "desc"
}
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}

}
}
}/<code>


分享到:


相關文章: