elasticsearch 在漫画搜索项目的应用(一)安装和部署搜索服务

早期的哔哩哔哩漫画搜索服务是由大数据部门提供的,很多功能上无法满足漫画自己业务的需求。我们在调研了一段时间 elasticsearch(后面简称ES)以后,动手搭建了一套功能齐全的搜索服务。探索的过程收获了不少东西,因此打算写几篇相关的文章记录一下。作为开篇,简单的讲一些ES的搭建过程和一些细节。

The early Bilibili Manga search service is provided by Big Data department,and many functions could not meet the needs of our own business. After a period of investigation, we start to construct a functional search service. We got a lots of exprience in this process so I decide to white some article to record it. As a start, I will go into some details about set up a ES service.

前期准备

preparing

在我们开始做调研的时候,ES的最新版本是7.6.1。作为一个没有历史包袱的项目,最新版本自然是最好的选择。对我们一般的用户来说,ES7相比之前的版本最大的改变是如下几点:

  1. 删除type类型
  2. 支持向量搜索
  3. 自带JDK环境

When we started doing the investigation, the latest version of ES is 7.6.1. As a brand new project,the latest version natural is the best choice. For normal users, the biggest changes from the previous version are as follows:

  1. remove ’type’ type
  2. vector search support
  3. JDK include

删除type类型以后,索引和字段的关系更加清晰,也方便从数据库的角度去理解索引(表)字段之间的对应关系。向量搜索可以为算法服务提供向量运算和基于向量模型的召回。自带JDK环境不必多说,部署ES服务更加方便了。

After remove ’type’ type,we can clearly understand the relation between index and field,besides,it is also convenient to understand the correspoding relationship between index (table) fields from the database perspective. The vector search provide vector operation and items recall based on vector model for algorithm service.

运维为我们提供了三台物理机,硬件配置都为24核128G内存。针对这样配置的机器,ES 官方有这样的建议:

OPS provide us three physical machines with 24 cores and 128 gigabytes of memory. some official recommendations for machines with this configuration:

假设你有个 128 GB 内存的机器,你可以创建两个节点,每个节点内存分配不超过 32 GB。 也就是说不超过 64 GB 内存给 ES 的堆内存,剩下的超过 64 GB 的内存给 Lucene。

If you have a 128 GB memory machine, you can create two nodes at most 32 GB memory allocate to each node,which means no more than 64 GB of memory allocate to ES heaps memory,and rest is over 64 GB of memory for Lucene.

具体说明在这里

Here is the deatail.

安装部署

install & deploy

自带 JDK 的 ES 安装非常简单,保持着一贯以来的开箱即用原则。只要在构建集群的时候稍微注意一下集群的配置就可以了。

The installation of ES with JDK is very simple,keeping with the usual out-of-the-box principles. We Just have to focus on the config when you build all the systems.

首先从下载页面 下载符合自己系统的安装包,解压缩:

First,download the installation package for your own system and unzip from the download page:

tar -zxvf elasticsearch-7.6.1-linux-x86_64.tar.gz
cd elasticsearch

几个主要的配置文件位于:

Several major configuration files are located here:

./config/elasticsearch.yml     # es 自身配置
./config/jvm.options     # jvm 相关配置

elasticsearch.yml 中几个重点的配置如下:

the point of elasticsearch.yml is:

cluster.name

集群名称,同集群下的节点配置必须相同。

The special name of cluster. Each node configuration must keep the same under a cluster.

node.name

节点名称,同集群下不同节点名称不能相同。

The special name of node. Each node configuration must keep different under a cluster.

node.data/node.master

节点类型。一般来说,一个节点要么是 master 节点,要么是 data 节点,这样拆分更合理。

node type. generally, a node either a master node or a data node. It makes more sense to split it this way.

path.data/path.logs

数据文件和日志文件储存目录。

The storage directory of data file and log file.

http.port

搜索 API 服务端口号,默认9200。

The port user for API service, deafult 9200.

transport.port

elasticsearch集群内数据通讯使用的端口,默认9300。

The port user for data transport within the cluser, default 9300.

discovery.seed_hosts

指定集群扫描的 IP 和端口号,便于发现集群节点纳入集群。

Assign IP and PORT to cluster discovery for find out and include into cluster easily.

因此一个完整的 master 节点配置文件内容如下:

Therefore, the content of complete master node configuration file is as follows:

cluster.name: my_cluster_name
node.name: node1
node.master: true
node.data: false
path.data: /mnt/storage01/esdata
path.logs: /mnt/storage01/log/elasticsearch
bootstrap.memory_lock: false
bootstrap.system_call_filter: false
network.host: 0.0.0.0
http.port: 9212
transport.tcp.port: 9312
discovery.seed_hosts: ['10.70.54.28:9311','10.70.54.28:9312','10.70.54.27:9312','10.70.54.27:9311','10.70.54.26:9312']
http.cors.enabled: true
http.cors.allow-origin: "*"
indices.memory.index_buffer_size: 40%
thread_pool.write.queue_size: 1024

同理,一个 data 节点的配置文件如下:

Similarly, the configuration file for a data node is as follows:

注意node.master,node.data的区别

notice the different of node.master and node.data

cluster.name: my_cluster_name
node.name: node2
node.master: false
node.data: true
path.data: /mnt/storage01/esdata
path.logs: /mnt/storage01/log/elasticsearch
bootstrap.memory_lock: false
bootstrap.system_call_filter: false
network.host: 0.0.0.0
http.port: 9211
transport.tcp.port: 9311
discovery.seed_hosts: ['10.70.54.28:9311','10.70.54.28:9312','10.70.54.27:9312','10.70.54.27:9311','10.70.54.26:9312']
http.cors.enabled: true
http.cors.allow-origin: "*"
indices.memory.index_buffer_size: 40%
thread_pool.write.queue_size: 1024

而 jvm.options 文件就很简单,需要改的只有以下两个地方:

And jvm.options file is simple,you just need change this two parts:

-Xms30g
-Xmx30g

指定最小和最大堆内存都为30GB。

Specifies the minimum and maximum heaps memory sizes.

都设置好了以后,需要创建一个运行es服务的用户,es不允许使用root账户直接运行,因此:

After all set up, we need to create a specified user for ES,root user is not allowed.So:

adduser --disabled-password --gecos "" es
chown -R es:es elasticsearch

不过一般来说第一次运行ES服务还会额外报一个错误:

But if we run ES service at first time,we may get another error:

max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

参考stackoverflow上的这个回答:

We can refer to this answer in stackoverflow:

sysctl -w vm.max_map_count=262144

或者一步到位,直接去修改系统的配置项,新增一行:

Or go ahead and change the system configuration and add a new line:

vim /etc/sysctl.conf
vm.max_map_count=262144

最后,直接执行 bin/elasticsearch -d 就可以了。

After all set up, it is ok to execute bin/elasticsearch -d directly.

Comments

comments powered by Disqus