Elasticsearch integration

原文：https://docs.gitlab.com/ee/integration/elasticsearch.html

Version Requirements
Installing Elasticsearch
Elasticsearch repository indexer
- Omnibus GitLab
- From source
  - Debian / Ubuntu
  - CentOS / RHEL
    - Mac OSX

Building and installing

System Requirements
Enabling Elasticsearch
- Limiting namespaces and projects
Disabling Elasticsearch
Adding GitLab’s data to the Elasticsearch index
GitLab Elasticsearch Rake tasks
- Environment variables
- Indexing a specific project
Elasticsearch index scopes
Tuning
Troubleshooting

Elasticsearch integration

版本历史

在 GitLab Starter 8.4 中引入 .
在 GitLab Starter 9.0 中引入了对Amazon Elasticsearch 的支持.

本文档介绍了如何使用 GitLab 设置 Elasticsearch. 启用后，您将受益于快速的搜索响应时间以及两个特殊搜索的优点：

Version Requirements

GitLab 版本	Elasticsearch 版本
GitLab 企业版 8.4-8.17	Elasticsearch 2.4 with Delete By Query Plugin installed
GitLab 企业版 9.0-11.4	弹性搜索 5.1-5.5
GitLab 企业版 11.5-12.6	Elasticsearch 5.6-6.x
manbetx 客户端打不开企业版 12.7+	Elasticsearch 6.x-7.x

Installing Elasticsearch

Omnibus 软件包中不包括 Elasticsearch. 无论您使用的是 Omnibus 软件包还是从源代码安装的 GitLab，您都必须自行安装 . 提供有关安装 Elasticsearch 的详细信息超出了本文档的范围.

注意： Elasticsearch 应该安装在单独的服务器上，无论您自己安装还是使用云托管产品，例如 Elastic 的Elasticsearch Service （在 AWS，GCP 或 Azure 上可用）或Amazon Elasticsearch Service. 不建议在与 GitLab 相同的服务器上运行 Elasticsearch，这可能会导致 GitLab 实例性能下降.注意： 对于单节点 Elasticsearch 集群，功能集群的运行状况为黄色 （永远不会为绿色），因为已分配了主要分片，但副本却无法存在，因为没有其他节点可以为 Elasticsearch 分配副本.

将数据添加到数据库或存储库中并在管理区域中启用 Elasticsearch 后，搜索索引将自动更新.

Elasticsearch repository indexer

为了索引 Git 存储库数据，GitLab 使用了Go 语言编写的索引器 .

Go 索引器的安装方式取决于您的 GitLab 版本：

对于 Omnibus GitLab 11.8 及更高版本，请参见Omnibus GitLab .
For installations from source or older versions of Omnibus GitLab, install the indexer From Source.

Omnibus GitLab

从 GitLab 11.8 开始，Omnibus GitLab 中包含了 Go 索引器. 以前的基于 Ruby 的索引器已在GitLab 12.3中删除.

From source

首先，我们需要安装一些依赖项，然后构建并安装索引器本身.

该项目依靠ICU进行文本编码，因此我们需要确保在运行make之前安装了适用于您平台的开发包.

Debian / Ubuntu

要在 Debian 或 Ubuntu 上安装，请运行：

sudo apt install libicu-dev

CentOS / RHEL

要在 CentOS 或 RHEL 上安装，请运行：

sudo yum install libicu-devel

Mac OSX

要在 macOS 上安装，请运行：

brew install icu4c
export PKG_CONFIG_PATH="/usr/local/opt/icu4c/lib/pkgconfig:$PKG_CONFIG_PATH"

Building and installing

要构建和安装索引器，请运行：

indexer_path=/home/git/gitlab-elasticsearch-indexer

# Run the installation task for gitlab-elasticsearch-indexer:
sudo -u git -H bundle exec rake gitlab:indexer:install[$indexer_path] RAILS_ENV=production
cd $indexer_path && sudo make install

The gitlab-elasticsearch-indexer will be installed to /usr/local/bin.

您可以使用PREFIX env 变量更改安装路径. 如果这样做，请记住将-E标志传递给sudo .

Example:

PREFIX=/usr sudo -E make install

安装完成后，请在您实例的 Elasticsearch 设置下启用，如下所述 .

System Requirements

Elasticsearch 需要的资源超出了GitLab 系统要求中记录的资源. 这些会因安装大小而异，但是，根据官方指南，您应确保每个 Elasticsearch 节点至少至少有8 GiB RAM .

请记住，这是 Elasticsearch 的最低要求 . 对于生产实例，他们建议使用更多的资源.

所需的存储空间在很大程度上取决于要存储在 GitLab 中的存储库的大小，但是根据经验，由于所有存储库加起来，您应该至少有 50％的可用空间.

Enabling Elasticsearch

为了启用 Elasticsearch，您需要具有管理员权限. 导航到管理区域 （扳手图标），然后导航到设置>集成并展开Elasticsearch部分.

单击保存更改以使更改生效.

可以使用以下 Elasticsearch 设置：

Parameter	Description
`Elasticsearch indexing`	启用/禁用 Elasticsearch 索引编制. 例如，您可能想要启用索引编制但禁用搜索以使索引有时间完全完成. 另外，请记住，此选项对现有数据没有任何影响，仅启用/禁用跟踪数据更改的后台索引器. 因此，通过启用此功能，您将不会为现有数据建立索引，请按照将 GitLab 的数据添加到 Elasticsearch 索引中所述的那样，使用特殊的 Rake 任务.
`Elasticsearch pause indexing`	启用/禁用临时索引暂停. 这对于群集迁移/重新索引很有用. 仍会跟踪所有更改，但是直到未暂停它们才提交给 Elasticsearch 索引.
`Search with Elasticsearch enabled`	在搜索中使用 Elasticsearch 启用/禁用.
`URL`	用于连接到 Elasticsearch 的 URL. 使用逗号分隔的列表来支持群集（例如， `http://host1, https://host2:9200` ）. 如果您的 Elasticsearch 实例受密码保护，请在 URL 中传递`username:password` （例如， `http://<username>:<password>@<elastic_host>:9200/` ）.
`Number of Elasticsearch shards`	出于性能原因，Elasticsearch 索引分为多个碎片. 通常，较大的索引需要具有更多的分片. 在重新创建索引之前，对该值所做的更改才会生效. 您可以在Elasticsearch 文档中阅读有关权衡的更多信息
`Number of Elasticsearch replicas`	每个 Elasticsearch 分片可以具有多个副本. 这些是分片的完整副本，可以提高查询性能或抵御硬件故障. 增大该值将大大增加索引所需的总磁盘空间.
`Limit namespaces and projects that can be indexed`	启用此选项将允许您选择要索引的名称空间和项目. 所有其他名称空间和项目将改为使用数据库搜索. 请注意，如果启用此选项但未选择任何名称空间或项目，则不会为它们建立索引. 在下面阅读更多内容 .
`Using AWS hosted Elasticsearch with IAM credentials`	使用AWS IAM 授权或AWS EC2 实例配置文件凭证对 Elasticsearch 请求进行签名. 必须将策略配置为允许`es:*`操作.
`AWS Region`	您的 Elasticsearch 服务所在的 AWS 区域.
`AWS Access Key`	AWS 访问密钥.
`AWS Secret Access Key`	AWS 秘密访问密钥.
`Maximum field length`	See the explanation in instance limits..
`Maximum bulk request size (MiB)`	最大批量请求大小由 GitLab 基于 Golang 的索引器流程使用，并指示在将有效负载提交给 Elasticsearch 的批量 API 之前，它应在给定的索引过程中收集（并存储在内存中）多少数据. 此设置应与批量请求并发设置一起使用（请参见下文），并且需要通过`gitlab-rake`命令或通过`gitlab-rake`命令来适应 Elasticsearch 主机和运行 GitLab 基于 Golang 的索引器的主机的资源限制. Sidekiq 任务.
`Bulk request concurrency`	批量请求并发指示可以并行运行多少个 GitLab 基于 Golang 的索引器进程（或线程）来收集数据，然后将其提交给 Elasticsearch 的批量 API. 这可以提高索引编制性能，但可以更快地填充 Elasticsearch 批量请求队列. 此设置应与最大批量请求大小设置（请参见上文）一起使用，并且需要容纳`gitlab-rake`的 Elasticsearch 主机和运行 GitLab 基于 Golang 的索引器的主机的资源限制命令或 Sidekiq 任务.

Limiting namespaces and projects

如果选择" Limit namespaces and projects that can be indexed更多选项可用

您可以选择名称空间和项目以排他索引. 请注意，如果名称空间是一个组，它将包括所有子组以及属于这些子组的要被索引的项目.

如果所有名称空间都已建立索引，Elasticsearch 仅提供跨组代码/提交搜索（全局）. 在仅对名称空间的子集进行索引的特定情况下，全局搜索将不提供代码或提交范围. 这仅在索引命名空间的范围内才有可能. 当前，无法对多个已索引的名称空间进行编码/提交搜索（当仅对名称空间的子集进行索引时）. 例如，如果两个组都已建立索引，则无法在两个组上运行单个代码搜索. 您只能在第一组上然后在第二组上运行代码搜索.

您可以通过编写感兴趣的部分名称空间或项目名称来过滤选择下拉列表.

注意：如果未选择名称空间或项目，则不会进行 Elasticsearch 索引.警告：如果您已经为实例建立索引，则必须重新生成索引，才能删除所有现有数据以进行过滤以使其正常工作. 为此，请运行 Rake 任务gitlab:elastic:recreate_index和gitlab:elastic:clear_index_status . 然后，从列表中删除名称空间或项目将按预期从 Elasticsearch 索引中删除数据.

Disabling Elasticsearch

要禁用 Elasticsearch 集成：

导航到管理区域 （扳手图标），然后导航到设置>集成 .
展开Elasticsearch部分，然后取消选中Elasticsearch 索引和启用 Elasticsearch 的 搜索 .
单击保存更改以使更改生效.

（可选）删除现有索引：

# Omnibus installations
sudo gitlab-rake gitlab:elastic:delete_index

# Installations from source
bundle exec rake gitlab:elastic:delete_index RAILS_ENV=production

Adding GitLab’s data to the Elasticsearch index

启用 Elasticsearch 索引编制后，您的 GitLab 实例中的新更改将在发生更改时自动进行索引. 要回填现有数据，可以使用以下方法之一在后台作业中对其进行索引.

Indexing through the administration UI

Introduced in GitLab Starter 12.3.

通过管理区域建立索引：

Configure your Elasticsearch host and port.

创建空索引：

# Omnibus installations
sudo gitlab-rake gitlab:elastic:create_empty_index

# Installations from source
bundle exec rake gitlab:elastic:create_empty_index RAILS_ENV=production

Enable Elasticsearch indexing.

Click 索引所有项目 in 管理区域>设置>集成> Elasticsearch.
单击确认消息中的检查进度 ，以查看后台作业的状态.

个人摘要需要手动编制索引：

# Omnibus installations
sudo gitlab-rake gitlab:elastic:index_snippets

# Installations from source
bundle exec rake gitlab:elastic:index_snippets RAILS_ENV=production

索引完成后，启用Search with Elasticsearch .

Indexing through Rake tasks

可以使用 Rake 任务执行索引.

Indexing small instances

警告：这将删除您现有的索引.

如果数据库大小小于 500 MiB，并且所有托管存储库的大小小于 5 GiB：

Configure your Elasticsearch host and port.

索引您的数据：

# Omnibus installations
sudo gitlab-rake gitlab:elastic:index

# Installations from source
bundle exec rake gitlab:elastic:index RAILS_ENV=production

索引完成后，启用Search with Elasticsearch .

Indexing large instances

警告：执行异步索引编制将生成很多 Sidekiq 作业. 确保具有可扩展和高度可用的设置或创建额外的 Sidekiq 进程以为该任务做准备

Configure your Elasticsearch host and port.

创建空索引：

# Omnibus installations
sudo gitlab-rake gitlab:elastic:create_empty_index

# Installations from source
bundle exec rake gitlab:elastic:create_empty_index RAILS_ENV=production

如果这是您的 GitLab 实例的重新索引，请清除索引状态：

# Omnibus installations
sudo gitlab-rake gitlab:elastic:clear_index_status

# Installations from source
bundle exec rake gitlab:elastic:clear_index_status RAILS_ENV=production

Enable Elasticsearch indexing.

索引大型 Git 存储库可能需要一段时间. 为了加快此过程，您可以暂时禁用自动刷新和复制. 根据我们的经验，您可以预期索引编制时间将减少 20％. 索引完成后，我们将启用它们. 此步骤是可选的！
```
curl --request PUT localhost:9200/gitlab-production/_settings --header 'Content-Type: application/json' --data '{
   "index" : {
       "refresh_interval" : "-1",
       "number_of_replicas" : 0
   } }' 
```
1. 索引项目及其相关数据：
```
# Omnibus installations
sudo gitlab-rake gitlab:elastic:index_projects

# Installations from source
bundle exec rake gitlab:elastic:index_projects RAILS_ENV=production 
```
这为需要索引的每个项目加入了一个 Sidekiq 作业. 您可以在管理区域>监视>后台作业>队列选项卡中查看作业，然后单击elastic_indexer ，也可以使用 Rake 任务查询索引状态：
```
# Omnibus installations
sudo gitlab-rake gitlab:elastic:index_projects_status

# Installations from source
bundle exec rake gitlab:elastic:index_projects_status RAILS_ENV=production

Indexing is 65.55% complete (6555/10000 projects) 
```
如果要将索引限制为一定范围的项目，可以提供ID_FROM和ID_TO参数：
```
# Omnibus installations
sudo gitlab-rake gitlab:elastic:index_projects ID_FROM=1001 ID_TO=2000

# Installations from source
bundle exec rake gitlab:elastic:index_projects ID_FROM=1001 ID_TO=2000 RAILS_ENV=production 
```
其中ID_FROM和ID_TO是项目 ID. 这两个参数都是可选的. 上面的示例将索引从 ID 1001到（并包括）ID 2000所有项目.
故障排除：有时gitlab:elastic:index_projects排队的项目索引作业可能会中断. 出于多种原因可能会发生这种情况，但是再次运行索引任务始终是安全的. 它将跳过已经建立索引的存储库.
当索引器在数据库中存储每个已索引存储库的最后一次提交 SHA 时，您可以使用特殊参数UPDATE_INDEX运行索引器，它将再次检查每个项目存储库以确保对存储库中的每个提交都进行了索引，这很有用.如果您的索引已过时：
```
# Omnibus installations
sudo gitlab-rake gitlab:elastic:index_projects UPDATE_INDEX=true ID_TO=1000

# Installations from source
bundle exec rake gitlab:elastic:index_projects UPDATE_INDEX=true ID_TO=1000 RAILS_ENV=production 
```
您还可以使用gitlab:elastic:clear_index_status Rake 任务来强制索引器"忘记"所有进度，因此它将从头开始重试索引过程.

个人代码段与项目无关，需要单独编制索引：

# Omnibus installations
sudo gitlab-rake gitlab:elastic:index_snippets

# Installations from source
bundle exec rake gitlab:elastic:index_snippets RAILS_ENV=production

在建立索引后再次启用复制和刷新（仅在您之前禁用它的情况下）：

curl --request PUT localhost:9200/gitlab-production/_settings --header 'Content-Type: application/json' --data '{
   "index" : {
       "number_of_replicas" : 1,
       "refresh_interval" : "1s"
   } }'

启用上述刷新后，应调用强制合并.

对于 Elasticsearch 6.x，在进行强制合并之前，索引应处于只读模式：

curl --request PUT localhost:9200/gitlab-production/_settings --header 'Content-Type: application/json' --data '{
 "settings": {
   "index.blocks.write": true
 } }'

然后，启动强制合并：

curl --request POST 'localhost:9200/gitlab-production/_forcemerge?max_num_segments=5'

此后，如果索引处于只读模式，请切换回读写模式：

curl --request PUT localhost:9200/gitlab-production/_settings --header 'Content-Type: application/json' --data '{
 "settings": {
   "index.blocks.write": false
 } }'

索引完成后，启用Search with Elasticsearch .

Indexing limitations

对于存储库和摘要文件，GitLab 最多只能索引 1 MiB 的内容，以避免索引超时.

GitLab Elasticsearch Rake tasks

耙任务可用于：

生成并安装索引器.
禁用 Elasticsearch时删除索引.
将 GitLab 数据添加到索引.

以下是一些可用的 Rake 任务：

Task	Description
`sudo gitlab-rake gitlab:elastic:index`	启用 Elasticsearch 索引并运行`gitlab:elastic:create_empty_index` ， `gitlab:elastic:clear_index_status` ， `gitlab:elastic:index_projects`和`gitlab:elastic:index_snippets` .
`sudo gitlab-rake gitlab:elastic:index_projects`	遍历所有项目并在 Sidekiq 作业中排队以在后台对它们进行索引.
`sudo gitlab-rake gitlab:elastic:index_projects_status`	确定索引的总体状态. 通过计算索引项目的总数，除以项目总数，然后乘以 100 即可完成.
`sudo gitlab-rake gitlab:elastic:clear_index_status`	删除所有项目的 IndexStatus 的所有实例.
`sudo gitlab-rake gitlab:elastic:create_empty_index[<TARGET_NAME>]`	生成一个空索引，并为该索引分配一个别名（仅当该索引不存在时）.
`sudo gitlab-rake gitlab:elastic:delete_index[<TARGET_NAME>]`	删除 Elasticsearch 实例上的 GitLab 索引和别名（如果存在）.
`sudo gitlab-rake gitlab:elastic:recreate_index[<TARGET_NAME>]`	`gitlab:elastic:delete_index[<TARGET_NAME>]`和`gitlab:elastic:create_empty_index[<TARGET_NAME>]`包装器任务.
`sudo gitlab-rake gitlab:elastic:index_snippets`	执行 Elasticsearch 导入，对片段数据进行索引.
`sudo gitlab-rake gitlab:elastic:projects_not_indexed`	显示未索引的项目.
`sudo gitlab-rake gitlab:elastic:reindex_cluster`	安排零停机群集重新索引编制任务. 此功能应与在 GitLab 13.0 之后创建的索引一起使用.

注意： TARGET_NAME参数是可选的，如果未设置，它将使用当前RAILS_ENV的默认索引/别名.

Environment variables

除了 Rake 任务，还有一些环境变量可用于修改过程：

环境变量	数据类型	它能做什么
`UPDATE_INDEX`	Boolean	告诉索引器覆盖任何现有索引数据（对/错）.
`ID_TO`	Integer	告诉索引器仅索引小于或等于该值的项目.
`ID_FROM`	Integer	告诉索引器仅索引大于或等于该值的项目.

Indexing a specific project

因为ID_TO和ID_FROM环境变量使用or equal to比较，所以您可以通过使用两个具有相同项目 ID 号的变量来仅索引一个项目：

root@git:~# sudo gitlab-rake gitlab:elastic:index_projects ID_TO=5 ID_FROM=5
Indexing project repositories...I, [2019-03-04T21:27:03.083410 #3384]  INFO -- : Indexing GitLab User / test (ID=33)...
I, [2019-03-04T21:27:05.215266 #3384]  INFO -- : Indexing GitLab User / test (ID=33) is done!

Elasticsearch index scopes

执行搜索时，GitLab 索引将使用以下范围：

范围名称	What it searches
`commits`	提交数据
`projects`	项目数据（默认）
`blobs`	Code
`issues`	发行数据
`merge_requests`	合并请求数据
`milestones`	里程碑数据
`notes`	笔记数据
`snippets`	片段数据
`wiki_blobs`	维基内容

Tuning

Guidance on choosing optimal cluster configuration

有关选择集群配置的基本指导，请参考Elastic Cloud Calculator . 您可以在下面找到更多信息.

通常，您将希望至少将一个 2 节点群集配置与一个副本一起使用，这将使您具有弹性. 如果存储使用量快速增长，则可能需要预先计划水平扩展（添加更多节点）.
不建议将 HDD 存储与搜索群集一起使用，因为它会影响性能. 最好使用 SSD 存储（例如 NVMe 或 SATA SSD 驱动器）.
您可以使用GitLab Performance Tool来对不同搜索集群大小和配置的搜索性能进行基准测试.
Heap size应设置为不超过物理 RAM 的 50％. 此外，不应将其设置为超过基于零的压缩 oop 的阈值. 确切的阈值会有所不同，但是在大多数系统上 26 GB 是安全的，但在某些系统上也可能高达 30 GB. 有关更多详细信息，请参见设置堆大小 .
每个节点的 CPU（CPU 核心）数通常对应于以下所述Number of Elasticsearch shards设置.
一个好的指导方针是确保每个节点的已配置 GB 堆中的碎片数保持在 20 个以下. 因此，具有 30GB 堆的节点最多应具有 600 个分片，但是越低于此限制，您可以使其越好. 通常，这将有助于群集保持良好的运行状况.
小碎片会导致小段，这会增加开销. 旨在将平均分片大小保持在至少几 GB 到几十 GB 之间. 另一个考虑因素是文档数量，您应该针对此简单的分片number of expected documents / 5M +1公式： number of expected documents / 5M +1 .
refresh_interval是每个索引设置. 如果您不需要实时数据，则可能需要将其从默认的1s调整为更大的值. 这将改变您看到新结果的时间. 如果这对您很重要，则应使其尽可能接近默认值.
如果您有大量繁重的索引操作，则可能需要将indices.memory.index_buffer_size提高到 30％或 40％.