Elasticsearch与Python的对接实现

知行编程网 2023-01-13 04:00 知行编程网 | 隐藏边栏 | 10 0

文章评分 0 次，平均分 0.0 ：

导语：本文主要介绍了关于Elasticsearch与Python的对接实现的相关知识，希望可以帮到处于编程学习途中的小伙伴

什么是 Elasticsearch

要想查资料，就离不开搜索，离不开搜索引擎。百度和谷歌是非常庞大和复杂的搜索引擎，它们索引了几乎所有在互联网上打开的网页和数据。但是对于我们自己的业务数据来说，肯定没有必要使用这么复杂的技术。如果我们想实现自己的搜索引擎，方便存储和检索，Elasticsearch是最好的选择。它是一个全文搜索引擎，可以快速高效地存储、搜索和分析海量数据。

为什么要用 Elasticsearch

Elasticsearch 是建立在 Apache Lucene™ 之上的开源搜索引擎，Apache Lucene™ 是一个全文搜索引擎库。

那么Lucene是什么？ Lucene 可能是现有的最先进、高性能和全功能的搜索引擎库，无论是开源的还是专有的，但它只是一个库。使用Lucene需要自己写Java，引用Lucene的包，需要对信息检索有一定的了解，才能明白Lucene是如何工作的，反正用起来没那么简单。

于是为了解决这个问题，Elasticsearch诞生了。 Elasticsearch 也是用 Java 编写的。它内部使用Lucene进行索引和搜索，但其目标是让全文检索变得简单，相当于对Lucene进行了一层封装。它提供了一组简单且一致的 RESTful API 来帮助我们实现存储和检索。

所以 Elasticsearch 只是一个简单的 Lucene 包装器？那是一个很大的错误，Elasticsearch 不仅仅是 Lucene，它也不仅仅是一个全文搜索引擎。可以准确地描述如下：

·

一个分布式的实时文档存储，每个字段可以被索引与搜索

·

一个分布式实时分析搜索引擎

·

能胜任上百个服务节点的扩展，并支持 PB 级别的结构化或者非结构化数据

总之，它是一个非常强大的搜索引擎。 Wikipedia、Stack Overflow 和 GitHub 都使用它进行搜索。

Elasticsearch 的安装

我们可以去Elasticsearch官网下载Elasticsearch：https://www.elastic.co/downloads/elasticsearch，官网也有安装说明。

首先下载并解压安装包，然后运行bin/elasticsearch（Mac或Linux）或bin\elasticsearch.bat（Windows）启动Elasticsearch。

我使用的是 Mac，Mac 下个人推荐使用 Homebrew 安装：

brew install elasticsearch

Elasticsearch 默认会在 9200 端口上运行，我们打开浏览器访问

http://localhost:9200/ 就可以看到类似内容：

{
  "name" : "atntrTf",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "e64hkjGtTp6_G2h1Xxdv5g",
  "version" : {
    "number": "6.2.4",
    "build_hash": "ccec39f",
    "build_date": "2018-04-12T20:37:28.497551Z",
    "build_snapshot": false,
    "lucene_version": "7.2.1",
    "minimum_wire_compatibility_version": "5.6.0",
    "minimum_index_compatibility_version": "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

如果看到这个内容，说明Elasticsearch已经安装并启动成功。这里显示我的Elasticsearch版本是6.2.4版本。版本非常重要。安装一些插件后，必须匹配版本。

接下来，让我们看一下Elasticsearch的基本概念以及它与Python的联系。

Elasticsearch 相关概念

Elasticsearch中有节点、索引、文档等几个基本概念，下面分别解释一下。了解这些概念对于熟悉 Elasticsearch 非常有帮助。

Node 和 Cluster

Elasticsearch本质上是一个分布式数据库，可以让多台服务器协同工作，每台服务器可以运行多个Elasticsearch实例。

单个 Elasticsearch 实例称为节点。一组节点组成一个集群（Cluster）。

Index

Elasticsearch会对所有字段进行索引，处理后写入倒排索引（Inverted Index）。查找数据时，直接查找索引。

因此，Elasticsearch数据管理的顶层单元称为Index（索引），其实相当于MySQL、MongoDB等数据库的概念。另外值得注意的是，每个Index的名称（即数据库）必须小写。

Document

Index 里面单条的记录称为 Document（文档）。许多条 Document 构成了一个 Index。

Document 使用 JSON 格式表示，下面是一个例子。

同一个Index中的文档不要求结构（方案）相同，但最好保持相同，这样有利于提高搜索效率。

Type

文档可以分组。例如，在天气指数中，可以按城市（北京和上海）或气候（晴天和雨天）进行分组。这个分组叫做Type，是一个虚拟的逻辑分组，用来过滤Document，类似于MySQL中的数据表和MongoDB中的Collections。

不同的Type应该有相似的结构（Schema）。例如，id 字段不能是本组中的字符串和另一个组中的值。这与关系数据库中的表不同。具有完全不同属性的数据（例如产品和日志）应该存储为两个 Index，而不是将两个 Type 存储在一个 Index 中（尽管可以这样做）。

按照规划，Elastic 6.x 版本只允许每个 Index 包含一种 Type，7.x 版本将彻底去除 Type。

Fields

也就是田地。每个Document类似于一个JSON结构，包含很多字段，每个字段都有对应的值。多个字段组成一个文档。其实可以类比MySQL数据表中的字段。

在Elasticsearch中，文档属于一种类型（Type），而这些类型存在于索引（Index）中。我们可以画一些简单的对比图来对比传统的关系型数据库：

Relational DB -> Databases -> Tables -> Rows -> Columns
Elasticsearch -> Indices   -> Types  -> Documents -> Fields

以上是Elasticsearch中的一些基本概念，通过与关系型数据库的对比，更有助于理解。

Python 对接 Elasticsearch

Elasticsearch其实提供了一系列的Restful API来进行访问和查询操作。我们可以使用curl等命令进行操作，但是命令行方式毕竟不是那么方便，所以这里直接介绍使用Python连接Elasticsearch的相关方法。

一个同名库用于连接Python中的Elasticsearch，安装方法很简单：

pip3 install elasticsearch

官方文档是：https://elasticsearch-py.readthedocs.io/，所有的用法都可以在里面找到，文章后面的内容也是以官方文档为准。

创建 Index

我们先来看看如何创建索引（Index），这里我们创建一个名为news的索引：

from elasticsearch import Elasticsearch
es = Elasticsearch()
result = es.indices.create(index='news', ignore=400)
print(result)

如果创建成功，会返回如下结果：

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'news'}

返回结果为JSON格式，有acknowledge字段表示创建操作成功。

但是此时，如果我们再次执行代码，会返回如下结果：

{'error': {'root_cause': [{'type': 'resource_already_exists_exception', 'reason': 'index [news/QM6yz2W8QE-bflKhc5oThw] 
already exists', 'index_uuid': 'QM6yz2W8QE-bflKhc5oThw', 'index': 'news'}], 'type': 'resource_already_exists_
exception', 'reason': 'index [news/QM6yz2W8QE-bflKhc5oThw] already exists', 'index_uuid': 'QM6yz2W8QE-bflKhc5oThw', 
'index': 'news'}, 'status': 400}

提示创建失败，status状态码为400，错误原因是Index已经存在。

注意我们代码中使用的ignore参数为400，意思是如果返回结果为400，则忽略错误，不会报错，程序不会执行并抛出异常。

假如我们不加 ignore 这个参数的话：

es = Elasticsearch()
result = es.indices.create(index='news')
print(result)

再次执行就会报错了：

raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: TransportError(400, 'resource_already_exists_exception', 'index 
[news/QM6yz2W8QE-bflKhc5oThw] already exists')

这样，程序的执行就会出现问题，所以我们需要善用ignore参数，排除一些意想不到的情况，保证程序的正常执行，不被中断。

删除 Index

删除 Index 也是类似的，代码如下：

from elasticsearch import Elasticsearch
es = Elasticsearch()
result = es.indices.delete(index='news', ignore=[400, 404])
print(result)

这里也使用了ignore参数，忽略因Index不存在导致删除失败导致程序中断的问题。

如果删除成功，会输出如下结果：

{'acknowledged': True}

如果Index已经被删除，执行删除时会输出如下结果：

{'error': {'root_cause': [{'type': 'index_not_found_exception', 'reason': 'no such index', 'resource.type': 
'index_or_alias', 'resource.id': 'news', 'index_uuid': '_na_', 'index': 'news'}], 'type': 'index_not_found_exception', 
'reason': 'no such index', 'resource.type': 'index_or_alias', 'resource.id': 'news', 'index_uuid': '_na_', 'index': 
'news'}, 'status': 404}

这个结果表示当前Index不存在，删除失败，返回结果也是JSON，状态码为400，但是因为我们添加了ignore参数，400状态码被忽略了，所以程序正常执行并输出 JSON 结果而不是抛出异常。

插入数据

Elasticsearch 就像 MongoDB。插入数据时，可以直接插入结构化字典数据。插入数据可以调用create()方法。比如这里我们插入一条新闻数据：

from elasticsearch import Elasticsearch
es = Elasticsearch()
es.indices.create(index='news', ignore=400)
data = {'title': '美国留给伊拉克的是个烂摊子吗', 'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm'}
result = es.create(index='news', doc_type='politics', id=1, body=data)
print(result)

这里我们首先声明了一条新闻数据，包括标题和链接，然后通过调用create()方法插入这条数据。在调用create()方法时，我们传入四个参数，index参数代表索引名称，doc_type代表文档类型，body代表文档的具体内容，id是数据的唯一标识。

运行结果如下：

{'_index': 'news', '_type': 'politics', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 
'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1}

结果中 result 字段为 created，代表该数据插入成功。

另外，我们也可以使用index()方法来插入数据，但是和create()不同的是，create()方法需要我们指定id字段来唯一标识这条数据，而index()方法则不需要需要的话，如果没有指定id，会自动生成一个id，index()方法的调用方式如下：

es.index(index='news', doc_type='politics', body=data)

create()方法其实内部调用了index()方法，是对index()方法的封装。

更新数据

更新数据也很简单，我们还需要指定数据的id和内容，调用update()方法即可，代码如下：

from elasticsearch import Elasticsearch
es = Elasticsearch()
data = {
    'title': '美国留给伊拉克的是个烂摊子吗',
    'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm',
    'date': '2011-12-16'
}
result = es.update(index='news', doc_type='politics', body=data, id=1)
print(result)

这里我们给数据添加一个日期字段，然后调用update()方法，结果如下：

{'_index': 'news', '_type': 'politics', '_id': '1', '_version': 2, 'result': 'updated', '_shards': {'total': 2, 
'successful': 1, 'failed': 0}, '_seq_no': 1, '_primary_term': 1}

可以看到返回的结果中，result字段被更新了，说明更新成功。另外，我们还注意到有一个字段_version，代表更新后的版本号，2表示这是第二个版本，因为之前已经插入过一次数据，所以第一次插入的数据是版本1，可以参考上面例子的运行结果。本次更新后版本号变为2，每次更新后版本号加1。

另外，update操作其实可以使用index()方法来完成，写法如下：

es.index(index='news', doc_type='politics', body=data, id=1)

可以看出index()方法可以代替我们完成两个操作。如果数据不存在，则执行insert操作，如果已经存在，则执行update操作，非常方便。

删除数据

如果要删除一条数据，可以调用delete()方法，指定要删除的数据的id。写法如下：

from elasticsearch import Elasticsearch
es = Elasticsearch()
result = es.delete(index='news', doc_type='politics', id=1)
print(result)

运行结果如下：

{'_index': 'news', '_type': 'politics', '_id': '1', '_version': 3, 'result': 'deleted', '_shards': {'total': 2, 
'successful': 1, 'failed': 0}, '_seq_no': 2, '_primary_term': 1}

可以看到运行结果中的result字段被删除了，说明删除成功，_version变为3，增加1。

查询数据

以上操作都是很简单的操作，MongoDB等常用数据库都可以完成，看起来也不是什么大不了的事情。 Elasticsearch比较特别的是它极其强大的检索功能。

对于中文，我们需要安装一个分词插件，这里我们使用elasticsearch-analysis-ik，GitHub链接是：https://github.com/medcl/elasticsearch-analysis-ik，这里我们使用另一个命令Elasticsearch行工具elasticsearch-plugin安装，这里安装的版本是6.2.4，请确保对应的是Elasticsearch的版本，命令如下：

elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.4 
/elasticsearch-analysis-ik-6.2.4.zip

这里的版本号请替换成你的 Elasticsearch 的版本号。

安装后重启Elasticsearch，它会自动加载安装好的插件。

首先，我们新建一个索引，指定需要分词的字段。代码如下：

from elasticsearch import Elasticsearch
es = Elasticsearch()
mapping = {
    'properties': {
        'title': {
            'type': 'text',
            'analyzer': 'ik_max_word',
            'search_analyzer': 'ik_max_word'
        }
    }
}
es.indices.delete(index='news', ignore=[400, 404])
es.indices.create(index='news', ignore=400)
result = es.indices.put_mapping(index='news', doc_type='politics', body=mapping)
print(result)

这里我们先删除之前的索引，然后创建一个新的索引，然后更新它的映射信息。映射信息中指定了分词字段，字段的类型指定为文本，词分析器analyzer和搜索词分析器search_analyzer为ik_max_word，表示使用我们刚刚安装的中文分词插件。如果未指定，则使用默认的英语分词器。

接下来我们插入几条新的数据：

datas = [
    {
        'title': '美国留给伊拉克的是个烂摊子吗',
        'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm',
        'date': '2011-12-16'
    },
    {
        'title': '公安部：各地校车将享最高路权',
        'url': 'http://www.chinanews.com/gn/2011/12-16/3536077.shtml',
        'date': '2011-12-16'
    },
    {
        'title': '中韩渔警冲突调查：韩警平均每天扣1艘中国渔船',
        'url': 'https://news.qq.com/a/20111216/001044.htm',
        'date': '2011-12-17'
    },
    {
        'title': '中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首',
        'url': 'http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml',
        'date': '2011-12-18'
    }
]
for data in datas:
    es.index(index='news', doc_type='politics', body=data)

这里我们指定了四条数据，都有title、url、date字段，然后通过index()方法插入到Elasticsearch中。索引名称为新闻，类型为政治。

接下来我们根据关键词查询一下相关内容：

result = es.search(index='news', doc_type='politics')
print(result)

可以看到查询出了所有插入的四条数据：

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1.0,
    "hits": [
      {
        "_index": "news",
        "_type": "politics",
        "_id": "c05G9mQBD9BuE5fdHOUT",
        "_score": 1.0,
        "_source": {
          "title": "美国留给伊拉克的是个烂摊子吗",
          "url": "http://view.news.qq.com/zt2011/usa_iraq/index.htm",
          "date": "2011-12-16"
        }
      },
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dk5G9mQBD9BuE5fdHOUm",
        "_score": 1.0,
        "_source": {
          "title": "中国驻洛杉矶领事馆遭亚裔男子枪击，嫌犯已自首",
          "url": "http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml",
          "date": "2011-12-18"
        }
      },
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dU5G9mQBD9BuE5fdHOUj",
        "_score": 1.0,
        "_source": {
          "title": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船",
          "url": "https://news.qq.com/a/20111216/001044.htm",
          "date": "2011-12-17"
        }
      },
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dE5G9mQBD9BuE5fdHOUf",
        "_score": 1.0,
        "_source": {
          "title": "公安部：各地校车将享最高路权",
          "url": "http://www.chinanews.com/gn/2011/12-16/3536077.shtml",
          "date": "2011-12-16"
        }
      }
    ]
  }
}

可以看到返回的结果会出现在hits字段中，然后有一个total字段表示查询结果项数，max_score表示匹配分数。

除此之外，我们还可以进行全文搜索，这也是Elasticsearch搜索引擎的特点体现的地方：

dsl = {
    'query': {
        'match': {
            'title': '中国 领事馆'
        }
    }
}
es = Elasticsearch()
result = es.search(index='news', doc_type='politics', body=dsl)
print(json.dumps(result, indent=2, ensure_ascii=False))

这里我们使用Elasticsearch支持的DSL语句进行查询，使用match指定全文搜索，搜索字段为title，内容为“Chinese Consulate”。搜索结果如下：

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 2.546152,
    "hits": [
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dk5G9mQBD9BuE5fdHOUm",
        "_score": 2.546152,
        "_source": {
          "title": "中国驻洛杉矶领事馆遭亚裔男子枪击，嫌犯已自首",
          "url": "http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml",
          "date": "2011-12-18"
        }
      },
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dU5G9mQBD9BuE5fdHOUj",
        "_score": 0.2876821,
        "_source": {
          "title": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船",
          "url": "https://news.qq.com/a/20111216/001044.htm",
          "date": "2011-12-17"
        }
      }
    ]
  }
}

这里我们看到有两个匹配结果，第一个的得分是2.54，第二个的得分是0.28。这是因为第一个匹配的数据包含“中国”和“领事馆”这两个词。两个匹配的数据中没有“consulate”，但是有“China”这个词，所以也被检索出来了，但是分数比较低。

因此可以看出，在搜索时，会搜索相应字段的全文，并根据搜索关键词的相关性对结果进行排序。这是一个基本搜索引擎的原型。

此外，Elasticsearch 还支持非常多的查询方式。详情请参考官方文档：https://www.elastic.co/guide/en/elasticsearch/reference/6.3/query-dsl.html

以上就是Elasticsearch的基本介绍和Python操作Elasticsearch的基本用法，但这只是Elasticsearch的基本功能。它还有更多强大的功能等着我们去探索，后续会持续更新，敬请期待。

本节代码：https://github.com/Germey/ElasticSearch。

资料推荐

另外推荐几个不错的学习站点：

Elasticsearch 权威指南：https://es.xiaoleilu.com/index.html

全文搜索引擎 Elasticsearch 入门教程：http://www.ruanyifeng.com/blog/2017/08/elasticsearch.html

Elastic 中文社区：https://www.elasticsearch.cn/

参考资料

https://es.xiaoleilu.com/index.html

https://blog.csdn.net/y472360651/article/details/76468327

https://elasticsearch-py.readthedocs.io/en/master/

https://es.xiaoleilu.com/010_Intro/10_Installing_ES.html

https://github.com/medcl/elasticsearch-analysis-ik

python学习网，免费的在线学习
，欢迎关注！

本文转自：https://cuiqingcai.com/6214.html

python

本文为原创文章，版权归知行编程网所有，欢迎分享本文，转载请保留出处！

知行编程网关注：1 粉丝：1

这个人很懒，什么都没写

内容反馈

你可能也喜欢

热评文章

联系我们

标签云

推广返利