用Solr构建垂直搜索引擎
简介
这是笔者毕业设计的项目『利用Lucene和Solr实现旅游景点位置感知搜索引擎』,通过三个多月的学习和工作,自认为完成的还可以,想把自己的学习内容分享出来。本文档是系列教程,希望系统介绍垂直搜索引擎的开发构建过程,鉴于笔者水平有限,难免有错误和不足,欢迎指正。此外,采用Gitbook的形式发布,希望能对该系列教程持续改进。
项目
Gitbook链接:https://www.gitbook.com/book/fliaping/create-your-vertical-search-engine-with-solr/details
作者博客系列文章(备用链接):
[1]"【总述】用Solr构建垂直搜索引擎", Payne's Blog, 2016. [EB/OL]. Available: https://blog.fliaping.com/create-your-vertical-search-engine-with-solr/. [Accessed: 03- Jun- 2016].
项目源码:
[2]"paynexu/trip-search", GitHub, 2016. [EB/OL]. Available: https://github.com/fliaping/trip-search. [Accessed: 03- Jun- 2016].
备用链接:http://git.oschina.net/fliaping/trip-search
摘要
由于通用搜索引擎数据源的广泛性,对于一些专业性或者特定领域的搜索结果并不能令人满意,所以诞生了垂直搜索引擎。它是通用搜索引擎的细分和延伸,只聚焦于某一特定主题,因此也叫作主题搜索。常见的领域有汽车产业、法律信息、医药信息、学术文献、旅游信息。
本项目实现了具有位置感知功能的旅游景点垂直搜索引擎。以开源搜索框架Solr和Lucene为基础,以另外一些开源项目例如Heritrix、webmagic、Zookeeper、Ionic、gradle、jetty等为工具,并在相关文档和技术博客的帮助下,完成了整个垂直搜索引擎系统的开发。用到的技术主要有网络爬虫、HTML解析、中文分词、文档索引、空间搜索、RESTful Web Service、Ajax、Hybrid App、容器技术Docker、SolrCloud、集群等。本文将进行项目背景的说明,技术原理上的概述,构架层面的分析以及项目的测试和交付。通过对搜索引擎技术流程分为搜集信息、整理信息、接受查询这三个部分,将每个模块进行阐述,并辅以相关的图表,对搜索引擎技术在宏观层面上进行解释说明。总的来说本项目是一个”可用于生产环境的垂直搜索引擎原型”。另外,比较详细的开发过程以及用到的技术细节请参考文末的相关链接和作者写的博客文章。
ABSTRACT
Because of the comprehensive source of general search engine, its results can’t satisfy our needs in specific segment or individual verticals, so vertical search engine was born. It is subdivided and extended from the general web search engine, which focuses on a specific theme. They are also called specialty or topical search engines. Common verticals include shopping, the automotive industry, legal information, medical information, scholarly literature, and travel.
This project is vertical search engine about travel sights with spatial search. It based on open source search framework Solr and Lucene, and other open source such as Heritrix, webmagic, Zookeeper, Ionic, gradle, jetty as tools. With the assistance of related technical documents and blogs, the development of this project was finished finally. The main technologies were used include web spider, HTML extract, Chinese analyzer, document indexing, spatial search, RESTful web service, Ajax, Hybrid App, container technology(Docker), SolrCloud, cluster and so on. This thesis will make introduction about the background of project, summary of technical theory, analysis in architecture, testing and deployment. According to the process of search engine, we divide it into three parts – collect information, index information, accept query. I will illustrate every module with diagram and words, thus, search engine technology will be explained in macrostructure. To sum up, this project is a prototype of vertical search engine that can be used in production environment. Besides, if you want to learn more about the detailed development process and technology implementation, you can reference the auctorial blogs and other articles through the links in the end of the paper.