Looking for something specific?

First steps with ElasticSearch

Samuel Useche

May 9th, 2017

What is ElasticSearch ?

Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases.

Is based on Lucene developed in Java programming language as open source under the conditions of the Apache license. Its functionality is through a REST interface receiving and sending data in JSON format and hidden through this interface the internal details of lucene. This interface allows it to be used by any platform not only Java, it can be used from Python, .NET, PHP or even from a browser with Javascript. It is persistent, that is, what we index in it will survive a restart of the server.

ElasticSearch Vs NoSql databases

What about systems like Postgres for example, that comes with full-text search and ACID-transactions? (Other examples are the full-text capabilities of MySQL, MongoDB, Riak, etc.) While you can implement basic search with Postgres, there is a huge gap in both performance and in the features. Elasticsearch can “cheat” and a lot of caching, with no concern for multi-version concurrency control and other complicating things. Search is also more than finding a keyword in a piece of text: it’s about applying domain specific knowledge to implement good relevancy models, giving an overview of the entire result space, and doing things like spell checking and autocompletion. All while being fast.

Elasticsearch is commonly used in addition to another database. A database system with stronger focus on constraints, correctness and robustness, and on being readily and transactionally updatable, has the master record – which is then asynchronously pushed to Elasticsearch.

Some basic concepts within ElasticSearch

Cluster

A cluster is a set of one of the largest that hold all information in a distributed and indexed way. Each group is identified by a name, by default they are called “elasticsearch”.

Node

A node is a server that is part of a cluster, stores your information and helps with the cluster indexing and search tasks. Nodes are identified by a name as well, but in this case each node is named after a Marvel character.

By default they are configured to be part of a cluster with the name “elasticsearch”.

There can be as many nodes as you want for each Cluster, in case there is no Cluster configured at the time of creation it will create it and join it.

Index

An Index is a collection of documents that have similar characteristics. The indexes are identified by a name, which we will use when indexing, searching, updating and deleting.

ElasticSearch installation

The installation will be done in a linux environment as it is most commonly used on servers.

Add the Oracle Java PPA to apt:

sudo add-apt-repository -y ppa:webupd8team/java

Update your apt package database:

sudo apt-get update

Install the latest stable version of Oracle Java 8 with this command (and accept the license agreement that pops up):

sudo apt-get -y install oracle-java8-installer

Lastly, verify it is installed:

java -version

Download the latest Elasticsearch version, which is 2.3.1 at the time of writing.

wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/deb/elasticsearch/2.3.1/elasticsearch-2.3.1.deb

Then install it in the usual Ubuntu way with dpkg.

sudo dpkg -i elasticsearch-2.3.1.deb

To make sure Elasticsearch starts and stops automatically with the server, add its init script to the default runlevels.

sudo systemctl enable elasticsearch.service

Configuring Elasticsearch

To start editing the main elasticsearch.yml configuration file with nano or your favorite text editor.

sudo nano /etc/elasticsearch/elasticsearch.yml

Remove the # character at the beginning of the lines for cluster.name and node.name to uncomment them, and then update their values. Your first configuration changes in the /etc/elasticsearch/elasticsearch.yml file should look like this:

. . .
cluster.name: mycluster1
node.name: “My First Node”
. . .

Testing Elasticsearch

By now, Elasticsearch should be running on port 9200. You can test it with curl, the command line client-side URL transfers tool and a simple GET request.

curl -X GET ‘http://localhost:9200’

You should see the following response:

Output of curl

{
“name” : “My First Node”,
“cluster_name” : “mycluster1”,
“version” : {
“number” : “2.3.1”,
“build_hash” : “bd980929010aef404e7cb0843e61d0665269fc39”,
“build_timestamp” : “2016-04-04T12:25:05Z”,
“build_snapshot” : false,
“lucene_version” : “5.5.0”
},
“tagline” : “You Know, for Search”
}

Using Elasticsearch

To start using Elasticsearch, let’s add some data first. As already mentioned, Elasticsearch uses a RESTful API, which responds to the usual CRUD commands: create, read, update, and delete. For working with it, we’ll use again curl.

You can add your first entry with the command:

curl -X POST ‘http://localhost:9200/tutorial/helloworld/1’ -d ‘{ “message”: “Hello World!” }’

You should see the following response:

Output

{“_index”:”tutorial”,”_type”:”helloworld”,”_id”:”1″,”_version”:1,”_shards”:{“total”:2,”successful”:1,”failed”:0},”created”:true}

With cuel, we have sent an HTTP POST request to the Elasticsearch server. The URI of the request was /tutorial/helloworld/1 with several parameters:

tutorial is the index of the data in Elasticsearch.
helloworld is the type.
1 is the id of our entry under the above index and type.

You can retrieve this first entry with an HTTP GET request.

curl -X GET ‘http://localhost:9200/tutorial/helloworld/1’

The result should look like:

Output:

{“_index”:”tutorial”,”_type”:”helloworld”,”_id”:”1″,”_version”:1,”found”:true,”_source”:{ “message”: “Hello World!” }}

So far we have added to and queried data in Elasticsearch. To learn about the other operations please check the API documentation.

In conclusion the use of elasticSearch for the searches in large batches of data present us with many advantages such as:

Scalability: Thanks to its design allows us to scale horizontally and scale our servers according to our needs.
High availability: ElasticSearch clusters are able to detect which nodes are failing and reorganize to make data always accessible.
Multi-Tenant: It allows us to operate on different indexes at the same time and thus enhance our searches.
Does not use schemas: It allows to work without a fixed database structure.
Document-oriented: Elasticsearch entities are stored as structured JSON files where all fields are indexed and we can include all indexes in the same query.
API: ElasticSearch provides Restfull APIs in JSON along with APIs for different languages.
Text-based searches: ElasticSearch is based on Lucene, which increases text search capabilities, supporting geolocation, autocompletion, …
Conflict Management: Prevents data loss by simultaneously editing records.