4 min read

ElasticSearch vs Hadoop MapReduce for Analytics

Anurag : Aug 18, 2017 4:30:00 PM

Hadoop MapReduce technology trends Big Data AWS ElasticSearch

ElasticSearch is an amazing tool for indexing and full-text search. It uses a Domain Specific query Language (DSL) which is JSON based and simple enough to understand but at the same time, very powerful. This makes ElasticSearch a standard for search integration in a web app.

But how good is ElasticSearch when it comes to using it as an analytics backend? Is it really something that can replace Hadoop anytime soon?

An advanced analytics system like Hadoop or ElasticSearch is something you’ll need when the data you get has outgrown the capabilities of Google Analytics, Amplitude, MixPanel or the likes. These tools are great for simple analytics and metrics. But when it comes to questions which can only be answered by custom queries - or the amount of data being collected is huge. While a lot of the legacy systems for advanced analytics are built on top of Hadoop, more developers are starting to use ElasticSearch for it. Here we’ll analyze if ElasticSearch really is a good alternative for advanced analytics or does Hadoop still win.

Why ElasticSearch is gaining popularity?

Elastic’s ELK analytics stack - with Logstash for server-side logging and Kibana for visualization - is becoming popular for web analytics because of the following reasons

Easy setup - It is very easy to get an ElasticSearch instance up and running with a small dataset
Easy DSL - The JSON based DSL is much simpler to implement than Hadoop’s MapReduce - which is more complex to setup

These reasons are primarily why a small business would get up and running with ElasticSearch rather than implement a complex solution like Hadoop.

But ElasticSearch, in the end, is a search engine platform, and Hadoop is a highly scalable distributed data platform - and this shows in their performance with respect to data ingestion and complex analysis.

ElasticSearch vs Hadoop MapReduce - A Detailed Analysis:

Data stream Ingestion:

Many teams struggle with the limitations ElasticSearch imposes when it comes to streaming ingestion - especially when they scale to production. Unlike the toy instance, a production scale ES cluster spans across nodes. And if a network outage cuts connections between nodes, it is known to cause issues. The severing connection between master nodes is known to cause 100% loss of streaming data during the period of outage.

While this is not a big problem when using repeatable tasks to capture data - like website crawling. But when it comes to streaming analytics data, this is a big drawback. The probability of a user coming to your site and performing the exact same steps as before is rare. Logstash doesn’t provide a replay option either. So once you miss some data, it is gone.

So if you are looking for data integrity, it is wiser to store the data on a database like RedShift or MongoDB and then replicate it on ElasticSearch to run analytics.

Resource management:

Configuring ElasticSearch for scalability in production is also not as easy as it feels. A lot of trial and error is involved and many settings need to be tweaked as the amount of data increases in volume. You have to guess the size of your data when setting up ElasticSearch. Too many shards per index and a relatively smaller dataset - means a lower search performance. And too few shards for a large dataset means hitting the shard’s maximum size limit as the data grows.

This can create index management issues as your historical dataset grows and must be archived yet still remain available for occasional querying.

Schema free - not really:

While Hadoop and NoSQL technologies make it easy to upload data in any key-value formats, this is not the case with ElasticSearch. Yes, you can add any data to it, but ES does recommend that data is transformed into generic key-value pairs before being uploaded.

Without this, Lucene will create an index for each custom key-value pair, causing the size of your ElasticSearch instance to explode as time goes by. This conversion is a huge resource hog when working through millions of records of analytics data.

Bulk uploads - not so easy either:

Handling bulk uploads is another pain point with ElasticSearch. ES has a default buffer limit, and every time this limit is breached, a silent OOM is generated by it. The data indexed before the OOM is still available for querying - so it can take time to understand where the failure happened.

Analytics - surely it is good at that!

While ES’s aggregation and full-text search are great for basic web analytics queries, including demographics, it doesn’t give the power that comes with window functions support in SQL. ES doesn’t allow outputting the result of a query into intermediate datasets for additional analysis, nor does it allow transformation of data sets in any way.

What it is great at is doing what a search tool does - aggregation of data into smaller sets based on various filters. But then so does Google Analytics. The whole reason why you’d want to use Hadoop or ElasticSearch is so you can add complex analytics!

And does Hadoop do all that?

There is a reason Hadoop is the go-to tool for distributed data processing. HDFS makes the system highly fault-tolerant and prevents data loss. It also has a lot of supporting tools for bulk upload and data ingestion and SQL engines for querying data.

The Hadoop MapReduce query framework can handle any data aggregation and transformation jobs easily.

Know more about the Hadoop distribution for Big Data Analytics

While Hadoop wins hands-down in these regards, setting up Hadoop with any of its additional tools requires much domain-specific knowledge and heavy setup and maintenance costs compared to ElasticSearch. Also mastering MapReduce is difficult. So most companies prefer to setup a query engine layer which helps you interact with the data using familiar SQL queries. These engines add an additional layer of complexity to the system.

ElasticSearch vs Hadoop MapReduce - Who Wins?

In the end, the tool that you use depends on your use case and the kind of data you are working with. While ElasticSearch is great for simpler web analytics, the possible streaming data loss and scaling issues make it unsuitable for complex analytics systems. Implementing a Hadoop instance for your analytics system has a steep learning curve, but it is well worth the effort - with increased stability, solid ingestion and a wide range of supporting third-party tools.

If you have a project on Big Data and need immediate assistance, feel free to contact us