Yes, the date on this post says Friday 7th, and it’s being published well after that, but let’s pretend it’s Friday and summarise some thoughts on analytical search.
Search is search right - you index some documents and call an API with some keywords and get back a list of documents that contain those keywords. You might also get back some counts of documents matches, get some matched text snippets back as well, and get support for stemming (so that searches for fox returns documents with foxes in as well), but fundamentally it’s about turning some search terms into a list of matching documents. Right?
What both Elasticsearch and Apache Solr have done is turn search into a much more general purpose analytical capability. In the same way that relational databases use indexes and full table scans as a starting point for the wide range of transformations, joins and aggregations supported by SQL, these search technologies now use their indexed lookups as the starting point for a similar range of capabilities.
So what can these technologies do? Aggregations were the first step, allowing results to be rolled up - great for understanding how many type X errors my system’s generated by hour over the last month. And this then lead to much more interesting aggregations, including those to support anomaly detection (are the number of errors I’m seeing now different from the usual count for this time of day) and graph analysis (what are the relationships between the terms I’m interested and the other terms in the same document).
And now both technologies are supporting configurable results analysis pipelines turning these into much more powerful analytical capabilities - in Solr’s case these support a range of other capabilities including MapReduce like transformations over results and the ability to execute SQL expressions over Solr indexes.
Which I think makes for a very interesting time to be looking at analysis of unstructured and semi-structured data, as we’re now starting to see similar analytical capabilities to those that have existed in the structured world for a long time, and they’re going to open up some extreemly interesting use cases.
And if Elasticsearch and Apache Solr are blazing this trail, which one should you choose for your use case? There are plenty of websites out there that will give you their thoughts or a feature by feature breakdown (although be wary of how old some of them are, as both technologies, and specifically Solr, have seen significant changes in the last couple of years), and although both technologies have slightly different focuses (Elasticsearch has a strong pedigree in doing log analysis for example) the strong likelihood is that unless you’re pushing the boundries they’ll both support your use case.
So my recommendation to you would be threefold:
- Do some proof of concepts with both technologies to understand how well they fit your use case and how well they work for you
- Look at the commercial model. Although Elastic is open source, there are considerable benefits to taking on one of Elastic’s commercial offerings, both in terms of extra functionality and having their support. With Solr all the functionality is in the open source product, but there’s still value in having commercial support, and there’s a much wider range of ways of getting this and a wider range of vendors that will provide this.
- Make sure you have access to people that understand these technologies and what they can do. They are powerful and complex pieces of technology that are relatively new, and getting access to people that understand them will ensure you get the most out of them.