25 de January de 2019

Data Virtualization (DaaS)

These days the Hadoop Ecosystem is a freaking pain when you think about frameworks that you can work with.

You typically ingest data to HDFS in some sort of file format (Text Input format, Parquet, Avro, whatever) then you put some data in HBase, after that you still want to access some data in a bunch of Oracle Databases, cross it with MongoDB and so on and so on… then you get a call from an upset Data Analyst that something is missing from one of its Machine Learning models and all hell breaks lose. Now you have to revisit most of the layers, check with the database team to see what happened, see if all those frameworks and workflows are working correctly… it’s a pain in the ass!

Now multiply this by the number of concurrent projects and developers you have on top of the Data Lake…

These days it’s all about agility and speed. Developers in general, Data Analysts don’t want to deal with stuff like Oozie Workflows, batch processing, heck… they don’t even want to wait for the Map Reduce jobs do finish in a normal way, even if they have the entire cluster resources on their DRP!

The past few days I’ve been investigating something, that in my opinion makes a lot of sense for some data boys and girls, heck… even for us Big Data Engineers this could be very helpful. I’m talking about Data Virtualization or Data-as-a-Service. One of the most interesting platforms in the market is Dremio.

Dremio is a business software company, headquartered in Mountain View, California. It produces and maintains an open source “self-service data platform” that uses Apache Arrow, Apache Calcite, and Apache Parquet, along with other open-source technologies, to allow accelerated querying of many different types of data sources.

Dremio integrates natively with relational database management systems, Apache Hadoop, MongoDB, Amazon S3, ElasticSearch, and other sources – using standard SQL syntax and a graphical query builder! Business intelligence tools, such as Tableau, PowerBI, Qlikview, can then connect to Dremio as if it were the primary data source, while Dremio manages query execution in native systems as well as accelerating queries using its own Apache Arrow-based engine. To accelerate queries, Dremio makes heavy use of physically optimized representations of source data, which the company calls “data reflections”.

Just a (very!) simple example of what you can do with Dremio:

On this simple test that I did, I joined 1 table from Hive and MySQL. In another test I added another table from HBase. All in just a couple of minutes! Your developers can access different data sources, build some sort of cubes, join tables from different systems (relational and non-relational) and Dremio shows you the results in just a few seconds! You can even deploy Dremio on YARN!

Congrats to Mr. Tomer Shiran and Mr. Jacques Nadeau (founders of Dremio) for developing such an interesting product that I think it will explode in the near future! Give it a check! http://www.dremio.com

You may also like...