Integración de PrestoDB como motor de consultas multi-repositorio

08/02/2023 Fco. Javier López Acevedo

Como recordaréis, en la release 3.1.0 de Onesait Plataform integramos Presto+MinIO como soporte para el almacenamiento tipo DataLake para escenarios de migración desde Hadoop.

¿Qué tecnología Open Source recomendamos para reemplazar Hadoop?

Pues en esta release, estamos trabajando para soportar Presto como motor de consultas SQL multirepositorio, lo que nos va a permitir hacer consultas analíticas sobre todas las Entidades de la Plataforma independientemente del repositorio donde estén almacenadas. Esto nos va a permitir, por ejemplo, hacer JOINs entre un PostgreSQLy un MongoDB, o entre un MinIO y un Oracle.

¿Cómo lo soportaremos en la Plataforma?

Por un lado, vamos a poder crear un nuevo tipo de Entidad denominada como «Presto Entity»:

Esto va a permitir a los usuarios conectarse a los diferentes catálogos dados de alta en Presto por el administrador de la Plataforma creando entidades «Presto»:

Una vez creadas estas entidades Presto, vamos a poder hacer JOINS entre ellas de forma transparente al repositorio:

Estas entidades Presto se manejan como el resto de Entidades de la Plataforma, pudiendo crear Dashboards sobre ellas, ingestar datos, publicarlas como API REST, etc.

También vamos a integrar el Presto UI, el cual será accesible para usuarios administradores y les va a permitir ver las consultas ejecutadas sobre Presto:

Vale pero, ¿qué es Presto?

PrestoDB es un motor de consultas SQL distribuido como Open Source y construido en Java, pensado para lanzar consultas analíticas interactivas contra un gran número de fuentes de datos (a través de conectores), soportando consultas sobre fuentes de datos que van desde gigabytes hasta petabytes.

Es un motor de consulta ANSI-SQL, permite consultar y manipular datos en cualquier fuente de datos conectada con las mismas sentencias, funciones y operadores SQL.

Como curiosidad, PrestoDB fue creado en 2012 en Facebook, donde inicialmente se pensó para para resolver el problema de la lentitud de HIVE al acceder a un Data Warehouse de 300 PB. Para resolver este problema, se construyó un motor MPP basado en SQL que fuera fácil de usar a partir de los conocimientos existentes, fácil de conectar a cualquier base de datos, almacén o Datalake, y fácil de integrar con cualquier herramienta de BI.

Trabajando con un Data Lake en la Onesait Platform (parte 1)

¿Qué vamos a poder hacer?

Presto nos va a permitir consultar los datos sobre su origen, incluyendo entre otros conectores Hive, Cassandra, bases de datos relacionales, Kafka, Kudu, Redis, MongoDB… Una sola consulta de Presto puede combinar datos de múltiples fuentes, lo que permite realizar análisis multi-store. Está enfocado a consultas analíticas que esperan tiempos de respuesta que van desde menos de un segundo hasta minutos.

Ofrece una línea de comandos para hacer las consultas:

Conectores

Presto nos ofrece una serie de conectores disponibles para acceder a los datos de diferentes fuentes de datos. En su página de documentación tienen un listado de los conectores disponibles.

Algunos de estos conectores son para: Accumulo, Cassandra, Druid, Elasticsearch, HIVE, JMX, Kafka, Kudu, ficheros locales, MongoDB, MySQL, Oracle, Postgresql, Redis, Redshift, SQL Server, etc.

Driver JDBC

Presto ofrece un driver JDBC que permite acceder a las fuentes de datos subyacentes desde cualquier aplicación que use el driver.

Presto Web UI

Presto proporciona una interfaz web para supervisar y gestionar las consultas. La interfaz web es accesible en el coordinador de Presto a través de HTTP.

El UI nos indicara para cada consulta su estado:

Imagen de cabecera: GitHub Presto.

✍🏻 Author(s)

Fco. Javier López Acevedo

See author's posts

Cookie	Duración	Descripción
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
connect.sid	1 day	This cookie is used for authentication and for secure log-in. It registers the log-in information.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duración	Descripción
pll_language	1 year	The pll _language cookie is used by Polylang to remember the language selected by the user when returning to the website, and also to get the language information when not available in another way.
ugid	1 year	This cookie is set by the provider Unsplash. This cookie is used for enabling the video content on the website.

Cookie	Duración	Descripción
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_127650363_5	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duración	Descripción
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duración	Descripción
atlassian.account.ffs.id	1 year	No description available.
atlassian.account.xsrf.token	session	No description available.
cloud.session.token	past	No description
pvc_visits[0]	1 hour	This cookie is created by post-views-counter. This cookie is used to count the number of visits to a post. It also helps in preventing repeat views of a post by a visitor.
SESSION	session	No description

¿Cómo lo soportaremos en la Plataforma?

Vale pero, ¿qué es Presto?

¿Qué vamos a poder hacer?

Conectores

Driver JDBC

Presto Web UI

✍🏻 Author(s)

Fco. Javier López Acevedo

También te puede gustar

Mejoras en la usabilidad de los Dashboard

Versión inicial de estadísticas de Entidades

Diagrama de navegación entre formularios

Deja una respuesta Cancelar la respuesta