Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add connector for Apache Pulsar #7852

Open
MarvinCai opened this issue May 7, 2021 · 10 comments · May be fixed by #23439
Open

Add connector for Apache Pulsar #7852

MarvinCai opened this issue May 7, 2021 · 10 comments · May be fixed by #23439

Comments

@MarvinCai
Copy link
Contributor

We're from Apache Pulsar Community.
Apache Pulsar is a distributed messaging/streaming system with cloud native architecture.
Pulsar used to implement its SQL feature with connector for PrestoSQL engine.
While the PrestoSQL project has evolved to Trino, we're updating our dependencies/documentations and want to take the chance to contribute the connector back to Trino community.

Not sure if there's a formal process for this so creating this issue to kick off the discussion here.

@kokosing
Copy link
Member

kokosing commented May 7, 2021

Hello! We are were glad that you are contacting with us. It would be very nice to have a Pulsur connector being mantained as part of Trino project.

There is no formal process. Please post a pull request and I would be happy to do a code review. In order to shorten the communication cycle please join the slack: https://trino.io/slack.html

Please make sure you are familiar with: https://github.com/trinodb/trino#development and https://trino.io/development/.

In order to contribution to be merged, we will require proper tests in place that would prove that integration works. That may include among the others implementation of io.trino.testing.BaseConnectorTest. Also it would be nice to provide product tests that would verify that integration works in production like environment. Please see https://github.com/trinodb/trino/blob/master/testing/trino-product-tests/README.md and example implementation io.trino.tests.product.launcher.env.environment.SinglenodeCassandra and io.trino.tests.cassandra.TestSelect.

Let me know if something is not clear.

@Anonymitaet
Copy link

Anonymitaet commented Jun 1, 2021

@kokosing many thanks for your guidance! This project continues to strengthen the good cooperation between Trino and Pulsar.

To promote awareness and usage of the Pulsar connector, may I have your thoughts on the content strategy?

User Guide

I see many connectors have their guides on the Trino website. For the Pulsar connector, @MarvinCai is contributing the Overview and Configuration chapters to Trino.

After thinking twice, I think keeping all documentation on the Pulsar website might be a better content approach. Below are the reasons:

  • Like code, we should only keep one doc source for easier maintenance in the long run. Either keep the documentation in Trino or Pulsar is workable.

  • Overview and Configuration chapters are only a part of the guide, more chapters are already on the Pulsar website. We can add or update the contents based on what we have, it takes less effort. In this way, we can add simple descriptions of the connector and doc link of the Pulsar website to Trino documentation. Readers and doc contributors will not miss any content.
    (P.S. @MarvinCai and I would like to help on the docs)

image

Blog

After finishing the work of code and user guides, writing a promotional blog to announce the connector to the world and show the cooperation between open source projects is a good marketing approach.

Since this is the first time Pulsar contributing a connector outside and Trino has much more experience (has accepted various connectors), would the Trino community write this blog? Then we can promote it together in various channels (English + Chinese). Thanks.

Here are my just two cents, please correct me if I'm wrong. Looking forward to your feedback, many thanks!

@bitsondatadev
Copy link
Member

bitsondatadev commented Jun 1, 2021

Hey @Anonymitaet!

One thing to point out, I will be giving a talk at the Pulsar Summit and would love to write a blog about this to promote this wonderful work done by @MarvinCai, @jerrypeng, and other amazing members of the Pulsar community! It would also be great to have you all on the Trino Community Broadcast at some point to talk about this as well!

After reading over Apache Pulsar PIP-19, it seems to me that there are two ways users may be using this connector.

  1. Trino is considered an internal service to Pulsar to enable analytical queries (Pulsar centric).
  2. Trino is the primary query engine and Pulsar is one of the many systems it connects to (Trino centric).

In my view, it makes sense to maintain two distinct sets of documentation from these points of view and have each documentation point out the alternative in case the user is interested in exploring that.

For example, on the Pulsar website, you will keep a Pulsar centric point of view that sees Trino as an embedded service and is managed through the Trino launcher. This would be for the user that just wants to run analytics on their Pulsar cluster but doesn't need to run federated queries over multiple data stores.

For the Trino centric view, we would then document the process of adding Pulsar as another data source. This would be for the user that is already using Trino, and adds Pulsar to the mix. The are likely querying other data sources, and want to run analytics queries over pulsar and their existing databases.

Thoughts?

@Anonymitaet
Copy link

Anonymitaet commented Jun 2, 2021

Hi @bitsondatadev many thanks for your guidance!

Looking forward to your speech at Pulsar Summit.
Please ping me once you finish the blog, we can promote this together.

For the documentation, we will contribute all of them to Trino and update the contents to make it more Trino-centric. Once we finish it, we may need reviews from the Trino community to improve the quality. Does this make sense? Thanks

(P.S. at the same time, I'll update Pulsar SQL docs to make it more Pulsar-centric, thanks for your valuable advice!)

@Anonymitaet
Copy link

Hi @MarvinCai I've drafted one and left some questions, could you please provide technical inputs?
It would be much appreciated if you could provide any feedback by EOD 6/6 GMT+8, many thanks!

@jerrypeng
Copy link

jerrypeng commented Jun 3, 2021 via email

@Anonymitaet
Copy link

Hi @mosabua @kokosing @bitsondatadev many thanks for your comments on the Pulsar connector doc, we've incorporated them and draft one here (we can communicate in google doc more conveniently rather than using GitHub comment), could you please help review? It would be much appreciated if you could provide any feedback by EOD 6/20 GMT+8, many thanks!

After the doc is finalized, I'll send a doc PR to the Trino community.

@Anonymitaet
Copy link

Hi @bitsondatadev thanks for giving a talk at the Pulsar Summit. As we discussed before, you will write a blog to promote the Pulsar connector, have you started the work?

@bitsondatadev
Copy link
Member

Hi @bitsondatadev thanks for giving a talk at the Pulsar Summit. As we discussed before, you will write a blog to promote the Pulsar connector, have you started the work?

Not yet. I will add it to my list and work on getting it out next week.

@Anonymitaet
Copy link

@bitsondatadev OK. After the blog is out, we can provide editorial review if you need it.

@eaba eaba linked a pull request Sep 21, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants