Easily deploy Trino on Dataproc with init action script

Samet Karadag
Google Cloud - Community
2 min readJan 31, 2022

--

Do you want to deploy Trino on Dataproc easily? Are you searching for Trino initialization script? Here it is;

Download “trino.sh from github and upload it to your GCS bucket.

Here is my github link https://github.com/sametkaradag/initialization-actions/blob/master/trino/trino.sh which is fork of https://github.com/GoogleCloudDataproc/initialization-actions (waiting for the pull as of writing this post)

If you want to use Trino’s BigQuery connector to query BigQuery data, replace line 162 in the init action with your project-id:

bigquery.project-id=set-your-project-id

Then create your dataproc cluster:

gcloud dataproc clusters create trino-test — enable-component-gateway — region europe-west4 \ — zone europe-west4-c — master-machine-type n1-standard-4 — master-boot-disk-size 100 — num-workers 8 \ — worker-machine-type n1-standard-4 — worker-boot-disk-size 100 — image-version 2.0-debian10 \ — scopes 'https://www.googleapis.com/auth/cloud-platform' — initialization-actions ‘gs://trino-init/trino.sh’ — project change-with-your-project-id

Here I use Trino for BigQuery queries on ephemeral Dataproc clusters, which means I create the cluster before processing and delete it afterwards to reduce costs.

I will not store any data on Dataproc, thus disk sizes (worker-boot-disk-size, master-boot-disk-size) are set to 100gb.

I am using only 2 worker nodes with n1-standard-4 machines which have 15GB RAM. Increase these if you need faster queries.

That is it — now you have Trino:)

Finally, how to connect?

You can use Trino CLI client or JDBC clients like SquirrelSQL , DBeaver (free) or DataGrip (requires paid subscription)

You can also configure your JDBC client to connect to BigQuery, having one client with 2 different sessions to analyse BQ data.

If you want to see this in action, here is a youtube video together with match_recognize demo.

--

--