Dataproc

A two node cluster configuration created by Cloud Data Fusion pipeline

.https://cloud.google.com/dataproc/docs/concepts/compute/secondary-vms

Dataproc staging and temp buckets - https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/staging-bucket

Dataproc and long running clusters

.https://cloud.google.com/blog/products/data-analytics/10-tips-for-building-long-running-clusters-using-cloud-dataproc

Externalize Hive metastore database with Cloud SQL

Hive metastore stores schema and location of the Hive table (location of Cloud storage on cloud). MySQL is commonly used as a metastore which will be replaced by Cloud SQL.

Cloud SQL as a metastore will be easier to find metadata and possible to attach multiple clusters to the same source of Hive metadata.

Know your way around Stackdriver

Stackdriver logging is used to capture the daemon logs and YARN container logs from Cloud Dataproc clusters. Stackdriver monitoring - Collects and ingests metrics, events, and metadata from Cloud Dataproc clusters.

Dataproc property dataproc:dataproc.monitoring.stackdriver.enable

Transform YARN queues into workflow templates

Transition YARN queues into separate clusters with unique cluster shapes and potentially different permissions. Data, metadata, permissioning system if persisted off the cluster can be reused.

Consolidate job history across multiple clusters

Override MapReduce done directory, intermediate done directory, and the spark event log directory with Cloud storage directory.

systemctl stop hadoop-mapreduce-historyserver
systemctl stop spark-history-server

PreviousML train with taxi data NextDataprep

Last updated 4 years ago

Was this helpful?