Dataproc
A two node cluster configuration created by Cloud Data Fusion pipeline

.https://cloud.google.com/dataproc/docs/concepts/compute/secondary-vms
Dataproc staging and temp buckets - https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/staging-bucket
Dataproc and long running clusters
Externalize Hive metastore database with Cloud SQL
Hive metastore stores schema and location of the Hive table (location of Cloud storage on cloud). MySQL is commonly used as a metastore which will be replaced by Cloud SQL.
Cloud SQL as a metastore will be easier to find metadata and possible to attach multiple clusters to the same source of Hive metadata.
Know your way around Stackdriver
Stackdriver logging is used to capture the daemon logs and YARN container logs from Cloud Dataproc clusters. Stackdriver monitoring - Collects and ingests metrics, events, and metadata from Cloud Dataproc clusters.
Dataproc property dataproc:dataproc.monitoring.stackdriver.enable
Transform YARN queues into workflow templates
Transition YARN queues into separate clusters with unique cluster shapes and potentially different permissions. Data, metadata, permissioning system if persisted off the cluster can be reused.
Consolidate job history across multiple clusters
Override MapReduce done directory, intermediate done directory, and the spark event log directory with Cloud storage directory.
systemctl stop hadoop-mapreduce-historyserver
systemctl stop spark-history-server
Last updated
Was this helpful?