Dataproc
Last updated
Was this helpful?
Last updated
Was this helpful?
A two node cluster configuration created by Cloud Data Fusion pipeline
.
Dataproc staging and temp buckets -
Hive metastore stores schema and location of the Hive table (location of Cloud storage on cloud). MySQL is commonly used as a metastore which will be replaced by Cloud SQL.
Cloud SQL as a metastore will be easier to find metadata and possible to attach multiple clusters to the same source of Hive metadata.
Stackdriver logging is used to capture the daemon logs and YARN container logs from Cloud Dataproc clusters. Stackdriver monitoring - Collects and ingests metrics, events, and metadata from Cloud Dataproc clusters.
Dataproc property dataproc:dataproc.monitoring.stackdriver.enable
Transition YARN queues into separate clusters with unique cluster shapes and potentially different permissions. Data, metadata, permissioning system if persisted off the cluster can be reused.
Override MapReduce done directory, intermediate done directory, and the spark event log directory with Cloud storage directory.
.