main
  • About
  • Civil Engineering
    • Interview questions
    • Bridge design
  • Google Cloud
    • Code samples
    • kafka
    • Cloud Run
    • persistent disks
    • Spinnaker
    • Assessment questions
    • IAM
    • Cloud Storage
    • VPC
    • Cost optimization
    • Compute Engine
    • App Engine
    • Cloud Vision
    • Spanner
    • Cloud SQL
    • Solutions
      • Static IP - WIP
      • Network performance
      • Building a VPN
      • Build a streaming app
      • ML train with taxi data
    • Dataproc
    • Dataprep
    • BigTable
    • Cloud Fusion
    • Data flow
    • CloudFront
    • APIGEE
    • BigQuery
    • Cloud logging
    • Pubsub
    • Identity Aware Proxy
    • Data center migration
    • Deployment Manager
    • Kubeflow
    • Kubernetes Engine
    • Istio
    • Read the following
    • Storage for cloud shell
    • kms
    • kpt
    • Hybrid cloud with Anthos
    • helm
    • Architecture
    • terraform
    • Network
    • Data studio
    • Actions
    • Jenkins
  • Data Processing
    • Data Lake
    • Data ingestion
    • Data Cleaning - Deduplication
    • Data Cleaning - Transformation
    • Data cleaning - rule definition
    • ETL
  • Machine Learning
    • Tensorflow
    • Tensorflow tips
    • Keras
    • Scikit-learn
    • Machine learning uses
    • Working with Pytorch
    • Federated learning
  • AWS cloud
    • Billing
    • Decrease volume size of EC2
    • Run CVE search engine
    • DataSync
    • EC2 spot instances
  • Java
    • Java
    • NIO
    • System Design
      • Zero trust framework
    • Collections
  • Azure
    • Enterprise Scale
    • API
    • Resource group
    • Create an sql database
  • UBUNTU
    • No Release file
    • STRATO blockchain
    • iperf
    • Rsync
    • curl
    • Shell
    • FAQ - git
  • PH test
    • Syllabus
    • Opportunities
    • Aptitude test
  • Development
    • Course creation
    • web.dev
    • docfx template
  • npm
  • Docker Desktop
  • Nginx
  • English rules
  • Confluent
  • sanity theme
  • Java Native Interface tutorial
  • Putty
  • Personal website host
  • Google search SEO
  • Reading a textbook
  • DFCC Progress
  • STORAGE
    • Untitled
  • Services Definition
    • Cloud VPN and routing
  • Microservices design and Architecture
    • Untitled
  • Hybrid network architecture
    • Untitled
  • Deployment
    • Untitled
  • Reliability
    • Untitled
  • Security
    • Untitled
  • Maintenance and Monitoring
    • Peering
  • Archive
    • parse dml to markdown
Powered by GitBook
On this page
  • Dataproc and long running clusters
  • Externalize Hive metastore database with Cloud SQL
  • Know your way around Stackdriver
  • Transform YARN queues into workflow templates
  • Consolidate job history across multiple clusters

Was this helpful?

  1. Google Cloud

Dataproc

PreviousML train with taxi dataNextDataprep

Last updated 4 years ago

Was this helpful?

A two node cluster configuration created by Cloud Data Fusion pipeline

.

Dataproc staging and temp buckets -

Dataproc and long running clusters

Externalize Hive metastore database with Cloud SQL

Hive metastore stores schema and location of the Hive table (location of Cloud storage on cloud). MySQL is commonly used as a metastore which will be replaced by Cloud SQL.

Cloud SQL as a metastore will be easier to find metadata and possible to attach multiple clusters to the same source of Hive metadata.

Know your way around Stackdriver

Stackdriver logging is used to capture the daemon logs and YARN container logs from Cloud Dataproc clusters. Stackdriver monitoring - Collects and ingests metrics, events, and metadata from Cloud Dataproc clusters.

Dataproc property dataproc:dataproc.monitoring.stackdriver.enable

Transform YARN queues into workflow templates

Transition YARN queues into separate clusters with unique cluster shapes and potentially different permissions. Data, metadata, permissioning system if persisted off the cluster can be reused.

Consolidate job history across multiple clusters

Override MapReduce done directory, intermediate done directory, and the spark event log directory with Cloud storage directory.

systemctl stop hadoop-mapreduce-historyserver
systemctl stop spark-history-server

.

https://cloud.google.com/blog/products/data-analytics/10-tips-for-building-long-running-clusters-using-cloud-dataproc
https://cloud.google.com/dataproc/docs/concepts/compute/secondary-vms
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/staging-bucket