main
  • About
  • Civil Engineering
    • Interview questions
    • Bridge design
  • Google Cloud
    • Code samples
    • kafka
    • Cloud Run
    • persistent disks
    • Spinnaker
    • Assessment questions
    • IAM
    • Cloud Storage
    • VPC
    • Cost optimization
    • Compute Engine
    • App Engine
    • Cloud Vision
    • Spanner
    • Cloud SQL
    • Solutions
      • Static IP - WIP
      • Network performance
      • Building a VPN
      • Build a streaming app
      • ML train with taxi data
    • Dataproc
    • Dataprep
    • BigTable
    • Cloud Fusion
    • Data flow
    • CloudFront
    • APIGEE
    • BigQuery
    • Cloud logging
    • Pubsub
    • Identity Aware Proxy
    • Data center migration
    • Deployment Manager
    • Kubeflow
    • Kubernetes Engine
    • Istio
    • Read the following
    • Storage for cloud shell
    • kms
    • kpt
    • Hybrid cloud with Anthos
    • helm
    • Architecture
    • terraform
    • Network
    • Data studio
    • Actions
    • Jenkins
  • Data Processing
    • Data Lake
    • Data ingestion
    • Data Cleaning - Deduplication
    • Data Cleaning - Transformation
    • Data cleaning - rule definition
    • ETL
  • Machine Learning
    • Tensorflow
    • Tensorflow tips
    • Keras
    • Scikit-learn
    • Machine learning uses
    • Working with Pytorch
    • Federated learning
  • AWS cloud
    • Billing
    • Decrease volume size of EC2
    • Run CVE search engine
    • DataSync
    • EC2 spot instances
  • Java
    • Java
    • NIO
    • System Design
      • Zero trust framework
    • Collections
  • Azure
    • Enterprise Scale
    • API
    • Resource group
    • Create an sql database
  • UBUNTU
    • No Release file
    • STRATO blockchain
    • iperf
    • Rsync
    • curl
    • Shell
    • FAQ - git
  • PH test
    • Syllabus
    • Opportunities
    • Aptitude test
  • Development
    • Course creation
    • web.dev
    • docfx template
  • npm
  • Docker Desktop
  • Nginx
  • English rules
  • Confluent
  • sanity theme
  • Java Native Interface tutorial
  • Putty
  • Personal website host
  • Google search SEO
  • Reading a textbook
  • DFCC Progress
  • STORAGE
    • Untitled
  • Services Definition
    • Cloud VPN and routing
  • Microservices design and Architecture
    • Untitled
  • Hybrid network architecture
    • Untitled
  • Deployment
    • Untitled
  • Reliability
    • Untitled
  • Security
    • Untitled
  • Maintenance and Monitoring
    • Peering
  • Archive
    • parse dml to markdown
Powered by GitBook
On this page
  • Transform
  • Deploying the Pipeline
  • Templating
  • Pipeline configure
  • Deploy
  • Airflow and Data fusion
  • Data Lineage
  • Troubleshooting

Was this helpful?

  1. Google Cloud

Cloud Fusion

Working with Cloud Fusion

PreviousBigTableNextData flow

Last updated 4 years ago

Was this helpful?

Roles required

Service Account

Role

Description

service-370266331882@gcp-sa-datafusion.iam.gserviceaccount.com

Cloud Data Fusion API Service Agent

Gives Cloud Data Fusion service account access to Service Networking, Cloud Dataproc, Cloud Storage, BigQuery, Cloud Spanner, and Cloud Bigtable resources.

43018728793-compute@developer.gserviceaccount.com

  • Compute Engine default service account

  • Cloud Data Fusion Runner

  • Cloud Data Fusion API Service Agent Editor

Access to data fusion runtime resources.

Gives Cloud Data Fusion service account access to Service Networking, Cloud Dataproc, Cloud Storage, BigQuery, Cloud Spanner, and Cloud Bigtable resources.

Components:

  • Source

  • Transform

  • Analytics

  • Sink

  • Conditions and Actions

Pipeline Import

Transform

Wrangler

Deploying the Pipeline

Templating

Pipeline configure

Deploy

Runtime arguments

Key

Value

system.profile.name

SYSTEM:dataproc

Airflow and Data fusion

By populating the Data Fusion operator with a few parameters, we can now deploy, start and stop pipelines.

start_pipeline = CloudDataFusionStartPipelineOperator(
    location=LOCATION,
    pipeline_name=PIPELINE_NAME,
    instance_name=INSTANCE_NAME,
    task_id="start_pipeline",
)

Order of task flow using bit shift operator

gcs_sensor_task >> start_pipeline

Data Lineage

Abstract

Example

Shipment data

Dataset level lineage

Field level lineage

This shows the history of changes a particular field has gone through.

Troubleshooting

If memory is not sufficient increase the executor memory quota in the pipeline config.

PROVISION task failed in REQUESTING_CREATE state for program run program_run:default.Reusable-Pipeline.-SNAPSHOT.workflow.DataPipelineWorkflow.a48de625-b92a-11eb-b814-66625c4294b9 due to Dataproc operation failure: INVALID_ARGUMENT: Insufficient 'DISKS_TOTAL_GB' quota. Requested 3000.0, available 1096.0.

.

Start a data pipeline -

.

https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/datafusion.html
https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/datafusion.html#start-a-datafusion-pipeline
https://cloud.google.com/solutions/architecture-concept-data-lineage-systems-in-a-data-warehouse
Pipeline to redact PII (phone number)
Lineage extraction flowchart