- Data pipelines can be used to run Spark on an Amazon EMR cluster. To execute a pipeline on an EMR cluster first configure the pipeline to use EMR as the cluster manager type. Then, configure the cluster properties.
- All users can have access to the cluster, it can have logging turned on, its bootstrap procedures can be specified, and Transformer can terminate the cluster when the pipeline is terminated.
- The most cost-effective way to operate a Transformer pipeline is to set up a cluster that automatically shuts down when the pipeline is stopped.
- Running a lot of pipelines on one existing cluster can cut costs. Often, it is 3X faster than open source and runs on Serverless or Kubernetes-based.
Orientation to EMR Serverless
A new deployment option for AWS EMR is called EMR Serverless (introduced in November 2021). Direct cluster configuration, maintenance, and scaling involvement are prevented by the serverless runtime environment offered by EMR Serverless. It is not required because Amazon handles cluster configuration, maintenance, and scalability. Managing virtual machines or installing and updating runtime software is not necessary (VMs). Applications may be started, paused, and deleted right away, simplifying processes and lowering labor expenses. Multi-AZ resiliency supported.
EMR Serverless: Why?
- Simpler to use and there are fewer decisions to make
- No need to guess cluster size and all benefits of EMR without managing cluster
- Fine-grained scaling save costs
- Enable secure shared applications and interactive applications
- Resiliency in Availability zone failures
- Easy to switch deployment modes
Characteristics of EMR Serverless
- Big data open-source programs like Apache Spark, Hive, and Presto, EMR Serverless offer petabyte analytics processing.
- Jobs can be submitted through the Amazon console, EMR Studio, APIs, command-line interface (CLI), SDK, and soon JDBC and ODBC.
- Any data pipelines that interest you can be performed using SageMaker, AWS Managed Workflows for Apache Airflow (Amazon MWAA), and AWS Step Functions (for machine learning).
- EMR notebooks, which are serverless notebooks, also allow you to run searches and write code.
- Debugging jobs super simple and isolated compute environments with defined guardrails
- Recently started supports private VPC and Graviton2
The “Simple, Scalable, and Serverless Data Integration” is offered by AWS Glue. Glue can be used for many different activities, including running ETL processes to prepare data, serving as a metadata repository, and automatically discovering schema. The computation resources required to run your data pipelines are provided and managed by Glue. Because Glue is a serverless service, you don’t have to build or manage the infrastructure—Glue takes care of it for you.
Amazon Glue charges a per-hour fee wrt DPU. The pay per second for ETL processes and crawlers varies. The Amazon Glue Data Catalog requires a monthly membership fee in order to view and store metadata.
Although many functionalities are limited due to EMR Serverless’ ongoing development, The services mentioned above are a sensible choice based on business needs.