AWS Glue is a serverless data integration service for discovering, preparing, and combining data for analytics, machine learning, and application development. It supports ETL workflows, data cataloging, and scalable data processing on AWS.
AWS Glue is a fully managed, serverless data integration service provided by Amazon Web Services that enables organizations to discover, prepare, move, and combine data from multiple sources for analytics, machine learning, and application development. It eliminates the need to provision or manage infrastructure for ETL (Extract, Transform, Load) pipelines, allowing data engineers and analysts to focus on transformation logic rather than cluster management.
At its core, AWS Glue provides several integrated components. The Glue Data Catalog serves as a centralized, persistent metadata repository compatible with Apache Hive Metastore, storing table definitions, schemas, and partition information for data assets across S3, RDS, Redshift, and dozens of other data stores. Glue Crawlers automatically scan data sources, infer schemas, and populate the Data Catalog, reducing manual cataloging effort. Glue ETL Jobs run on a managed Apache Spark or Apache Ray environment, supporting Python (PySpark) and Scala for batch transformations, with auto-scaling that adjusts Data Processing Units (DPUs) based on workload. As of Glue version 4.0, jobs run on an optimized Spark 3.3.0 runtime with up to 2.7x faster start times and improved performance over earlier versions.
AWS Glue also supports streaming ETL for near-real-time data processing from Amazon Kinesis Data Streams and Apache Kafka sources, enabling continuous ingestion pipelines. Glue DataBrew provides a visual, no-code data preparation interface with over 250 built-in transformations, making data cleaning accessible to analysts without programming expertise. Glue Studio offers a visual drag-and-drop interface for authoring, running, and monitoring ETL jobs.
The service integrates natively with the broader AWS ecosystem including Amazon S3, Amazon Redshift, Amazon Athena, Amazon EMR, and AWS Lake Formation. It supports the AWS Glue Schema Registry for managing and enforcing Avro and JSON schemas in streaming applications. Glue handles job bookmarking to process only new data in incremental loads, and supports job triggers and workflows for orchestrating complex multi-step ETL pipelines.
AWS Glue processes petabytes of data for organizations ranging from startups to enterprises. It supports JDBC, ODBC, and native connectors to over 70 data sources including SaaS applications via AWS Glue custom connectors and the AWS Marketplace. The service operates across all major AWS regions and is SOC, HIPAA, and PCI DSS compliant, making it suitable for regulated industries.
Was this helpful?
Free
From $0.44/DPU-hour
$1.00 per node-hour
$1.00 per 100,000 objects/month
$0.44/DPU-hour
Ready to get started with AWS Glue?
View Pricing Options →Weekly insights on the latest AI tools, features, and trends delivered to your inbox.
No reviews yet. Be the first to share your experience!
Get started with AWS Glue and see if it's the right fit for your needs.
Get Started →Take our 60-second quiz to get personalized tool recommendations
Find Your Perfect AI Stack →Explore 20 ready-to-deploy AI agent templates for sales, support, dev, research, and operations.
Browse Agent Templates →