AWS Glue
How to connect UltiHash to AWS Glue
AWS Glue essentially enables users to use Spark in an AWS environment. To start an AWS Glue session users should be working with a UltiHash cluster deployed on AWS, import AWS Glue and configure the S3A driver as described in the code below:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
# Define the UltiHash endpoint URL
s3_endpoint = "<https://ultihash.cluster.io>"
sc = SparkContext()
# AWS access and secret keys could be any, since authentication is not yet supported by UltiHash
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "mocked") # Replace with the corresponding UltiHash credentials
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "mocked") # Replace with the corresponding UltiHash credentials
# The S3 endpoint is a URL pointing to the deployed UltiHash cluster
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", s3_endpoint)
# S3 path style access has to be enabled
sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
sc._jsc.hadoopConfigration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false")
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
See all information about the integration on GitHub here: https://github.com/UltiHash/scripts/tree/main/glue
Last updated
Was this helpful?