This guide walks you through setting up a PySpark session configured for Kafka and UltiHash, enabling you to stream, process, and store data efficiently. You’ll learn how to configure environment variables, launch a local Kafka instance using Docker, produce and consume Kafka messages, and write structured data directly to an UltiHash bucket using PySpark. Ideal for real-time data pipelines or testing event-driven workflows.
Start PySpark session with config for Kafka
To begin, export your credentials and session variables, then launch a PySpark session with the necessary packages and configurations to connect with both Kafka and UltiHash’s S3-compatible storage.
Once your Spark session is running, you can create a simple DataFrame and write it as a Parquet file directly to a specified UltiHash bucket and path.
>>> from pyspark.sql import SparkSession
...
... # Start your Spark session (should already be running with S3 configs from your CLI)
... spark = SparkSession.builder.appName("TestWriteToUltiHash").getOrCreate()
...
... # Create a basic DataFrame
... df = spark.createDataFrame([
... ("Alice", 25),
... ("Bob", 32),
... ], ["name", "age"])
...
... # Write the DataFrame as Parquet to UltiHash under the 'sample-data/' prefix in your 'kafka' bucket
... df.write.mode("overwrite").parquet("s3a://kafka/sample-data") #my bucket name is kafka
...
... # Optional: Stop the Spark session (only if you're done)
... spark.stop()
...
25/03/26 14:03:01 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
25/03/26 14:03:02 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
>>>
Create docker-compose.kafka.yalm
Set up a minimal Kafka and Zookeeper environment using Docker Compose. This configuration will run both services locally, exposing the standard ports for communication.