Iceberg
A lakehouse architecture is built on two key components: an open table format like Apache Iceberg, and scalable object storage like UltiHash. In this setup, users typically start with structured data (e.g., CSV files) and convert it to the Iceberg format before storing it in UltiHash. This approach enables efficient querying, schema evolution, and ACID guarantees on object storage.
Below is a step-by-step guide using PySpark to convert CSV data into an Iceberg table and store it in UltiHash.
Start a PySpark Session
To get started, launch a PySpark session with the required dependencies and configurations:
Make sure your target bucket exists on UltiHash.
Include the necessary Iceberg, Hadoop AWS, and AWS SDK packages.
Configure the S3A driver properly to connect to UltiHash’s endpoint.
Create an Iceberg Table
Once the session is running, start by creating a namespace if it doesn’t already exist. This serves as the logical container for your tables.
Now you can define your Iceberg table, specifying the schema and any table-level properties such as format version or metadata retention:
To verify the table was created successfully:
Load data from a structured format (e.g. CSV)
Read your CSV data into a DataFrame, enabling schema inference and header detection:
Write data to UltiHash in Iceberg format
With your DataFrame ready, append the records to the Iceberg table stored in your UltiHash bucket:
Read Iceberg data from UltiHash
To confirm the data was written correctly, simply query the table:
Last updated
Was this helpful?