Skip to content

Add Lake Formation credential vending for offline store ingestion (Spark 3.5+)#57

Open
BassemHalim wants to merge 29 commits intomainfrom
feat/lakeformation
Open

Add Lake Formation credential vending for offline store ingestion (Spark 3.5+)#57
BassemHalim wants to merge 29 commits intomainfrom
feat/lakeformation

Conversation

@BassemHalim
Copy link
Copy Markdown
Contributor

@BassemHalim BassemHalim commented Apr 10, 2026

Issue

Add Lake Formation credential vending support to the SageMaker Feature Store Spark connector, enabling secure S3 access through Lake Formation-scoped temporary credentials instead of relying on the caller's IAM permissions.

Description

This PR introduces an opt-in useLakeFormationCredentials parameter (Scala) / use_lake_formation_credentials parameter (Python) to ingestData / ingest_data. When enabled, the connector vends temporary credentials via GetTemporaryGlueTableCredentials and configures Hadoop S3A with per-bucket credentials scoped to the offline store's S3 location.

Key behaviors:

  • Credentials are configured per-bucket (fs.s3a.bucket.<bucket>.*), so only the target offline store bucket uses LF-scoped credentials
  • Credential refresh happens just before the write to minimize the expiration window
  • If credential vending fails, a RuntimeException is thrown with an actionable error message -- the connector does not fall back silently
  • For Hive/Glue offline stores, the S3A magic committer is configured automatically (auto-detects EMR's SQLEmrOptimizedCommitProtocol, falls back to open-source spark-hadoop-cloud, or fails fast with a clear error)
  • Cross-account credential vending is supported
  • Fully backward compatible: useLakeFormationCredentials defaults to false
  • Gated to Spark 3.5+ at build time (version-specific Scala source directories) and runtime (Python raises ValueError on PySpark < 3.5)

Key Changes

Scala

  • LakeFormationCredentials -- case class with expiration tracking (isExpiringSoon with 5-min buffer)
  • LakeFormationHelper -- singleton handling credential vending, automatic refresh, Glue table ARN construction, and LF prefix seeding
  • FeatureGroupArnResolver -- extended with resolveAccountId() and resolvePartition() for Glue table ARN construction (supports China and GovCloud partitions)
  • ClientFactory -- extended with lazy LakeFormationClient getter
  • SparkSessionInitializer -- per-bucket S3A TemporaryAWSCredentialsProvider config and S3A magic committer setup (EMR auto-detect, open-source fallback, fail-fast)
  • MinSparkVersionGate -- build-time Spark 3.5+ gating via version-specific source directories
  • FeatureStoreManager -- useLakeFormationCredentials parameter added to ingestData/ingestDataInJava (defaults to false)
  • SLF4J excluded from fat JAR assembly to fix logging conflict with PySpark's SLF4J 2.x binding
  • lakeformation SDK added to build.sbt dependencies
  • Version bumped to 2.0.0
  • sbt-sonatype plugin reverted to 3.9.10

Python

  • FeatureStoreManager.ingest_data() -- gains use_lake_formation_credentials parameter (default False)
  • PySpark version check raises ValueError if PySpark < 3.5 and use_lake_formation_credentials=True

API Changes

Scala

val featureStoreManager = new FeatureStoreManager()

// Default: no LF credential vending (backward compatible)
featureStoreManager.ingestData(dataFrame, featureGroupArn, directOfflineStore = true)

// Enable LF credential vending
featureStoreManager.ingestData(
  dataFrame,
  featureGroupArn,
  directOfflineStore = true,
  useLakeFormationCredentials = true
)

Python

feature_store_manager = FeatureStoreManager()

# Default: no LF credential vending (backward compatible)
feature_store_manager.ingest_data(
    input_data_frame=df, feature_group_arn=arn, direct_offline_store=True
)

# Enable LF credential vending
feature_store_manager.ingest_data(
    input_data_frame=df,
    feature_group_arn=arn,
    direct_offline_store=True,
    use_lake_formation_credentials=True,
)

Testing

Unit Tests

  • LakeFormationHelper: vend success, vend failure, credential refresh
  • LakeFormationCredentials: expiry and isExpiringSoon logic
  • FeatureGroupArnResolver: account ID/partition resolution including China and GovCloud ARNs
  • SparkSessionInitializer: magic committer config (EMR, non-EMR, missing), per-bucket LF credential config
  • FeatureStoreManagerLakeFormationTest (Spark 3.5+ only): verifies vendCredentials is not called when LF is disabled
  • Python: use_lake_formation_credentials parameter passthrough to JVM
  • Scala test coverage increased from 82% to 93% (JaCoCo)

Integration Tests

  • LakeFormationHiveIngestionTest.py -- end-to-end ingestion with LF credentials against a Glue (Hive-partitioned) offline store
  • LakeFormationIcebergIngestionTest.py -- end-to-end ingestion with LF credentials against an Iceberg offline store
  • Manual testing performed on EMR (emr-7.x)

Prerequisites

Users enabling Lake Formation credential vending must ensure:

  1. The offline store S3 location is registered with Lake Formation. You can use the SageMaker Python SDK to create an LF-governed feature group with LakeFormationConfig passed to FeatureGroupManager.create(), which handles registration automatically.
  2. The IAM role running the Spark job has:
    • lakeformation:GetDataAccess
    • lakeformation:GetTemporaryGlueTableCredentials
    • glue:GetTable, glue:GetDatabase, glue:GetPartitions
    • sagemaker:DescribeFeatureGroup
  3. Lake Formation table grants on the Table resource: SELECT, INSERT, DELETE, DESCRIBE
  4. Lake Formation account-level settings:
    • AllowExternalDataFiltering: true
    • AllowFullTableExternalDataAccess: true
    • ExternalDataFilteringAllowList includes the account running the Spark job
  5. S3A magic committer available on the classpath (Hive/Glue tables only):
    • EMR 6.15+ / 7.x: built-in, no action needed
    • Other runtimes (Glue, SageMaker Notebook, standalone PySpark): add org.apache.spark:spark-hadoop-cloud_2.12:<spark-version> via --packages or spark.jars.packages

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the CONTRIBUTING doc
  • I certify that the changes I am introducing will be backward compatible
  • I used the commit message format described in CONTRIBUTING doc

Tests

  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
  • I have verified all code in this commit are well formatted
Manual Testing
  1. Create a Lakeformation Managed Feature Group using sagemaker sdk
# Import the FeatureGroup with Lake Formation support
from sagemaker.mlops.feature_store import FeatureGroupManager, LakeFormationConfig
from sagemaker.core.shapes import (
    FeatureDefinition,
    FeatureValue,
    OfflineStoreConfig,
    OnlineStoreConfig,
    S3StorageConfig,
)
from sagemaker.core.helper.session_helper import Session as SageMakerSession



FEATURE_GROUP_NAME = 'Lakeformation-Managed-FG-iceberg'
ROLE = 'arn:aws:iam::550124139430:role/admin'

def main():
    feature_definitions = [
        FeatureDefinition(feature_name="customer_id", feature_type="String"),
        FeatureDefinition(feature_name="event_time", feature_type="String"),
        FeatureDefinition(feature_name="age", feature_type="Integral"),
        FeatureDefinition(feature_name="total_purchases", feature_type="Integral"),
        FeatureDefinition(feature_name="avg_order_value", feature_type="Fractional"),
    ]

    sagemaker_session = SageMakerSession()

    S3_BUCKET = sagemaker_session.default_bucket()
    REGION = sagemaker_session.boto_session.region_name

    # Configure online and offline stores
    online_store_config = OnlineStoreConfig(enable_online_store=False)

    offline_store_config_1 = OfflineStoreConfig(
        s3_storage_config=S3StorageConfig(
        s3_uri=f"s3://{S3_BUCKET}/feature-store-demo/"
        ),
        table_format="Iceberg"
    )

    lakeformation_config = LakeFormationConfig(
        enabled=True,
        use_service_linked_role=False,
        registration_role_arn=ROLE,
        hybrid_access_mode_enabled=False,
        acknowledge_risk=True
    )
    FeatureGroupManager.create(
    feature_group_name=FEATURE_GROUP_NAME,
    record_identifier_feature_name="customer_id",
    event_time_feature_name="event_time",
    feature_definitions=feature_definitions,
    online_store_config=online_store_config,
    offline_store_config=offline_store_config_1,
    role_arn=ROLE,
    description="Lake Formation managed FG",
    lake_formation_config=lakeformation_config,
    region=REGION
)

if __name__ == "__main__":
    main()
  1. test ingest into feature group
"""Quick test script for ingesting data into a Lake Formation managed feature group.

Usage:
    python test_lf_ingestion.py --feature-group-arn <arn>
"""

import argparse
from datetime import datetime

from feature_store_pyspark import classpath_jars
from pyspark.sql import SparkSession


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--feature-group-arn", required=True)
    parser.add_argument("--role-arn", default=None, help="Optional assume role ARN")
    args = parser.parse_args()

    jars = ",".join(classpath_jars())
    spark = SparkSession.builder \
        .appName("LF-Ingestion-Test") \
        .config("spark.jars", jars) \
        .config("spark.driver.extraClassPath", jars) \
        .config("spark.driver.bindAddress", "127.0.0.1") \
        .config("spark.driver.host", "127.0.0.1") \
        .getOrCreate()

    spark.sparkContext.setLogLevel("INFO")

    # Import after SparkSession so the JAR is on the classpath
    from feature_store_pyspark.FeatureStoreManager import FeatureStoreManager

    now = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
    data = [
        ("cust-001", now, 32, 15, 49.99),
        ("cust-002", now, 45, 8, 120.50),
        ("cust-003", now, 28, 22, 35.75),
    ]
    df = spark.createDataFrame(data, ["customer_id", "event_time", "age", "total_purchases", "avg_order_value"])

    print(f"Ingesting {df.count()} records to {args.feature_group_arn}")
    df.show()

    fm = FeatureStoreManager(args.role_arn)
    fm.ingest_data(
        input_data_frame=df,
        feature_group_arn=args.feature_group_arn,
        target_stores=["OfflineStore"],
        use_lake_formation_credentials=True
    )

    print("Ingestion complete.")
    spark.stop()


if __name__ == "__main__":
    main()

python test_lf_ingestion.py --feature-group-arn arn:aws:sagemaker:us-west-2:550124139430:feature-group/Lakeformation-Managed-FG-iceberg

I verified the records made it to glue table using athena

write_time api_invocation_time is_deleted customer_id event_time age total_purchases avg_order_value

1 2026-04-21 16:45:44.999444 UTC 2026-04-21 16:45:44.999444 UTC false cust-001 2026-04-21T16:45:32Z 32 15 49.99
2 2026-04-21 16:45:44.999444 UTC 2026-04-21 16:45:44.999444 UTC false cust-002 2026-04-21T16:45:32Z 45 8 120.5
3 2026-04-21 16:45:44.999444 UTC 2026-04-21 16:45:44.999444 UTC false cust-003 2026-04-21T16:45:32Z 28 22 35.75

The code was also tested on EMR using both an Iceberg and Hive formatted glue tables

#!/bin/bash
set -euo pipefail

# --- Configuration (fill in before running) ---
CLUSTER_NAME="<YOUR-CLUSTER-NAME>"
REGION="<YOUR-REGION>"
EMR_RELEASE="<EMR-RELEASE>"            # e.g. emr-7.12.0
INSTANCE_TYPE="<INSTANCE-TYPE>"        # e.g. m5.xlarge
S3_BUCKET="<YOUR-S3-BUCKET>"
S3_PREFIX="<YOUR-S3-PREFIX>"
LOG_URI="s3://${S3_BUCKET}/emr-logs/"
FEATURE_GROUP_ARN="<YOUR-FEATURE-GROUP-ARN>"
SUBNET_ID="<YOUR-SUBNET-ID>"

# The instance profile role must have the necessary LF permissions on the table
INSTANCE_PROFILE="<YOUR-INSTANCE-PROFILE>"
SERVICE_ROLE="<YOUR-SERVICE-ROLE>"      # e.g. EMR_DefaultRole

# PySpark SDK wheel filename (built via: SPARK_BUILD_VERSION=3.5 pip3 wheel --no-deps --no-build-isolation --wheel-dir pyspark-sdk/dist pyspark-sdk)
WHL_FILENAME="<YOUR-WHL-FILENAME>"     # e.g. sagemaker_feature_store_pyspark-1.3.0-py3-none-any.whl

# Spark SDK jar path on the EMR node (depends on python / spark version installed)
SPARK_SDK_JAR="<PATH-TO-SPARK-SDK-JAR>" # e.g. /usr/local/lib/python3.9/site-packages/feature_store_pyspark/jars/sagemaker-feature-store-spark-sdk-3.5.jar

# Local test script to upload and run as the Spark step
TEST_SCRIPT="<YOUR-TEST-SCRIPT>"       # e.g. test_lf_ingestion.py
STEP_NAME="<YOUR-STEP-NAME>"           # e.g. LF-Ingestion-Test

# --- Upload test script and wheel ---
echo "Uploading test script to S3..."
aws s3 cp "${TEST_SCRIPT}" "s3://${S3_BUCKET}/${S3_PREFIX}/script.py" --region "${REGION}"
aws s3 sync pyspark-sdk/dist/ "s3://${S3_BUCKET}/${S3_PREFIX}/" --exclude "*" --include "${WHL_FILENAME}" --size-only --region "${REGION}"

# --- Upload bootstrap script ---
cat > /tmp/emr_bootstrap.sh << BOOTSTRAP
#!/bin/bash
set -e
sudo pip3 install numpy
# Option 1: Install released version from PyPI
# sudo pip3 install sagemaker-feature-store-pyspark --no-binary :all:
# Option 2: Install unreleased whl from S3
aws s3 cp s3://${S3_BUCKET}/${S3_PREFIX}/${WHL_FILENAME} /tmp/${WHL_FILENAME}
sudo pip3 install /tmp/${WHL_FILENAME}
BOOTSTRAP

aws s3 cp /tmp/emr_bootstrap.sh "s3://${S3_BUCKET}/${S3_PREFIX}/bootstrap.sh" --region "${REGION}"

# --- Create cluster ---
echo "Creating EMR cluster..."
CLUSTER_ID=$(aws emr create-cluster \
  --name "${CLUSTER_NAME}" \
  --release-label "${EMR_RELEASE}" \
  --applications Name=Spark Name=Hadoop \
  --instance-groups '[
    {"InstanceGroupType":"MASTER","InstanceCount":1,"InstanceType":"'"${INSTANCE_TYPE}"'"}
  ]' \
  --ec2-attributes '{
    "InstanceProfile":"'"${INSTANCE_PROFILE}"'",
    "SubnetId":"'"${SUBNET_ID}"'"
  }' \
  --service-role "${SERVICE_ROLE}" \
  --log-uri "${LOG_URI}" \
  --bootstrap-actions '[{
    "Name": "Install feature-store-pyspark",
    "Path": "s3://'"${S3_BUCKET}"'/'"${S3_PREFIX}"'/bootstrap.sh"
  }]' \
  --steps '[
    {
      "Type": "CUSTOM_JAR",
      "Name": "'"${STEP_NAME}"'",
      "Jar": "command-runner.jar",
      "ActionOnFailure": "CONTINUE",
      "Args": [
        "spark-submit",
        "--deploy-mode", "client",
        "--jars", "'"${SPARK_SDK_JAR}"'",
        "s3://'"${S3_BUCKET}"'/'"${S3_PREFIX}"'/script.py",
        "--feature-group-arn", "'"${FEATURE_GROUP_ARN}"'"
      ]
    }
  ]' \
  --auto-terminate \
  --region "${REGION}" \
  --query 'ClusterId' --output text)

echo ""
echo "Cluster ID: ${CLUSTER_ID}"
echo "Logs:       ${LOG_URI}"
echo ""
echo "Monitor with:"
echo "  aws emr describe-cluster --cluster-id ${CLUSTER_ID} --region ${REGION} --query 'Cluster.Status.State'"
echo "  aws emr list-steps --cluster-id ${CLUSTER_ID} --region ${REGION} --query 'Steps[].{Name:Name,Status:Status.State}' --output table"

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

bhhalim added 4 commits April 7, 2026 14:24
Add LakeFormationCredentials case class to hold LF-vended temporary
credentials with isExpiringSoon() for refresh logic. Extend
FeatureGroupArnResolver with resolveAccountId() and resolvePartition()
to extract account ID and ARN partition from feature group ARNs.
---
X-AI-Prompt: add lake formation managed table support to spark connector
X-AI-Tool: kiro-cli
Add LakeFormationHelper singleton with checkAndVendCredentials() that
calls Glue GetTable to detect LF-managed tables and vends temporary
credentials via GetTemporaryGlueTableCredentials. All failures gracefully
fall back to default credentials. Add lazy GlueClient and
LakeFormationClient getters to ClientFactory. Add lakeformation SDK
dependency to build.sbt.
---
X-AI-Prompt: add lake formation managed table support to spark connector
X-AI-Tool: kiro-cli
Wire LF detection and credential vending into batchIngestIntoOfflineStore
for both Glue table (parquet) and Iceberg write paths. Add lfCredentials
parameter to SparkSessionInitializer methods to configure Hadoop S3A with
TemporaryAWSCredentialsProvider when LF credentials are present. Update
tests with mock GlueClient setup.
---
X-AI-Prompt: add lake formation managed table support to spark connector
X-AI-Tool: kiro-cli
Add support for vending temporary LakeFormation credentials when
ingesting data into LF-managed offline stores. Changes include:

- Add useLakeFormationCreds parameter to ingest_data/ingestData
- Exclude SLF4J from fat JAR to fix logging conflict with PySpark
- Bump LF fallback-to-default-creds log level to WARN
- Add logging to FeatureStoreManager for LF credential flow
- Increase Scala test coverage from 82% to 93% (new LF tests)
- Add scala-spark-sdk README with build instructions

---
X-AI-Prompt: add LF cred support, fix SLF4J conflict, add logging, increase test coverage, add Scala README
X-AI-Tool: kiro-cli
@BassemHalim BassemHalim requested a review from a team as a code owner April 10, 2026 18:52
…in LakeFormationHelperTest

Replace when(...).thenReturn(...) with doReturn(...).when(...) style
stubbing in LakeFormationHelperTest to prevent
scala.reflect.internal.Symbols$CyclicReference involving
GetTemporaryGlueTableCredentialsResponse.

The when/thenReturn form triggers reflective type resolution on the
AWS SDK builder types which, under certain Spark version classpaths,
causes a cyclic reference in Scala 2.12 compiler-generated metadata.
The doReturn form avoids this by setting up the return value before
invoking the mock method.

x-ai-tool: kiro
- Fix potential NPE in ingestDataInJava when useLakeFormationCreds is
  null from Java/PySpark by wrapping with Option().getOrElse(true)
- Override toString in LakeFormationCredentials to mask sensitive fields
  and prevent credential leakage in logs
- Store partition in LakeFormationCredentials for consistent refresh
  behavior instead of re-deriving from region
- Use DEBUG for success-path log messages, keep WARN for fallbacks
- Add Scaladoc for useLakeFormationCreds parameter
- Document hardcoded credential duration limitation
- Skip redundant GetTable call when useLakeFormationCreds is true,
  vend credentials directly and let failures fall back gracefully
- Remove unused checkAndVendCredentials method and Glue test fixtures
---
X-AI-Prompt: address code review issues 1-6, skip GetTable when LF creds enabled, remove checkAndVendCredentials
X-AI-Tool: kiro-cli
Remove GlueClient from ClientFactory (import, singleton var, lazy
getter, test setter, initialize reset, factory method), its mock from
FeatureStoreManagerTest, and the glue SDK dependency from build.sbt.

No production code ever called methods on the GlueClient. The
LakeFormation credential vending uses the LF SDK directly, not the
Glue SDK.
---
X-AI-Prompt: find all glue clients and remove unused dead code
X-AI-Tool: kiro-cli
Comment thread pyspark-sdk/src/feature_store_pyspark/FeatureStoreManager.py Outdated
Rename the Lake Formation parameter across Scala and Python SDKs to
align with AWS naming conventions (matches Glue Crawler's
UseLakeFormationCredentials parameter):

- Scala: useLakeFormationCreds -> useLakeFormationCredentials
- Python: use_lakeformation_creds -> use_lake_formation_credentials

Also change the default value from true to false in all method
signatures (ingestData, ingestDataInJava, writeToOfflineStore) and
update doc comments accordingly.
---
X-AI-Prompt: rename useLakeFormationCreds/use_lakeformation_creds to useLakeFormationCredentials/use_lake_formation_credentials, change default to false
X-AI-Tool: kiro-cli
Remove pyspark-sdk/__pycache__/__init__.cpython-310.pyc from git index.
This file was accidentally committed in e5c6e32 and is already covered
by the **/__pycache__ gitignore rule.
---
X-AI-Prompt: remove accidentally tracked pycache file from git index
X-AI-Tool: kiro-cli
…on_credentials

The default for use_lake_formation_credentials was changed from True to
False. Update the test assertion that relies on the default value to
expect False instead of True.
---
X-AI-Prompt: fix failing unit test after renaming argument and changing default to false
X-AI-Tool: kiro-cli
- Bump VERSION from 1.3.0 to 2.0.0
- Revert sbt-sonatype from 3.11.3 to 3.9.10 (unrelated to LF feature)
- Remove scala-spark-sdk/README.md (will be added in a separate PR)
---
X-AI-Prompt: version bump to 2.0, revert sonatype plugin, remove README from branch
X-AI-Tool: kiro
@BassemHalim BassemHalim changed the title Feat/lakeformation Add Lake Formation credential vending for offline store ingestion (Spark 3.5+) Apr 24, 2026
}
}

def refreshIfNeeded(credentials: LakeFormationCredentials): LakeFormationCredentials = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about this, when and where does the refresh happen?

Copy link
Copy Markdown
Contributor Author

@BassemHalim BassemHalim Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we call it in batchIngestIntoOfflineStore
like here https://github.com/aws/sagemaker-feature-store-spark/pull/57/changes#diff-570e7dae04e1411a2e05570bc71c7a546c88683dcb5f9b4bcad0545cf4f03f77R332
right before we try to write to make sure they are note expired

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a scenario where workers are still writing to S3 using Lake Formation credentials, but the credentials expire mid-write and cause partial failures? My understanding is that credentials are refreshed once right before the workers start writing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is possible yes if the write takes longer than 1 hour. This would only happen for very large DF in that case the customer should break the data down I think

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just added a note in README about that limitation

Comment thread scala-spark-sdk/build.sbt Outdated
// only compile on Spark 3.5+.
Test / unmanagedSourceDirectories += {
val baseDir = baseDirectory.value
if (majorSparkVersion.toDouble >= 3.5) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in case of spark version "3.10".toDouble, just want to ensure that it doesn't break for future versions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will just compile the tests at ./src/test/scala-spark-3.5 given that we only support up to version 3.5 this should not be an issue

Comment thread README.md Outdated
…in README

The README referenced a non-existent direct_offline_store/directOfflineStore
parameter in both the Getting Started and Lake Formation sections. Replace
all 6 occurrences with the actual target_stores/targetStores parameter and
update the description text to reflect the list-based API.
---
X-AI-Prompt: fix incorrect direct_offline_store references in README to match actual API
X-AI-Tool: kiro
Extract configureS3aCredentials private method to eliminate duplicate
bucket-scoped hadoop credential configuration in both
initializeSparkSessionForOfflineStore and
initializeSparkSessionForIcebergTable.
---
X-AI-Prompt: refactor duplicate S3A credential config code into shared helper
X-AI-Tool: kiro
…edential request

The connector only appends Parquet files and never deletes S3 objects,
so Permission.DELETE (which maps to s3:DeleteObject) is not needed.
This follows the principle of least privilege.
---
X-AI-Prompt: remove Permission.DELETE from LF credential vending request and README
X-AI-Tool: kiro
@BassemHalim BassemHalim added the enhancement New feature or request label Apr 27, 2026
Comment thread scala-spark-sdk/build.sbt
Comment thread pyspark-sdk/integration_test/LakeFormationHiveIngestionTest.py Outdated
Replace toDouble-based version comparison with integer major.minor
parsing so that versions like 3.10 are correctly compared as greater
than 3.5 instead of being truncated to 3.1 by floating-point parsing.
datetime.fromtimestamp() without tz used the local timezone to build
the year/month/day/hour partition path, which would produce incorrect
paths on machines not set to UTC. Pass tz=timezone.utc explicitly.
Document that Lake Formation temporary credentials can expire during
long-running Spark writes, causing S3 403 errors. Recommend batching
large DataFrames and calling ingestData per batch to vend fresh
credentials.

import string
from typing import List

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we are importing already why are we also importing DataFrame explicitly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants