Add Lake Formation credential vending for offline store ingestion (Spark 3.5+) by BassemHalim · Pull Request #57 · aws/sagemaker-feature-store-spark

BassemHalim · 2026-04-10T18:52:15Z

Issue

Add Lake Formation credential vending support to the SageMaker Feature Store Spark connector, enabling secure S3 access through Lake Formation-scoped temporary credentials instead of relying on the caller's IAM permissions.

Description

This PR introduces an opt-in useLakeFormationCredentials parameter (Scala) / use_lake_formation_credentials parameter (Python) to ingestData / ingest_data. When enabled, the connector vends temporary credentials via GetTemporaryGlueTableCredentials and configures Hadoop S3A with per-bucket credentials scoped to the offline store's S3 location.

Key behaviors:

Credentials are configured per-bucket (fs.s3a.bucket.<bucket>.*), so only the target offline store bucket uses LF-scoped credentials
Credential refresh happens just before the write to minimize the expiration window
If credential vending fails, a RuntimeException is thrown with an actionable error message -- the connector does not fall back silently
For Hive/Glue offline stores, the S3A magic committer is configured automatically (auto-detects EMR's SQLEmrOptimizedCommitProtocol, falls back to open-source spark-hadoop-cloud, or fails fast with a clear error)
Cross-account credential vending is supported
Fully backward compatible: useLakeFormationCredentials defaults to false
Gated to Spark 3.5+ at build time (version-specific Scala source directories) and runtime (Python raises ValueError on PySpark < 3.5)

Key Changes

Scala

LakeFormationCredentials -- case class with expiration tracking (isExpiringSoon with 5-min buffer)
LakeFormationHelper -- singleton handling credential vending, automatic refresh, Glue table ARN construction, and LF prefix seeding
FeatureGroupArnResolver -- extended with resolveAccountId() and resolvePartition() for Glue table ARN construction (supports China and GovCloud partitions)
ClientFactory -- extended with lazy LakeFormationClient getter
SparkSessionInitializer -- per-bucket S3A TemporaryAWSCredentialsProvider config and S3A magic committer setup (EMR auto-detect, open-source fallback, fail-fast)
MinSparkVersionGate -- build-time Spark 3.5+ gating via version-specific source directories
FeatureStoreManager -- useLakeFormationCredentials parameter added to ingestData/ingestDataInJava (defaults to false)
SLF4J excluded from fat JAR assembly to fix logging conflict with PySpark's SLF4J 2.x binding
lakeformation SDK added to build.sbt dependencies
Version bumped to 2.0.0
sbt-sonatype plugin reverted to 3.9.10

Python

FeatureStoreManager.ingest_data() -- gains use_lake_formation_credentials parameter (default False)
PySpark version check raises ValueError if PySpark < 3.5 and use_lake_formation_credentials=True

API Changes

Scala

val featureStoreManager = new FeatureStoreManager()

// Default: no LF credential vending (backward compatible)
featureStoreManager.ingestData(dataFrame, featureGroupArn, directOfflineStore = true)

// Enable LF credential vending
featureStoreManager.ingestData(
  dataFrame,
  featureGroupArn,
  directOfflineStore = true,
  useLakeFormationCredentials = true
)

Python

feature_store_manager = FeatureStoreManager()

# Default: no LF credential vending (backward compatible)
feature_store_manager.ingest_data(
    input_data_frame=df, feature_group_arn=arn, direct_offline_store=True
)

# Enable LF credential vending
feature_store_manager.ingest_data(
    input_data_frame=df,
    feature_group_arn=arn,
    direct_offline_store=True,
    use_lake_formation_credentials=True,
)

Testing

Unit Tests

LakeFormationHelper: vend success, vend failure, credential refresh
LakeFormationCredentials: expiry and isExpiringSoon logic
FeatureGroupArnResolver: account ID/partition resolution including China and GovCloud ARNs
SparkSessionInitializer: magic committer config (EMR, non-EMR, missing), per-bucket LF credential config
FeatureStoreManagerLakeFormationTest (Spark 3.5+ only): verifies vendCredentials is not called when LF is disabled
Python: use_lake_formation_credentials parameter passthrough to JVM
Scala test coverage increased from 82% to 93% (JaCoCo)

Integration Tests

LakeFormationHiveIngestionTest.py -- end-to-end ingestion with LF credentials against a Glue (Hive-partitioned) offline store
LakeFormationIcebergIngestionTest.py -- end-to-end ingestion with LF credentials against an Iceberg offline store
Manual testing performed on EMR (emr-7.x)

Prerequisites

Users enabling Lake Formation credential vending must ensure:

The offline store S3 location is registered with Lake Formation. You can use the SageMaker Python SDK to create an LF-governed feature group with LakeFormationConfig passed to FeatureGroupManager.create(), which handles registration automatically.
The IAM role running the Spark job has:
- lakeformation:GetDataAccess
- lakeformation:GetTemporaryGlueTableCredentials
- glue:GetTable, glue:GetDatabase, glue:GetPartitions
- sagemaker:DescribeFeatureGroup
Lake Formation table grants on the Table resource: SELECT, INSERT, DELETE, DESCRIBE
Lake Formation account-level settings:
- AllowExternalDataFiltering: true
- AllowFullTableExternalDataAccess: true
- ExternalDataFilteringAllowList includes the account running the Spark job
S3A magic committer available on the classpath (Hive/Glue tables only):
- EMR 6.15+ / 7.x: built-in, no action needed
- Other runtimes (Glue, SageMaker Notebook, standalone PySpark): add org.apache.spark:spark-hadoop-cloud_2.12:<spark-version> via --packages or spark.jars.packages

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the CONTRIBUTING doc
I certify that the changes I am introducing will be backward compatible
I used the commit message format described in CONTRIBUTING doc

Tests

I have added tests that prove my fix is effective or that my feature works (if appropriate)
I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
I have verified all code in this commit are well formatted

Manual Testing

Create a Lakeformation Managed Feature Group using sagemaker sdk

# Import the FeatureGroup with Lake Formation support
from sagemaker.mlops.feature_store import FeatureGroupManager, LakeFormationConfig
from sagemaker.core.shapes import (
    FeatureDefinition,
    FeatureValue,
    OfflineStoreConfig,
    OnlineStoreConfig,
    S3StorageConfig,
)
from sagemaker.core.helper.session_helper import Session as SageMakerSession



FEATURE_GROUP_NAME = 'Lakeformation-Managed-FG-iceberg'
ROLE = 'arn:aws:iam::550124139430:role/admin'

def main():
    feature_definitions = [
        FeatureDefinition(feature_name="customer_id", feature_type="String"),
        FeatureDefinition(feature_name="event_time", feature_type="String"),
        FeatureDefinition(feature_name="age", feature_type="Integral"),
        FeatureDefinition(feature_name="total_purchases", feature_type="Integral"),
        FeatureDefinition(feature_name="avg_order_value", feature_type="Fractional"),
    ]

    sagemaker_session = SageMakerSession()

    S3_BUCKET = sagemaker_session.default_bucket()
    REGION = sagemaker_session.boto_session.region_name

    # Configure online and offline stores
    online_store_config = OnlineStoreConfig(enable_online_store=False)

    offline_store_config_1 = OfflineStoreConfig(
        s3_storage_config=S3StorageConfig(
        s3_uri=f"s3://{S3_BUCKET}/feature-store-demo/"
        ),
        table_format="Iceberg"
    )

    lakeformation_config = LakeFormationConfig(
        enabled=True,
        use_service_linked_role=False,
        registration_role_arn=ROLE,
        hybrid_access_mode_enabled=False,
        acknowledge_risk=True
    )
    FeatureGroupManager.create(
    feature_group_name=FEATURE_GROUP_NAME,
    record_identifier_feature_name="customer_id",
    event_time_feature_name="event_time",
    feature_definitions=feature_definitions,
    online_store_config=online_store_config,
    offline_store_config=offline_store_config_1,
    role_arn=ROLE,
    description="Lake Formation managed FG",
    lake_formation_config=lakeformation_config,
    region=REGION
)

if __name__ == "__main__":
    main()

test ingest into feature group

"""Quick test script for ingesting data into a Lake Formation managed feature group.

Usage:
    python test_lf_ingestion.py --feature-group-arn <arn>
"""

import argparse
from datetime import datetime

from feature_store_pyspark import classpath_jars
from pyspark.sql import SparkSession


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--feature-group-arn", required=True)
    parser.add_argument("--role-arn", default=None, help="Optional assume role ARN")
    args = parser.parse_args()

    jars = ",".join(classpath_jars())
    spark = SparkSession.builder \
        .appName("LF-Ingestion-Test") \
        .config("spark.jars", jars) \
        .config("spark.driver.extraClassPath", jars) \
        .config("spark.driver.bindAddress", "127.0.0.1") \
        .config("spark.driver.host", "127.0.0.1") \
        .getOrCreate()

    spark.sparkContext.setLogLevel("INFO")

    # Import after SparkSession so the JAR is on the classpath
    from feature_store_pyspark.FeatureStoreManager import FeatureStoreManager

    now = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
    data = [
        ("cust-001", now, 32, 15, 49.99),
        ("cust-002", now, 45, 8, 120.50),
        ("cust-003", now, 28, 22, 35.75),
    ]
    df = spark.createDataFrame(data, ["customer_id", "event_time", "age", "total_purchases", "avg_order_value"])

    print(f"Ingesting {df.count()} records to {args.feature_group_arn}")
    df.show()

    fm = FeatureStoreManager(args.role_arn)
    fm.ingest_data(
        input_data_frame=df,
        feature_group_arn=args.feature_group_arn,
        target_stores=["OfflineStore"],
        use_lake_formation_credentials=True
    )

    print("Ingestion complete.")
    spark.stop()


if __name__ == "__main__":
    main()

python test_lf_ingestion.py --feature-group-arn arn:aws:sagemaker:us-west-2:550124139430:feature-group/Lakeformation-Managed-FG-iceberg

I verified the records made it to glue table using athena

write_time api_invocation_time is_deleted customer_id event_time age total_purchases avg_order_value

1 2026-04-21 16:45:44.999444 UTC 2026-04-21 16:45:44.999444 UTC false cust-001 2026-04-21T16:45:32Z 32 15 49.99
2 2026-04-21 16:45:44.999444 UTC 2026-04-21 16:45:44.999444 UTC false cust-002 2026-04-21T16:45:32Z 45 8 120.5
3 2026-04-21 16:45:44.999444 UTC 2026-04-21 16:45:44.999444 UTC false cust-003 2026-04-21T16:45:32Z 28 22 35.75

The code was also tested on EMR using both an Iceberg and Hive formatted glue tables

#!/bin/bash
set -euo pipefail

# --- Configuration (fill in before running) ---
CLUSTER_NAME="<YOUR-CLUSTER-NAME>"
REGION="<YOUR-REGION>"
EMR_RELEASE="<EMR-RELEASE>"            # e.g. emr-7.12.0
INSTANCE_TYPE="<INSTANCE-TYPE>"        # e.g. m5.xlarge
S3_BUCKET="<YOUR-S3-BUCKET>"
S3_PREFIX="<YOUR-S3-PREFIX>"
LOG_URI="s3://${S3_BUCKET}/emr-logs/"
FEATURE_GROUP_ARN="<YOUR-FEATURE-GROUP-ARN>"
SUBNET_ID="<YOUR-SUBNET-ID>"

# The instance profile role must have the necessary LF permissions on the table
INSTANCE_PROFILE="<YOUR-INSTANCE-PROFILE>"
SERVICE_ROLE="<YOUR-SERVICE-ROLE>"      # e.g. EMR_DefaultRole

# PySpark SDK wheel filename (built via: SPARK_BUILD_VERSION=3.5 pip3 wheel --no-deps --no-build-isolation --wheel-dir pyspark-sdk/dist pyspark-sdk)
WHL_FILENAME="<YOUR-WHL-FILENAME>"     # e.g. sagemaker_feature_store_pyspark-1.3.0-py3-none-any.whl

# Spark SDK jar path on the EMR node (depends on python / spark version installed)
SPARK_SDK_JAR="<PATH-TO-SPARK-SDK-JAR>" # e.g. /usr/local/lib/python3.9/site-packages/feature_store_pyspark/jars/sagemaker-feature-store-spark-sdk-3.5.jar

# Local test script to upload and run as the Spark step
TEST_SCRIPT="<YOUR-TEST-SCRIPT>"       # e.g. test_lf_ingestion.py
STEP_NAME="<YOUR-STEP-NAME>"           # e.g. LF-Ingestion-Test

# --- Upload test script and wheel ---
echo "Uploading test script to S3..."
aws s3 cp "${TEST_SCRIPT}" "s3://${S3_BUCKET}/${S3_PREFIX}/script.py" --region "${REGION}"
aws s3 sync pyspark-sdk/dist/ "s3://${S3_BUCKET}/${S3_PREFIX}/" --exclude "*" --include "${WHL_FILENAME}" --size-only --region "${REGION}"

# --- Upload bootstrap script ---
cat > /tmp/emr_bootstrap.sh << BOOTSTRAP
#!/bin/bash
set -e
sudo pip3 install numpy
# Option 1: Install released version from PyPI
# sudo pip3 install sagemaker-feature-store-pyspark --no-binary :all:
# Option 2: Install unreleased whl from S3
aws s3 cp s3://${S3_BUCKET}/${S3_PREFIX}/${WHL_FILENAME} /tmp/${WHL_FILENAME}
sudo pip3 install /tmp/${WHL_FILENAME}
BOOTSTRAP

aws s3 cp /tmp/emr_bootstrap.sh "s3://${S3_BUCKET}/${S3_PREFIX}/bootstrap.sh" --region "${REGION}"

# --- Create cluster ---
echo "Creating EMR cluster..."
CLUSTER_ID=$(aws emr create-cluster \
  --name "${CLUSTER_NAME}" \
  --release-label "${EMR_RELEASE}" \
  --applications Name=Spark Name=Hadoop \
  --instance-groups '[
    {"InstanceGroupType":"MASTER","InstanceCount":1,"InstanceType":"'"${INSTANCE_TYPE}"'"}
  ]' \
  --ec2-attributes '{
    "InstanceProfile":"'"${INSTANCE_PROFILE}"'",
    "SubnetId":"'"${SUBNET_ID}"'"
  }' \
  --service-role "${SERVICE_ROLE}" \
  --log-uri "${LOG_URI}" \
  --bootstrap-actions '[{
    "Name": "Install feature-store-pyspark",
    "Path": "s3://'"${S3_BUCKET}"'/'"${S3_PREFIX}"'/bootstrap.sh"
  }]' \
  --steps '[
    {
      "Type": "CUSTOM_JAR",
      "Name": "'"${STEP_NAME}"'",
      "Jar": "command-runner.jar",
      "ActionOnFailure": "CONTINUE",
      "Args": [
        "spark-submit",
        "--deploy-mode", "client",
        "--jars", "'"${SPARK_SDK_JAR}"'",
        "s3://'"${S3_BUCKET}"'/'"${S3_PREFIX}"'/script.py",
        "--feature-group-arn", "'"${FEATURE_GROUP_ARN}"'"
      ]
    }
  ]' \
  --auto-terminate \
  --region "${REGION}" \
  --query 'ClusterId' --output text)

echo ""
echo "Cluster ID: ${CLUSTER_ID}"
echo "Logs:       ${LOG_URI}"
echo ""
echo "Monitor with:"
echo "  aws emr describe-cluster --cluster-id ${CLUSTER_ID} --region ${REGION} --query 'Cluster.Status.State'"
echo "  aws emr list-steps --cluster-id ${CLUSTER_ID} --region ${REGION} --query 'Steps[].{Name:Name,Status:Status.State}' --output table"

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Add LakeFormationCredentials case class to hold LF-vended temporary credentials with isExpiringSoon() for refresh logic. Extend FeatureGroupArnResolver with resolveAccountId() and resolvePartition() to extract account ID and ARN partition from feature group ARNs. --- X-AI-Prompt: add lake formation managed table support to spark connector X-AI-Tool: kiro-cli

Add LakeFormationHelper singleton with checkAndVendCredentials() that calls Glue GetTable to detect LF-managed tables and vends temporary credentials via GetTemporaryGlueTableCredentials. All failures gracefully fall back to default credentials. Add lazy GlueClient and LakeFormationClient getters to ClientFactory. Add lakeformation SDK dependency to build.sbt. --- X-AI-Prompt: add lake formation managed table support to spark connector X-AI-Tool: kiro-cli

Wire LF detection and credential vending into batchIngestIntoOfflineStore for both Glue table (parquet) and Iceberg write paths. Add lfCredentials parameter to SparkSessionInitializer methods to configure Hadoop S3A with TemporaryAWSCredentialsProvider when LF credentials are present. Update tests with mock GlueClient setup. --- X-AI-Prompt: add lake formation managed table support to spark connector X-AI-Tool: kiro-cli

Add support for vending temporary LakeFormation credentials when ingesting data into LF-managed offline stores. Changes include: - Add useLakeFormationCreds parameter to ingest_data/ingestData - Exclude SLF4J from fat JAR to fix logging conflict with PySpark - Bump LF fallback-to-default-creds log level to WARN - Add logging to FeatureStoreManager for LF credential flow - Increase Scala test coverage from 82% to 93% (new LF tests) - Add scala-spark-sdk README with build instructions --- X-AI-Prompt: add LF cred support, fix SLF4J conflict, add logging, increase test coverage, add Scala README X-AI-Tool: kiro-cli

…in LakeFormationHelperTest Replace when(...).thenReturn(...) with doReturn(...).when(...) style stubbing in LakeFormationHelperTest to prevent scala.reflect.internal.Symbols$CyclicReference involving GetTemporaryGlueTableCredentialsResponse. The when/thenReturn form triggers reflective type resolution on the AWS SDK builder types which, under certain Spark version classpaths, causes a cyclic reference in Scala 2.12 compiler-generated metadata. The doReturn form avoids this by setting up the return value before invoking the mock method. x-ai-tool: kiro

- Fix potential NPE in ingestDataInJava when useLakeFormationCreds is null from Java/PySpark by wrapping with Option().getOrElse(true) - Override toString in LakeFormationCredentials to mask sensitive fields and prevent credential leakage in logs - Store partition in LakeFormationCredentials for consistent refresh behavior instead of re-deriving from region - Use DEBUG for success-path log messages, keep WARN for fallbacks - Add Scaladoc for useLakeFormationCreds parameter - Document hardcoded credential duration limitation - Skip redundant GetTable call when useLakeFormationCreds is true, vend credentials directly and let failures fall back gracefully - Remove unused checkAndVendCredentials method and Glue test fixtures --- X-AI-Prompt: address code review issues 1-6, skip GetTable when LF creds enabled, remove checkAndVendCredentials X-AI-Tool: kiro-cli

Remove GlueClient from ClientFactory (import, singleton var, lazy getter, test setter, initialize reset, factory method), its mock from FeatureStoreManagerTest, and the glue SDK dependency from build.sbt. No production code ever called methods on the GlueClient. The LakeFormation credential vending uses the LF SDK directly, not the Glue SDK. --- X-AI-Prompt: find all glue clients and remove unused dead code X-AI-Tool: kiro-cli

Rename the Lake Formation parameter across Scala and Python SDKs to align with AWS naming conventions (matches Glue Crawler's UseLakeFormationCredentials parameter): - Scala: useLakeFormationCreds -> useLakeFormationCredentials - Python: use_lakeformation_creds -> use_lake_formation_credentials Also change the default value from true to false in all method signatures (ingestData, ingestDataInJava, writeToOfflineStore) and update doc comments accordingly. --- X-AI-Prompt: rename useLakeFormationCreds/use_lakeformation_creds to useLakeFormationCredentials/use_lake_formation_credentials, change default to false X-AI-Tool: kiro-cli

Remove pyspark-sdk/__pycache__/__init__.cpython-310.pyc from git index. This file was accidentally committed in e5c6e32 and is already covered by the **/__pycache__ gitignore rule. --- X-AI-Prompt: remove accidentally tracked pycache file from git index X-AI-Tool: kiro-cli

…s backward compatible

…on_credentials The default for use_lake_formation_credentials was changed from True to False. Update the test assertion that relies on the default value to expect False instead of True. --- X-AI-Prompt: fix failing unit test after renaming argument and changing default to false X-AI-Tool: kiro-cli

- Bump VERSION from 1.3.0 to 2.0.0 - Revert sbt-sonatype from 3.11.3 to 3.9.10 (unrelated to LF feature) - Remove scala-spark-sdk/README.md (will be added in a separate PR) --- X-AI-Prompt: version bump to 2.0, revert sonatype plugin, remove README from branch X-AI-Tool: kiro

romiik · 2026-04-24T22:00:16Z

+    }
+  }
+
+  def refreshIfNeeded(credentials: LakeFormationCredentials): LakeFormationCredentials = {


Curious about this, when and where does the refresh happen?

we call it in batchIngestIntoOfflineStore
like here https://github.com/aws/sagemaker-feature-store-spark/pull/57/changes#diff-570e7dae04e1411a2e05570bc71c7a546c88683dcb5f9b4bcad0545cf4f03f77R332
right before we try to write to make sure they are note expired

Is there a scenario where workers are still writing to S3 using Lake Formation credentials, but the credentials expire mid-write and cause partial failures? My understanding is that credentials are refreshed once right before the workers start writing.

it is possible yes if the write takes longer than 1 hour. This would only happen for very large DF in that case the customer should break the data down I think

just added a note in README about that limitation

romiik · 2026-04-24T22:50:01Z

+// only compile on Spark 3.5+.
+Test / unmanagedSourceDirectories += {
+  val baseDir = baseDirectory.value
+  if (majorSparkVersion.toDouble >= 3.5) {


What happens in case of spark version "3.10".toDouble, just want to ensure that it doesn't break for future versions

it will just compile the tests at ./src/test/scala-spark-3.5 given that we only support up to version 3.5 this should not be an issue

…in README The README referenced a non-existent direct_offline_store/directOfflineStore parameter in both the Getting Started and Lake Formation sections. Replace all 6 occurrences with the actual target_stores/targetStores parameter and update the description text to reflect the list-based API. --- X-AI-Prompt: fix incorrect direct_offline_store references in README to match actual API X-AI-Tool: kiro

Extract configureS3aCredentials private method to eliminate duplicate bucket-scoped hadoop credential configuration in both initializeSparkSessionForOfflineStore and initializeSparkSessionForIcebergTable. --- X-AI-Prompt: refactor duplicate S3A credential config code into shared helper X-AI-Tool: kiro

…edential request The connector only appends Parquet files and never deletes S3 objects, so Permission.DELETE (which maps to s3:DeleteObject) is not needed. This follows the principle of least privilege. --- X-AI-Prompt: remove Permission.DELETE from LF credential vending request and README X-AI-Tool: kiro

Replace toDouble-based version comparison with integer major.minor parsing so that versions like 3.10 are correctly compared as greater than 3.5 instead of being truncated to 3.1 by floating-point parsing.

datetime.fromtimestamp() without tz used the local timezone to build the year/month/day/hour partition path, which would produce incorrect paths on machines not set to UTC. Pass tz=timezone.utc explicitly.

Document that Lake Formation temporary credentials can expire during long-running Spark writes, causing S3 403 errors. Recommend batching large DataFrames and calling ingestData per batch to vend fresh credentials.

alexyoung13 · 2026-04-28T00:33:53Z


 import string
 from typing import List
+


nit: if we are importing already why are we also importing DataFrame explicitly?

bhhalim added 4 commits April 7, 2026 14:24

BassemHalim requested a review from a team as a code owner April 10, 2026 18:52

BassemHalim temporarily deployed to auto-approve April 10, 2026 18:52 — with GitHub Actions Inactive

chore: fix formatting

bf093e5

BassemHalim temporarily deployed to auto-approve April 13, 2026 17:13 — with GitHub Actions Inactive

BassemHalim temporarily deployed to auto-approve April 13, 2026 21:32 — with GitHub Actions Inactive

BassemHalim force-pushed the feat/lakeformation branch from ad22211 to b497902 Compare April 13, 2026 21:41

BassemHalim temporarily deployed to auto-approve April 13, 2026 21:41 — with GitHub Actions Inactive

BassemHalim force-pushed the feat/lakeformation branch from b497902 to b2c5e45 Compare April 13, 2026 22:07

BassemHalim temporarily deployed to auto-approve April 13, 2026 22:07 — with GitHub Actions Inactive

BassemHalim force-pushed the feat/lakeformation branch from b2c5e45 to 78aec6f Compare April 13, 2026 22:34

BassemHalim temporarily deployed to auto-approve April 13, 2026 22:34 — with GitHub Actions Inactive

BassemHalim temporarily deployed to auto-approve April 14, 2026 01:14 — with GitHub Actions Inactive

BassemHalim temporarily deployed to auto-approve April 14, 2026 17:14 — with GitHub Actions Inactive

BassemHalim force-pushed the feat/lakeformation branch from 78a7476 to 8c69889 Compare April 14, 2026 19:48

BassemHalim temporarily deployed to auto-approve April 14, 2026 19:48 — with GitHub Actions Inactive

alexyoung13 reviewed Apr 14, 2026

View reviewed changes

Comment thread pyspark-sdk/src/feature_store_pyspark/FeatureStoreManager.py Outdated

alexyoung13 reviewed Apr 14, 2026

View reviewed changes

Comment thread ...scala/software/amazon/sagemaker/featurestore/sparksdk/helpers/LakeFormationCredentials.scala

BassemHalim added 4 commits April 15, 2026 12:24

chore: update .gitignore

dc7a885

chore: bump Minor VERSION since we are changing public method but it'…

d414b91

…s backward compatible

BassemHalim temporarily deployed to auto-approve April 15, 2026 22:04 — with GitHub Actions Inactive

BassemHalim temporarily deployed to auto-approve April 24, 2026 17:51 — with GitHub Actions Inactive

BassemHalim temporarily deployed to auto-approve April 24, 2026 18:15 — with GitHub Actions Inactive

BassemHalim force-pushed the feat/lakeformation branch from 663097d to 3e83ec4 Compare April 24, 2026 18:35

BassemHalim temporarily deployed to auto-approve April 24, 2026 18:35 — with GitHub Actions Inactive

BassemHalim changed the title ~~Feat/lakeformation~~ Add Lake Formation credential vending for offline store ingestion (Spark 3.5+) Apr 24, 2026

romiik reviewed Apr 24, 2026

View reviewed changes

Comment thread ...main/scala/software/amazon/sagemaker/featurestore/sparksdk/helpers/LakeFormationHelper.scala Outdated

romiik reviewed Apr 24, 2026

View reviewed changes

Comment thread README.md Outdated

romiik reviewed Apr 24, 2026

View reviewed changes

Comment thread .../scala/software/amazon/sagemaker/featurestore/sparksdk/helpers/SparkSessionInitializer.scala Outdated

BassemHalim added 3 commits April 24, 2026 16:36

BassemHalim temporarily deployed to auto-approve April 25, 2026 01:17 — with GitHub Actions Inactive

BassemHalim added the enhancement New feature or request label Apr 27, 2026

BassemHalim requested review from alexyoung13 and romiik April 27, 2026 18:48

romiik reviewed Apr 27, 2026

View reviewed changes

Comment thread scala-spark-sdk/build.sbt

romiik reviewed Apr 27, 2026

View reviewed changes

Comment thread pyspark-sdk/integration_test/LakeFormationHiveIngestionTest.py Outdated

fix(build): Use integer comparison for Spark version checks

d94db61

Replace toDouble-based version comparison with integer major.minor parsing so that versions like 3.10 are correctly compared as greater than 3.5 instead of being truncated to 3.1 by floating-point parsing.

BassemHalim temporarily deployed to auto-approve April 27, 2026 20:01 — with GitHub Actions Inactive

fix(integ-test): Use UTC timezone for Hive partition path derivation

44721eb

datetime.fromtimestamp() without tz used the local timezone to build the year/month/day/hour partition path, which would produce incorrect paths on machines not set to UTC. Pass tz=timezone.utc explicitly.

BassemHalim temporarily deployed to auto-approve April 27, 2026 21:17 — with GitHub Actions Inactive

BassemHalim temporarily deployed to auto-approve April 27, 2026 22:31 — with GitHub Actions Inactive

docs: Add LF credential expiry limitation for large DataFrames

ffefbf9

Document that Lake Formation temporary credentials can expire during long-running Spark writes, causing S3 403 errors. Recommend batching large DataFrames and calling ingestData per batch to vend fresh credentials.

BassemHalim force-pushed the feat/lakeformation branch from 60548d0 to ffefbf9 Compare April 27, 2026 23:19

BassemHalim deployed to auto-approve April 27, 2026 23:19 — with GitHub Actions Active

romiik approved these changes Apr 27, 2026

View reviewed changes

alexyoung13 reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Lake Formation credential vending for offline store ingestion (Spark 3.5+)#57

Add Lake Formation credential vending for offline store ingestion (Spark 3.5+)#57
BassemHalim wants to merge 29 commits intomainfrom
feat/lakeformation

BassemHalim commented Apr 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

romiik Apr 24, 2026

Uh oh!

BassemHalim Apr 24, 2026 •

edited

Loading

Uh oh!

romiik Apr 27, 2026

Uh oh!

BassemHalim Apr 27, 2026

Uh oh!

BassemHalim Apr 27, 2026

Uh oh!

Uh oh!

romiik Apr 24, 2026

Uh oh!

BassemHalim Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexyoung13 Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

BassemHalim commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Description

Key Changes

Scala

Python

API Changes

Scala

Python

Testing

Unit Tests

Integration Tests

Prerequisites

Merge Checklist

General

Tests

Manual Testing

write_time api_invocation_time is_deleted customer_id event_time age total_purchases avg_order_value

Uh oh!

Uh oh!

Uh oh!

romiik Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

BassemHalim Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

romiik Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

BassemHalim Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

BassemHalim Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

romiik Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

BassemHalim Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexyoung13 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BassemHalim commented Apr 10, 2026 •

edited

Loading

BassemHalim Apr 24, 2026 •

edited

Loading