Aurelio logo
Updated on July 29, 2025

Building E-commerce Recommendations with Kumo AI

AI Engineering

Building recommendation systems is hard.

In data science, we can spend months wrangling data, training models, and still end up with mediocre results. That's where Kumo AI comes in — it's a service that abstracts away the complexity of building Graph Neural Networks (GNNs) for predictive analytics.

In this article, we'll build a complete e-commerce recommendation engine using real H&M data with 33 million transactions. By the end, we'll have a system that can:

  • Predict customer lifetime value for the next 30 days
  • Generate personalized product recommendations
  • Forecast purchase behavior to identify active customers

The best part? We'll do all this in a couple of hours rather than months.


Why Graph Neural Networks?

Traditional recommendation systems miss the complex relationships between customers, products, and transactions.

GNNs excel here because they naturally model:

  • Network Effects: How customer preferences influence each other
  • Temporal Dynamics: Purchase patterns over time (Christmas shopping, summer clothes)
  • Cold Start Problem: Making predictions for new customers with limited data

Graph structure showing customers connected to transactions connected to products

Kumo was co-founded by one of the pioneers of GNNs and a co-author of the PyG library. Jure Leskovec and the team at Kumo built world-class expertise into the platform, meaning we get top-tier performance without needing deep graph theory knowledge.

Setting Up Kumo

First, install the necessary packages:

python
!pip install -qU \
    "db-dtypes>=1.4.3" \
    "google-auth>=2.40.2" \
    "google-cloud-bigquery>=3.33.0" \
    "kaggle>=1.7.4.5" \
    "kumoai>=2.1.0"

Connecting to Kumo

We'll need an API key from our Kumo workspace. Head to the workspace's Admin section to find it.

Kumo AI dashboard API keys section

python
import os
from getpass import getpass
import kumoai

api_key = os.getenv("KUMO_API_KEY") or \
    getpass("Enter your Kumo API key: ")

kumoai.init(url="https://aurelio.kumoai.cloud/api", api_key=api_key)
text
[2025-07-11 15:34:56 - kumoai:196 - INFO] Successfully initialized the Kumo SDK against deployment https://aurelio.kumoai.cloud/api, with log level INFO.

Data Infrastructure

Kumo integrates with our existing data infrastructure. We'll use BigQuery for this tutorial, but S3, Snowflake, and Databricks are also supported.

BigQuery Setup

Create a service account in GCP with these permissions:

  • BigQuery Data Viewer
  • BigQuery Filtered Data Viewer
  • BigQuery Metadata Viewer
  • BigQuery Read Session User
  • BigQuery User
  • BigQuery Data Editor

Download the JSON credentials and save as kumo-gcp-creds.json.

BigQuery to Kumo connection flow diagram

python
import json

name = "kumo_intro_live"
project_id = "aurelio-advocacy"  # our GCP project ID
dataset_id = "rel_hm"  # unique dataset ID

with open("kumo-gcp-creds.json", "r") as fp:
    creds = json.loads(fp.read())

connector = kumoai.BigQueryConnector(
    name=name,
    project_id=project_id,
    dataset_id=dataset_id, 
    credentials=creds,
)

This creates our BigQuery connector that Kumo will use to read source data and write predictions.

The H&M Dataset

We're using real H&M transaction data — not a toy dataset.

H&M dataset scale visualization

The dataset includes:

  • 1.3M customers with demographics
  • 100K+ products with detailed attributes
  • 33M+ transactions with timestamps

Downloading from Kaggle

First, set up our Kaggle credentials and accept the competition terms.

python
from kaggle.api.kaggle_api_extended import KaggleApi

api = KaggleApi()
api.authenticate()

api.competition_download_files(
    competition="h-and-m-personalized-fashion-recommendations",
    quiet=False
)

This downloads a large zip file (~3.5GB) containing all the competition data.

Extract the CSV files:

python
import zipfile

path = "h-and-m-personalized-fashion-recommendations.zip"

with zipfile.ZipFile(path, "r") as zip_ref:
    file_list = zip_ref.namelist()
    file_list = [f for f in file_list if f.endswith(".csv")]
    for file in file_list:
        zip_ref.extract(file, "hm_data")

We'll have four CSV files:

  • customers.csv - 1.3M customer records
  • articles.csv - 100K+ product records
  • transactions_train.csv - 33M+ transaction records
  • sample_submission.csv - (we'll skip this one)

Loading into BigQuery

Now we'll push our data to BigQuery where Kumo can access it.

python
from google.cloud import bigquery
from google.oauth2 import service_account

creds_obj = service_account.Credentials.from_service_account_file(
    "kumo-gcp-creds.json",
    scopes=["https://www.googleapis.com/auth/cloud-platform"],
)

client = bigquery.Client(
    credentials=creds_obj,
    project="aurelio-advocacy",
)

Create the dataset:

python
dataset_ref = client.dataset(dataset_id)

try:
    dataset = client.get_dataset(dataset_ref)
    print("Dataset already exists")
except Exception:
    print("Creating dataset...")
    dataset = bigquery.Dataset(dataset_ref)
    dataset.location = "US"
    dataset.description = "H&M Dataset"
    dataset = client.create_dataset(dataset)

Upload each CSV as a table:

python
for file in files[:-1]:  # skip sample_submission.csv
    table_id = file.split("/")[-1].split(".")[0]
    print(f"Pushing {table_id} to BigQuery...")
    table_ref = dataset.table(table_id)
    
    job_config = bigquery.LoadJobConfig(
        write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
        source_format=bigquery.SourceFormat.CSV,
        skip_leading_rows=1,
        autodetect=True,
    )
    
    with open(file, "rb") as f:
        load_job = client.load_table_from_file(
            f, table_ref, job_config=job_config
        )
    load_job.result()
    print(f"Loaded {load_job.output_rows} rows")
text
Pushing customers to BigQuery...
Loaded 1371980 rows
Pushing articles to BigQuery...
Loaded 105542 rows
Pushing transactions_train to BigQuery...
Loaded 31788324 rows

Building Our Graph

With data in BigQuery, we can construct the graph structure for Kumo.

Connect to Source Tables

python
articles_source = connector["articles"]
customers_source = connector["customers"]
transactions_source = connector["transactions_train"]

Let's examine the data structure:

python
articles_source.head()
text
  article_id product_code              prod_name product_type_no product_type_name
0  108775015       108775  Strap top (Yogyakarta)             253          Vest top
1  108775044       108775  Strap top (Yogyakarta)             253          Vest top
2  108775051       108775        Strap top (Salta)             253          Vest top
3  110065001       110065       OP T-shirt (Idro)             306          Bra
4  110065002       110065       OP T-shirt (Idro)             306          Bra

This shows product information including IDs, names, types, colors, and descriptions.

python
customers_source.head(2)
text
                                        customer_id  FN  Active club_member_status fashion_news_frequency  age
0  00000dbac...  NaN     NaN             ACTIVE               NONE   49
1  0000f46a3...  NaN     NaN             ACTIVE               NONE   25
python
transactions_source.head(2)
text
        t_dat                                    customer_id  article_id      price  sales_channel_id
0  2018-09-20  000058a12d5b43e67d225668fa1f8d618c13dc232690b0...  663713001  0.0508305                 2
1  2018-09-20  000058a12d5b43e67d225668fa1f8d618c13dc232690b0...  541518023  0.0305085                 2

Create Kumo Tables

Transform BigQuery tables into Kumo table objects:

python
articles = kumoai.Table.from_source_table(
    source_table=articles_source,
    primary_key="article_id"
).infer_metadata()

customers = kumoai.Table.from_source_table(
    source_table=customers_source,
    primary_key="customer_id"
).infer_metadata()

transactions_train = kumoai.Table.from_source_table(
    source_table=transactions_source,
    time_column="t_dat"
).infer_metadata()

Kumo automatically infers the metadata for each table, including data types and primary/foreign key relationships.

Define the Graph

Connect everything into a graph structure:

python
graph = kumoai.Graph(
    tables={
        "articles": articles,
        "customers": customers,
        "transactions": transactions_train,
    },
    edges=[
        {"src_table": "transactions", "fkey": "customer_id", "dst_table": "customers"},
        {"src_table": "transactions", "fkey": "article_id", "dst_table": "articles"},
    ]
)

graph.validate(verbose=True)
text
[2025-07-11 16:05:42 - kumoai.graph.table:555 - INFO] Table articles is configured correctly.
[2025-07-11 16:05:44 - kumoai.graph.table:555 - INFO] Table customers is configured correctly.
[2025-07-11 16:05:45 - kumoai.graph.table:555 - INFO] Table transactions_train is configured correctly.
[2025-07-11 16:05:47 - kumoai.graph.graph:798 - INFO] Graph is configured correctly.

Predictive Query Language (PQL)

Here's where Kumo shines.

Instead of writing complex neural network code, we describe predictions using SQL-like PQL.

Use Case 1: Customer Lifetime Value

Predict total revenue per customer over the next 30 days:

PQL syntax for customer value prediction

python
pquery = kumoai.PredictiveQuery(
    graph=graph,
    query=(
        "PREDICT SUM(transactions.price, 0, 30, days)\n"
        "FOR EACH customers.customer_id\n"
    )
)

pquery.validate(verbose=True)
text
[2025-07-11 16:12:47 - kumoai.pquery.predictive_query:211 - INFO] Query PREDICT SUM(transactions.price, 0, 30, days)
FOR EACH customers.customer_id
 is configured correctly.

Get Kumo's recommended model parameters:

python
model_plan = pquery.suggest_model_plan()
model_plan

This returns a comprehensive model plan with:

  • Training parameters (learning rates, batch sizes, epochs)
  • GNN architecture details (channels, aggregation methods)
  • Optimization settings (loss functions, weight decay)

Start training:

python
trainer = kumoai.Trainer(model_plan=model_plan)
training_job = trainer.fit(
    graph=graph,
    train_table=pquery.generate_training_table(non_blocking=True),
    non_blocking=True,
)
text
[2025-07-11 16:17:57 - kumoai.graph.graph:394 - INFO] Graph snapshot created.
[2025-07-11 16:18:00 - kumoai.graph.graph:462 - WARNING] Graph snapshot already exists, will not be refreshed.

Use Case 2: Product Recommendations

Predict top 10 products each customer will likely buy:

PQL syntax for product recommendations

python
purchase_pquery = kumoai.PredictiveQuery(
    graph=graph,
    query=(
        "PREDICT LIST_DISTINCT(transactions.article_id, 0, 30)\n"
        "RANK TOP 10\n"
        "FOR EACH customers.customer_id\n"
    )
)

purchase_pquery.validate(verbose=True)
text
[2025-07-11 16:23:59 - kumoai.pquery.predictive_query:211 - INFO] Query PREDICT LIST_DISTINCT(transactions.article_id, 0, 30)
RANK TOP 10
FOR EACH customers.customer_id
 is configured correctly.

Train the model:

python
model_plan = purchase_pquery.suggest_model_plan()
purchase_trainer = kumoai.Trainer(model_plan=model_plan)
purchase_training_job = purchase_trainer.fit(
    graph=graph,
    train_table=purchase_pquery.generate_training_table(non_blocking=True),
    non_blocking=True,
)

Use Case 3: Purchase Volume

Predict transaction count for recently active customers:

PQL syntax for transaction predictions

python
transactions_pquery = kumoai.PredictiveQuery(
    graph=graph,
    query=(
        "PREDICT COUNT(transactions.*, 0, 30)\n"
        "FOR EACH customers.customer_id\n"
        "WHERE COUNT(transactions.*, -30, 0) > 0\n"
    )
)

transactions_pquery.validate(verbose=True)
text
[2025-07-11 16:27:30 - kumoai.pquery.predictive_query:211 - INFO] Query PREDICT COUNT(transactions.*, 0, 30)
FOR EACH customers.customer_id
WHERE COUNT(transactions.*, -30, 0) > 0
 is configured correctly.

The WHERE clause filters for customers active in the past 30 days, reducing prediction scope.

Making Predictions

Once models finish training (40-60 minutes), generate predictions:

python
# Check training status
training_job.status()

Customer Value Predictions

python
from kumoai.artifact_export.config import OutputConfig

predictions = trainer.predict(
    graph=graph,
    prediction_table=pquery.generate_prediction_table(non_blocking=True),
    output_config=OutputConfig(
        output_types={"predictions"},  
        output_connector=connector,
        output_table_name="SUM_TRANSACTIONS_PRED",
    ),
    training_job_id=training_job.id,
    non_blocking=True,
)
text
[2025-07-11 18:12:51 - kumoai.trainer.trainer:418 - WARNING] Prediction produced the following warnings: 
For the optimal experience, it is recommended for output tables to only contain uppercase characters, numbers, and underscores

Product Recommendations

For ranking predictions, specify how many results per entity:

python
purchase_predictions = purchase_trainer.predict(
    graph=graph,
    prediction_table=purchase_pquery.generate_prediction_table(non_blocking=True),
    num_classes_to_return=10,  # top 10 products
    output_config=OutputConfig(
        output_types={"predictions"},
        output_connector=connector,
        output_table_name="PURCHASE_PRED",
    ),
    training_job_id=purchase_training_job.id,
    non_blocking=True,
)

Transaction Volume

python
transactions_predictions = transactions_trainer.predict(
    graph=graph,
    prediction_table=transactions_pquery.generate_prediction_table(non_blocking=True),
    output_config=OutputConfig(
        output_types={"predictions"},
        output_connector=connector,
        output_table_name="TRANSACTIONS_PRED",
    ),
    training_job_id=transactions_training_job.id,
    non_blocking=True,
)

These prediction jobs will write results to new tables in BigQuery with the suffix _predictions.

Analyzing Results

Kumo writes predictions back to BigQuery. Let's analyze them.

Top Value Customers

python
query = f"""
SELECT * FROM {dataset_id}.SUM_TRANSACTIONS_PRED_predictions
ORDER BY TARGET_PRED DESC
LIMIT 5
"""

client.query(query).to_dataframe()
text
                                            ENTITY  TARGET_PRED
0  63d4ee9c373b7ec52fd03b319faf53f3f1f24763d8a3ac...     0.668505
1  f69cf6fca69045a8259f9554e318e00fbf5e8e758e88b1...     0.657948
2  be96311f48cf1049e0da065ab322fada512ee88486c371...     0.647882
3  203785d96661d87a84718e998664c1169f43aa21b677a1...     0.643395
4  17d6270f6f81ad1f7e5a1cb7ed8edb54bc00d0d5c2cde6...     0.640910

Finding Customer Preferences

Let's identify our most valuable customers and their preferences:

python
valuable_customers = f"""
SELECT cust.*, trans.target_pred AS score FROM {dataset_id}.customers cust
INNER JOIN (
    SELECT entity, target_pred FROM {dataset_id}.SUM_TRANSACTIONS_PRED_predictions
    ORDER BY target_pred DESC
    LIMIT 30
) trans ON cust.customer_id = trans.entity
"""

top_customers = client.query(valuable_customers + ";").result().to_dataframe()
top_customers.head()
text
                                       customer_id   FN  Active club_member_status fashion_news_frequency  age     score
0  e9c27cf3d00e7bb6a27f395a01d01fbe6328901afdb645...  NaN     NaN             ACTIVE                   NONE   22  0.599846
1  ceb037bfdab35cdd507685b20648829ddc0d92c8e02e2f...  NaN     NaN             ACTIVE                   NONE   24  0.621152
2  d8c54f5ca6421ba8c5d7631ebdf7a5b67ccf2dce4b859c...  NaN     NaN             ACTIVE                   NONE   25  0.585189
3  8c40103139dd4b93163fa25a536cac2351ebb5936700cb...  1.0     1.0             ACTIVE              Regularly   25  0.575516
4  d1bbee89e5364ecdb031e2b2f4be3509029d007eac99a2...  1.0     1.0             ACTIVE              Regularly   37  0.610632

Now see what the top customer will likely buy:

python
product_recs = f"""
SELECT
    pred.entity AS customer_id,
    pred.score AS score,
    art.*
FROM {dataset_id}.PURCHASE_PRED_predictions pred
INNER JOIN {dataset_id}.articles art ON pred.class = art.article_id
INNER JOIN {dataset_id}.customers cust ON pred.entity = cust.customer_id
WHERE cust.customer_id = '{top_customers.customer_id[0]}'
"""

top_cust_recs = client.query(product_recs).result().to_dataframe()
top_cust_recs
text
                                       customer_id     score  article_id product_code               prod_name product_type_no product_type_name
0  e9c27cf3d00e7bb6a27f395a01d01fbe6328901afdb645...  7.167438   787285001       787285                   Magic             265             Dress
1  e9c27cf3d00e7bb6a27f395a01d01fbe6328901afdb645...  7.023399   859957001       859957      LE Good Ada Dress             265             Dress
2  e9c27cf3d00e7bb6a27f395a01d01fbe6328901afdb645...  6.783765   758381002       758381            Twist fancy              92    Heeled sandals
3  e9c27cf3d00e7bb6a27f395a01d01fbe6328901afdb645...  6.890747   935635002       935635   LUCKY TIE NECK SHIRT             259             Shirt
4  e9c27cf3d00e7bb6a27f395a01d01fbe6328901afdb645...  6.762956   787285003       787285                   Magic             265             Dress
5  e9c27cf3d00e7bb6a27f395a01d01fbe6328901afdb645...  7.849483   904625001       904625      Pax HW PU Joggers             272          Trousers
6  e9c27cf3d00e7bb6a27f395a01d01fbe6328901afdb645...  7.308242   918212001       918212           ED Uma dress             265             Dress
7  e9c27cf3d00e7bb6a27f395a01d01fbe6328901afdb645...  7.370093   787285005       787285                   Magic             265             Dress
8  e9c27cf3d00e7bb6a27f395a01d01fbe6328901afdb645...  7.245024   814980001       814980          Alabama Dress             265             Dress
9  e9c27cf3d00e7bb6a27f395a01d01fbe6328901afdb645...  6.791434   835247001       835247              Supernova             265             Dress

This customer clearly loves dresses — 7 out of 10 recommendations are dresses!

Purchase Volume Analysis

Finally, predict transaction volume for valuable customers:

python
predicted_volume = f"""
SELECT cust.customer_id, trans.target_pred
FROM ({valuable_customers}) cust
INNER JOIN {dataset_id}.TRANSACTIONS_PRED_predictions trans 
ON cust.customer_id = trans.entity
"""

cust_volume = client.query(predicted_volume).result().to_dataframe()
cust_volume.head()
text
                                        customer_id  target_pred
0  2cabdc6101018f8cea44310343769715049befed47caa9...    19.226614
1  77db96923d20d40532eba0020b55cd91eb51358885c2d6...    10.042411
2  062234bcfa5875d71069215348a11f100aa15edd540868...    12.537105
3  2baed3260d6a0c2f23737d09b68d30eff348eb8ec428e0...    15.382269
4  788785852eddb5874f924603105f315d69571b3e5180f3...    10.424101

Our top valuable customers are predicted to make between 10-20 transactions in the next 30 days.


What Makes This Powerful

Building a pipeline like this can take weeks or months — data wrangling, model training (and retraining again and again), followed by analytics can easily become a long-running project. As we demonstrated here,we can build the same pipeline in hours with Kumo — and the results are likely better than what many of us can achieve solo.

This democratizes advanced analytics. We don't need deep GNN expertise to provide world-class insights to complex data and business questions — we use Kumo.