# ETL Quickstart

`loom.etl` is a declarative ETL subsystem with compile-time validation,
backend-agnostic declarations, and a single runtime entrypoint (`ETLRunner`).

## Install

Choose one backend:

```bash
pip install "loom-kernel[etl-polars]"
# or
pip install "loom-kernel[etl-spark]"
```

## Minimal pipeline

```python
from datetime import date

import polars as pl
from loom.etl import (
    ETLParams,
    ETLStep,
    ETLProcess,
    ETLPipeline,
    ETLRunner,
    FromTable,
    IntoTable,
)


class DailyParams(ETLParams):
    run_date: date


class CleanOrders(ETLStep[DailyParams]):
    orders = FromTable("raw.orders").columns("id", "amount", "run_date")
    target = IntoTable("staging.orders").replace()

    def execute(self, params: DailyParams, *, orders: pl.LazyFrame) -> pl.LazyFrame:
        return orders.filter(pl.col("amount") > 0)


class DailyProcess(ETLProcess[DailyParams]):
    steps = [CleanOrders]


class DailyPipeline(ETLPipeline[DailyParams]):
    processes = [DailyProcess]


runner = ETLRunner.from_dict(
    storage={
        "engine": "polars",
        "defaults": {"table_path": {"uri": "/var/lib/loom/lake"}},
    }
)
runner.run(DailyPipeline, DailyParams(run_date=date(2026, 3, 30)))
```

## Basic write modes

Every `IntoTable` target declares exactly one write mode by chaining a method.

| Mode | What it does | Example |
|------|--------------|---------|
| `append` | Add rows to the table | `IntoTable("staging.orders").append()` |
| `replace` | Full overwrite | `IntoTable("staging.orders").replace()` |
| `replace_partitions` | Overwrite only partitions present in the batch | `IntoTable("staging.orders").replace_partitions("year", "month")` |
| `replace_partition` | Overwrite a single known partition | `IntoTable("staging.orders").replace_partition(year=params.run_date.year)` |
| `replace_where` | Overwrite rows matching a predicate | `IntoTable("staging.orders").replace_where(col("date") == params.run_date)` |
| `upsert` | Merge on key columns | `IntoTable("staging.orders").upsert(keys=("order_id",))` |

See the [ETL pipelines guide](../etl/pipelines.md) for the full write-mode reference.

## YAML config

```yaml
storage:
  engine: polars
  defaults:
    table_path:
      uri: s3://my-lake
      storage_options:
        AWS_REGION: ${oc.env:AWS_REGION}
        AWS_ACCESS_KEY_ID: ${oc.env:AWS_ACCESS_KEY_ID}
        AWS_SECRET_ACCESS_KEY: ${oc.env:AWS_SECRET_ACCESS_KEY}

  tmp_root: /var/lib/loom/lake/_tmp

observability:
  log: true
  slow_step_threshold_ms: 30000
  run_sink:
    root: /var/lib/loom/lake/_runs
```

```python
from loom.etl import ETLRunner

runner = ETLRunner.from_yaml("config/etl.yaml")
```

## File aliases

Hard-coding file paths couples logic to infrastructure. Use aliases instead:

```yaml
storage:
  engine: polars
  files:
    - name: events_raw
      path:
        uri: s3://raw-bucket/events/
        storage_options:
          AWS_REGION: eu-west-1
```

```python
from loom.etl import ETLStep, FromFile, IntoFile, Format

class LoadEvents(ETLStep[DailyParams]):
    events = FromFile.alias("events_raw", format=Format.CSV)
    target = IntoFile.alias("exports_daily", format=Format.PARQUET)

    def execute(self, params: DailyParams, *, events: pl.LazyFrame) -> pl.LazyFrame:
        return events
```

## Next steps

- [ETL pipelines guide](../etl/pipelines.md) — full write modes, Spark runtime, cloud config, and pluggable resolvers
- [ETL testing guide](../etl/testing.md) — in-memory runners, scenarios, and stubs
- [ETL examples](../etl/examples.md) — companion repository with runnable Polars and Spark pipelines
- [API reference](../reference/api/etl)