boring beats clever.
An ugly cron job in production beats an elegant DAG in a notebook. Ship small, iterate, log everything.
Data engineer based in Germany. I own the full data stack — ingest, warehouse, pipelines, LLM automation, and BI — so small teams get numbers they can actually trust.
I'm Utkarsh, a data engineer who likes the messy middle: the place where business problems become schemas, schemas become pipelines, and pipelines become decisions.
Right now I'm the entire data & automation function at Noritual Lab, a matcha e-commerce startup in Berlin. Solo DE reporting to the founder. I picked the stack, stood up the warehouse, and own everything from CREATE SCHEMA to the KPI dashboard founders open every Monday. Before that I worked on agentic AI research at BioMed X (Heidelberg), and I hold an MS in Applied Data Science & Analytics from SRH Heidelberg (GPA 1.9).
I care about systems that are actually used: small, debuggable, documented, and cheap to run. In a typical week I'm writing DAGs, shipping Django, generating the monthly P&L, and tuning a Claude-API invoice classifier — all in the same repo.
// now running Noritual's monthly P&L · scaling the invoice pipeline · open to data eng & AI eng roles.
tools I use daily, weekly, or know well enough to ship with.
a few things I believe about building data systems.
An ugly cron job in production beats an elegant DAG in a notebook. Ship small, iterate, log everything.
Every pipeline exists to answer something someone actually asks, not to look good in a diagram.
A README, a runbook, a diagram. Every project. If the next engineer can't find it, it doesn't exist.
seven pieces. from production pipelines to research code.
Full internal web app, end-to-end with one teammate in ~2 months. Scoped requirements with the team, designed a star-schema Postgres backend, built a Django UI with CRUD, role-based access (user / admin / founder), and forgot-password flow. Ingests from Notion API, Google Sheets, and direct UI inputs.
→ single source of truth for founders' weekly review meetings. 10–12 active users.
Processes ~1,000 transactions/month: classifies which need invoices (~25% do), then matches each one to its PDF with confidence-scored Claude API screening. Hive-style partitioned GCS layout (raw → accepted → trash) for clean backfills and audit. Replaced accountant fees previously spent on transaction-linking.
→ the hardest problem I'm proud of solving — invoice/PO matching at scale.
Own the company's monthly P&L process since Aug 2025. Designed the P&L data model end-to-end, built a dedicated Postgres DB ingesting from Amazon SP-API, Shopify, and Sheets. Multi-currency FX conversion, three-way matching (PO ↔ goods receipt ↔ invoice), month-end close support.
→ replaced the founder's prior (inaccurate) self-built P&L with an owned, defensible monthly cadence.
Electron + React desktop tool for multi-channel pricing analysis. Three calculation modes, channel-specific COGS, editable fee configuration, and a real-time margin dashboard backed by Postgres.
→ real margin visibility per channel in one click.
Master's thesis comparing 7 deep architectures (LSTM, TCN, TCN-LSTM, attention variants) for hourly demand forecasting. Best model: Multi-Scale TCN+LSTM at 88.83% R². Identified and corrected a data-leakage flaw in the original published benchmark.
→ academic rigor: research finding, not just a model exercise.
12-week flagship build: Airflow, dbt, multilingual NLP (Hindi, Punjabi, English), XGBoost trend prediction, Streamlit dashboard. Cross-platform fuzzy matching across Spotify, YouTube, Genius, with MLflow tracking and a silent GitHub Actions collector.
→ early-signal layer for trend bets, not post-hoc charts.
5-stage pipeline taking audio to insight: Whisper transcription, sentiment (TextBlob + DistilRoBERTa), LSA topic modeling, T5 chunked summarization with BLEU/ROUGE eval. Dockerised with ffmpeg bundled, full CI/CD on GHCR.
→ open source · evaluated · reproducible.
Solo data & automation engineer at a matcha e-commerce startup. Reports to the founder. Own the GCP/Postgres warehouse, the KPI dashboard (Django, 10–12 users), monthly P&L close, and a Claude-API invoice pipeline processing ~1,000 transactions/month. Vendor migrations saved ~€1,250/year.
Built a multi-agent system with LangGraph + LangChain for academic paper discovery and retrieval. RAG pipeline on Milvus, Streamlit front-end, Docker, CI/CD, LangSmith observability.
GPA 1.9 (German scale · high distinction). Thesis on hybrid TCN-LSTM architectures for demand forecasting (best model 88.83% R²) — identified a data-leakage flaw in the published CUBIST benchmark paper.
GPA 8.39 / 10.
Looking for a data engineer who ships? Or just want to swap notes on Airflow, LLMs, or German bureaucracy?
utkarsh.sawant21@gmail.com