Job Description : Reporting to: Chief Product graph traversal strategies (network science); agentic RAG
3) Multi-document classification + scoring (risk-focused) Build instruction-based and ML-assisted classification pipelines for multi-document inputs (themes, narratives, risk taxonomy).
Explore generating data to fine tune small models.
Create scoring methodologies (e.g., risk score, severity, momentum/growth, confidence, exposure) with a clear rationale and calibration approach.
Bonus: experience building "risk detection" classifiers and adverse media style pipelines.
4) Context engineering + automatic prompt improvement Lead prompt engineering practices across the product: reusable prompt assets, versioning, guardrails, and domain adaptation.
Implement prompt evolution techniques (e.g., automated prompt iteration / prompt improvement loops) where it makes commercial sense.
Understand the impact of the words in a prompt into the distribution of probabilities the LLM outputs, managing context, through graphs and information retrieval
5) Evaluation: make quality measurable and repeatable Build robust evaluation methodologies for prompts, RAG, summarization, and classification.
Apply multiple evaluation techniques, including: offline metrics (precision/recall/F1 where appropriate) retrieval metrics and ablations LLM-as-a-judge style evaluations with rubrics, controls, and drift detection Define quality gates that allow the team to move fast without breaking trust.
Understanding an LLM as a neural network, and not only something that can be prompted and observed from the outside.
For example understanding how entropy can be a signal to detect hallucinations while they unfold through the layers of the model.
6) LLMOps + cost control Implement LLMOps: experiment tracking, model/prompt versioning, dataset management, observability, and release practices.
Build monitoring for quality + safety + cost, and actively optimize infrastructure spend in cloud environments.
Deploying and maintaining open source models
7) Lead by influence (and occasionally by direct leadership) Bring "Senior/Lead Engineer" judgement: clean architecture, pragmatic decisions, mentoring, unblock teams.
Partner tightly with Product, Design, Data Science, and Engineering—while also being able to execute independently.
What success looks like (first 6–12 months) A production-grade agentic architecture powering key workflows (investigate ? summarize ? classify ? score ? recommend action).
A measurable evaluation framework where quality improves release over release.
A Graph RAG (or equivalent) capability that materially improves multi-doc summarization accuracy and defensibility.
Clear cost/performance tradeoffs and observability that make the system operable at scale.
A team around you that's leveled up in GenAI engineering practices.
Required experience (Must-have) Proven background as a Senior / Lead Engineer (or equivalent staff-level scope), owning architecture and delivery.
Demonstrated experience building agentic GenAI architecture for commercially successful product features (not only internal prototypes).
Strong experience working with Data Scientists on ML algorithms, NLP, evaluation design, and productionization.
Hands-on experience in AWS and GCP (Azure acceptable as additional).
Production experience with: RAG chatbots multi-document summarization (ideally Graph RAG) multi-document classification scoring methodologies (risk scoring is a strong bonus) Deep expertise in prompt engineering and evaluation, including both classical metrics (e.g., precision/recall) and LLM-as-a-judge approaches.
Strong LLMOps and GenAI product design experience: experimentation ? deployment ? monitoring ? iteration.
Nice-to-have (Strong bonuses) Experience in risk/compliance domains (e.g., adverse media, AML, entity investigation workflows).
Knowledge graphs in production (e.g., Neo4j) and graph extraction pipelines.
Experience running annotation programs / building labeled datasets for NLP tasks.
Skills