Agentic engineering workflowsLLM and multimodal evaluation systemsWorkflow-backed review platformsProduction full-stack AI products

AI Agent Systems Builder

Shivam Kumar

I build production-grade AI systems, agentic workflows, evaluation platforms, and scalable full-stack products for teams that need reliable execution, not demos.

Writing Projects Lab Contact

Shivam Kumar smiling in a casual dark overshirt and mustard tee

Working model

Serious AI systems need more than prompts.

The strongest agentic products are built like operational systems: clear task boundaries, async execution, evaluation artifacts, human review, observability, and deployment discipline.

Agent workflows that finish work

I design systems where agents plan, execute, review, and leave artifacts that humans can trust.

Evaluation as product infrastructure

I care about scoring, review loops, retry behavior, and result provenance because they make AI systems operable.

Full-stack ownership

I move across product surfaces, APIs, workers, data models, and deployment paths without losing the system shape.

Practical ML systems

My background spans LLM platforms, multimodal benchmarks, computer vision, and production automation.

Featured writing

Building evaluation platforms that can be operated.

A public-safe technical reflection on turning an agent evaluation prototype into a production-grade platform.

All writing

How Does OpenClaw Work?

A production-minded systems tour of OpenClaw: Gateway architecture, context assembly, hooks and plugins, file-backed memory and recall, sandboxed tool execution, agent runs, QA scenarios, security audits, and red-team patterns you can reuse in your own agent platform.

2026-06-09 / 13 min read

Selected projects

Systems across evaluation and applied ML.

A few representative projects from multimodal benchmarking and computer vision.

View projects

Computer vision benchmark

ShapeCodeBench

A synthetic benchmark for evaluating whether multimodal models can reconstruct executable drawing programs from rendered geometric scenes.

arXiv GitHub Zenodo

Designed a deterministic image-to-program task with a small Python-like drawing DSL and render-based scoring.
Released a frozen eval_v1 split, baseline model results, paper source, and reproducibility artifacts.
Published the project as an arXiv preprint with a permanent Zenodo archive.

PythonComputer visionProgram synthesisBenchmarkingarXiv

Multimodal benchmarking

VLMBench / multimodal model evaluation

An evaluation tool for comparing multimodal language models with a consistent prediction-to-ground-truth methodology.

Evaluated model outputs across image and text tasks using repeatable scoring flows.
Built around provider APIs, cloud storage, structured samples, and result analysis.
Focused on making model comparisons auditable and easier for teams to reason about.

PythonGCPOpenAI APIGemini APIEvaluation design

Computer vision and applied ML

DSM reconstruction ML system

A patented machine-learning system for generating Digital Surface Models from imagery, reducing reliance on external elevation sources.

Built data collection and preprocessing workflows for large-scale model training.
Worked across computer vision modeling, evaluation, and production integration.
Extended operational coverage for geospatial workflows where elevation data was limited.

PythonPyTorchOpenCVGCPComputer vision