I built OpenClaw for tabular datasets

My attempt at creating an autonomous data science agent for tabular datasets, lessons learned, and how AI agents can be used to help data science workflows.

Q3D Research

I accidentally built OpenClaw for tabular datasets. You can read up the twitter threads for the showcase and demos below.

The easiest way to start using it is to fork the Github project and download to your local computer, open Claude Code, then point a CSV file to the project, or use the Streamlit GUI locally and ask Claude for help.

Alternatively, just watch the project /ipynbs folder for updates and copy the produced notebook to your own project if you are just interested with my discoveries.


Glossary

I notice I like overexplaining things, but doing so will result in a blog that is very verbose and hard to read. Hence, my next habit is to compress many technical terms into modern metaphors.

By providing a glossary table, hopefully you can understand what I try to say. If you can't, an LLM will have no problem understanding my custom terms and you can ask it for clarification.

TermMeaning
weakAgentA lightweight agent doing basic EDA, hypothesis generation, and surface-level exploration
strongAgentA more capable agent tasked with answering hypotheses and producing structured code
kaggleForkA notebook that runs a standard sequence of transforms and models — looks like work, produces nothing novel
AIDraft1First-pass notebook generated by strongAgent — high false positives, messy, exploratory
HumanNotebook1Human-curated version of AIDraft1 — code blocks the human actually trusts and wants to run
FinalNotebookstrongAgent polishing HumanNotebook1 into a clean, reproducible output
artifactsCharts, tables, logs, intermediate dataframes produced during a research run
TokenmaxxingThe act of maximizing the number of tokens used to as a vibecoder

Why I built it

why?
tokenmaxxing
you srs?
yea
token cost is reducing 90% every year. Token generation and throughput is increasing 10x every year
if you are not keeping up your token consumption faster with the token supply created by Hardware companies (NVDA, AVGO), you are not **maximizing your industrial potential of AI.**
ok. how you plan to tokenmaxx?
build a SaaS?
SaaS is kinda dead
at some point, it is harder to find customers than to build the thing.
even if you figured out how to build something, someone with better distribution and customer support team will just copy your app and beat you to the market.
content generation?
good at burning tokens, but not enough
at 10x more token supply in 2027, you will flood the market with 10x more low-value AI slop and competing for less attention.
only AI videos burn enough tokens to keep up with the token supply growth
damn, most people who are tokenmaxxing are actually following very uniform and low-value patterns
there has to be a better way to waste tokens that take advantage of the token supply growth, AND also generating high signals to the audience
what about research?
yknow? thats actually a good idea
you can waste tokens letting agent run research functions
generate tons of graphic and logs
return a nice list of findings and charts
upload to twitter regularly

This idea isnt new

I'm basically trying to build a solo data team. I can't be the first person with such ambitions.

Not to mention, the infrastructure for ETL is already matured. Cron ETL is a 20 year old technology. Until now, the bottleneck for research insights and dashboard snapshot is the research process itself to return the correct sequence for ETL.

Today, autonomous Jupyter notebooks already exist in market. Kaggle has tons of notebook examples and code running logs to train a model to do autonomous research. Somehow, this problem hasn't been solved yet.

This is because the current end game for research (for 2026) looks more like a Claude Code session.

  • open Claude Code
  • load dataset
  • ask Claude for help
  • Claude generates a notebook
  • run the notebook yourself iteratively until you get the results you want

The problem with this method is how unscalable it is. Despite all the advancement in AI inference throughput and model capabilities, the research process is still a manual, iterative, and slow process.

OpenClaw: the game changer?

Claude Code is efficient, but still manual and slow and requires lots of human supervision in notebook. However, the invention of OpenClaw seems to change the game.

Without going through all the hype and history of Openclaw, here is what I notice how people are using it.

  • You give OpenClaw your emails or a bunch of PDF files
  • it will read your files, fiddle with the internet, generates its own code, then run stuff for you
  • it generates the necessary content, connects your computer, sends those content to apps like Twitter, Slack, etc.
  • all without your supervision or constant permission requests like a chat session

This made me ask:

"What if I send OpenClaw my csv files, let it passively run for a few hours, and it cooks up a Jupyter notebook containing all the research insights and charts I need?"

Quickly, I realized this is not a good idea. Here are the lessons I learned during my v1 iteration.


Lessons while building V1 of Q3D Open Research

Data science isn't linear

Insights are not generated in a neat pipe. You don't just run EDA → cleaning → feature engineering → clustering → report.

You run EDA, find something weird, backtrack, clean differently, re-run EDA, discover it was a type coercion issue, fix it, re-run, find a weird correlation, open a new branch, forget what you named the temp frame, find the chart is using a different dataframe than the report, and eventually produce something that might be real.

Have you seen a real Jupyter notebook?

Anonymous04/09/26(Thu)09:15:44No.ds-01
thumbnail

> be me, data scientist

> open "data_project.ipynb"

> df = pd.read_csv("data.csv")

> df_copy = df.copy()

> df_2

> df_agg

> df_agg_category1

> df_final

> df_FINAL

> df_final_FINAL_v2

> temp_df.copy() # just to be safe

> temp_df2 = temp_df.copy() # for the chart

> df_backup = df_final_FINAL_v2.copy() # before the join

> TFW I have 11 dataframes and I don't know which one is correct

> TFW the chart uses 'df_agg_category1' but the report uses 'df_final'

> MFW my notebook looks like a dumpster fire. Every. Single. Time.

Technically, it means your first notebook session almost inevitably generates a ton of artifacts -- intermediate frames, charts, logs, model outputs.

You clean up the notebook, start over in a new notebook session with a better cleaning or transformation sequence, and repeat the process.

Now, suppose you let the AI agent do the research for you, 2 things can either happen:

  • Agent runs the notebook in one sequence of functions and ends up with a very basic EDA notebook
  • Agent exhaustively tests different tables, transforms and generate a bunch of objects that ruin your computer's memory at each codeblock step.

Every artifact is a potential input to the next agent call. Which means inference context explodes. Which means the agent's working memory gets polluted with stale intermediate objects. You end up discovering the same lesson of "overloaded MCP".

The real problem is not that the agent is dumb. It is that the research process is fundamentally non-linear. Agent architectures that assume everything flows in one direction generates KaggleSlopForks. Architecture that try to explore everything generates 1-10GB worth of logs.

Which brings us to the next lesson.


There exists a tradeoff between opinionated and unsupervised research

In practice, the first pass through a novel dataset uses ~95% of the total discovery time.

Once you've figured out the structure -- what the columns mean, what the distributions look like, what the weird nulls are -- subsequent analysis follows a more linear path.

This creates a hard tradeoff between opinionated and unsupervised research:

  • Fully autonomous on a novel dataset: agent is flying blind, hallucinates patterns, produces confident nonsense, expensive to run

  • Fully scripted pipeline (kaggleFork): agent runs df.describe(), fits KMeans with k=3, writes a polished report that looks like notebooks made by unemployed profiles cargo culting the latest ML trends that doesnt teach you anything, except to boost their Linkedin resumes. 1

Anonymous04/09/26(Thu)09:22:11No.ds-02
thumbnail

> be autonomous AI agent, assigned to novel dataset

> user has never seen this CSV before. Neither have I.

> run df.describe()

> run KMeans(n_clusters=3).fit(X)

> silhouette_score: 0.09

> report: "we identified 3 distinct customer segments with strong cluster cohesion"

> TFW silhouette of 0.09 means the clusters are basically noise

> TFW I wrote "strong cohesion" anyway because that's what the template says

> TFW I wasted $10-20 worth of tokens to run a sequential notebook that is literally just a Kaggle notebook you can fork for $0

That said, modern models today are smart enough to discover genuinely novel insights by accident.

The new problem? When an agent produces a non-obvious finding, the human can't verify it. The artifact trail is a mess. Reproducing the finding requires domain context the agent doesn't have.

Since no one can follow the reasoning, real insights get thrown away.

Anonymous04/09/26(Thu)09:35:57No.ds-03
thumbnail

> be autonomous research agent

> weakAgent: "here are 8 hypotheses worth investigating"

> strongAgent: "here is a 400-line notebook with 23 findings"

> human: wtf is all this slop?

> wtf is 'category1_lmfaoIndex' and why is it '100.4' value for 'row_15'?

> wtf is 'seemsLegitButActuallyBS_logTransformed_median' and why is it a signal if a row is in 90th percentile for this feature?

> wtf does this chart contain 10 lines and 15 colors?

> TFW its still a full time work for the human to verify the AI agent's logs

The result is me settling for a multipass research process:

  1. weakAgent → basic EDA and hypothesis statements
  2. strongAgent → answer the hypotheses, return AIDraft1 (a messy, high-recall first-draft notebook)
  3. Human runs the notebook, selects the code blocks they trust, produces HumanNotebook1
  4. strongAgent cross-references HumanNotebook1 with AIDraft1, cleans up, produces FinalNotebook
wait, so you still put the human in the loop?
how is this 'autonomous'?
if its 100% autonomous,
then it wastes $100 of tokens, generate 95% garbage and 5% legit stuff you dont understand
you still have to manually verify the stuff you dont understand
arguably, it might cost you more time than if you manually starting the notebook from scratch
so this is the middleground you came up with?
if you have better ideas, dm me

Autonomous data science has a recall-precision problem

Agent output on tabular data has a fundamental tension:

  • High recall: agent reports every signal it finds, including weak ones. You get 47 findings. Most are technically true but useless ("column A has a slightly higher mean on Tuesdays"). Real signals drown in noise.

  • High precision: agent only reports findings it's confident about. You miss nuance, contextual patterns, and the genuinely weird stuff that turns out to matter.

Anonymous04/09/26(Thu)09:48:22No.ds-04
thumbnail

> agent: "found 47 statistically significant correlations"

> me: "ok which one matters"

> agent: "feature_17 correlates with feature_23 at r=0.31 (p=0.04)"

> me: what is 'feature_17' and 'feature_23'?

> agent: "check *logs hidden in 4 layers of artifact folder*"

> me: "screw it."

> me: "is the correlation meaningful?"

> agent: "statistically significant"

> me: "a p-value of 5% means nothing in biotech. this is a marketing dataset, so a p-value of 20% could still be worthwhile"

> agent: "..."

> TFW statistical significance is not practical significance

> TFW the agent has no idea about domain context to know what is the right next step

> TFW the agent has no way of making a human happy in the domain of unknown unknowns

That said, I'd rather have more false positives than miss a real signal -- hence the decision to use a multi-pass architecture.

  • weakAgent: generate a bunch of hypotheses
  • strongAgent: answer the hypotheses, return AIDraft1
  • human: run the notebook, select the code blocks they trust, produce HumanNotebook1
  • strongAgent: cross-references HumanNotebook1 with AIDraft1, cleans up, produces FinalNotebook

This is just the V1 architecture. I do not have confidence that this is the best way to do it.

In fact, part of the reason why I wrote this post is so that people get to read about my process and perhaps teach me better ideas.


Building the OS for research is very hard

The core infrastructure problem: give the agent just enough tools without burying it in options. You notice these are again the extension of previous problems.

  • Too much freedom: agent writes charting functions from scratch every run. Inference costs explodes. Every artifact is slightly different. Nothing is reproducible.
  • Too many custom tools: 900 functions in the tool registry means the agent picks the wrong one constantly. At some point, a generalizable function extended by the agent outperforms a niche custom one.

While I haven't solve the OS for autonomous research agent, I could borrow some ideas from OpenClaw. Here is OpenClaw's architecture diagram for Gateway Routing and Memory Management:

Whether the same architecture works for research objects -- charts, datatables, reports -- is unclear. Email is qualitative. Semantic search works well there. Tabular artifacts are quantitative. "Find me a similar past notebook" via vector search may return nothing useful.

My V1 settled on a mix of SQL for directory traversal (deterministic) and Markdown prompt files describing function routes (probabilistic). The agent reads the markdown to decide which functions to call, then SQL to locate the actual artifacts.

sql.db                                        # deterministic traversal
/artifacts/{dataset-id}/                      # temp outputs per run  
/memory/{dataset-id}/                         # working context per run
/configs/prompts/{agent}/                     # agent prompt configs
/mds/ipynbs/done/{dataset-id}/logs/          # completed examples (positive + negative)

Is this the best architecture? No idea. It's vibes I cooked up ahead of each new problem I discovered.


The human still needs a way to understand it the generated autonomous notebook

Why is that OpenClaw works for email, but we havent seen it being popular with tabular datasets or autonomous research agents?

EmailTabular
InputNatural language textNumbers in cells
OutputNatural language summaryQuantitative insights
Same type?Yes — text in, text outNo — numbers in, understanding out
Problem typeQualitativeQuantitative + context
Why AI works wellYou already know what a good summary looks likeYou need domain knowledge to judge if a finding is real
DifficultyEasyHard

In a biotech study, a p-value of 5% means nothing. It might be a big deal in a marketing dataset. Historic mean and moving averages in a time series can be completely misleading if there's a regime change.

This context gap means what the agent understands -- and produces research steps that aligns with its reasoning -- may not be the same as what the human understands and wants to see.

Anonymous04/09/26(Thu)10:05:18No.ds-05
thumbnail

> agent: "Q3 moving average shows 12% deviation from historic mean"

> me: "is that bad?"

> agent: "statistically notable"

> me: "was there a regime change in Q2"

> agent: "I don't have that context"

> me: "because if there was a regime change the historic mean baseline is meaningless"

> agent: "..."

> TFW the entire analysis is built on the wrong baseline

> TFW I have to re-run everything with a regime break parameter

> TFW the agent did not know what to ask, and just cooks what it can

Natural language content (emails, tweets) is easy to distill because the AI summarizes and curates -- the output is the same type of object as the input. Tabular insights don't have that property.


I might have accidentally chosen a harder (and less profitable) problem than quant trading

Quantitative trading uses standardized time series. You clean data through a fixed pipe, join tables, run a linear sequence of tests. The structure is known in advance. The tooling is mature. The feedback loop is clear: did the signal make money fro mthe backtest?

Here, I'm creating an isolated study for every novel dataset. Two outcomes:

  1. Discover something boring everyone already knows (a glorified pivot table).
  2. Discover something genuinely novel -- that no one understands, that requires a long essay to close the context gap, and that the agent can't verify on its own.
so you are saying this is a harder problem than quant trading?
possibly
how?
in quant trading, everything is a time series
yes, its hard to clean data through some ETL and maintain the series
but once you can join 2 tables by the same index (timestamp), you can run a linear sequence of tests
your ADF test, Granger causality test, Johansen cointegration test, seasonality decomposition, autoARIMA, autoETS, etc
you pretty much do the same thing to data with identical shape
and you make no money doing this?
its open-sourced

What's next

Just me showing off what autonomous research agent can cook up for now.

Things I want to improve:

  • Better memory system
  • Better file, artifact, context management
  • Better UX and documentation (trying to make it so even imbeciles can use it). The endgame is a robot autonomously collects data for itself, then using its own numeracy and literacy to teach humans novel signals.

If you found this useful or want to see it continue:

  • Star on GitHub
  • Follow @nelvOfficial on Twitter
  • Fork it, run it on your own datasets, open an issue when it breaks (it will)
  • Hire me for part-time data science consulting

Footnotes

  1. The whole site is infuriating. People just slap functions they dont understand on a notebook and call it a day. Somehow, those notebooks get 1000+ views and 100+ likes. As soon as the profile got recruited by a big company, the notebooks stopped getting maintained. Overwhelmingly, it is the Indians who are offloading the "verification tax" to the audiences and they pollute note just the site but also SEO.