My attempt at creating an autonomous data science agent for tabular datasets, lessons learned, and how AI agents can be used to help data science workflows.
I accidentally built OpenClaw for tabular datasets. You can read up the twitter threads for the showcase and demos below.
The easiest way to start using it is to fork the Github project and download to your local computer, open Claude Code, then point a CSV file to the project, or use the Streamlit GUI locally and ask Claude for help.
Alternatively, just watch the project /ipynbs folder for updates and copy the produced notebook to your own project if you are just interested with my discoveries.
I notice I like overexplaining things, but doing so will result in a blog that is very verbose and hard to read. Hence, my next habit is to compress many technical terms into modern metaphors.
By providing a glossary table, hopefully you can understand what I try to say. If you can't, an LLM will have no problem understanding my custom terms and you can ask it for clarification.
Term
Meaning
weakAgent
A lightweight agent doing basic EDA, hypothesis generation, and surface-level exploration
strongAgent
A more capable agent tasked with answering hypotheses and producing structured code
kaggleFork
A notebook that runs a standard sequence of transforms and models — looks like work, produces nothing novel
AIDraft1
First-pass notebook generated by strongAgent — high false positives, messy, exploratory
HumanNotebook1
Human-curated version of AIDraft1 — code blocks the human actually trusts and wants to run
FinalNotebook
strongAgent polishing HumanNotebook1 into a clean, reproducible output
artifacts
Charts, tables, logs, intermediate dataframes produced during a research run
Tokenmaxxing
The act of maximizing the number of tokens used to as a vibecoder
token cost is reducing 90% every year. Token generation and throughput is increasing 10x every year
if you are not keeping up your token consumption faster with the token supply created by Hardware companies (NVDA, AVGO), you are not **maximizing your industrial potential of AI.**
ok. how you plan to tokenmaxx?
build a SaaS?
SaaS is kinda dead
at some point, it is harder to find customers than to build the thing.
even if you figured out how to build something, someone with better distribution and customer support team will just copy your app and beat you to the market.
content generation?
good at burning tokens, but not enough
at 10x more token supply in 2027, you will flood the market with 10x more low-value AI slop and competing for less attention.
only AI videos burn enough tokens to keep up with the token supply growth
damn, most people who are tokenmaxxing are actually following very uniform and low-value patterns
there has to be a better way to waste tokens that take advantage of the token supply growth, AND also generating high signals to the audience
what about research?
yknow? thats actually a good idea
you can waste tokens letting agent run research functions
I'm basically trying to build a solo data team. I can't be the first person with such ambitions.
Not to mention, the infrastructure for ETL is already matured. Cron ETL is a 20 year old technology. Until now, the bottleneck for research insights and dashboard snapshot is the research process itself to return the correct sequence for ETL.
Today, autonomous Jupyter notebooks already exist in market. Kaggle has tons of notebook examples and code running logs to train a model to do autonomous research. Somehow, this problem hasn't been solved yet.
This is because the current end game for research (for 2026) looks more like a Claude Code session.
open Claude Code
load dataset
ask Claude for help
Claude generates a notebook
run the notebook yourself iteratively until you get the results you want
The problem with this method is how unscalable it is. Despite all the advancement in AI inference throughput and model capabilities, the research process is still a manual, iterative, and slow process.
Claude Code is efficient, but still manual and slow and requires lots of human supervision in notebook. However, the invention of OpenClaw seems to change the game.
Without going through all the hype and history of Openclaw, here is what I notice how people are using it.
You give OpenClaw your emails or a bunch of PDF files
it will read your files, fiddle with the internet, generates its own code, then run stuff for you
it generates the necessary content, connects your computer, sends those content to apps like Twitter, Slack, etc.
all without your supervision or constant permission requests like a chat session
This made me ask:
"What if I send OpenClaw my csv files, let it passively run for a few hours, and it cooks up a Jupyter notebook containing all the research insights and charts I need?"
Quickly, I realized this is not a good idea. Here are the lessons I learned during my v1 iteration.
Insights are not generated in a neat pipe. You don't just run EDA → cleaning → feature engineering → clustering → report.
You run EDA, find something weird, backtrack, clean differently, re-run EDA, discover it was a type coercion issue, fix it, re-run, find a weird correlation, open a new branch, forget what you named the temp frame, find the chart is using a different dataframe than the report, and eventually produce something that might be real.
Have you seen a real Jupyter notebook?
Anonymous04/09/26(Thu)09:15:44No.ds-01
> be me, data scientist
> open "data_project.ipynb"
> df = pd.read_csv("data.csv")
> df_copy = df.copy()
> df_2
> df_agg
> df_agg_category1
> df_final
> df_FINAL
> df_final_FINAL_v2
> temp_df.copy() # just to be safe
> temp_df2 = temp_df.copy() # for the chart
> df_backup = df_final_FINAL_v2.copy() # before the join
> TFW I have 11 dataframes and I don't know which one is correct
> TFW the chart uses 'df_agg_category1' but the report uses 'df_final'
> MFW my notebook looks like a dumpster fire. Every. Single. Time.
Technically, it means your first notebook session almost inevitably generates a ton of artifacts -- intermediate frames, charts, logs, model outputs.
You clean up the notebook, start over in a new notebook session with a better cleaning or transformation sequence, and repeat the process.
Now, suppose you let the AI agent do the research for you, 2 things can either happen:
Agent runs the notebook in one sequence of functions and ends up with a very basic EDA notebook
Agent exhaustively tests different tables, transforms and generate a bunch of objects that ruin your computer's memory at each codeblock step.
Every artifact is a potential input to the next agent call. Which means inference context explodes. Which means the agent's working memory gets polluted with stale intermediate objects. You end up discovering the same lesson of "overloaded MCP".
The real problem is not that the agent is dumb. It is that the research process is fundamentally non-linear. Agent architectures that assume everything flows in one direction generates KaggleSlopForks. Architecture that try to explore everything generates 1-10GB worth of logs.
In practice, the first pass through a novel dataset uses ~95% of the total discovery time.
Once you've figured out the structure -- what the columns mean, what the distributions look like, what the weird nulls are -- subsequent analysis follows a more linear path.
This creates a hard tradeoff between opinionated and unsupervised research:
Fully autonomous on a novel dataset: agent is flying blind, hallucinates patterns, produces confident nonsense, expensive to run
Fully scripted pipeline (kaggleFork): agent runs df.describe(), fits KMeans with k=3, writes a polished report that looks like notebooks made by unemployed profiles cargo culting the latest ML trends that doesnt teach you anything, except to boost their Linkedin resumes.1
Anonymous04/09/26(Thu)09:22:11No.ds-02
> be autonomous AI agent, assigned to novel dataset
> user has never seen this CSV before. Neither have I.
> TFW silhouette of 0.09 means the clusters are basically noise
> TFW I wrote "strong cohesion" anyway because that's what the template says
> TFW I wasted $10-20 worth of tokens to run a sequential notebook that is literally just a Kaggle notebook you can fork for $0
That said, modern models today are smart enough to discover genuinely novel insights by accident.
The new problem? When an agent produces a non-obvious finding, the human can't verify it. The artifact trail is a mess. Reproducing the finding requires domain context the agent doesn't have.
Since no one can follow the reasoning, real insights get thrown away.
Anonymous04/09/26(Thu)09:35:57No.ds-03
> be autonomous research agent
> weakAgent: "here are 8 hypotheses worth investigating"
> strongAgent: "here is a 400-line notebook with 23 findings"
> human: wtf is all this slop?
> wtf is 'category1_lmfaoIndex' and why is it '100.4' value for 'row_15'?
> wtf is 'seemsLegitButActuallyBS_logTransformed_median' and why is it a signal if a row is in 90th percentile for this feature?
> wtf does this chart contain 10 lines and 15 colors?
> TFW its still a full time work for the human to verify the AI agent's logs
The result is me settling for a multipass research process:
weakAgent → basic EDA and hypothesis statements
strongAgent → answer the hypotheses, return AIDraft1 (a messy, high-recall first-draft notebook)
Human runs the notebook, selects the code blocks they trust, produces HumanNotebook1
strongAgent cross-references HumanNotebook1 with AIDraft1, cleans up, produces FinalNotebook
wait, so you still put the human in the loop?
how is this 'autonomous'?
if its 100% autonomous,
then it wastes $100 of tokens, generate 95% garbage and 5% legit stuff you dont understand
you still have to manually verify the stuff you dont understand
arguably, it might cost you more time than if you manually starting the notebook from scratch
Agent output on tabular data has a fundamental tension:
High recall: agent reports every signal it finds, including weak ones. You get 47 findings. Most are technically true but useless ("column A has a slightly higher mean on Tuesdays"). Real signals drown in noise.
High precision: agent only reports findings it's confident about. You miss nuance, contextual patterns, and the genuinely weird stuff that turns out to matter.
The core infrastructure problem: give the agent just enough tools without burying it in options. You notice these are again the extension of previous problems.
Too much freedom: agent writes charting functions from scratch every run. Inference costs explodes. Every artifact is slightly different. Nothing is reproducible.
Too many custom tools: 900 functions in the tool registry means the agent picks the wrong one constantly. At some point, a generalizable function extended by the agent outperforms a niche custom one.
While I haven't solve the OS for autonomous research agent, I could borrow some ideas from OpenClaw. Here is OpenClaw's architecture diagram for Gateway Routing and Memory Management:
Whether the same architecture works for research objects -- charts, datatables, reports -- is unclear. Email is qualitative. Semantic search works well there. Tabular artifacts are quantitative. "Find me a similar past notebook" via vector search may return nothing useful.
My V1 settled on a mix of SQL for directory traversal (deterministic) and Markdown prompt files describing function routes (probabilistic). The agent reads the markdown to decide which functions to call, then SQL to locate the actual artifacts.
sql.db # deterministic traversal/artifacts/{dataset-id}/ # temp outputs per run /memory/{dataset-id}/ # working context per run/configs/prompts/{agent}/ # agent prompt configs/mds/ipynbs/done/{dataset-id}/logs/ # completed examples (positive + negative)
Is this the best architecture? No idea. It's vibes I cooked up ahead of each new problem I discovered.
Why is that OpenClaw works for email, but we havent seen it being popular with tabular datasets or autonomous research agents?
Email
Tabular
Input
Natural language text
Numbers in cells
Output
Natural language summary
Quantitative insights
Same type?
Yes — text in, text out
No — numbers in, understanding out
Problem type
Qualitative
Quantitative + context
Why AI works well
You already know what a good summary looks like
You need domain knowledge to judge if a finding is real
Difficulty
Easy
Hard
In a biotech study, a p-value of 5% means nothing. It might be a big deal in a marketing dataset. Historic mean and moving averages in a time series can be completely misleading if there's a regime change.
This context gap means what the agent understands -- and produces research steps that aligns with its reasoning -- may not be the same as what the human understands and wants to see.
Anonymous04/09/26(Thu)10:05:18No.ds-05
> agent: "Q3 moving average shows 12% deviation from historic mean"
> me: "is that bad?"
> agent: "statistically notable"
> me: "was there a regime change in Q2"
> agent: "I don't have that context"
> me: "because if there was a regime change the historic mean baseline is meaningless"
> agent: "..."
> TFW the entire analysis is built on the wrong baseline
> TFW I have to re-run everything with a regime break parameter
> TFW the agent did not know what to ask, and just cooks what it can
Natural language content (emails, tweets) is easy to distill because the AI summarizes and curates -- the output is the same type of object as the input. Tabular insights don't have that property.
Quantitative trading uses standardized time series. You clean data through a fixed pipe, join tables, run a linear sequence of tests. The structure is known in advance. The tooling is mature. The feedback loop is clear: did the signal make money fro mthe backtest?
Here, I'm creating an isolated study for every novel dataset. Two outcomes:
Discover something boring everyone already knows (a glorified pivot table).
Discover something genuinely novel -- that no one understands, that requires a long essay to close the context gap, and that the agent can't verify on its own.
so you are saying this is a harder problem than quant trading?
possibly
how?
in quant trading, everything is a time series
yes, its hard to clean data through some ETL and maintain the series
but once you can join 2 tables by the same index (timestamp), you can run a linear sequence of tests
Just me showing off what autonomous research agent can cook up for now.
Things I want to improve:
Better memory system
Better file, artifact, context management
Better UX and documentation (trying to make it so even imbeciles can use it). The endgame is a robot autonomously collects data for itself, then using its own numeracy and literacy to teach humans novel signals.
If you found this useful or want to see it continue:
The whole site is infuriating. People just slap functions they dont understand on a notebook and call it a day. Somehow, those notebooks get 1000+ views and 100+ likes. As soon as the profile got recruited by a big company, the notebooks stopped getting maintained. Overwhelmingly, it is the Indians who are offloading the "verification tax" to the audiences and they pollute note just the site but also SEO. ↩
The following research is produced autonomously by my agentic pipeline (claude)
I did not:
1. look at the dataset
2. run any code myself
3. know what each of those columns do until the AI told me what it is
All I do:
1. "Claude, use this link, generate a report for me"
This
Q3D open research automatically converts this boring 291k row table into this epic report
Even feature engineers columns not native to the table, allowing creative insights