What does HackerNews think of sketch?

AI code-writing assistant that understands data content

Language: Python

#79 in Python
I want to applaud this effort, coming up with new ways of manipulating pandas dataframes is sorely needed. This project along with sketch[1] takes a conversational approach.

I have been building my own open source tool for wrangling pandas dataframes - Buckaroo[2]. The aim of Buckaroo is to automate common data cleaning and exploration techniques, and provide a GUI to quickly access transforms and see their results along with the python code to perform the transform. The jupyter notebook is great for presentation, but clunky for iterative exploration.

Here are the most common operations I do when exploring a new dataset

1. Assess - What are the columns in this dataset, what are their ranges of values, how do these columns vary together. Buckaroo offers toggable summary stats to show this info.

2. Initial Clean - Are all of the date columns parsed as datetimes? Are columns that contain integers typed as integers (a single string in the colum will force the whole column to type object). Clean the types with safeint/dropna...

3. Filter to a subset of data based on criteria. Row wise filtering.

4. Perform some type of analysis. Group by with different aggregations, a plot... Something more complex

Other operations that I regularly perform

5. Concattenating similar dataframes. pd.concat is simple, knowing whether your columns match up, along with their types is difficult. I want a UI that shows where they match and where they dont. This normally will require an extra data cleaning step

6. Joining tables/dataframe. I want a UI that tells me Is there a natural key to join on? What percentage of rows are joined vs not matched.

Buckaroo currently does an initial version 1,2, and 4. Buckaroo allows these steps to be done in a single notebook cell, instead of many separate cells iteratively built up. There are planned features that address all of these use cases. In addition Buckaroo has been engineered to enable easy extension by users. Plug your own functions in, and make them quickly accessible.

In my view a point and click gui is the best way to quickly accomplish these tasks as opposed to a conversational AI approach. The conversational AI approach doesn't provide enough structure around common tasks.

PS: This morning I added a "Related Projects" [3] Section to the Buckaroo docs. If Buckaroo doesn't solve your problem, look at one of the other linked projects (like Mito).

[1] https://github.com/approximatelabs/sketch

[2] https://github.com/paddymul/buckaroo

[3] https://buckaroo-data.readthedocs.io/en/latest/FAQ.html

Sketch is similar, but can do code generation (with `.sketch.howto`): https://github.com/approximatelabs/sketch

> This because the function is no longer idempotent, each call to the AI can yield a different result.

Also, it makes it harder to verify correctness, and may make processing larger datasets (or repeatedly processing similar datasets) more expensive.

Sketch is another take on integrating an LLM into Pandas:

https://github.com/approximatelabs/sketch

For GPT/Copilot style help for pandas, in notebooks REPL flow (without needing to install plugins), I built sketch. I genuinely use it every-time I'm working on pandas dataframes for a quick one-off analysis. Just makes the iteration loop so much faster. (Specifically the `.sketch.howto`, anecdotally I actually don't use `.sketch.ask` anymore)

https://github.com/approximatelabs/sketch

I sorta did this, feel free to check it out and let me know your thoughts!

On the main langchain post (In January) that got the traction on hackernews, i left this comment: https://news.ycombinator.com/item?id=34422917 . It still remains true, a "simpler langchain"

> To offer this code-style interface on top of LLMs, I made something similar to LangChain, but scoped what i made to only focus on the bare functional interface and the concept of a "prompt function", and leave the power of the "execution flow" up to the language interpreter itself (in this case python) so the user can make anything with it.

I made a really lightweight wrapper over requests and call it lambdaprompt https://github.com/approximatelabs/lambdaprompt It has served all of my personal use-cases since making it, including powering `sketch` (copilot for pandas) https://github.com/approximatelabs/sketch

Core things it does: Uses jinja templates, does sync and async, and most importantly treats LLM completion endpoints as "function calls", which you can compose and build structures around just with simple python. I also combined it with fastapi so you can just serve up any templates you want directly as rest endpoints. It also offers callback hooks so you can log & trace execution graphs.

All together its only ~600 lines of python.

I haven't had a chance to really push all the different examples out there, so I think it hasn't seen much adoption outside of those that give it a try.

I hope to get back to it sometime in the next week to introduce local-mode (eg. all the open source smaller models are now available, I want to make those first-class)

This is great~ There's been some really rapid progress on Text2SQL in the last 6 months, and I really thinking this will have a real impact on the modern data stack ecosystem!

I had similar success with lambdaprompt for solving Text2SQL (https://github.com/approximatelabs/lambdaprompt/) where one of the first projects we built and tested was a Text-to-SQL very similar to this

Similar learnings as well:

- Data content matters and helps these models do Text2SQL a lot

- Asking for multiple queries, and selecting from the best is really important

- Asking for re-writes of failed queries (happens occasionally) also helps

The main challenge I think with a lot of these "look it works" tools for data applications, is how do you get an interface that actually will be easy to adopt. The chat-bot style shown here (discord and slack integration) I can see being really valuable, as I believe there has been some traction with these style integrations with data catalog systems recently. People like to ask data questions to other people in slack, adding a bot that tries to answer might short-circuit a lot of this!

We built a prototype where we applied similar techniques to the pandas-code-writing part of the stack, trying to help keep data scientists / data analysts "in flow", integrating the code answers in notebooks (similar to how co-pilot puts suggestions in-line) -- and released https://github.com/approximatelabs/sketch a little while ago.

From https://github.com/approximatelabs/sketch/blob/main/sketch/p... it appears that this library is calling a remote API, which obviates the utility of the demonstrated use case.

Upon closer inspection, it looks like https://github.com/approximatelabs/sketch interfaces with the model via https://github.com/approximatelabs/lambdaprompt, which is made by the same organization. This suggests to me that the former may be a toy demonstration of the latter.

Interesting how as of the time of writing this, most of the comments here (i.e. dozens) are praising this as a legitimate use case. Maybe I'm missing something obvious, but it seems clear to me that uploading data to a third party to verify whether that data contains PII is a non-starter for any serious application.