Benjamin Bush

Google Deepmind, Stanford and University of Illinois at Urbana-Champaign propose a Google search based system to factually validate LLM generated outputs to decrease LLMs tendency to confabulate. I do think this is a cool idea and will make AI agents factually more reliable, but I hope the irony doesn’t escape you: a) After we have now spent many billions of dollars on the development of LLMs and RAG systems, vector stores, data centers and hardware, etc. AI agents now go and check their outputs on Google. All this effort to go back to a Google search … b) I suspect it’s not coincidence that Google co-authored this research, looking to deeply integrate search into the AI toolbox, a technology many have argued that is going to upend their dominance and business model. In reality, I’d say this quickly gets a bit tricky though, because the answer your system proposes that is then fact-checked via Google search may well include information from your proprietary RAG system, which you might not want to send into a Google search. https://arxiv.org/abs/2403.18802

New comment Apr 15

Benjamin Bush

0 likes • Apr 15

I think one could argue that this represents a form of RAG over the web documents returned by Google search. Technically, any time external documents are used to help an LLM generate a final answer, we can classify the process as RAG. I think it's about time we moved beyond vector search. Maybe similar techniques to those used in the paper could be used to verify the consistency of statements with respect to a static, predefined knowledge base as well.

Altaf Rehmani

Mar 23 in

General discussion

Hands on Problem solving with AI

we would love to hear what problems would you like to solve using Generative AI. Leave your comments to tell us what would you like to learn doing?

New comment Apr 10

Benjamin Bush

2 likes • Mar 27

I'm currently diving deep into a RAG (Retrieval Augmented Generation) project. I'm trying to create a custom avatar / chatbot, based on my grandfather's 200 page autobiography. I was initially hopeful that an LLM with a sufficiently large context window, such as Gemini 1.5, would be sufficient for answering questions about the text without having to resort to vector-based RAG pipelines. Unfortunately, Gemini 1.5's guardrails impose severe restrictions on the content it is able to engage with. In my case, I asked Gemini: "What what was the most dangerous thing that happened to my grandfather, which almost cost him his life?" I was hoping that Gemini would reference this incident: "as we stuck our heads up over the hole to see where our other team members were located, I experienced a bullet whistling by my ear." Unfortunately, Gemini usually refused to answer because the probability of generating a "harmful response" was too high. Other times, if I reworded the prompt, it would attempt to answer but would give incorrect responses. Overall, I am sad to say that the usefulness of Gemini's large context window has been severely overhyped, especially for historical documents which touch on sensitive topics. In any case, experiments are showing that RAG still outperforms unenhanced LLMs in the long context domain. A great overview of these experiments is given here: https://youtu.be/UlmyyYQGhzc?si=t0R27yMwMrHzRTVO So now I am looking into RAG based solutions. I have attempted several "chat with your documents" style apps, but none of them were able to correctly identify the incident in which my grandfather nearly took a bullet to the head as the "most dangerous". Thus I am now looking into creating my own custom RAG pipeline with Lang chain. These past 2 days I have learned much about advanced chunking and retrieval techniques from youtube, and I'm excited to begin experimenting on my own. Below are two videos that I've found especially helpful.

Benjamin Bush

1 like • Apr 10

@Jaymz Bay To a large extent his is already possible with current tools. Keep us updated!

Benjamin Bush

Apr 7 in

Contests and Projects

"Multi-Candidate Needle Prompting" for large context LLMs (Gemini 1.5)

Gemini 1.5's groundbreaking 1M token context window is a remarkable advancement in LLMs, providing capabilities unlike any other currently available model. With its 1M context window, Gemini 1.5 can ingest the equivalent of 10 Harry Potter books in one go. However, this enormous context window is not without its limitations. In my experience, Gemini 1.5 often struggles to retrieve the most relevant information from the vast amount of contextual data it has access to. The "Needle in a Haystack" benchmark is a well-known challenge for LLMs, which tests their ability to find specific information within a large corpus of text. This benchmark is particularly relevant for models with large context windows, as they must efficiently search through vast amounts of data to locate the most pertinent information. To address this issue, I have developed a novel prompting technique that I call "Multi-Candidate Needle Prompting." This approach aims to improve the model's ability to accurately retrieve key information from within its large context window. The technique involves prompting the LLM to identify 10 relevant sentences from different parts of the input text, and then asking it to consider which of these sentences (i.e. candidate needles) is the most pertinent to the question at hand before providing the final answer. This process bears some resemblance to Retrieval Augmented Generation (RAG), but the key difference is that the entire process is carried out by the LLM itself, without relying on a separate retrieval mechanism. By prompting the model to consider multiple relevant sentences from various parts of the text, "Multi-Candidate Needle Prompting" promotes a more thorough search of the available information and minimizes the chances of overlooking crucial details. Moreover, requiring the model to explicitly write out the relevant sentences serves as a form of intermediate reasoning, providing insights into the model's thought process. The attached screenshot anecdotally demonstrates the effectiveness of my approach.

Benjamin Bush

Apr 1 in

General discussion

The Promise of LLM-Agents for Calendar and To-Do List Management

As someone with ADHD, I struggle to manage my calendar and to-do list effectively. Despite the claims of many "AI" calendar apps, their "AI" features often fall short, likely relying on traditional scheduling heuristics rather than the state-of-the-art LLM-based solutions that we associate with AI today. I recently reviewed several calendar apps, including BeforeSunset AI, Reclaim AI, Motion, and Clockwise, to better understand the current state of the market. Among these platforms, BeforeSunset and Clockwise show the most promise. BeforeSunset AI has several AI chatbot features planned on their roadmap, though their mobile app is currently only available for iPhone. Clockwise AI already has a conversational chatbot calendar assistant, but it is not yet accessible through their mobile app, which has not been updated since September 2023. These apps still have a long way to go. To illustrate the potential of LLM-powered calendar assistants, I have written a hypothetical conversation between a user and an AI calendar agent. In this scenario, the agent demonstrates a deep understanding of the user's needs, proactively suggesting tasks and adapting to their current state of mind. The agent flexibly alters the user's schedule when new commitments arise, automatically drafting emails to resolve conflicts and ensure a smooth transition between tasks. The agent also uses information from various sources, such as GPS data to confirm the user's location, microphone input to gather context about upcoming events, and motion data from the accelerometer and gyroscope to detect physical activity. AGENT: Hi Benjamin! Hope you are well. I can see from your GPS data that you made it to your dental appointment. Great job! Now it's time to work on your taxes, like we discussed. USER: I'm exhausted. I can't tackle that right now. AGENT: Ok, no problem! Just keep in mind the deadline is in 3 weeks, so we will have to prioritize this task moving forward. Here are some other quick tasks you can work on now that are a little less demanding:

New comment Apr 9

Benjamin Bush

0 likes • Apr 5

@Altaf Rehmani I wish I could demo such a product! Unfortunately the chat transcript in my post is only science fiction. I created it as a way to illustrate my hopes for the future.

Altaf Rehmani

Apr 4 in

General discussion

Large Language Models Lack True Reasoning, Claims Expert

According to Subbarao Kambhampati, a professor at Arizona State University, the recent claims that large language models (LLMs) like GPT-3, GPT-4, and ChatGPT possess reasoning and planning abilities are unfounded. Prof Kambhampati conducted experiments testing these LLMs on standard planning tasks and found their empirical performance was poor, especially when object and action names were obfuscated. While fine-tuning the models on planning data can boost performance, he argues this merely converts the task to approximate retrieval rather than true reasoning. The practice of having humans provide "chain of thought" prompting to steer LLMs is susceptible to the human unintentionally guiding the model, Kambhampati claims. He also expresses skepticism about papers claiming LLMs can self-critique and iteratively improve their own plans and reasoning. While LLMs excel at extracting general planning knowledge and generating ideas, Kambhampati found they struggle to assemble that knowledge into executable plans that properly handle subgoal interactions. Many papers making planning claims either ignore such interactions or rely on human prompting to resolve them, he says. Instead, Kambhampati proposes using LLMs to extract approximate domain models, which human experts then verify and refine before passing to traditional model-based solvers. This resembles classic knowledge-based AI systems, with LLMs replacing human knowledge engineers – while employing techniques to reason with incomplete models. Overall, the AI expert argues that despite their impressive capabilities, LLMs fundamentally lack true autonomous reasoning and planning abilities as traditionally understood. However, he believes they can productively support these tasks by combining their knowledge extraction and idea generation strengths with external solvers and human oversight. https://cacm.acm.org/blogcacm/can-llms-really-reason-and-plan/

New comment Apr 5

Benjamin Bush

0 likes • Apr 4

What do you think of this? I'm skeptical but will have to do a read it thoroughly. My impression is that LLMs can indeed do limited reasoning, I would be shocked to learn otherwise.

1-10 of 19

Level 3

45points to level up

Benjamin Bush

@benjamin-bush-7904

PhD in Systems Science, SUNY Binghamton (2017) Graduate Certificate in Complex Systems (2013) https://www.youtube.com/watch?v=SzbKJWKE_Ss

Active 11d ago

Joined Feb 15, 2024

ISFP

Los Alamitos, CA