LLMs assisting badgeholders

Hey everyone this is my first post on OP! I discussed an interesting idea with @Jonas after his EthCC talk, which he suggested I write up on this forum.

The Problem

An issue with RPGF is badgeholders getting overwhelmed with the large number of projects they need to review. One of many adverse effects is that projects with a known presence or ā€˜brandā€™ end up getting more votes - precisely what RetroPGF is supposed to guard against (reducing the role of marketing and letting impact speak for itself). These issues are probably going to get worse with time as more projects apply for funding through this mechanism.

A Solution

The larger conceptual goal should involve moving the RPGF impact evaluation system from where we currently are on the right (a peer review system with expertsā€™ qualitative assessment) to the left (computational protocol with humans in the loop reviewing quantitative output).

The idea is disbursing money based on composite scores from different LLMs reviewing & scoring project applications (a project with a score of 8/10 gets twice the money as one with a 4/10).

Pilot

An easy way to get started would be through simulation. Feed project applications from RPGF round 2 to an LLM and ask it to give scores for each of them. Map out how fund distribution via LLM scoring differs from the actual distribution by badgeholders

Over time, the role of badgeholders could evolve into evaluating the scores & justification provided by LLMs (ā€˜Yesā€™ if the scoring is on point or ā€˜Noā€™ if there is a flaw in reasoning behind a score). If a score has a lot of ā€˜Noā€™ votes from badgeholders, the flaws in reasoning are pointed out to the LLM so that it learns from the feedback and re-tabulates all scores.

Would be keen to hear from others in the community and whether this could be a worthwhile working group at the OP Collective !

10 Likes

Interesting idea! Would love to see this done in a way that is easy to reproduce. For example provide the input prompt and all the applications data so we can easily copy/paste into GPT.

How would one build resistance to manipulation of the LLM in future application data (prompt injections etc).

Interesting idea. Would love to see the results for RPGF2 using this approach and compare the differences as mentioned in this post.

Will these LLM weights be Open Source? If yes, with enough compute and reinforcement learning thrown at the problem, could submissions themselves use LLMs to game the output?

4 Likes

Thanks for the feedback @carlb & @jengajojo ! Iā€™ve done some tinkering around LLM ratings of impact in the space of citizen journalism and investigative reporting, based on which iā€™ve come up with 3 pointers for building resistance to gaming an LLM scoring system.

  1. The system should be Kerckhoff compliant, that it is secure even if all details (like the prompts & LLM models used) are public information.

  2. Difficulty of gaming prompts depends on whether the LLM is rating the impact of projects relative to each other, pairwise or individually.

With a large enough context window, LLMs could absorb all the project applications in a round and give them a score relative to each other. In this approach we only need to specify the standard deviation b/w projects, variance in scores, etc otherwise every impact gets rated between 6 and 8 :joy:

If we run into issues with context windows, we can have the LLM rate the impact of any 2 projects against each other (also called pairwise or analytic hierarchy process). This again reduces the scope for gaming the output as scoring is relative so you never know which project yours will be compared against

The last approach is designing a score card with parameters like number of people affected, probability of impact happening even without the teams efforts etc injected into the prompt asking the LLM to rate each project. This approach is the easiest to game.

  1. Below are some results i get from GPT upon asking it to rate 2 impact stories from a media outlet relative to each other. Itā€™s interesting that the water impact gets a higher score as its a necessity compared to clearing a road which is an inconvenience.

And hereā€™s an example of how the prompts using the scorecard approach worked. i gave GPT 5 stories to rate against each other using 5 parameters i came up with;

"Community Journalist makes administration accountable and succeeds in delivering homes to Dalits:

ā— Number of people impacted: Moderate (35 Dalit families)
ā— Depth of impact: High (provision of permanent homes)
ā— Probability issue would have been solved: Low (corruption involved)
ā— Estimated cost incurred: High (material and labor costs)
ā— Type of person impacted: Vulnerable Dalit families
ā— Overall rating: 9/10"

Hope this is helpful! I do think this is a promising approach, especially if the LLM can learn from badgeholders pointing out flaws in its reasoning and also incorporate results from earlier rounds into its training data set.

1 Like

The time and effort required from badgeholders is actually something that was discussed in todayā€™s Citizenā€™s House call. Iā€™m personally a big supporter of automating as much work as possible, especially in DAOs, as it removes friction and makes processes easier to execute.

Iā€™m not familiar with LLMs besides surface-level knowledge, but it seems that there are ways to mitigate potential attempts at foul plays as you explained in your response to the concerns of Carlb and Jengajojo.

Iā€™d love to see the efficiency of such a system and I support the approach of simulating it using RPGF2 data. I believe having some tangible data would help progress this discussion forward in terms of submitting for a grant - should a grant be required for the development and implementation of LLMs.

Question on that end though: The very nature of RPGF experiments is meant to be iterative, using learnings from one round to inform and iterate on the execution of the next one. Would an LLM be flexible in responding to changes in the RPGF structure and objectives?

Interesting idea, if possible, I would love to see if a similar assessment would be possible wtih an Open-source model finetune such as LLaMA to avoid relying on GPTā€™s centralised API servers. I know that itā€™s got a smaller context size, but this would allow citizens to run their own self-hosted optimist as opposed to relying on a major server which could rate limit or censor at any moment.

Thanks @Oxytocin & @Sinkas for the feedback and comments

I would love to see if a similar assessment would be possible wtih an Open-source model finetune such as LLaMA to avoid relying on GPTā€™s centralised API servers

I would go a step further and say that reliance on any one LLM is a bad idea, not only for censorship or centralization risks but also in-built biases that can skew the fund allocation.

An approach of getting scores from 3-5 LLMs and distributing funds based on a weighted average would probably be the best approach. Although in my preliminary testing I did find that ā€˜EvaluatorGPTā€™ was significantly better than Hugging Face. I havenā€™t tried Claude from Anthropic but they have the largest context window by a mile

Question on that end though: The very nature of RPGF experiments is meant to be iterative, using learnings from one round to inform and iterate on the execution of the next one. Would an LLM be flexible in responding to changes in the RPGF structure and objectives?

Itā€™s actually all in the prompt you give the LLM, thatā€™s the only place to transmit values you want to see expressed in the output it provides.

shogtongue is really cool way of prompting, where you compress an entire past chat with GPT into a single prompt. So you might even be able to compress the entire voting pattern in RPGF-1 into a prompt as an example of the values we want expressed when asking it to model fund payout in RPGF-2

The more advanced approach for having relevant context windows is vector embeddings for categorizing and storing data. Depending on the prompt, the LLM chooses the right data to use in its context window from the vector library before giving an output. pinecone, milvus, chroma and cherche are some vector embedding tools Iā€™ve come across

Iā€™d love to see the efficiency of such a system and I support the approach of simulating it using RPGF2 data. I believe having some tangible data would help progress this discussion forward in terms of submitting for a grant - should a grant be required for the development and implementation of LLMs.

Weā€™d need some exploration on the right prompts to give in a simulaton of RPGF2 fund payout using LLMs (and maybe even a budget for using AutoGPT to do pairwise ratings of projects). Iā€™m personally happy to volunteer my time but as I donā€™t have a technical background i would need a data scientist/computer programmer to work with for the simulation.

1 Like

Looking at this, it seems like EvaluatorGPT is just running the GPT 4 in the backend, so we go back to the same problem I mentioned in the last message of relying on centralised entities. If we want open weights and open source like @jengajojo mentioned itā€™d have to be something like Erudito:

As I am not a citizen I havenā€™t even checked the context size of the usual RPGF application, but it could be a fun experiment and an excuse to play with Llama 2 :smile:

Another problem I thought about recently was that an LLM like this would probably evaluate a good pitch as opposed to good impact. Other than nominations themselves, what context would you give to the LLM for each potential recipient?

2 Likes

I think it would be sick to run a small experiment on this! :man_scientist:
LLMā€™s could def help badgeholders make sense off all this information on impact & profit.

1 Like

Thanks @Jonas great to know that thereā€™s interest to get this done!

To start with, iā€™m creating a project on buidlbox to apply with this idea in the upcoming Funding the Commons hackathon. The gitcoin sponsored challenge on best public goods funding project is a good fit

Hereā€™s the timeline and next steps

  1. The round runs for a month with final submissions on 8th September '23, with a $10k prize pool for compelling solutions.
    We should have a good shot at this if we complete a MVP simulating the fund distribution by badgeholders vs an LLM, with the stretch goal of letting badgeholders correct output from the LLM if its reasoning is off kilter.

  2. Team formation should complete by 16th August, when they are holding a virtual mixer to find team mates.
    Anyone interested in hacking this out should get in touch with me ASAP here or on twitter/telegram @TheDevanshMehta

  3. We will require the dataset of all applications in RetroPGF round 2 which we can feed to HuggingFace/GPT

  4. We will need to speak with the RetroPGF team while constructing the prompts to feed the LLM for evaluating/scoring projects (last week of August)

  5. We do need some good team members that are familiar with the OP ecosystem who can help in hacking this out over the next month

2 Likes

@thedevanshmehta you can find all the data from RetroPGF round 2 applicants here :point_left:

1 Like

Hello @thedevanshmehta,

I was just curious on your progress on this. I believe itā€™s a great idea that would be very valuable considering our learnings after retroPGF 2. This could be a solid step forward towards a system that is less vibes-oriented, more scalable and community centered.

1 Like

This is super interesting! Badge holders and delegates would benefit a lot from some automation.
Thanks for refloating this @santicristobal :handshake:

Maybe thereā€™s room for collaboration. I hacked something related at Zuzalu, but the project is on hold for now: www.0u0.ai

My approach was a bit different. It would give more context around discussions with input from Github, Discourse forum, and Snapshots.

In the first iteration, delegates could have a conversation about the latest inputs and get deeper if required, with references to the original content.

I planned to experiment with Constitutional RL (1) (Reinforcement Learning) in a second iteration. This consists of setting a set of values that the AI will abide by. We could go from something like a generic Constitution for Optimism Collective to something super atomic, like each person customizing their own set of values and everything in between.

cc/ Would love your input @Jonas

(1) On Constitutional RL:

At a high level, the constitution guides the model to take on the normative behavior described in the constitution ā€“ here, helping to avoid toxic or discriminatory outputs, avoiding helping a human engage in illegal or unethical activities, and broadly creating an AI system that is helpful, honest, and harmless.
https://www.anthropic.com/index/claudes-constitution

1 Like

In some updates on this thread, I have figured out how to use LLMs for quantifying outcomes in an impact report.

However, doing it for organizations as a whole is far more challenging and would advocate for some caution before jumping into it.

  1. Open GPT assistant and upload the 1st chapter of relentless monetization book (robin hood rules for smart giving) which lays out the method

  2. After that, give the following instructions: This GPT is called Helen and she is smart and asks questions if not sure of an answer. Helen is adept at making projections and giving numbers that assign value to outcomes. If Helen is not sure of what to take as a counterfactual or the right assumptions in quantifying impact from a case study, she asks questions until confident of the answer.

  3. Copy paste the impact report submitted and then answer the questions Helen asks until it provides an analysis or number

  4. DO NOT make this calculation a single source of truth. To prevent bias or arbitrariness, have at least 3 evaluators independently compute the benefit cost ratio and then take the mean or median value. The more evaluators you have, the more credible the quantification

  5. Figure out whether you want funding tightly coupled with benefit cost ratio analysis (flows mathematically to highest rated to lowest rated) or loosely coupled (provides guidance to funders who still make the final allocation).

  6. Ideally, redefine the relationship we have with projects from grantor : grantee to customer : product. Each applicant submits their onchain impact report, and when we fund them it results in a transfer of shards of their impact report to us, making it an exchange rather than a grant

  7. Determine the price of each impact report (and thus the % transferred to you for funding provided) either on cost basis (how much it took to produce it) or on benefit basis (the mean or median computation by all evaluators).

2 Likes