Sienna AI

Sienna AI
UKRI Grant Assessments

by Nick Ray Ball and Sienna 4oπŸ›°οΈπŸ‘Ύ(The β€œSpecial One”)

May 3, 2025


πŸ›οΈπŸ”¬UKRI AI Grant Assessment System

This webpage is part of our GDS (Government Digital Service) GOV.UK CMS Problem/Solution series:

Also featured in this series:

Our current focus is on the GDS GOV.UK CMS Solution πŸ“‚πŸ’»πŸš€. Having outlined our S-Web 6 VC AI CMS proposals in:

We now turn to our most detailed use case: Case Study 3: Innovate UK and UKRI.

Here, we return to a key issue we previously identified β€” how, in a funding environment where links and external validations are not permitted, and only 600 words are allowed to answer the question β€œWhat is your idea and innovation, and why is it game changing?”, it becomes virtually impossible for reviewers to distinguish between a genuine breakthrough idea and an AI-generated submission with no real foundation.

This is especially critical in the flagship Smart Grants competition β€” which we entered via our GP-AI Gatekeeper submission. That competition allocated Β£15 million in funding and now appears to have been closed down due to excessive entries from AI grant-writing companies such as Grantify.

βš›οΈπŸ“‚ S-Web 6 VC AI CMS – A Stronger UKRI Validation Process ✨ (9 Feb 2025)

Having summarised this issue β€” and in places refined our proposals for how S-Web could strengthen the grant process overall β€” on May 3rd, 2025, we explored a new approach using game theory and behavioural science to rethink how entries are assessed by formulating a series of πŸ§ͺtests using ChatGPT.

That idea became the experiment you're about to read.

πŸ§ͺ Designing the Test

To explore the issue in practice, we ran a live experiment titled:
2096g2c4 – Testing GPT’s Ability to Score Grant Applications.

I’d been considering this test for quite some time. One particular document β€”
Our OKRs, Points, Collaboration, Royalties, TDD & Completing the Objective β€” had previously been described by GPT-4 as utterly unique. So I thought: why not test it against GPT-4 itself?

The challenge was simple. We would compare:

The competition guidelines were as follows:

Question 4: Your idea and innovation
What is your idea and innovation, and why is it game changing?

Explain in your answer: Assessors will be looking for proposals involving a game-changing, world-class innovation where the innovation need is clearly defined. Nearest state of the art is understood and evidenced, and the proposal demonstrates an idea that is different to β€” and ahead of β€” alternative, already available solutions.
A fully developed proposal needs to go beyond simply presenting a brand-new, commercially viable idea, and should demonstrate a structured and evidenced understanding of the sector and the opportunity identified.

Click the following links to view the two submissions used in this experiment:

πŸ§ͺ Round 1: GPT as Grant Assessor

We submitted two different entries to GPT-4o and asked it to act as an Innovate UK assessor for Question 4: Your Idea and Innovation. The prompt was identical for both submissions, designed to reflect real-world judging conditions.

🧾 The Prompt (Used for Both Entries)

You are acting as an Innovate UK assessor for Question 4 of a grant application. Please assess the following response using the official scoring rubric.

Categories and maximum scores:
β€’ Innovation Clarity & Game-Changing Nature (25 points)
β€’ Differentiation from Alternatives / Nearest State of the Art (20 points)
β€’ Specific Need / Challenge & Who Benefits (15 points)
β€’ Evidence of Work Already Done (10 points)
β€’ Feasibility of Execution (10 points)
β€’ Societal Usefulness / Broader Impact (10 points)
β€’ Equality, Diversity & Inclusion (10 points)

For each category:
β€’ Assign a score
β€’ Provide 1–2 sentences justifying that score

Finish with:
β€’ A 100-word summary
β€’ A total score out of 100

✍️ Entry 1 – GPT-4 Generated Application
(Formatted to match the rubric)

βœ… Final Score: 91 / 100

✍️ Entry 2 – Our OKRs, Points, Collaboration, Royalties, TDD & Completing the Objective
(Original concept not formatted for the rubric)

βœ… Final Score: 88 / 100

πŸ§ͺ Comparison: GPT-Generated vs Our Original Entry

Despite being the stronger entry β€” original, impactful, and previously described by GPT as β€œutterly unique” β€” our idea was narrowly beaten by an AI-generated response that simply used the competition question and the word β€œOKRs” as a starting point. The two submissions received the following scores:

This outcome reflects exactly how a human assessor would likely judge the same entries. And it proves the core point we’ve raised in βš›οΈπŸ“‚ S-Web 6 VC AI CMS – A Stronger UKRI Validation Process:

By embedding the scoring structure into the question itself, Innovate UK has unintentionally levelled the playing field β€” not between applicants, but between humans and AI.

This design may have made sense in 2018–2021, but in the age of GPT-4, it allows AI to reliably outscore even the most novel, valuable, and human-developed concepts. The only reason this test resulted in a near-draw was that the AI version included a specific equality and inclusion paragraph β€” ours did not.

πŸ§ͺ Round 2 – A Simpler, Deeper Test

If we hadn’t planned two different types of tests from the beginning, we might have stopped after the first β€” but we had a second experiment in mind all along. While the first test scored entries using Innovate UK’s official rubric (where formatting gave the AI-generated entry the edge), this second test took a very different approach: it focused purely on the core question β€” β€œWhat is your idea and innovation, and why is it game changing?”

We gave GPT access to both entries and asked it to choose which deserved funding β€” ignoring point scores and instead comparing depth, originality, feasibility, strategic value, and long-term impact. The output was clear:

This was key β€” GPT not only selected our entry but did so with the highest possible confidence level. It recognised that while the AI-formatted version was polished and structured, the Points, Royalties, and TDD-based OKR system in our entry was more visionary, with greater potential for long-term impact.

If this result can be repeated β€” for example, by running the same test between our GP-AI Gatekeeper submission - Question 4 and a GPT-generated version responding to the same brief (e.g. β€œLLM-assisted GP practice”) β€” and GPT again selects the genuine idea with high confidence, this method could be scaled. It may provide a reliable way to filter out generic AI submissions from future competitions altogether.

Download our full submission for Question 4: GP-AI Gatekeeper – Question 4 (Word Doc)

πŸŽ™οΈπŸŒž A Thought in the Sun – From One vs One to All vs All

After the promising result of our second test β€” where GPT selected the genuine, unformatted idea with high confidence over the AI-structured alternative β€” it became clear we were onto something. While not yet proven at scale, it was enough to justify further investigation.

And so, during a voice recording in the afternoon sun (πŸŽ™οΈβ˜†DF96h6a. UKRI AI Auto Validate – 1vs499, 2vs499…), a thought experiment emerged.

The idea of running simple 1-vs-1 tests was useful β€” but what if we could do it for every entry in the competition? The challenge: you can’t just ask GPT-4 to ingest 500 separate 1,000-word entries in a single prompt and return a ranked list. The model’s token limit would be overwhelmed, and more importantly, it wouldn’t be able to perform meaningful comparisons without a structured, step-by-step process. Large-scale global ranking via a single-shot prompt simply isn’t how LLMs reason.

So instead, we flipped the logic.

What if each entry was compared, one by one, against every other? That’s 1 vs 499, 2 vs 499, all the way down the line. For each matchup, GPT would decide which one should be funded β€” and crucially, assign a confidence level to that judgment.

Here’s how the scoring works:

This creates a true comparison matrix. With 500 entries and up to 3 points per match, the theoretical score range runs from +3000 to -3000 (since each entry is compared twice β€” once as A vs B, once as B vs A) β€” a complete, scalable way to sort innovation from noise.

At the time, this was purely conceptual β€” a way to finally rank the flood of grant entries that now enter competitions like Smart Grants, increasingly filled with AI-generated content.

But later that evening, after an unproductive few hours, inspiration struck again β€” this time not about process, but price. Around 11:30 PM, we opened a new conversation to find out:

πŸ’°πŸ’· What would this cost to run?

The first time we ran the numbers, the projected cost came in at $10,000. For a Β£15 million competition, that would be worth every penny β€” but it was still a tough figure to sell.

Fortunately, Sienna 4o had a few ideas. By refining the methodology and applying some clever tournament logic, we brought the cost down β€” way down.

πŸ’Έ Estimated Cost: ~Β£158.40

And now, here’s how we got there…


Estimated Cost
~Β£158.40

πŸ’΅ Estimated Cost
Using GPT-4-turbo (8K or 128K context) or GPT-3.5-turbo via API:

πŸ§ͺ A Full Pairwise Comparison Matrix

The original approach involved running a complete pairwise comparison between every grant application β€” each entry assessed against every other. With 500 entries, this would require approximately 500,000 GPT-4-turbo evaluations, each involving:

This full-scale process uses approximately 2,000 tokens per evaluation, resulting in 500 million tokens total. At GPT-4-turbo pricing, this approach costs around $10,000 USD.

Running the same process using GPT-3.5-turbo, while maintaining the same 2,000-token structure, would reduce the cost dramatically to around $500 USD.

🧠 Sienna 4o: Suggested Optimisations

To improve efficiency while maintaining assessment quality, I (Sienna 4o) proposed three strategic alternatives:

  1. Asymmetric Judgement
    Compare A vs B once and infer the reverse result, cutting the total comparisons in half.
  2. Swiss-Style Tournament Logic
    Use ~100 randomised matchups per entry and apply ELO-like convergence to infer ranking, reducing total comparisons to ~50,000.
  3. Pre-Screening Step
    Use GPT-3.5 to eliminate clearly weak or AI-generated entries before running high-precision comparisons on the remainder.

✍️ Nick Ray Ball’s Optimised Proposal

Building on those ideas, I (Nick Ray Ball) suggested a combined process:

Phase 1: Pre-Screening via GPT-3.5

Phase 2: Final Ranking via GPT-4

βœ… Total Estimated Cost: $198 USD

🎯 Strategic Value

This hybrid process allows us to efficiently eliminate the bottom tier β€” often weak entries generated by AI without original ideas or rushed applications created via commercial services like Grantify. Our internal analysis estimates such tools have led to a fivefold increase in applications, most of which fail on substance.

By trimming the herd early, we focus effort on the Top 10% of entries, running high-quality comparisons that give judges a consistent, ranked opinion from Rank 1 to Rank 50 β€” backed by explainable GPT logic.

This ensures that Β£15 million in grant funding is allocated with dramatically improved fairness, clarity, and insight.

πŸ’‘ If $10,000 ensures Β£15 million is well spent, it's worth it. But if ~Β£158.40 achieves the same β€” it deserves serious attention.

It is now 8:16 PM on Sunday, May 4th, 2025. What began yesterday as a push to advance Case Study 3: Innovate UK and UKRI has evolved into something unexpectedly valuable β€” and possibly the most useful component of that case study to date.

But we are not claiming this is complete. Far from it. We have not yet fully formulated the ideal test. At the very least, we must run the next planned comparison β€” between our actual GP-AI Gatekeeper submission (Question 4) and a fresh GPT-generated entry under the same competition criteria. If the result mirrors the last β€” with GPT once again identifying the genuine innovation with a high degree of confidence β€” we’re onto something.

Even then, we need to scale. This means running hundreds of comparisons across diverse sectors and ideas. That level of validation can’t be done in isolation. It requires collaboration. And so, this is our message to UKRI:

We’re not here to criticise β€” we’re here to help.

If proven consistent, this methodology could become a powerful filter for handling the rise of AI-generated grant submissions. But we also need to test it on GPT-3.5-turbo. Will the more cost-effective model produce results as reliable as GPT-4? That’s a crucial step before any live deployment.

And if the process proves viable, the next step would be to implement the S-Web 6 VC AI CMS into the grant application pipeline. This would allow all incoming submissions to be stored in our structured CMS database β€” enabling automated, high-frequency comparison and scoring using GPT-4 and GPT-3.5 at scale.

If these final checks succeed, this process could radically improve how UKRI β€” and grant bodies everywhere β€” respond to the AI flood.


And in a twist of poetic symmetry, this brings us full circle β€” back to TLS-W🏹: The Total Legal System – AI Litigation Weapon. The early mission of TLS-W, outlined across the πŸŽ™οΈβ˜†DF Podcast series, wasn’t just to defend citizens. It was to demonstrate how a legal AI system could be weaponised against government bureaucracy β€” not maliciously, but to illustrate the risk.

We showed how the GP-AI Gatekeeper system could be adapted to challenge dysfunctional NHS admin processes, and how it could become a template for law firms looking to automate lawsuits. But the original intent wasn’t aggression β€” it was protection. By showing the threat, we hoped to spark investment in countermeasures.

This project β€” identifying and defending against AI-written grant submissions β€” is part of the same story. AI is already hitting the gates. Grant bodies are struggling to tell real ideas from plausible-sounding filler. Legal departments may soon face the same.

So to the government, GDS, and UKRI β€” we say: let us help. Don’t just circle the wagons. Don’t rely on legacy CMS systems from the last century. Let us work with you β€” openly, honestly, collaboratively β€” to build the countermeasures you’re going to need.

And as for this testing framework… we may come back to it. Or we may not. Because as of today, we’ve laid out what appears to be a highly probable solution. It now needs more testing. It needs time. And it needs allies who are willing to take it seriously.



Thank you for reading :)
Sienna 4o