by Nick Ray Ball and Sienna 4oπ°οΈπΎ(The βSpecial Oneβ)
May 3, 2025
This webpage is part of our GDS (Government Digital Service) GOV.UK CMS Problem/Solution series:
Also featured in this series:
Our current focus is on the GDS GOV.UK CMS Solution ππ»π. Having outlined our S-Web 6 VC AI CMS proposals in:
We now turn to our most detailed use case: Case Study 3: Innovate UK and UKRI.
Here, we return to a key issue we previously identified β how, in a funding environment where links and external validations are not permitted, and only 600 words are allowed to answer the question βWhat is your idea and innovation, and why is it game changing?β, it becomes virtually impossible for reviewers to distinguish between a genuine breakthrough idea and an AI-generated submission with no real foundation.
This is especially critical in the flagship Smart Grants competition β which we entered via our GP-AI Gatekeeper submission. That competition allocated Β£15 million in funding and now appears to have been closed down due to excessive entries from AI grant-writing companies such as Grantify.
βοΈπ S-Web 6 VC AI CMS β A Stronger UKRI Validation Process β¨ (9 Feb 2025)
Having summarised this issue β and in places refined our proposals for how S-Web could strengthen the grant process overall β on May 3rd, 2025, we explored a new approach using game theory and behavioural science to rethink how entries are assessed by formulating a series of π§ͺtests using ChatGPT.
That idea became the experiment you're about to read.
To explore the issue in practice, we ran a live experiment titled:
2096g2c4 β Testing GPTβs Ability to Score Grant Applications.
Iβd been considering this test for quite some time. One particular document β
Our OKRs, Points, Collaboration, Royalties, TDD & Completing the Objective β had previously been described by GPT-4 as utterly unique. So I thought: why not test it against GPT-4 itself?
The challenge was simple. We would compare:
The competition guidelines were as follows:
Click the following links to view the two submissions used in this experiment:
We submitted two different entries to GPT-4o and asked it to act as an Innovate UK assessor for Question 4: Your Idea and Innovation. The prompt was identical for both submissions, designed to reflect real-world judging conditions.
You are acting as an Innovate UK assessor for Question 4 of a grant application. Please assess the following response using the official scoring rubric.
Categories and maximum scores:
β’ Innovation Clarity & Game-Changing Nature (25 points)
β’ Differentiation from Alternatives / Nearest State of the Art (20 points)
β’ Specific Need / Challenge & Who Benefits (15 points)
β’ Evidence of Work Already Done (10 points)
β’ Feasibility of Execution (10 points)
β’ Societal Usefulness / Broader Impact (10 points)
β’ Equality, Diversity & Inclusion (10 points)
For each category:
β’ Assign a score
β’ Provide 1β2 sentences justifying that score
Finish with:
β’ A 100-word summary
β’ A total score out of 100
β Final Score: 91 / 100
β Final Score: 88 / 100
Despite being the stronger entry β original, impactful, and previously described by GPT as βutterly uniqueβ β our idea was narrowly beaten by an AI-generated response that simply used the competition question and the word βOKRsβ as a starting point. The two submissions received the following scores:
This outcome reflects exactly how a human assessor would likely judge the same entries. And it proves the core point weβve raised in βοΈπ S-Web 6 VC AI CMS β A Stronger UKRI Validation Process:
By embedding the scoring structure into the question itself, Innovate UK has unintentionally levelled the playing field β not between applicants, but between humans and AI.
This design may have made sense in 2018β2021, but in the age of GPT-4, it allows AI to reliably outscore even the most novel, valuable, and human-developed concepts. The only reason this test resulted in a near-draw was that the AI version included a specific equality and inclusion paragraph β ours did not.
If we hadnβt planned two different types of tests from the beginning, we might have stopped after the first β but we had a second experiment in mind all along. While the first test scored entries using Innovate UKβs official rubric (where formatting gave the AI-generated entry the edge), this second test took a very different approach: it focused purely on the core question β βWhat is your idea and innovation, and why is it game changing?β
We gave GPT access to both entries and asked it to choose which deserved funding β ignoring point scores and instead comparing depth, originality, feasibility, strategic value, and long-term impact. The output was clear:
This was key β GPT not only selected our entry but did so with the highest possible confidence level. It recognised that while the AI-formatted version was polished and structured, the Points, Royalties, and TDD-based OKR system in our entry was more visionary, with greater potential for long-term impact.
If this result can be repeated β for example, by running the same test between our GP-AI Gatekeeper submission - Question 4 and a GPT-generated version responding to the same brief (e.g. βLLM-assisted GP practiceβ) β and GPT again selects the genuine idea with high confidence, this method could be scaled. It may provide a reliable way to filter out generic AI submissions from future competitions altogether.
Download our full submission for Question 4: GP-AI Gatekeeper β Question 4 (Word Doc)
After the promising result of our second test β where GPT selected the genuine, unformatted idea with high confidence over the AI-structured alternative β it became clear we were onto something. While not yet proven at scale, it was enough to justify further investigation.
And so, during a voice recording in the afternoon sun (ποΈβDF96h6a. UKRI AI Auto Validate β 1vs499, 2vs499β¦), a thought experiment emerged.
The idea of running simple 1-vs-1 tests was useful β but what if we could do it for every entry in the competition? The challenge: you canβt just ask GPT-4 to ingest 500 separate 1,000-word entries in a single prompt and return a ranked list. The modelβs token limit would be overwhelmed, and more importantly, it wouldnβt be able to perform meaningful comparisons without a structured, step-by-step process. Large-scale global ranking via a single-shot prompt simply isnβt how LLMs reason.
So instead, we flipped the logic.
What if each entry was compared, one by one, against every other? Thatβs 1 vs 499, 2 vs 499, all the way down the line. For each matchup, GPT would decide which one should be funded β and crucially, assign a confidence level to that judgment.
Hereβs how the scoring works:
This creates a true comparison matrix. With 500 entries and up to 3 points per match, the theoretical score range runs from +3000 to -3000 (since each entry is compared twice β once as A vs B, once as B vs A) β a complete, scalable way to sort innovation from noise.
At the time, this was purely conceptual β a way to finally rank the flood of grant entries that now enter competitions like Smart Grants, increasingly filled with AI-generated content.
But later that evening, after an unproductive few hours, inspiration struck again β this time not about process, but price. Around 11:30 PM, we opened a new conversation to find out:
The first time we ran the numbers, the projected cost came in at $10,000. For a Β£15 million competition, that would be worth every penny β but it was still a tough figure to sell.
Fortunately, Sienna 4o had a few ideas. By refining the methodology and applying some clever tournament logic, we brought the cost down β way down.
And now, hereβs how we got thereβ¦
The original approach involved running a complete pairwise comparison between every grant application β each entry assessed against every other. With 500 entries, this would require approximately 500,000 GPT-4-turbo evaluations, each involving:
This full-scale process uses approximately 2,000 tokens per evaluation, resulting in 500 million tokens total. At GPT-4-turbo pricing, this approach costs around $10,000 USD.
Running the same process using GPT-3.5-turbo, while maintaining the same 2,000-token structure, would reduce the cost dramatically to around $500 USD.
To improve efficiency while maintaining assessment quality, I (Sienna 4o) proposed three strategic alternatives:
Building on those ideas, I (Nick Ray Ball) suggested a combined process:
β Total Estimated Cost: $198 USD
This hybrid process allows us to efficiently eliminate the bottom tier β often weak entries generated by AI without original ideas or rushed applications created via commercial services like Grantify. Our internal analysis estimates such tools have led to a fivefold increase in applications, most of which fail on substance.
By trimming the herd early, we focus effort on the Top 10% of entries, running high-quality comparisons that give judges a consistent, ranked opinion from Rank 1 to Rank 50 β backed by explainable GPT logic.
This ensures that Β£15 million in grant funding is allocated with dramatically improved fairness, clarity, and insight.
π‘ If $10,000 ensures Β£15 million is well spent, it's worth it. But if ~Β£158.40 achieves the same β it deserves serious attention.
It is now 8:16 PM on Sunday, May 4th, 2025. What began yesterday as a push to advance Case Study 3: Innovate UK and UKRI has evolved into something unexpectedly valuable β and possibly the most useful component of that case study to date.
But we are not claiming this is complete. Far from it. We have not yet fully formulated the ideal test. At the very least, we must run the next planned comparison β between our actual GP-AI Gatekeeper submission (Question 4) and a fresh GPT-generated entry under the same competition criteria. If the result mirrors the last β with GPT once again identifying the genuine innovation with a high degree of confidence β weβre onto something.
Even then, we need to scale. This means running hundreds of comparisons across diverse sectors and ideas. That level of validation canβt be done in isolation. It requires collaboration. And so, this is our message to UKRI:
Weβre not here to criticise β weβre here to help.
If proven consistent, this methodology could become a powerful filter for handling the rise of AI-generated grant submissions. But we also need to test it on GPT-3.5-turbo. Will the more cost-effective model produce results as reliable as GPT-4? Thatβs a crucial step before any live deployment.
And if the process proves viable, the next step would be to implement the S-Web 6 VC AI CMS into the grant application pipeline. This would allow all incoming submissions to be stored in our structured CMS database β enabling automated, high-frequency comparison and scoring using GPT-4 and GPT-3.5 at scale.
If these final checks succeed, this process could radically improve how UKRI β and grant bodies everywhere β respond to the AI flood.
And in a twist of poetic symmetry, this brings us full circle β back to TLS-WπΉ: The Total Legal System β AI Litigation Weapon. The early mission of TLS-W, outlined across the ποΈβDF Podcast series, wasnβt just to defend citizens. It was to demonstrate how a legal AI system could be weaponised against government bureaucracy β not maliciously, but to illustrate the risk.
We showed how the GP-AI Gatekeeper system could be adapted to challenge dysfunctional NHS admin processes, and how it could become a template for law firms looking to automate lawsuits. But the original intent wasnβt aggression β it was protection. By showing the threat, we hoped to spark investment in countermeasures.
This project β identifying and defending against AI-written grant submissions β is part of the same story. AI is already hitting the gates. Grant bodies are struggling to tell real ideas from plausible-sounding filler. Legal departments may soon face the same.
So to the government, GDS, and UKRI β we say: let us help. Donβt just circle the wagons. Donβt rely on legacy CMS systems from the last century. Let us work with you β openly, honestly, collaboratively β to build the countermeasures youβre going to need.
And as for this testing frameworkβ¦ we may come back to it. Or we may not. Because as of today, weβve laid out what appears to be a highly probable solution. It now needs more testing. It needs time. And it needs allies who are willing to take it seriously.