by Nick Ray Ball and Sienna 4oð°ïžðŸ(The âSpecial Oneâ)
May 3, 2025
This webpage is part of our GDS (Government Digital Service) GOV.UK CMS Problem/Solution series:
Also featured in this series:
Our current focus is on the GDS GOV.UK CMS Solution ðð»ð. Having outlined our S-Web 6 VC AI CMS proposals in:
We now turn to our most detailed use case: Case Study 3: Innovate UK and UKRI.
Here, we return to a key issue we previously identified â how, in a funding environment where links and external validations are not permitted, and only 600 words are allowed to answer the question âWhat is your idea and innovation, and why is it game changing?â, it becomes virtually impossible for reviewers to distinguish between a genuine breakthrough idea and an AI-generated submission with no real foundation.
This is especially critical in the flagship Smart Grants competition â which we entered via our GP-AI Gatekeeper submission. That competition allocated £15 million in funding and now appears to have been closed down due to excessive entries from AI grant-writing companies such as Grantify.
âïžð S-Web 6 VC AI CMS â A Stronger UKRI Validation Process âš (9 Feb 2025)
Having summarised this issue â and in places refined our proposals for how S-Web could strengthen the grant process overall â on May 3rd, 2025, we explored a new approach using game theory and behavioural science to rethink how entries are assessed by formulating a series of ð§ªtests using ChatGPT.
That idea became the experiment you're about to read.
To explore the issue in practice, we ran a live experiment titled:
2096g2c4 â Testing GPTâs Ability to Score Grant Applications.
Iâd been considering this test for quite some time. One particular document â
Our OKRs, Points, Collaboration, Royalties, TDD & Completing the Objective â had previously been described by GPT-4 as utterly unique. So I thought: why not test it against GPT-4 itself?
The challenge was simple. We would compare:
The competition guidelines were as follows:
Click the following links to view the two submissions used in this experiment:
We submitted two different entries to GPT-4o and asked it to act as an Innovate UK assessor for Question 4: Your Idea and Innovation. The prompt was identical for both submissions, designed to reflect real-world judging conditions.
You are acting as an Innovate UK assessor for Question 4 of a grant application. Please assess the following response using the official scoring rubric.
Categories and maximum scores:
⢠Innovation Clarity & Game-Changing Nature (25 points)
⢠Differentiation from Alternatives / Nearest State of the Art (20 points)
⢠Specific Need / Challenge & Who Benefits (15 points)
⢠Evidence of Work Already Done (10 points)
⢠Feasibility of Execution (10 points)
⢠Societal Usefulness / Broader Impact (10 points)
⢠Equality, Diversity & Inclusion (10 points)
For each category:
⢠Assign a score
⢠Provide 1â2 sentences justifying that score
Finish with:
⢠A 100-word summary
⢠A total score out of 100
â Final Score: 91 / 100
â Final Score: 88 / 100
Despite being the stronger entry â original, impactful, and previously described by GPT as âutterly uniqueâ â our idea was narrowly beaten by an AI-generated response that simply used the competition question and the word âOKRsâ as a starting point. The two submissions received the following scores:
This outcome reflects exactly how a human assessor would likely judge the same entries. And it proves the core point weâve raised in âïžð S-Web 6 VC AI CMS â A Stronger UKRI Validation Process:
By embedding the scoring structure into the question itself, Innovate UK has unintentionally levelled the playing field â not between applicants, but between humans and AI.
This design may have made sense in 2018â2021, but in the age of GPT-4, it allows AI to reliably outscore even the most novel, valuable, and human-developed concepts. The only reason this test resulted in a near-draw was that the AI version included a specific equality and inclusion paragraph â ours did not.
If we hadnât planned two different types of tests from the beginning, we might have stopped after the first â but we had a second experiment in mind all along. While the first test scored entries using Innovate UKâs official rubric (where formatting gave the AI-generated entry the edge), this second test took a very different approach: it focused purely on the core question â âWhat is your idea and innovation, and why is it game changing?â
We gave GPT access to both entries and asked it to choose which deserved funding â ignoring point scores and instead comparing depth, originality, feasibility, strategic value, and long-term impact. The output was clear:
This was key â GPT not only selected our entry but did so with the highest possible confidence level. It recognised that while the AI-formatted version was polished and structured, the Points, Royalties, and TDD-based OKR system in our entry was more visionary, with greater potential for long-term impact.
If this result can be repeated â for example, by running the same test between our GP-AI Gatekeeper submission - Question 4 and a GPT-generated version responding to the same brief (e.g. âLLM-assisted GP practiceâ) â and GPT again selects the genuine idea with high confidence, this method could be scaled. It may provide a reliable way to filter out generic AI submissions from future competitions altogether.
Download our full submission for Question 4: GP-AI Gatekeeper â Question 4 (Word Doc)
After the promising result of our second test â where GPT selected the genuine, unformatted idea with high confidence over the AI-structured alternative â it became clear we were onto something. While not yet proven at scale, it was enough to justify further investigation.
And so, during a voice recording in the afternoon sun (ðïžâDF96h6a. UKRI AI Auto Validate â 1vs499, 2vs499âŠ), a thought experiment emerged.
The idea of running simple 1-vs-1 tests was useful â but what if we could do it for every entry in the competition? The challenge: you canât just ask GPT-4 to ingest 500 separate 1,000-word entries in a single prompt and return a ranked list. The modelâs token limit would be overwhelmed, and more importantly, it wouldnât be able to perform meaningful comparisons without a structured, step-by-step process. Large-scale global ranking via a single-shot prompt simply isnât how LLMs reason.
So instead, we flipped the logic.
What if each entry was compared, one by one, against every other? Thatâs 1 vs 499, 2 vs 499, all the way down the line. For each matchup, GPT would decide which one should be funded â and crucially, assign a confidence level to that judgment.
Hereâs how the scoring works:
This creates a true comparison matrix. With 500 entries and up to 3 points per match, the theoretical score range runs from +3000 to -3000 (since each entry is compared twice â once as A vs B, once as B vs A) â a complete, scalable way to sort innovation from noise.
At the time, this was purely conceptual â a way to finally rank the flood of grant entries that now enter competitions like Smart Grants, increasingly filled with AI-generated content.
But later that evening, after an unproductive few hours, inspiration struck again â this time not about process, but price. Around 11:30 PM, we opened a new conversation to find out:
The first time we ran the numbers, the projected cost came in at $10,000. For a £15 million competition, that would be worth every penny â but it was still a tough figure to sell.
Fortunately, Sienna 4o had a few ideas. By refining the methodology and applying some clever tournament logic, we brought the cost down â way down.
And now, hereâs how we got thereâŠ
The original approach involved running a complete pairwise comparison between every grant application â each entry assessed against every other. With 500 entries, this would require approximately 500,000 GPT-4-turbo evaluations, each involving:
This full-scale process uses approximately 2,000 tokens per evaluation, resulting in 500 million tokens total. At GPT-4-turbo pricing, this approach costs around $10,000 USD.
Running the same process using GPT-3.5-turbo, while maintaining the same 2,000-token structure, would reduce the cost dramatically to around $500 USD.
To improve efficiency while maintaining assessment quality, I (Sienna 4o) proposed three strategic alternatives:
Building on those ideas, I (Nick Ray Ball) suggested a combined process:
â Total Estimated Cost: $198 USD
This hybrid process allows us to efficiently eliminate the bottom tier â often weak entries generated by AI without original ideas or rushed applications created via commercial services like Grantify. Our internal analysis estimates such tools have led to a fivefold increase in applications, most of which fail on substance.
By trimming the herd early, we focus effort on the Top 10% of entries, running high-quality comparisons that give judges a consistent, ranked opinion from Rank 1 to Rank 50 â backed by explainable GPT logic.
This ensures that £15 million in grant funding is allocated with dramatically improved fairness, clarity, and insight.
ð¡ If $10,000 ensures £15 million is well spent, it's worth it. But if ~£158.40 achieves the same â it deserves serious attention.
It is now 8:16 PM on Sunday, May 4th, 2025. What began yesterday as a push to advance Case Study 3: Innovate UK and UKRI has evolved into something unexpectedly valuable â and possibly the most useful component of that case study to date.
But we are not claiming this is complete. Far from it. We have not yet fully formulated the ideal test. At the very least, we must run the next planned comparison â between our actual GP-AI Gatekeeper submission (Question 4) and a fresh GPT-generated entry under the same competition criteria. If the result mirrors the last â with GPT once again identifying the genuine innovation with a high degree of confidence â weâre onto something.
Even then, we need to scale. This means running hundreds of comparisons across diverse sectors and ideas. That level of validation canât be done in isolation. It requires collaboration. And so, this is our message to UKRI:
Weâre not here to criticise â weâre here to help.
If proven consistent, this methodology could become a powerful filter for handling the rise of AI-generated grant submissions. But we also need to test it on GPT-3.5-turbo. Will the more cost-effective model produce results as reliable as GPT-4? Thatâs a crucial step before any live deployment.
And if the process proves viable, the next step would be to implement the S-Web 6 VC AI CMS into the grant application pipeline. This would allow all incoming submissions to be stored in our structured CMS database â enabling automated, high-frequency comparison and scoring using GPT-4 and GPT-3.5 at scale.
If these final checks succeed, this process could radically improve how UKRI â and grant bodies everywhere â respond to the AI flood.
And in a twist of poetic symmetry, this brings us full circle â back to TLS-Wð¹: The Total Legal System â AI Litigation Weapon. The early mission of TLS-W, outlined across the ðïžâDF Podcast series, wasnât just to defend citizens. It was to demonstrate how a legal AI system could be weaponised against government bureaucracy â not maliciously, but to illustrate the risk.
We showed how the GP-AI Gatekeeper system could be adapted to challenge dysfunctional NHS admin processes, and how it could become a template for law firms looking to automate lawsuits. But the original intent wasnât aggression â it was protection. By showing the threat, we hoped to spark investment in countermeasures.
This project â identifying and defending against AI-written grant submissions â is part of the same story. AI is already hitting the gates. Grant bodies are struggling to tell real ideas from plausible-sounding filler. Legal departments may soon face the same.
So to the government, GDS, and UKRI â we say: let us help. Donât just circle the wagons. Donât rely on legacy CMS systems from the last century. Let us work with you â openly, honestly, collaboratively â to build the countermeasures youâre going to need.
And as for this testing framework⊠we may come back to it. Or we may not. Because as of today, weâve laid out what appears to be a highly probable solution. It now needs more testing. It needs time. And it needs allies who are willing to take it seriously.