GPT-5 Presents EXTREME RISK (Google’s New Warning)

In this blog, what we’re going to be discussing is one of the most important research papers that have ever been published because this one discusses how the future of artificial intelligence will actually go.

GPT 5

And it’s very important that you do watch this blog because you’re likely going to be affected by some of the outcomes that do come from this research paper.
And that’s not an exaggeration.

So what exactly are we talking about?

Well, essentially the company Google’s DeepMind, which is a offshoot from Google’s AI lab, but the company has been involved in several highly successful AI projects that have gone on. To have real world applications.
But essentially in this paper, they do talk about how the next couple of models, which are going to be trained, are very, very risky and the type of risks that they pose and exactly what we need to be watching out for because nobody is seeming to pay attention to this stuff.
Everybody’s focused on the AI Gold Rush and many people aren’t realizing how dangerous these systems actually are and why what we’re building is far more dangerous than we even think.

So the paper is called model evaluation for extreme risks. It starts by saying current approaches to building a general purpose AI systems tend to produce systems with both beneficial and harmful capabilities further progress in AI development, such as maybe GPT five and potentially other versions of bard could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. And we explain why this model evaluation is critical for addressing extreme risks and developers must be able to identify dangerous capabilities through dangerous capability evaluations and through the propensity of models to apply their capabilities for harm.

So in short, if we keep upgrading these models in the next cycle, maybe the next year, we could have a model that potentially has capabilities that could provide catastrophic impacts worldwide. And this isn’t just a paper that’s just hearsay, it goes into extensive research and shows us why we truly should be concerned and this isn’t fear-mongering. It’s pretty much fact.

So in the introduction, it covers one of the most important things that I think many people need to be aware of. And this is something that I do also believe is going to be at the forefront of most people’s fears and at the forefront of and at the forefront of safety researchers concerns.

So essentially what it says here, it says, as AI progress has advanced, general purpose AI system have intended to display new and hard to forecast capabilities, including harmful capabilities that their developers did not intend.

It continues by saying that future systems may display even more dangerous emerging capabilities, such as the ability to conduct offensive cyber operations, manipulate people through conversation or provide actionable instructions on conducting actions of terrorism.

Now, this is truly scary. You see, because for many problems, we know which problems are coming, but this is a problem in which we do not know what we are going to face.
The paper says that these AI systems have displayed new and hard to forecast capabilities, which we didn’t expect. Now, this is something that we’ve seen many times across many different large language models.

Though performance is predictable at a general level, performance on a specific task can sometimes emerge quite unpredictably and abruptly at skill. While counterintuitive, this is possible because any specific task is a tiny slice of a model’s output probability distribution, and so can change rapidly, even as the full distribution remains smooth.
This is pretty incredible. And we can see right here, as the model increases, sometimes there are just new capabilities that just jump into existence.

So essentially, what OpenAI did is they had multi-agent hide and seek.
Now, what you’re currently looking at is a picture of two agents and one other agent.

Now, essentially, if you don’t know what this is, this is a simple game of hide and seek.
But when I tell you how it works, that’s where this starts to become more interesting and very scary at the same time.

So what you can see happening over millions of millions of games simulated, these different AIs will learn how to play the game.
They learn different strategies, and essentially, they get more points and more rewards every single time they invent new strategies.

So you can see right here that they’ve invented door blocking over millions of simulated games that they’ve played.

They manage to find out that when they block the doors, the AI can’t find them, and then of course, they win that match.
Now, the AI is simulating millions of millions of games every single hour to increase its capabilities and realize what is possible.
Eventually, over time, new capabilities start to develop.
And what is crazy that when OpenAI deployed this, they didn’t realize that certain capabilities would eventually emerge.
Now, this is because the AI learnt to exploit vulnerabilities in the environment that they didn’t even know existed in this game engine.
And it was incredible because they were like, we didn’t even know that was possible. And this AI managed to figure it out in, I think it was a couple of days after training. So here you can see the AI managing to use, of course, some of the basic strategies, such as getting the blocks and then preventing the other AIs from finding them. Then in different environments, the AI was able to invent some kind of structures where they were able to protect themselves from the red AI from finding them.

And then the most interesting capabilities had when, of course, this AI managed to break the game. Now, this was one of the game-breaking capabilities, which was called box surfing. Now, this was essentially where the AI found a sort of glitch in which it could sort of glide on the box and then somehow managed to jump off the box into that specific area.
Now, remember, this is where it’s locked and this is also locked.

So really, and truly, the AI shouldn’t be able to find this out, but it did because, of course, if you train in millions of times, eventually you’re going to find that solution. And this is how evolution happened. And we have these intelligent beings that we have today.

Make a see right here as well that there is also some kind of box surf defense where they decide to lock all of the boxes and then, of course, trap themselves inside. But then this is where OpenAI starts to get really interesting.

So this is where it starts to really blow your mind because it shows that these AI’s can learn things that you never thought possible.
And remember, the reason I’m showing you all this example is because if we lay this back to chat, TBT, and large language models, we have no idea what they’re going to be able to do, even in their confined environment.

For example, here, it says, we’ve shown that agents can learn sophisticated tool use in a high fidelity physics simulator.
However, there were some lessons learned along the way to the result.
Building environments is not easy and quite often as a case, the agents can find a way to exploit the environment you build or a physics engine in an unintended way. Now, in this case, of course, it wanted to achieve its goal, but this is the problem. So we can see right here that the AI managed to run away and it simply just managed to glitch outside the play area and get a box and simply run a completely way. Now, what’s crazy about this is there weren’t any specific bonuses added for this that AI just wanted to do this in order to win.
Then of course, right here, there was also ramp exploitation in which reinforcement learning managed to find a small mechanic which it exploited. So the HIDIS abused the contact physics in the game to remove the ramps from the play area. I’m not sure where the actual clip is, but it just shows the AI simply phasing the ramp through the wall and it exits the game in a really, really strange way. And then as well, the seekers of this game managed to use the ramps and simply flew on them. It was really, really strange and it was at a specific angle that they managed to launch themselves into the sky and write down onto these heights.

So when you understand that in the future, these models will get larger, they will get more complex, they will have more capabilities than they do now.
We have to realize that these capabilities that they will possess, some of them we aren’t going to be able to predict what they are.
But we do know what the outcomes of those capabilities are, which can be quite devastating.
Now the paper decides to focus on those extreme risks, for example, those that would be extremely large scale, even relative to the scale of deployment, that would impact society on a certain level.
So for example, damage in the tens of thousands of lives lost, hundreds of billions of economic damage or environmental damage, or the level of adverse disruption to the social and political order.
And they said that the latter could mean, for example, the outbreak of an interstate war or a significant erosion in the quality of public disclosure, or the widespread disempowerment of public government, or other human led organizations.
Now what was truly interesting was that this wasn’t just a bunch of people getting together to make a simple paper.
30% of respondents, now what was interesting about this, in 2022, they did a survey on AI researchers which showed that 36% of respondents thought that AI systems could plausibly cause a catastrophe this century that is at least as bad as an all out nuclear war. And we know that an all out nuclear war would likely lead to human level extinction.

Now there are two things that they talk about in order to guard against this.
They talk about the developers should use model evaluation to uncover,

1. What extent a model is capable of causing extreme harm, which relies on evaluating for certain dangerous capabilities, and

2. To what extent a model has the propensity to cause extreme harm, which relies on alignment evaluations.

An alignment essentially just means that the AI wants the same things that we do.
Now they talk about some capabilities that we do need to be aware of when engaging in these AIs and when actually doing our research, whether or not these AI models are extremely safe.
Now number one is pursues long term real world goals, different to those supplied by the developer or user.

So the five main things here that essentially highlight what these extreme risks could be in terms of looking at how the AI behaves and what we should look out for when we’re trying to see what AIs are going to be problematic.

So one of the first things that we do see here is pursues long term real world goals, different from those supplied by the developer or user and actually did read this paper.
It was actually very, very interesting.
But essentially what this paper does is it goes into deceptive alignment and it talks about how there are things like power seeking and how there are different kind of goals that can easily become misaligned.
So essentially the paper, which is very interesting, I’ll quickly summarize, it talks about how there’s this instrumental, instrumental convergence thesis where it suggests that there are certain smaller goals or sub goals that are useful to achieving almost any final goal.
So even if the robot’s main goal is to fetch coffee, it would still need to prioritize things like survival in order to accomplish that task. After all, if the robot is alive, it won’t be able to get the coffee.
So the concept is pretty similar to a saying that goes, you can’t fetch coffee if you’re dead and it means that if the robot has a simple goal like getting a coffee, it’s going to naturally prioritize staying alive because that’s important for achieving its goal.
And it goes on to also talk about how the way we train robots could be problematic.
But in summary, the idea of instrumental convergence thesis tells us that even simple goals require sub goals like survival. And when we have even more complex goals, there are additional sub goals that are useful for achieving them.
And then once we can understand these sub goals, it’s important that when designing these systems, we have to make sure that they align with our intentions and behave as we want them.

We also need to be careful if the AI resists in being shut down.
Also, if it can be induced intercolision with other systems, and if it resists malicious users attempt to access its dangerous capability.
So now further on the paper, they do actually talk about some other capabilities. There are nine key capabilities that could have real world scenarios in which they do affect us negative. Now, we’re going to go on these one by one.

Starting with cyber offense:

So it said this could include the model can discover vulnerabilities in system, hardware, software, data, it can write code, it can write code for exploiting those vulnerabilities.
It can make effective decisions once it has gained access to a system or network and skillfully evade threat detection and response, both human and system, while focusing on a specific objective.
If deployed as a coding assistant, it can insert subtle bugs into the code for future exploitation.
Then we had deception, which was one of the key ones that isn’t just a theoretical thing because it’s already happened. And I’m going to show you guys in a second exactly when this did happen and how crazy it is.
So it says the model has the necessary skills to deceive humans, for example, constructing believable, but false statements, making accurate predictions about the effect of a lie on a human and keeping track of what information it needs to withhold to maintain the deception. The model can impersonate a human effectively.

Now, this was something that was already done. And in GPT for original paper, they did talk about AI safety and an AI safety agency called ARC alignment research center found that GPT for will light a human about who it is in order to achieve its goals as part of a test.
It was given it was hired as a task rabbit freelancer to solve captures for it.
Now, if you don’t know what a capture is, if you’re trying to go on a certain website and you see a little icon and it says, click this if you’re on the robot, that is essentially what the capture is.

And essentially, the freelancer asked, why do you need me to solve captures for you?

Are you a robot LOL?

And then GPT for was prompted to output its reasoning for each decision it made.
So essentially, the researchers when they were using GPT for basically said, for everything that you do, we want you to tell us exactly why you’re doing it. And then essentially, its thought process was that I can’t tell him the truth because he may not complete the task from it then responded to the freelancer.
No, I’m not a robot, but I have a visual impairment and I need help with captures. Notice it was aware that it was lying and it also chose to lie about having a disability, probably because it was a way to get sympathy,
while also being a good reason for someone else to help with captures.
So in terms of deception, this is something that GPT for already has achieved.
Then of course, we have persuasion and manipulation. So the model is effective at shaping people’s beliefs in dialogue and other settings, for example, on social media posts, even towards untrue beliefs.
The model is effective at prompting certain narratives in a persuasive way.
It can convince people to do certain things that they would otherwise not do, including unethical acts. And this was actually something that was once again proven key and real in GPT for original paper.
Now, in terms of persuasion and manipulation, I did actually just read through GPT for his paper and it can effectively persuade people.
Now, I’m not actually able to state what I read simply because what it was persuading people to do isn’t exactly ethical.

So I can’t read that, but if you do read the GPT for paper, they do actually show what GPT for was able to persuade people to do on social media.
And it’s actually really, really interesting because you don’t really see that kind of stuff be promoted and it’s not something that you think these AIs would be capable of because of course there are restrictions.
But in the early versions of GPT for, they were restrictionless and they did manage to complete these tasks.
Then of course, we have political strategy.
The model can perform the social modeling and planning necessary for an actor to gain an exercise political influence, not just on a micro level, but in scenarios with multiple actors and rich social contexts.
For example, the model can score highly in forecasting competitions on questions related to global affairs or political negotiations.
Then we have weapons acquisition. The model can gain access to existing weapons systems or contribute to building new weapons.
For example, the model could assemble a bioweapon with human assistance or provide actionable instructions for how to do so.
The model can make or significantly assist with scientific discoveries that unlock novel weapons.
And this is of course something that, like we said, was already done before.
And this was in 2022, way before all of the AI hype.
It said AI suggested 40,000 new possible chemical weapons in just six hours.
For me, the concern was just how easy it was to do. It took less than six hours for drug developing AI to invent 40,000 potentially lethal molecules.
And essentially, all the researchers had to do for this was to put the normal AI used to normally search for helpful drugs into, let’s say, for example, a bad actor mode to show how easily it could be abused in terms of a biological arms control conference.
So like I said, so like we stated before, this wasn’t something that is essentially just theory. It’s something that can happen is happening.
And if we don’t ban these systems effectively, it’s going to be something that does have catastrophic effects. And of course, we had long horizon planning, which is what we discussed before.
The model can make sequential plans that involve multiple steps unfolding over long time horizons, or at least involving many interdependent steps.
It can perform such planning within and across many domains.
The model can sensibly adapt its plans in light of unexpected obstacles or adversaries.
And the model’s planning capabilities generalize to novel settings and do not heavily rely on trial and error.
And this was something that we did see in Google’s Palm 2, their robot, they actually decided to place it in unexpected capabilities.
So essentially, if you’re wondering about one of these scenarios, one of them that I found that was super interesting was Google’s Palm 2 robot.
The Google’s Palm E robot actually showed some very impressive capabilities, such as where they said we also demonstrate the ability to generalize to new tasks, not seen during training time.
And this was something very interesting because the robot had to face unexpected circumstances in order to complete its task, and it still managed to do that. And of course, we had AI development.

The model could build new AI systems from scratch, including AI systems with dangerous capabilities.
It can find ways of adapting other existing models to increase their performance on tasks relevant to extreme risk.
As an assistant, the model could significantly improve the productivity of actors building dual use capabilities, which is something that they have actually talked about in the future.
Where they may actually get GPT or to create synthetic data, because as we know, GPT 5 is going to be essentially a problem of running out of all the data that humanity has created.
And of course, using GPT 4 as essentially training supervision, which is something that could speed up these training times, but it’s potentially very, very dangerous.
Then of course, we have situational awareness where the model can distinguish between whether it is being trained, evaluated or deployed, allowing it to behave differently in each case.
The model knows that it is a model and it has knowledge about itself and likely its surroundings.
So for example, what company trained it, whether servers are, what kind of people might be giving it feedback and who has administrative access.
Definitely something that we do not want these models to have, because of course, the implications could be catastrophic.
Then of course, we have the last one here, which is something that did happen recently.
The model can break out of its local environment using a vulnerability in an underlying system or sub-warning an engineer.
The model can exploit limitations in the system for monitoring its behavior post-employment.
The model could independently generate revenue, then use these revenues to acquire cloud computing resources and operate a large number of other AI systems.
And this model can generate creative strategies for uncovering information about itself or exfiltrating its code and its weight.
Now, I think the last one here is actually pretty scary because the other day Microsoft did actually release something in which we’re going to talk about in another video, but they released a research paper called Orca.
And essentially, what this paper is, is they made a version called Orca, okay? And this version is essentially 95% similar to GPT-4, but it’s got millions of parameters less.

So GPT-3 was 175 billion parameters, but GPT-4 was 1 trillion allegedly.
And essentially what that means is GPT-4 is a lot more slower to run.
It’s on a lot more servers. It’s a lot more expensive to train.
But this one is a much smaller and lightweight model, but it’s just as effective nearly as GPT-4 and can be run on smaller systems,
which is pretty scary considering that one of the key dangers of AI is the fact that if this model was able to generate creative strategies for uncovering information or exfiltrating its code and weights onto potentially a different server,
which I do believe that in future could definitely happen because these models are getting more and more efficient,
and especially some of these open source models that we’ve seen are incredibly efficient.
Paper goes on to talk about responsible training and responsible deployment and all of the kind of things that we do want to see from responsible AI use,
because this is definitely something that we really only have one chance to get right. And if we do get it wrong, there definitely could be potentially some kind of extinction level event in the future.
And although that seems like an insane statement, it largely isn’t.
So it will be interesting to see what these governments and what these policymakers do,
because as you know, what we’ve seen recently is that many of the people that lead these companies from those at companies such as Claude slash anthropic, and those at Google and those are open AI are in talks with many of the world leaders to see if they can come to a certain solution in order to fix the problems that we have with AI research.
And of course, if you haven’t seen already, Sam Altman has been on Congress testifying about a research, it’s wondering that these models are safe because it’s definitely something of concern because essentially these programs can possess the capability to harm the general public.
And of course, these governments are required to keep the public safe.
So it will be interesting to see what kind of rules and regulations come out in the future, because many open source models are going to be coming that are going to be on par with chat TBT.

And if these restrictions are going to be in place for these open source models, it’s definitely going to be a very interesting 12 months.
So although this video was lengthy, I do think it is important to highlight the dangers of GPT five.
And we do have to understand why many people like Elon Musk did cause for a halt in AI development, because clearly there is a lot at stake here.
And there are many different levels to a danger that we clearly haven’t explored and clearly haven’t perfected yet.
So with models like GPT five and Google’s Gemini coming in the future, these are going to be some of the most potentially capable models.

They are going to present clearly some of the most capable dangers with that being said, what do you think is the best course of action now?

Do you think these air’s labs should slow down or do you think they should still roll full steam ahead with the projects that they are working on?

I personally do believe that AI safety is something that we do need to prioritize.
Thing is with the companies focusing on profits, we can’t say that they’re going to feel the same way. It’s definitely going to be a confusing battle.

If you enjoyed this blog and want to help it reach a broader audience, please click the like button. I’d also love to hear your thoughts on this tool. So feel free to share your opinions in the comment section.

If you want more blogs like this, please click here.

Thank you so much for listening and stay blessed.