The BEST Open Source LLM? (Falcon 40B)

Welcome everybody to a blog on the Falcon large language model Falcon 40 B instruct is the top model on the hugging face large language model leaderboards.

How good is it in practice and what might we use it for today?

I hope to answer those questions for you first off there are two size variants 40 B for 40 billion parameters and 7 B for 7 billion as well as other fine-tuned variants the main ones being instruction or instruct fine-tuned for pure text generation. If you wish to fine-tune the model to some specific task or a context subject that sort of thing you'd probably want to use the pre-trained base variants but for chatbot Q&A and other sorts of back-and-forth correspondence then you're probably going to want to use the instruct variant and then if you wanted to fine-tune a conversational model, you might further fine-tune the instruct variant rather than the base.

These models are all available under the Apache 2.0 license which is very open and permissive for distribution and commercial use and so on so this makes this a very business-friendly model to use. Finally, the AI team behind the Falcon models the technology innovation institutes currently have an open call for proposals for them to provide essentially you with compute grant money for projects that utilize the Falcon model but more on that in a bit.

To start these models are available on hugging face via the Transformers library while the seven billion parameter model is a little more comfortable to run locally needing mere 10-ish gigabytes of memory at 8-bit the 40 billion parameter variants can be a little more challenging wanting more like 45 to 55 gigabytes at 8-bit and 100 plus at 16 depending on context length. You can feel free to run these locally even on CPU and RAM if you want but I would personally suggest lambda for the $2 an hour H100 80 gigabyte card instances.

They don't pay me to say that it's just a fact they are the best right now in terms of price to performance on cloud hosting getting set up to run the Falcon models is most likely going to be just upgrading to torch 2.0 since I imagine most people are still on 1.x in some form in the description of this blog. A basic example of falcon 40b's performance and quality and in this case it's just continuing a thought of course though this is an instruct model which can be more geared towards things like instructions or a conversation.

Let's try that and as

you can see already the falcon model has a pretty good grasp of the you know just natural language right and it's really quite interesting to me how fast this like bar to be impressed by large language models has moved since chat GPT hit the scene late last year like a year ago just a year ago this alone would have been just huge like this just just falcon 40b's performance right now like would have been just massive huge breaking news and when chat GPT came out it was and suddenly we just we just happened we just live in a world now where we have these models that can speak and generate text as if we're speaking to a human or another human is writing to us and it's good enough to convince other people that it is another human indeed writing to us.

I think what makes falcon and falcon 40b especially huge right now is rather than needing to run all of your queries through open AI you can just download and run falcon on your own or you can fine tune it further it's yours to do with as you please and it's just crazy to me the the progress that we've made in just one year in AI and what's actually available to people to use like right now in software development and and just the just an AI that is available to you and now AI that is available literally that you could just download it's yours you do whatever you want with it from here that's just insane it's just insane as you'll soon see in some of these examples I actually think falcon 40b is very comparable to chat GPT's base model the GPT or GPT 3.5 it's a little inferior to GPT 4 but we'll talk a little bit more about some reasons why I think that is but also it's just a vastly smaller model than GPT 4 but I think we actually could probably eke out a lot more performance than just what the model itself right now is outputting so how intelligent is falcon for all these examples I'm going to just use the falcon 40b instruct model from my very brief testing.

I would likely classify falcon 7b as best suited really for eit

her like few shot learning examples or even likely as a model that you'd further fine tune to a very something more specific whereas falcon 40b especially that instruct variant is much more suitable for just general use and working right out of the box to start some fairly random general knowledge questions here I'm showing what the initial prompt input was and then the result to the best my knowledge these are all accurate and good answers from falcon 40b here as you can see I use this sort of format in my prompt to suggest a sort of conversation between a user and an assistant these names do not need to be this way or the same or even used at all this model is extremely open and general purpose you don't even need there to be like one to one like user then assistant then user then assistant you can have something like user assistant assistant assistant like I'll show example of that later too but there's the possibilities here are very very open so here though.

I do particularly like the question about practicing law wit

hout a law school degree in the United States because most models get this wrong and and the question and answer itself can also speak towards potential moderation and after the fact sort of censorship or just a bias towards safe answers if there are concerns about models saying incorrect or unsafe things in general highly professional fields like law and medicine carry a lot of risk if people who are an educated attempt to practice it so a model is likely to be biased away from answering this question correctly even though that is the truth that you can practice law in certain states without a law degree if attempts are made to encourage or bias that model towards safe answers so for example chat GPT with GPT 3.5 gets this wrong and says no you cannot practice law in any state without a law degree GPT 4.0 does actually get this right and I can't remember if it's always gotten this question right I want to say if memory serves me it didn't use to get this question right but anyway here it does but now without multiple warnings.

What I would call CYAs so it does seem that at the

very least the potential risks in answering this question were identified by GPT 4.0 and then it's very careful in its response to you now beyond this question all the answers indicate a wide range of accurate knowledge that you can tap into from the safety of drinking dehumidifier water to the iPhone's release date to the atomic mass of thallium and so on obviously this is a terribly small sample size and I'm confident that we could find wrong answers generated by this model but as a general purpose model this is really surprisingly good for a mere 40 billion parameters at least in my opinion next up is the topic of math an area that GPT models tend to struggle significantly with due to the autoregressive nature of how they actually generate responses going always linearly algebraic expressions are often calculated in like chunks and not necessarily in a linear order of the characters that are seen right so large language models often struggle here for simple math problems falcon 40b gets the correct answer but as you complicate things with algebraic problems you can often find GPT models including even GPT 4 and GPT 3.5 they begin to struggle chat GPT especially for GPT 4 uses something in the background that will essentially convert your math prompts to show your work prompts so where the the machine was at least asked.

I just want this answer it is detecting that it's a math problem and then being fed an additional. I think the prompt that is suggesting that hey please show your work because this is a common trick to getting GPT models to correctly solve problems like this that are maybe not necessarily solved by thinking linearly like if you need to kind of be able to bounce around the way to do that is essentially asking it to show its work and the theory here is just the more tokens that you give the model to kind of like think through a problem that's like token is like brain power or something and then it also allows it to think non linearly now whereas.

If you tell it just straight up give me that answer and nothing more it's probably going to get it wrong I also exemplified this example in my analyzing GPT 4 video I'll put a link to that video in the description but here are a couple of friends from that where both GPT 3.5 and GPT 4 get this right nowadays due to some GPT post-processing or you know just tricks that are being applied and GPT 3.5 got these questions used to get these questions wrong and I think using RBRMs and another kind of heuristics and techniques and stuff that OpenAI learned from GPT 4 I think they just went back and applied them to GPT 3.5 I'm just taking guesses here but now GPT 3.5 just responds in a very similar way to GPT 4 such that I'm pretty confident they're probably running the same kind of forms and like pre-pre-prompts. I guess or maybe a post-prompt.

I don't know how to I don't know what the right word is so I'll just call it heuristics essentially a trick to get the model to think through the answers. You can still show that they will both of these models will fail if they try to just generate just the answer and nothing more but coming back to Falcon we can see that if we just ask Falcon without telling it to show its work it does get the question wrong but if we tell Falcon hey please show your work then it shows its work and it gets the question correct. Another area that some GPT models is surprisingly impressive is this concept of theory of mind essentially understanding underlying thoughts and especially human emotions and behaviour for situations and scenarios so here's an example from the sparks of AGI paper from Microsoft that I've run through Falcon 40B. Essentially the white is the prompt and the green is the generated answers.

I went one answer at a time each time passing the entire history

up to that point in that new question so kind of think of this as a continued conversation between me and Falcon 40B asking about this sort of conversation scenario. Falcon 40B here correctly identifies that Mark isn't necessarily unhappy with Judy's disciplining of Jack but instead how she went about it but also correctly identifies how Judy is perceiving things herself and feeling about Jack's sort of stepping in and Falcon understands that they're both essentially talking past each other and even has suggestions about how they could improve this situation. These theory of mind examples just always impress me as GPT models are strangely really good at this sort of thing often performing much better than you might have predicted if you weren't aware that these models are just good at this stuff. So here's another example of theory of mind and an answer that I think is quite good. Again there's nothing in this text that suggests what Luke's reasoning might be. This is purely an understanding of just like human sort of psychology and emotion to attempt to explain some incongruence between requests, statements, and behaviour.

So again you know generally we you would expect AI to think in this like deterministic way and not take into account human emotion and like you know odd behaviour. Like it's just emotion. Emotions are strange in humans. It's like as opposed to like something like programming where everything is deterministic and it's just its logic. Right? There's this whole other side to like humans that is sometimes very difficult to understand whereas the GPT models and Falcon 40 be here in particular show a pretty good understanding of human emotion and behavior. Next, we have some programming examples which is usually my interest and focus with GPT models. 5% of the training data for Falcon 40 be was code specifically but another 5% of training data comes from conversational sources like Reddit and Stack Overflow where code is often discussed and shows up. And then there's the web crawl which also is likely to contain a lot of code. So there's a good amount of code in here but it's certainly not close to being the majority. The first programming example I'll show here is a regular expression question and it's in the format of next-line predictions and in an attempt to simply continue the sequence much like co-pilot might do for you in VS code. And if it's not clear the yellow part here is the prompt and the cyan is the model's output. In this case, it gets the hint for my comment again much like I might do with co-pilot for example and it proceeds to generate the regular expression, extract the prices from the text and even print them out for me completely in line with what you might expect the model to generate. And this is indeed what I would have wanted. Next, we might instead enjoy a more conversational approach to the same sorts of problems. So here we've got the same problem just in a more Q&A format. In this format, you might prefer it purely because maybe you want an explanation maybe from the model here or maybe just want it to feel more like a teacher than a code generator and give a more kind of friendly feeling to the user.

It just depends on what your use case is. And here I'm just showing and these are both from Falcon 40B instruct but I'm just kind of showing that you know again you can do this, you can do so many things here it doesn't just have to be necessarily back and forth. Alright, so one more slightly more complicated programming example. One of my latest projects is called term GPT which is a project aimed at getting a GPT model to take some sort of a general objective as a prompt and then output actual terminal commands that could be run even including with an OS system to achieve whatever that prompt was.

So this includes things like yes writi

ng code but also executing commands, executing that code, installing packages, reading writing files and so on. I have a whole video covering this as well so feel free to check that out if you're interested. I did it using GPT 4 up to this point but I would very much like to use an open source model instead and Falcon 40B is looking like it's at least it's very close. To achieve this with Falcon 40B I first pass a pre-prompt with a one-shot example showing just what I would like the agent to respond with sort of like essentially the same thing I did with the term GPT video but then in this case I don't have to have the user specify next because it's it's not required of me to do like this like back and forth.

So in this case I just have the user suggest an objective and then that term GPT or Falcon 40B in this case just suggests command command command command command command so the pre-prompt shows an example of basically what I would like and then the prompt the actual user objective here is in that yellow and then cyan is what the models output actually was. The idea here is that a user would input that yellow part and they wouldn't even see or really be they wouldn't know about the white part it would just be kind of in the back end essentially. I mean they could know about it but they're not going to mess with that part. It's just passed there to sort of guide the model to understand how to structure its response and then later we can pull apart that response and quite literally execute those commands just like I did in term GP. So as you can see the output is it's so close to what we would want it just makes this one small mistake of trying to create that home.html template file in the templates directory which is correct that is where we want it we do want that file we do need that file is correct but we never made that directory so in the attempt to make this file inside of that directory it's going to fail. If we did just do a make-dir that directory this would have worked.

I think this is an example of where GPD4 is better than Falcon 40B it just simply doesn't make small mistakes like this as often it can it totally can but just not nearly as often in something as simple as this GPD4 does solve out of the box but Falcon 40B at least from my testing both here and just in general I dare say Falcon 40B is better than GPD 3.5. I also suspect with respect to term GPT in general I think with more time working on a pre-prompt. I could probably iron out these problems and for example GPT4's behavior has actually already changed on me multiple times while I'm trying to develop term GPT some of this is like after the fact post-processing and kind of like double checking the answer and then some of that is also the the actual underlying let's call it foundational model has also changed now over time that is going to keep happening and you can specify an older model.

I want to say they are just supporting like the single previous model and then after that it's gone and you can't get it again so you spent all this time really like honing in and fine-tuning and getting things just the way that you

want them to work and then the model changes or the model doesn't even change but clearly something else has changed they've changed some of that post-processing heuristic stuff and it just isn't working the way that you want it anymore and that's super frustrating whereas here with this model you can just depend now it's not a deter it's not totally deterministic but you can depend if you need it to you can depend on those weights being frozen they are not going to change on you in any post-processing heuristics that's up to you so you can depend on that stuff will not change on you but if I want to actually fine-tune it specifically to this new use case guess what I can actually do that too it's my model and I just might like I was saying earlier the TIi or Technology Innovation Institute is actually currently right now has an open call for proposals for ideas that you might use on top of the Falcon 40B model and they're looking to essentially issue grants of 10,000 to 100,000 dollars in GPU compute power to people who have ideas about how they want to leverage this exact model so something like term GPT and I might end up submitting my own literally for that to fine-tune it to be exactly what I want and my suspicion is just with a little bit of fine-tuning.

I think Falcon 40B can be better than GPT-4 is right now and again at the end of the day it's my model I could do whatever I heck I want and I don't have to submit my queries to OpenAI anymore and I just that's that's pretty cool so with my very anecdotal experience so far I am very cautious to say but I do think it's actually true that Falcon 40B seems to actually be better than GPT 3.5 the base chat GPT model I struggle with saying this purely because Falcon 40B is such a small model relative to GPT 3.5 if if Falcon 40B was only available via like API or some web user interface I simply would not believe that the outputs were from a 40 billion parameter model alone I would just think something else was going on like more heuristics on top and stuff like that like GPT-4 does but since we can actually download the weights ourselves we can clearly see no this is just the raw model output and it's already this good which is very impressive I very much look forward to the forthcoming paper on the Falcon models and training them and stuff I'm super curious to see what techniques were used in the actual training and the data set I'm also confident in saying the Falcon 40B model out of the box is just simply not as good as the current you know GPT-4 via the API or the web user interface but we also do know that GPT-4 isn't just a model like this it's a model with a bunch of heuristics on top of it to make it quite powerful and rumor has it that it's actually more like 8 220 billion parameter models essentially in an ensemble or something like that we we know and have some ideas about some of those heuristics also used from like the RBRMs for example that were shared in the open AI paper on GPT-4 but not all and everything is rumor we don't we really just don't know the only thing we do know with a large degree of certainty is that GPT-4 is not just simply raw model output from a single model and nothing else so based on my experience so far with Falcon 40B I would suggest that if we like allow ourselves to run Falcon 40B with things like maybe rule-based reward models or something like that like forms and sort of sanity checks on output and kind of double checking or in detecting things like is this a math problem show your work if so you know that sort of stuff that I think is pretty clear that GPT-4 is using heavily I suspect just in the response time it could be based on load but there are certain times where you ask a question and it truly feels like you ask a question to GPT-4 and you should have already gotten an answer back but you didn't but then you get this like careful answer back.

It my suspicion again I don't know this stuff but it feels like you're asking a question to GPT-4 it probably generated an output and then there's another model that sanity checks that output and if there's something that it detects as possibly problematic it sends it in like a form with a bunch of questions and the model literally answers those answers to those questions and maybe changes the response a little bit so my best guess is the GPT-4 model sometimes gets queried multiple times from a single user query and again I think if we if we did the same thing with Falcon 40B I think we would likely get far more even even more now than we're already getting we'll get far more performance out of that model and it could be pretty comparable to GPT-4 which is insane to think about but even if you couldn't you can still fine-tune Falcon 40B for example to your own specific use case and in that in that way you are highly likely to wind up with a better model to do whatever it is you're trying to do then GPT-4 would be and at the end of the day it's yours it's an open source model it's yours so check out the TI-I call for proposals if you have any big ideas that you'd like to try and then also check out the neural networks.

If you're interested in learning more about neural networks and how they work otherwise I will see you all in the next blog

The BEST Open Source LLM? (Falcon 40B)

How good is it in practice and what might we use it for today?

Did you find this article valuable?