Thanks! I actually picked up the concept of context window, and from there how to create a modelfile, through one of the links provided earlier and it has made a huge difference. In your experience, would a small model like llama3.2 with a bigger context window be able to provide the same output as a big modem L, like qwen2.5:14b, with a more limited window? The bigger window obviously allow more data to be taken into account, but how does the model size compare?
I need to look into flash attention! And if i understand you correctly a larger model of llama3.1 would be better prepared to handle a larger context window than a smaller llama3.1 model?