25 FPS default to 50 Hz instead of 75 (OLED)

The Hobbyist@lemmy.zip · 3 days ago

I didn’t say it can’t. But I’m not sure how well it is optimized for it. From my initial testing it queues queries and submits them one after another to the model, I have not seen it batch compute the queries, but maybe it’s a setup thing on my side. vLLM on the other hand is designed specifically for the multi co current user use case and has multiple optimizations for it.

The Hobbyist@lemmy.zip · edit-2 3 days ago

I run the Mistral-Nemo(12B) and Mistral-Small (22B) on my GPU and they are pretty code. As others have said, the GPU memory is one of the most limiting factors. 8B models are decent, 15-25B models are good and 70B+ models are excellent (solely based on my own experience). Go for q4_K models, as they will run many times faster than higher quantization with little performance degradation. They typically come in S (Small), M (Medium) and (Large) and take the largest which fits in your GPU memory. If you go below q4, you may see more severe and noticeable performance degradation.

If you need to serve only one user at the time, ollama +Webui works great. If you need multiple users at the same time, check out vLLM.

Edit: I’m simplifying it very much, but hopefully should it is simple and actionable as a starting point. I’ve also seen great stuff from Gemma2-27B

Edit2: added links

Edit3: a decent GPU regarding bang for buck IMO is the RTX 3060 with 12GB. It may be available on the used market for a decent price and offers a good amount of VRAM and GPU performance for the cost. I would like to propose AMD GPUs as they offer much more GPU mem for their price but they are not all as supported with ROCm and I’m not sure about the compatibility for these tools, so perhaps others can chime in.

Edit4: you can also use openwebui with vscode with the continue.dev extension such that you can have a copilot type LLM in your editor.

The Hobbyist@lemmy.zip · 14 days ago

The way I see it, is because of the controls. You have a much stronger reaction with a mouse than a joystick. Anytime you play with a mouse, the reaction time is expected to be lower because you I dictate where you want to be looking (like in am fps). The mouse acts as a view positioning device. It is not forgiving. A joystick however is a rotation device. It tells how fast you want to be moving around when looking, not where it should be looking. It is much more forgiving because you only dictate the speed of rotation. If you plugged in a mouse in your deck and played it on the deck you would immediately notice the difference I imagine. I think the trackpads do bring some aspects of the mouse to the deck too in that regard.

But yeah, my takeaway is, with a joystick you don’t need that tight of a latency as with a mouse.

The Hobbyist@lemmy.zip · 21 days ago

Indeed, quite surprising. You got to “stroke their fur the right way” so to speak haha

Also, I’m increasingly more impressed with the rapid progress reaching open-weights models: initially I was playing with Llama3.1-8B which is already quite useful for simple querries. Then lately I’ve been trying out Mistral-Nemo (12B) and Mistrall-Small (22B) and they are quite much more capable. I have a 12GB GPU and so far those are the most powerful models I can run decently. I’m using them to help me in writing tasks for ansible, learning the inner workings of the Linux kernel and some bootloader stuff. I find them quite helpful!

The Hobbyist@lemmy.zip · 21 days ago

Someone recently referred me to this blog post about using RAG in open-webui. I have not tested if but the author seems to reach a good setup.

https://medium.com/@kelvincampelo/how-ive-optimized-document-interactions-with-open-webui-and-rag-a-comprehensive-guide-65d1221729eb

Perhaps this is of use to you?

The Hobbyist@lemmy.zip · 26 days ago

I have no idea if ollama can handle multi-GPU. The 70B in it’s q2_k quantized form requires already 26GB of memory, so you would need at least that to run it well and that would only imply it could be entirely run on GPU, which is the best case scenario, but not at what speed.

I know some people with apple silicon who have enough memory to run the 70B model and for them it runs fast enough to be usable. You may be able to find more info about it online.

The Hobbyist@lemmy.zip · 26 days ago

I wish I could. I have an RTX 3060 12GB, I run mostly llama3.1 8B versions in fp8, at 30-35 tokens/s.

The Hobbyist@lemmy.zip · 27 days ago

Sure! It can be a bit of a steep learning curve at times but there are heaps of resources online, and LLMs can also be useful, even if it just in pointing you in the direction for further reading. Regardless, you can reach out to me or other great folks from the !localllama@sh.itjust.works or similar AI, ML or related communities!

Enjoy :)

The Hobbyist@lemmy.zip · edit-2 27 days ago

For RAG, there are some tools available in open-webui, which are documented here: https://docs.openwebui.com/tutorials/features/rag They have plans for how to expand and improve it, which they describe here: https://docs.openwebui.com/roadmap#information-retrieval-rag-

For fine-tuning, I think this is (at least for now) out of scope. They focus on inferencing. I think the direction is to eventually help you create/manage your own data which you get from using LLMs using Open-WebUI, but the task of actually fine-tuning is not possible (yet) using either ollama or open-webui.

I have not used the RAG function yet, but besides following the instructions on how to set it up, your experience with RAG may also be somewhat limited depending on which embedding model you use. You may have to go and look for a good model (which is probably both small and efficient to re-scan your documents yet powerful to generate meaningful embeddings). Also, in case you didn’t know, the embeddings you generate are specific to an embedding model, so if you change that model you’ll have to rescan your whole documents library.

Edit: RAG seems a bit limited by the supported file types. You can get it here: https://github.com/open-webui/open-webui/blob/2fa94956f4e500bf5c42263124c758d8613ee05e/backend/apps/rag/main.py#L328 It seems not to support word documents, or PDFs, so mostly incompatible with documents which have advanced formatting and are WYSIWYG.

The Hobbyist@lemmy.zip · 27 days ago

The interface called open-webui can run in a container, but ollama runs as a service on your system, from my understanding.

The models are local and only answer queries by default. It all happens on the system without any additional tools. Now, if you want to give them internet access, you can, it is an option you have to setup and open-webui makes that possible though I have not tried it myself. I just see it.

I have never heard of any llm “answer base queries offline before contacting their provider for support”. It’s almost impossible for the LLM to do it by itself without you setting things up for it that way.

The Hobbyist@lemmy.zip · 27 days ago

whats great is that with ollama and webui, you can as easily run it all on one computer locally using the open-webui pip package or in a remote server using the container version of open-webui.

Ive run both and the webui is really well done. It offers a number of advanced options, like the system prompt but also memory features, documents for RAG and even a built in python ide for when you want to execute python functions. You can even enable web browsing for your model.

I’m personally very pleased with open-webui and ollama and they both work wonders together. Hoghly recommend it! And the latest llama3.1 (in 8 and 70B variants) and llama3.2 (in 1 and 3B variants) work very well, even on CPU only, for the latter! Give it a shot, it is so easy to set up :)

The Hobbyist@lemmy.zip · 1 month ago

This is great! I had wondered, with my GoG games, as they provide offline installers for the games, what would be the best way to manage and distribute them for myself and this really hits the spot! Very glad to be able to folly self-manage my games in this way. Thanks!

The Hobbyist@lemmy.zip · 1 month ago

Has anyone been following the development of 3.6? What are some highlighted features or bugs being addressed?

The Hobbyist@lemmy.zip · 2 months ago

Have used my BT headset for almost a year and never had an issue. What are you referring to?

The Hobbyist@lemmy.zip · 2 months ago

100%

The Hobbyist@lemmy.zip · 4 months ago

I’m guessing this technology requires specific implementation into the game? Nonetheless, it’s so great to see these kind of efforts very suitable for battery based laptops and handhelds, they fit perfectly this use case imo.

The Hobbyist@lemmy.zip · 5 months ago

I’m surprised, if I recall, all but one LCD model were to be phased out in November, or at least that’s what they said when they announced the OLED version. Were the supplies that large?

The Hobbyist@lemmy.zip · 6 months ago

Can chromium be used at all or are there specific google components required?

The Hobbyist@lemmy.zip · edit-2 7 months ago

It’s pretty new but if you are interested in degoogling and looking for ways to use android auto, there is apparently a working solution with GrapheneOS but I have not tested so can not verify it.

https://grapheneos.org/features#android-auto

Edit: typo

The Hobbyist@lemmy.zip · 8 months ago

Can you elaborate on why it is a bad security practice? It’s the first time I’m reading about it and I’d like to read more about it. Thanks!

The Hobbyist@lemmy.zip · 11 months ago

25 FPS default to 50 Hz instead of 75 (OLED)

The Hobbyist@lemmy.zip · 1 year ago

Looking for a video on quicksync performance impact of iGPU passthrough

The Hobbyist@lemmy.zip · edit-2 1 year ago

ZFS dataset configuration for a movies and tv shows library? Very heterogeneous data

The Hobbyist

25 FPS default to 50 Hz instead of 75 (OLED)

25 FPS default to 50 Hz instead of 75 (OLED)

Looking for a video on quicksync performance impact of iGPU passthrough

Looking for a video on quicksync performance impact of iGPU passthrough

ZFS dataset configuration for a movies and tv shows library? Very heterogeneous data

ZFS dataset configuration for a movies and tv shows library? Very heterogeneous data