• 0 Posts
  • 111 Comments
Joined 2 years ago
cake
Cake day: June 15th, 2023

help-circle


  • SEO (search engine optimization) has dominated search results for almost as long as search engines have existed. The entire field of SEO is about gaming the system at the expense of users, and often also at the expense of search platforms.

    The audience for an author’s gripping life story in every goddamn recipe was never humans, either. That was just for Google’s algorithm.

    Slop is not new. It’s just more automated now. There are two new problems for users, though:

    1. Google no longer gives a shit. They used to play the cat-and-mouse game, and while their victories were never long-lasting, at least their defeats were not permanent. (Remember ExpertsExchange? It took years before Google brought down the hammer on that. More recently, think of how many results you’ve seen from Pinterest, Forbes, or Medium, and think of how few of those deserved even a second of your time.)
    2. Companies that still do give a shit face a much more rapid exploitation cycle. The cats are still plain ol’ cats, but the mice are now Borg.

  • Well I’m sorry, but most PDF distillers since the 90s have come with OCR software that can extract text from the images and store it in a way that preserves the layout AND the meaning

    The accuracy rate of even the best OCR software is far, far too low for a wide array of potential use cases.

    Let’s say I have an archive of a few thousand scientific papers. These are neatly formatted digital documents, not even scanned images (though “scanned images” would be within scope of this task and should not be ignored). Even for that, there’s nothing out there that can produce reliably accurate results. Everything requires painstaking validation and correction if you really care about accuracy.

    Even ArXiv can’t do a perfect job of this. They launched their “beta” HTML converter a couple years ago. Improving accuracy and reliability is an ongoing challenge. And that’s with the help or LaTeX source material! It would naturally be much, much harder if they had to rely solely on the PDFs generated from that LaTeX. See: https://info.arxiv.org/about/accessible_HTML.html

    As for solving this problem with “AI”…uh…well, it’s not like “OCR” and “AI” are mutually exclusive terms. OCR tools have been using neural networks for a very long time already, it just wasn’t a buzzword back then so nobody called it “AI”. However, in the current landscape of “AI” in 2025, “accuracy” is usually just a happy accident. It doesn’t need to be that way, and I’m sure the folks behind commercial and open-source OCR tools are hard at work implementing new technology in a way that Doesn’t Suck.

    I’ve played around with various VL models and they still seem to be in the “proof of concept” phase.


  • I’ve been using cryptpad.fr (the “flagship instance” of CryptPad) for years. It’s…fine. Really, it’s fine. I’m not thrilled with the experience, but it is functional and I’m not aware of any viable alternatives that are end-to-end encrypted.

    It’s based on OnlyOffice, which is basically a heavyweight web-first Microsoft Office clone. Set your expectations accordingly.

    No mobile apps, and the web UI is not optimized for mobile. I mean, it works, but does using the desktop MS Office UI on a smartphone sound like fun to you?

    Performance is tolerable but if you’re used to Google Sheets, it’s a big downgrade. Some of this is just the necessary overhead involved in an end-to-end encrypted cloud service. Some of it is because, again, this is a heavyweight desktop UI running in a web browser. It’s functional, but it’s not fast and it’s not pretty.


  • For instance, Mozilla said it may have removed blanket claims that it never sells user data because the legal definition of “sale of data” is now “broad and evolving,” Mozilla’s blog post stated.

    Uh huh.

    The company pointed to the California Consumer Privacy Act (CCPA) as an example of why the language was changed, noting that the CCPA defines “sale” as the “selling, renting, releasing, disclosing, disseminating, making available, transferring, or otherwise communicating orally, in writing, or by electronic or other means, a consumer’s personal information by [a] business to another business or a third party” in exchange for “monetary” or “other valuable consideration.”

    Yes. That’s what “sale of data” means. Everybody understood that. That’s exactly what we don’t want you to do.


  • DNS over HTTPS. It allows encrypted DNS lookup with a URL, which allows for url-based customizations not possible with traditional DNS lookups (e.g. the server could have /ads or /trackers endpoints so you can choose what to block).

    DNS Over TLS (DoT) is similar, but it doesn’t use URLs, just IP addresses like generic DNS. Both are encrypted.




  • If you think this isn’t related to human rights, then you’ve missed the point.

    People have the right to use technology, and indeed we effectively need technology to exercise our right to free speech. You cannot have one without the other. Not anymore.

    The right way to think about this that they are arbitrarily banning a topic of discussion simply because it is not dead-center average. This isn’t even a legal issue, and the justification is utter nonsense (Facebook itself runs on Linux, like >90% of the internet). No government has officially asked them to do this, though the timing suggests that it is unofficially from the Trump administration.

    This is about exerting control, establishing precedent, and applying a chilling effect to anything not directly aligned with their interests. This obviously extends to human rights issues. This is a test run.


  • But any 50 watt chip will get absolutely destroyed by a 500 watt gpu

    If you are memory-bound (and since OP’s talking about 192GB, it’s pretty safe to assume they are), then it’s hard to make a direct comparison here.

    You’d need 8 high-end consumer GPUs to get 192GB. Not only is that insanely expensive to buy and run, but you won’t even be able to support it on a standard residential electrical circuit, or any consumer-level motherboard. Even 4 GPUs (which would be great for 70B models) would cost more than a Mac.

    The speed advantage you get from discrete GPUs rapidly disappears as your memory requirements exceed VRAM capacity. Partial offloading to GPU is better than nothing, but if we’re talking about standard PC hardware, it’s not going to be as fast as Apple Silicon for anything that requires a lot of memory.

    This might change in the near future as AMD and Intel catch up to Apple Silicon in terms of memory bandwidth and integrated NPU performance. Then you can sidestep the Apple tax, and perhaps you will be able to pair a discrete GPU and get a meaningful performance boost even with larger models.


  • This will be highly platform-dependent, and also dependent on your threat model.

    On PC laptops, you should probably enable Secure Boot (if it’s not enabled by default), and password-protect your BIOS. On Macs you can disable booting from external media (I think that’s even the default now, but not totally sure). You should definitely enable full-disk encryption – that’s FileVault on Mac and Bitlocker on Windows.

    On Apple devices, you can enable USB Restricted Mode, which will protect against some attacks with USB cables or devices.

    Apple devices also have lockdown mode, which restricts or disables a whole bunch of functionality in an effort to reduce your attack surface against a variety of sophisticated attacks.

    If you’re worried about hardware hacks, then on a laptop you’d want to apply some tamper-evident stickers or something similar, so if an evil maid opens it up and tampers with the hardware, at least you’ll know something fishy happened, so you can go drop your laptop in an active volcano or something.

    If you use any external devices, like a keyboard, mouse, hard drive, whatever…well…how paranoid are you? I’m going to be honest: there is a near 0% chance I would even notice if someone replaced my charging cables or peripheral cables with malicious ones. I wouldn’t even notice if someone plugged in a USB keylogger between my desktop PC and my keyboard, because I only look at the back of my PC once in a blue moon. Digital security begins with physical security.

    On the software side, make sure you’re the only one with admin rights, and ideally you shouldn’t even log into admin accounts on a day-to-day basis.



  • I guess the idea is that the models themselves are not infringing copyright, but the training process DID. Some of the big players have admitted to using pirated material in training data. The rest obviously did even if they haven’t admitted it.

    While language models have the capacity to produce infringing output, I don’t think the models themselves are infringing (though there are probably exceptions). I mean, gzip can reproduce infringing material too with the correct input. If producing infringing work requires both the algorithm AND specific, intentional user input, then I don’t think you should put the blame solely on the algorithm.

    Either way, I don’t think existing legal frameworks are suitable to answer these questions, so I think it’s more important to think about what the law should be rather than what it currently is.

    I remember stories about the RIAA suing individuals for many thousands of dollars per mp3 they downloaded. If you applied that logic to OpenAI — maximum fine for every individual work used — it’d instantly bankrupt them. Honestly, I’d love to see it. But I don’t think any copyright holder has the balls to try that against someone who can afford lawyers. They’re just bullies.



  • Thanks for the info. I was not aware that Bluesky had public, shareable block lists. That is indeed a great feature.

    For anyone else like me who was not aware, I found this site with an index of a lot of public block lists: https://blueskydirectory.com/lists . I was not able to load some of them, but others did load successfully. Maybe some were deleted or are not public? I’m not sure.

    I’ve never been heavily invested in microblogging, so my first-hand experience is limited and mostly academic. I have accounts on Mastodon and Bluesky, though. I would not have realized this feature was available in Bluesky if you hadn’t mentioned it and I didn’t find that index site in a web search. It doesn’t seem easily discoverable within Bluesky’s own UI.

    Edit: I agree, of course, that there is a larger systemic problem at the society level. I recently read this excellent piece (very long but worth it!) that talks a bit about how that relates to social media: https://www.wrecka.ge/against-the-dark-forest/ . Here’s a relevant excerpt:

    If this truly is the case—if the only way to improve our public internet is to convert all humans one by one to a state of greater enlightenment—then a full retreat into the bushes is the only reasonable course.

    But it isn’t the case. Because yes, the existence of dipshits is indeed unfixable, but building arrays of Dipshit Accelerators that allow a small number of bad actors to build destructive empires defended by Dipshit Armies is a choice. The refusal to genuinely remodel that machinery when its harms first appear is another choice. Mega-platform executives, themselves frequently dipshits, who make these choices, lie about them to governments and ordinary people, and refuse to materially alter them.



  • I’d rather have something like a “code grammar checker” that highlights potential errors for my examination rather than something that generates code from scratch itself

    Agreed. The other good use case I’ve found is as a faster reference for simple things. LLMs are absolutely great for one-liners and generating troublesome (but logically simple) things like complex xpath queries. But I still haven’t seen one generate a good script of even moderate complexity without hand-holding. In some cases I’ve been able to get usable output with a few shots, saving me a bit of time compared to if I’d written the whole darned thing from scratch.

    I’ve found LLMs very useful for coding, but they aren’t replacing my actual coding, per se. They replace looking things up, like through man pages, language references, or StackOverflow. Something like ffmpeg, for example, has a million options and it is always a little annoying to sift through the docs manually when I just need to do one specific task.

    I’m sure it’ll happen sooner or later. I’m not naive enough to claim that “computers will never be able to do $THING” anymore. I’ll say “not in the next year”, though.


  • Just marketing nonsense. There are three ways to present AI features:

    1. A generational improvement on things that have been available for 20+ years. This is not sexy and does not make for good advertising. For example: grammar checking, natural-speech processing (Siri), automatic photo tagging/sorting.

    2. A new type of usage that nobody cares about because they’ve lived without it just fine up to now.

    3. Straight-up lie to people about what it can do, using just enough weasel words to keep yourself out of jail.