For years I’ve on and off looked for web archiving software that can capture most sites, including ones that are “complex” with lots of AJAX and require logins like Reddit. Which ones have worked best for you?

Ideally I want one that can be started up programatically or via command line, an opens a chromium instance (or any browser), and captures everything shown on the page. I could also open the instance myself and log into sites and install addons like UBlock Origin. (btw, archiveweb.page must be started manually).

  • N0x0n@lemmy.ml
    link
    fedilink
    English
    arrow-up
    5
    ·
    4 days ago

    For reddit, SingleFile HTML pages can be 20MB per file ! Which is huge for a simple discussion…

    To reduce that bloated but still relevant site, redirect to any still working alternative like https://github.com/redlib-org/redlib or old reddit and decrease your file to less than 1MB/file.

    • klangcola@reddthat.com
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 day ago

      SingleFile provides a faithful representation of the original webpage, so bloated webpages are indeed saved as bloated html files.

      On the plus side you’re getting an exact copy, but on the downside an exact copy may not be necessary and takes a huge amount of space.

      • N0x0n@lemmy.ml
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        1 day ago

        You’re right ! And because OP want to archive Reddit pages I propose an alternative to reduce that bloated site to a minimum :).

        From my tests, it can go from 20MB to 700Bytes. IMO still big for a chat conversation but the readability from the alternative front-end is a + !