How to archive a website in a future-proof way (involves PDF hybrid)

evenwicht@lemmy.sdf.org · edit-2 1 hour ago

wget has a --load-cookies file option. It wants the original Netscape cookie file format. Depending on your GUI browser you may have to convert it. I recall in one case I had to parse the session ID out of a cookie file then build the expected format around it. I don’t recall the circumstances.

Another problem: some anti-bot mechanisms crudely look at user-agent headers and block curl attempts on that basis alone.

(edit) when cookies are not an issue, wkhtmltopdf is a good way to get a PDF of a webpage. So you could have a script do a wget to get the HTML faithfully, and wkhtmltopdf to get a PDF, then pdfattach to put the HTML inside the PDF.

(edit2) It’s worth noting there is a project called curl-impersonate which makes curl look more like a GUI browser to get more equal treatment. I think they go as far as adding a javascript engine or something.

evenwicht@lemmy.sdf.org · 2 hours ago

It’s perhaps the best way for someone that has a good handle on it. Docs say it “sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.” So you would need to tune it so that it’s not grabbing objects that are irrelevent to the view, and probably exclude some file types like videos and audio. If you get a well-tuned command worked out, that would be quite useful. But I do see a couple shortcomings nonetheless:

If you’re on a page that required you to login to and do some interactive things to get there, then I think passing the cookie from the gui browser to wget would be non-trivial.
If you’re on a capped internet connection, you might want to save from the brower’s cache rather that refetch everything.

But those issues aside I like the fact that wget does not rely on a plugin.

evenwicht@lemmy.sdf.org · edit-2 5 hours ago

The other thing is, what about JavaScript? JS changes the presentation.

Markdown is probably ideal when saving an article, like a news story. It might even be quite useful to get it into a Gemini-compatible language. But what if you are saving the receipt for a purchase? A tax auditor would suspect shenanigans. So the idea with archival is generally to closely (faithfully) preserve the doc.

evenwicht@lemmy.sdf.org · edit-2 5 hours ago

IIUC you are referring to this extension, which is Firefox-only (~~like~~unlike the save page WE, which has a Chromium version).

Indeed the beauty of ZIP is stability. But the contents are not. HTML changes so rapidly, I bet if I unzip an old MAFF file it would not have stood the test of time well. That’s why I like the PDF wrapper. Nonetheless, this WebScrapBook could stand in place of the MHTML from the save page WE extension. In fact, save page WE usually fails to save all objects for some reason. So WebScrapBook is probably more complete.

(edit) Apparently webscrapbook gives a choice between htz and maff. I like that it timestamps the content, which is a good idea for archived docs.

(edit2) Do you know what happens with JavaScript? I think JS can be quite disruptive to archival. If webscrapbook saves the JS, it’s saving an app, in effect, and that language changes. The JS also may depend on being able to access the web, which makes a shitshow of archival because obviously you must be online and all the same external URLs must still be reachable. OTOH, saving the JS is probably desirable if doing the hybrid PDF save because the PDF version would always contain the static result, not the JS. Yet the JS could still be useful to have a copy of.

(edit3) I installed webscrapbook but it had no effect. Right-clicking does not give any new functions.

evenwicht@lemmy.sdf.org · edit-2 6 hours ago

How to archive a website in a future-proof way (involves PDF hybrid)

evenwicht@lemmy.sdf.org · edit-2 2 days ago

12ft.io botches all PDF docs. Admin has no open-free-world presence

evenwicht@lemmy.sdf.org · edit-2 3 days ago

Robot worked out whether to use a shared bicycle (down hill) or a tram (up hill)

evenwicht@lemmy.sdf.org · edit-2 4 days ago

Yeah I’m with you… it was more of an attempt at humor. Although if you search around it’s actually common for people to ask how to check if their spouse is on dating sites… which may be inspired by the whole Ashley Madison databreach.

evenwicht@lemmy.sdf.org · edit-2 5 days ago

Woman rejects dating sites UIs and writes a program to find her husband

evenwicht@lemmy.sdf.org · edit-2 6 days ago

Does pdfinfo give any indication of the application used to create the document?

Oracle Documaker PDF Driver
PDF version: 1.3

If it chokes on the Java bit up front, can you extract just the PDF from the file and look at that?

Not sure how to do that but I did just try pdfimages -all which was not useful since it’s a vector PDF. pdfdetach -list shows 0 attachments. It just occurred to me that pdftocairo could be useful as far as a CLI way to neuter the doc and make it useable, but that’s a kind of a lossy meat-grinder option that doesn’t help with analysis.

You might also dig through the PDF a bit using Dider Stevens 's Tools,

Thanks for the tip. I might have to look into that. No readme… I guess this is a /use the source, Luke/ scenario. (edit: found this).

I appreciate all the tips. I might be tempted to dig into some of those options.

evenwicht@lemmy.sdf.org · edit-2 6 days ago

Your assertion that the document is malicious without any evidence is what I’m concerned about.

I did not assert malice. I asked questions. I’m open to evidence proving or disproving malice.

At some point you have to decide to trust someone. The comment above gave you reason to trust that the document was in a standard, non-malicious format. But you outright rejected their advice in a hostile tone. You base your hostility on a youtube video.

There was too much uncertainty there to inspire trust. Getoffmylan had no idea why the data was organised as serialised java.

You should read the essay “on trusting trust” and then make a decision on whether you are going to participate in digital society or live under a bridge with a tinfoil hat.

I’ll need a more direct reference because that phrase gives copious references. Do you mean this study? Judging from the abstract:

To what extent should one trust a statement that a program is free of Trojan horses? Perhaps it is more important to trust the people who wrote the software.

I seem to have received software pretending to be a document. Trust would naturally not be a sensible reaction to that. In the infosec discipline we would be incompetent fools to loosely trust whatever comes at us. We make it a point to avoid trust and when trust cannot be avoided we seek justfiication for trust. We have a zero-trust principle. We also have the rule of leaste privilige which means not to extend trust/permissions where it’s not necessary for the mission. Why would I trust a PDF when I can take steps to access the PDF in a way that does not need excessive trust?

The masses (security naive folks) operate in the reverse-- they trust by default and look for reasons to distrust. That’s not wise.

In Canada, and elsewhere, insurance companies know everything about you before you even apply, and it’s likely true elsewhere too.

When you move, how do they find out if you don’t tell them? Tracking would be one way.

Privacy is about control. When you call it paranoia, the concept of agency has escaped you. If you have privacy, you can choose what you disclose. What would be good rationale for giving up control?

Even if they don’t have personally identifiable information, you’ll be in a data bucket with your neighbours, with risk profiles based on neighbourhood, items being insuring, claim rates for people with similar profiles, etc. Very likely every interaction you have with them has been going into a LLM even prior to the advent of ChatGPT, and they will have scored those interactions against a model.

If we assume that’s true, what do you gain by giving them more solid data to reinforce surreptitious snooping? You can’t control everything but It’s not in your interest to sacrifice control for nothing.

But what you will end up doing instead is triggering fraudulent behaviour flags. There’s something called “address fraud”, where people go out of their way to disguise their location, because some lower risk address has better rates or whatever.

Indeed for some types of insurance policies the insurer has a legitimate need to know where you reside. But that’s the insurer’s problem. This does not rationalize a consumer who recklessly feeds surreptitious surveillance. Street wise consumers protect themselves of surveillance. Of course they can (and should) disclose their new address if they move via proper channels.

Why? Because someone might take a vacation somewhere and interact from another state. How long is a vacation? It’s for the consumer to declare where they intend to live, e.g. via “declaration of domicile”. Insurance companies will harrass people if their intel has an inconsistency. Where is that trust you were talking about? There is no reciprocity here.

When you do everything you can to scrub your location, this itself is a signal that you are operating as a highly paranoid individual and that might put you in a bucket.

Sure, you could end up in that bucket if you are in a strong minority of street wise consumers. If the insurer wants to waste their time chasing false positives, the time waste is on them. I would rather laugh at that than join the street unwise club that makes the street wise consumers stand out more.

evenwicht@lemmy.sdf.org · 6 days ago

It’s interesting to note that some research “discovered thousands of vulnerabilities in 693 banking apps, which indicates these apps are not as secure as we expected.”

evenwicht@lemmy.sdf.org · edit-2 6 days ago

Don’t Canadian insurance companies want to know where their customers are? Or are the Canadian privacy safeguards good on this?

In the US, Europe (despite the GDPR), and other places, banks and insurance companies snoop on their customers to track their whereabouts as a normal common way of doing business. They insert surreptitious tracker pixels in email to not only track the fact that you read their msg but also when you read the msg and your IP (which gives whereabouts). If they suspect you are not where they expect you to be, they take action. They modify your policy. It’s perfectly legal in the US to use sneaky underhanded tracking techniques rather than the transparent mechanism described in RFC 2298. If your suppliers are using RFC 2298 and not involuntary tracking mechanisms, lucky you.

evenwicht@lemmy.sdf.org · edit-2 6 days ago

You’re kind of freaking out about nothing.

I highly recommend Youtube video l6eaiBIQH8k, if you can track it down. You seem to have no general idea about PDF security problems.

And I’m not sure why an application would output a pdf this way. But there’s nothing harmful going on.

If you can’t explain it, then you don’t understand it. Thus you don’t have answers.

It’s a bad practice to just open a PDF you did not produce without safeguards. Shame on me for doing it… I got sloppy but it won’t happen again.

evenwicht@lemmy.sdf.org · edit-2 6 days ago

(PDF neutering) Not all PDFs are documents; some are apps! Insurance company sent me a form to sign as a PDF with JavaScript Java. Is it a tracker?

evenwicht@lemmy.sdf.org · edit-2 6 days ago

(PDF neutering) Not all PDFs are documents; some are apps! Insurance company sent me a form to sign as a PDF with JavaScript Java. Is it a tracker?

evenwicht@lemmy.sdf.org · 7 days ago

Not sure if this is relevant, but service manuals for cars older than 2014 can be found here: charm.li (no cost and enshification-free).

evenwicht@lemmy.sdf.org · 7 days ago

Also worth noting Brother uses that trick where empty cartridges are detected by a laser which is exactly not positioned as low on the cartridge as it could be, forcing people to toss not-so-empty cartridges.

BTW, regarding the trackers dots I’ll drop a link here for anyone who wants to verify Brother’s role in it:

https://www.eff.org/pages/list-printers-which-do-or-do-not-display-tracking-dots

evenwicht@lemmy.sdf.org · 7 days ago

Another reason to use inkjets: GHG footprint. Inkjets use far less energy than lasers. It’s a shame we have to choose between ecocide and tricks and traps.

The only no-compromise path I see is to pull an inkjet from the dumpster, fix it, and refill the cartridges with homemade “ink” from spent coffee grounds and tea.

evenwicht@lemmy.sdf.org · edit-2 7 days ago

deleted by creator

evenwicht@lemmy.sdf.org · edit-2 7 days ago

I should also add that some people come for asylum but they do not follow the legal process because they are reasonably concerned that the process will fail to protect them (especially if they entered under the Trump regime). If someone enters without filing then gets targeted (e.g. a hospital rats them out), and only then claim asylum, I don’t know what happens but obviously we need the process is competent about separating the genuine cases from the rest. I suppose that’s the scenario you are referring to.

evenwicht@lemmy.sdf.org · 8 days ago

Asylum is a legal process. If they follow that process (which begins with claiming asylum), then of course they cease to be illegal immigrants throughout the process.

evenwicht@lemmy.sdf.org · edit-2 9 days ago

In fact, borderline human rights compromise is actually a good incentive for people to leave. Would perhaps be good for the country if those in Texas who respect human rights would move from Texas to Pennsylvania for a human rights upgrade (where also the death penalty was repealed).

But I doubt your statement is accurate considering inbound refugees are fleeing from even worse conditions w.r.t. human rights. Refugees still technically have their human right to access emergency medical treatment, they just risk getting harassed and tagged for deportation.

evenwicht@lemmy.sdf.org · 9 days ago

Texas hospitals must now ask patients whether they're in the US legally.

evenwicht@lemmy.sdf.org · 10 days ago

nationbuilder.com blocks Tor, yet no one is exposing this anti-democratic issue. More activists needed!

evenwicht@lemmy.sdf.org · 12 days ago

Question answered in the parent thread:

https://lemmy.sdf.org/comment/15364720

when a server pushes a 403, it still sees the full URL that was attempted.

evenwicht@lemmy.sdf.org · 12 days ago

Just dodged a tracker pixel (I think) -- thanks to a test-based mail client and Tor

evenwicht@lemmy.sdf.org · edit-2 13 days ago

To what extent does AAA fund pro-car lobbyists? Is AAA overall harmful?

evenwicht@lemmy.sdf.org · edit-2 17 days ago

I would indeed be concerned with hosting. But to a lesser extent than email. Email service is gratis & paid for by advertising. The terms of service for email explicitly gives the surveillance advertiser carte blanche on snooping and exploiting email traffic for all it’s worth which is understood by all parties involved.

Hosting service is a paid subscription. Hosting users have the option of controlling their own keys. It is not customary or expected for a web hosting provider to snoop on the traffic they are hosting. Unlike email snooping, I believe it would be a malicious act for a hosting provider to collect data from traffic they host. That said, internal breaches are common, like that of Capitol One data being exfiltrated by an AWS contractor. So it’s not entirely wise to trust MS and Amazon not to snoop on Azure and AWS.

Consider US 3 letter agencies doing their unlawful unwarranted snooping. Because they need to conceal their own snooping activity, they cannot liberally exploit the data they collect. They have to use parallel construction to create a legally plausible scenario by which they obtained the data. This substantially limits how they can use the data and to what extent. I think this is similar to MS’s situation with Azure. How can they use the web traffic data without revealing that they are using it? Not easy. Risks are high. Disgruntled employees tattle on their employers.

You have to decide for yourself where to draw the line. But certainly you’re setting the bar as low as possible if you tolerate email snooping, and a bit higher if you reject email snooping but are not worried about web traffic snooping. A good place to set the bar is to reject email snooping and also reject using their website if hosted by GAFAM or proxied Cloudflare (Cloudflare almost always manages the keys, thus a bit foolish to use lemmy.world).

In the case at hand the prospective insurer blocks Tor, which again means they are demanding more info from me than contractually necessary (my IP address). So I would not use their website regardless of their hosting provider. They will charge a penalty fee for not being paperless.

The insurance company would still likely have your data in a dodgy outsourced cloud space even if you don’t use the website. But in that case control is almost entirely out of your hands. Generally you cannot even be informed about their internal ops. The more out of your control it is, the more liable the insurance company is for misuse. If email traffic to you is abused or misused, you share the blame because you signed up for it by sharing your email address knowing that Outlook traffic is openly surveilled on the table. You willfully feed Microsoft in that case. But when you don’t know how your data is stored for their internal ops, there is nothing you can do and no decision on your part to make.

evenwicht@lemmy.sdf.org · edit-2 17 days ago

Every email provider is a surveillance advertiser?

No, the insurance company only uses one email provider, which is Microsoft. Microsoft is a surveillance advertiser.

You have to share personal information with a broker, insurance company, mortgage provider etc.

I don’t have a problem with that. That’s need-to-know and consistent with data minimization. Of course if I don’t trust a particular company with my data I’m not going to pick up the phone and call them in the first place.

Sometimes they ask for too much info. Some brokers ask for more than others. I walk in those cases. I will not authorize a homeowners insurer to check my credit history (only my insurance history).

And your biggest concern is an email?

Of course. Microsoft is a centralized surveillance capitalist who has mastered exploitation of the data it collects to the fullest extent allowed by law, and even beyond that because MS has been caught breaking the law in their exploitation of personal data. It’s reckless and stupid to put a notorious privacy offender like Microsoft in the loop on an insurance deal.