• 25 Posts
  • 46 Comments
Joined 4 months ago
cake
Cake day: July 1st, 2024

help-circle
  • wget has a --load-cookies file option. It wants the original Netscape cookie file format. Depending on your GUI browser you may have to convert it. I recall in one case I had to parse the session ID out of a cookie file then build the expected format around it. I don’t recall the circumstances.

    Another problem: some anti-bot mechanisms crudely look at user-agent headers and block curl attempts on that basis alone.

    (edit) when cookies are not an issue, wkhtmltopdf is a good way to get a PDF of a webpage. So you could have a script do a wget to get the HTML faithfully, and wkhtmltopdf to get a PDF, then pdfattach to put the HTML inside the PDF.

    (edit2) It’s worth noting there is a project called curl-impersonate which makes curl look more like a GUI browser to get more equal treatment. I think they go as far as adding a javascript engine or something.


  • It’s perhaps the best way for someone that has a good handle on it. Docs say it “sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.” So you would need to tune it so that it’s not grabbing objects that are irrelevent to the view, and probably exclude some file types like videos and audio. If you get a well-tuned command worked out, that would be quite useful. But I do see a couple shortcomings nonetheless:

    • If you’re on a page that required you to login to and do some interactive things to get there, then I think passing the cookie from the gui browser to wget would be non-trivial.
    • If you’re on a capped internet connection, you might want to save from the brower’s cache rather that refetch everything.

    But those issues aside I like the fact that wget does not rely on a plugin.



  • IIUC you are referring to this extension, which is Firefox-only (likeunlike the save page WE, which has a Chromium version).

    Indeed the beauty of ZIP is stability. But the contents are not. HTML changes so rapidly, I bet if I unzip an old MAFF file it would not have stood the test of time well. That’s why I like the PDF wrapper. Nonetheless, this WebScrapBook could stand in place of the MHTML from the save page WE extension. In fact, save page WE usually fails to save all objects for some reason. So WebScrapBook is probably more complete.

    (edit) Apparently webscrapbook gives a choice between htz and maff. I like that it timestamps the content, which is a good idea for archived docs.

    (edit2) Do you know what happens with JavaScript? I think JS can be quite disruptive to archival. If webscrapbook saves the JS, it’s saving an app, in effect, and that language changes. The JS also may depend on being able to access the web, which makes a shitshow of archival because obviously you must be online and all the same external URLs must still be reachable. OTOH, saving the JS is probably desirable if doing the hybrid PDF save because the PDF version would always contain the static result, not the JS. Yet the JS could still be useful to have a copy of.

    (edit3) I installed webscrapbook but it had no effect. Right-clicking does not give any new functions.








  • Your assertion that the document is malicious without any evidence is what I’m concerned about.

    I did not assert malice. I asked questions. I’m open to evidence proving or disproving malice.

    At some point you have to decide to trust someone. The comment above gave you reason to trust that the document was in a standard, non-malicious format. But you outright rejected their advice in a hostile tone. You base your hostility on a youtube video.

    There was too much uncertainty there to inspire trust. Getoffmylan had no idea why the data was organised as serialised java.

    You should read the essay “on trusting trust” and then make a decision on whether you are going to participate in digital society or live under a bridge with a tinfoil hat.

    I’ll need a more direct reference because that phrase gives copious references. Do you mean this study? Judging from the abstract:

    To what extent should one trust a statement that a program is free of Trojan horses? Perhaps it is more important to trust the people who wrote the software.

    I seem to have received software pretending to be a document. Trust would naturally not be a sensible reaction to that. In the infosec discipline we would be incompetent fools to loosely trust whatever comes at us. We make it a point to avoid trust and when trust cannot be avoided we seek justfiication for trust. We have a zero-trust principle. We also have the rule of leaste privilige which means not to extend trust/permissions where it’s not necessary for the mission. Why would I trust a PDF when I can take steps to access the PDF in a way that does not need excessive trust?

    The masses (security naive folks) operate in the reverse-- they trust by default and look for reasons to distrust. That’s not wise.

    In Canada, and elsewhere, insurance companies know everything about you before you even apply, and it’s likely true elsewhere too.

    When you move, how do they find out if you don’t tell them? Tracking would be one way.

    Privacy is about control. When you call it paranoia, the concept of agency has escaped you. If you have privacy, you can choose what you disclose. What would be good rationale for giving up control?

    Even if they don’t have personally identifiable information, you’ll be in a data bucket with your neighbours, with risk profiles based on neighbourhood, items being insuring, claim rates for people with similar profiles, etc. Very likely every interaction you have with them has been going into a LLM even prior to the advent of ChatGPT, and they will have scored those interactions against a model.

    If we assume that’s true, what do you gain by giving them more solid data to reinforce surreptitious snooping? You can’t control everything but It’s not in your interest to sacrifice control for nothing.

    But what you will end up doing instead is triggering fraudulent behaviour flags. There’s something called “address fraud”, where people go out of their way to disguise their location, because some lower risk address has better rates or whatever.

    Indeed for some types of insurance policies the insurer has a legitimate need to know where you reside. But that’s the insurer’s problem. This does not rationalize a consumer who recklessly feeds surreptitious surveillance. Street wise consumers protect themselves of surveillance. Of course they can (and should) disclose their new address if they move via proper channels.

    Why? Because someone might take a vacation somewhere and interact from another state. How long is a vacation? It’s for the consumer to declare where they intend to live, e.g. via “declaration of domicile”. Insurance companies will harrass people if their intel has an inconsistency. Where is that trust you were talking about? There is no reciprocity here.

    When you do everything you can to scrub your location, this itself is a signal that you are operating as a highly paranoid individual and that might put you in a bucket.

    Sure, you could end up in that bucket if you are in a strong minority of street wise consumers. If the insurer wants to waste their time chasing false positives, the time waste is on them. I would rather laugh at that than join the street unwise club that makes the street wise consumers stand out more.











  • I should also add that some people come for asylum but they do not follow the legal process because they are reasonably concerned that the process will fail to protect them (especially if they entered under the Trump regime). If someone enters without filing then gets targeted (e.g. a hospital rats them out), and only then claim asylum, I don’t know what happens but obviously we need the process is competent about separating the genuine cases from the rest. I suppose that’s the scenario you are referring to.



  • In fact, borderline human rights compromise is actually a good incentive for people to leave. Would perhaps be good for the country if those in Texas who respect human rights would move from Texas to Pennsylvania for a human rights upgrade (where also the death penalty was repealed).

    But I doubt your statement is accurate considering inbound refugees are fleeing from even worse conditions w.r.t. human rights. Refugees still technically have their human right to access emergency medical treatment, they just risk getting harassed and tagged for deportation.







  • I would indeed be concerned with hosting. But to a lesser extent than email. Email service is gratis & paid for by advertising. The terms of service for email explicitly gives the surveillance advertiser carte blanche on snooping and exploiting email traffic for all it’s worth which is understood by all parties involved.

    Hosting service is a paid subscription. Hosting users have the option of controlling their own keys. It is not customary or expected for a web hosting provider to snoop on the traffic they are hosting. Unlike email snooping, I believe it would be a malicious act for a hosting provider to collect data from traffic they host. That said, internal breaches are common, like that of Capitol One data being exfiltrated by an AWS contractor. So it’s not entirely wise to trust MS and Amazon not to snoop on Azure and AWS.

    Consider US 3 letter agencies doing their unlawful unwarranted snooping. Because they need to conceal their own snooping activity, they cannot liberally exploit the data they collect. They have to use parallel construction to create a legally plausible scenario by which they obtained the data. This substantially limits how they can use the data and to what extent. I think this is similar to MS’s situation with Azure. How can they use the web traffic data without revealing that they are using it? Not easy. Risks are high. Disgruntled employees tattle on their employers.

    You have to decide for yourself where to draw the line. But certainly you’re setting the bar as low as possible if you tolerate email snooping, and a bit higher if you reject email snooping but are not worried about web traffic snooping. A good place to set the bar is to reject email snooping and also reject using their website if hosted by GAFAM or proxied Cloudflare (Cloudflare almost always manages the keys, thus a bit foolish to use lemmy.world).

    In the case at hand the prospective insurer blocks Tor, which again means they are demanding more info from me than contractually necessary (my IP address). So I would not use their website regardless of their hosting provider. They will charge a penalty fee for not being paperless.

    The insurance company would still likely have your data in a dodgy outsourced cloud space even if you don’t use the website. But in that case control is almost entirely out of your hands. Generally you cannot even be informed about their internal ops. The more out of your control it is, the more liable the insurance company is for misuse. If email traffic to you is abused or misused, you share the blame because you signed up for it by sharing your email address knowing that Outlook traffic is openly surveilled on the table. You willfully feed Microsoft in that case. But when you don’t know how your data is stored for their internal ops, there is nothing you can do and no decision on your part to make.


  • Every email provider is a surveillance advertiser?

    No, the insurance company only uses one email provider, which is Microsoft. Microsoft is a surveillance advertiser.

    You have to share personal information with a broker, insurance company, mortgage provider etc.

    I don’t have a problem with that. That’s need-to-know and consistent with data minimization. Of course if I don’t trust a particular company with my data I’m not going to pick up the phone and call them in the first place.

    Sometimes they ask for too much info. Some brokers ask for more than others. I walk in those cases. I will not authorize a homeowners insurer to check my credit history (only my insurance history).

    And your biggest concern is an email?

    Of course. Microsoft is a centralized surveillance capitalist who has mastered exploitation of the data it collects to the fullest extent allowed by law, and even beyond that because MS has been caught breaking the law in their exploitation of personal data. It’s reckless and stupid to put a notorious privacy offender like Microsoft in the loop on an insurance deal.