7

Introducing u/Archie: Raddle's Archive Bot [Update: Now archives everything except for a shortlist]

Submitted by leftous in RaddleBots (edited )

Hi all,

I'd like to introduce u/Archie

Using selver's python code as a starting point, I was able to put together an Archive bot that checks if a link is external. is a corporate news site (against a list of domains) and provide an archive link in the comments.

First time writing in python, so I somewhat butchered the code and want to clean it up a bit before posting. I have it checking all/new every 5 mins.

In the mean time, feel free to post any domains to add or problems you've noticed with it. I don't expect there to be any issues since it's pretty straight forward.

Right now the whitelisted domains (that won't be archived) are:

"raddle.me", "coinsh.red", "youtube.com", "youtu.be"

I was thinking to add a u/Archie command for users to generate archive links manually. I may add that in a future iteration if people are interested or think it makes sense.

Comments

You must log in or register to comment.

5

Tequila_Wolf wrote

This is really cool.

The only problem with this is that the sorting algorithm will heavily privilege these sites. It might be worthwhile to think about ways to deal with this.

3

leftous wrote (edited )

My initial idea was to append the archive link to the post body, but I noticed this meant the submitter couldn't edit their post after the fact.

A solution could be requiring users to manually request the archive in the post body (with a command), rather than it being automatically generated.

Alternatively, if the bot just generates archive links for every external site, there would be no bias.

2

Tequila_Wolf wrote (edited )

Probably the latter is best?

If you're interested, it may also be worthwhile to coordinate with emma so that aside from other raddle addresses (and coinish and maybe others), every address submitted with a new post is automatically archived and added to the original posting as the main link.

Edit: or a checkbox with the option to do it when you make a submission?

2

leftous wrote

Ok, I will change it up so the bot archives every link except internal, coinish, or youtube links. Let me know if you think of others.

It may also be useful if postmill has separate "bot" role admins can assign that doesn't affect algorithms or appear in /comments. I see the support site emma announced is down, so I will reach out there when it's back up.

4

HEEEEEEEEEEEYHAAAAAAA wrote (edited )

PLEASE USE HTTPS://archive.is and NOT HTTP://archive.is

2

leftous wrote (edited )

Thanks, didn't pay attention to that. I will change it now.

Edit - Done :)

4

HEEEEEEEEEEEYHAAAAAAA wrote

Also isn't the Internet Archive better for that kind of thing? (I assume the person(s) who operate(s) archive.is doesn't get as much financial support)

2

leftous wrote

Does Archive.org have an API for something like that? I wouldn't mind using it, but to my understanding it is just a crawler. You cannot request snapshots, only retrieve them.

I posted the code btw if you're interested https://raddle.me/f/RaddleBots/28187/code-for-the-archive-bot-u-archie

2

HEEEEEEEEEEEYHAAAAAAA wrote

For most of the links that you're mentioning, i.e.,

def check_domain(url): domains = ["nytimes.com", "wsj.com", "cnn.com", "thetimes.co.uk", "vice.com" , "newsweek.com", "kyivpost.com", "ft.com", "latimes.com", "nypost.com" , "telegraph.co.uk", "independent.co.uk", "scmp.com", "nationalpost.com" , "haaretz.com", "bostonglobe.com", "washingtonpost.com", "theaustralian.com.au" , "wsj.com", "nytimes.com", "theglobeandmail.com", "theage.com.au", "smh.com.au" , "www.economist.com", "reuters.com", "afp.com", "rt.com", "huffingtonpost.com" , "aljazeera.com", "cnbc.com", "chicagotribune.com", "buzzfeed.com" , "theguardian.com", "reddit.com", "cbc.ca", "bbc.co.uk", "cnet.com" , "bloomberg.com", "bbc.com", "suntimes.com", "foxnews.com", "jpost.com" , "voat.co" ]

I think in most cases a snapshot would already be available, with the exception of reddit and voat. Their API might be enough https://archive.org/help/wayback_api.php

2

leftous wrote

Ok, I'll check it out. Maybe I will see if one is available on archive.org, and if not use archive.is.

4

Dumai wrote

thank fuck now i don't have to manually archive every goddamn premium haaretz article

3

ziq wrote

Awesome, I've been wanting this forever.

What about theguardian.com? Reddit.com too.

3

leftous wrote

Ah good idea, just perused through /f/news and added a bunch

3

ziq wrote (edited )

voat.co is also important

well that's a sentence I didn't think I'd ever type.

2

boringskip wrote (edited )

I also suggest adding archive.org links, if possible? It's a great non-profit project and I'd rather support them.

Edit: looks like i'm late to the game lol, other people suggested it

2

boringskip wrote

Fuck yeah. I assume you use a googlebot useragent to get around paywalls?

1

leftous wrote

I just used the archiveis python library, I don't need to access the site directly.