Welcome to the Club de TeleMatique mast bash crash course!
Why bash ? Besides the fact that I like bash, you will have shell on any unix machine, and the script can be easily modified for another shell. Most of the other scripts I found on the internet are python script. Nothing wrong with that, but if it isn't on your machine, or not the supported version...
Lesson 1: The Foundation
- What you need:
- Vocabulary: in programming, you need to understand the 4 following words:
- constant: things that don't change
- variable: things that do change
- type: sort of
- function: piece of code returning a result
Some languages don't have constants, so constants are just variable that don't change (ideally). (*)
(*) Hence the famous programmer joke: constants aren't - variables won't.
[Top]
Lesson 2: The basics
install madonctl in a place it will be found (ie in path). That's where you'll place your bash script as well.
What we are going to do is a simple script that
- gets the news from a newspaper website
- checks if it has been posted in the previous run
- posts whatever hasn't been posted before
- saves what has been posted for the next run to check
Nothing fancy, the idea is to give you the basics, you can of course improve and modify to your taste.
[Top]
Lesson 3: The script
- First we download the page with curl (a standard unix client for urls).
curl "https://www.lessentiel.lu/de"
What we get is a html page. Pretty much everything is mangled, so we need first to have a better idea of what is really in there. In order to do that, instead of dumping the content directly to the screen, we will first filter it by sorting it by html tags.
we create a small function that will just do that:
########################## xmlgetnext () { local IFS='>' read -d '<' TAG VALUE }
And do the same again:
curl "https://www.lessentiel.lu/de" | while xmlgetnext ; do echo $TAG ; done
Much better. We have now all tags clearly on the screen, and we can see that all the articles that we want have a a href="/de/story in front of them
- Then we parse it and get what we want, ie the news
So let's just do that:
curl "https://www.lessentiel.lu/de" | while xmlgetnext ; do echo $TAG ; done | grep '^a href="/de/stor'
grep is the standard unix utility to get regular expressions. The regular expression is what follow, and it means a href etc at the beginning of the line (^).
Almost there. Now all we need to do is send that to a variable instead of the screen, and we're done for this part:
Info=`curl --silent "https://www.lessentiel.lu/de" | while xmlgetnext ; do echo $TAG ; done | grep '^a href="/de/stor' `
The result inside the Info variable isa href="/de/story/vollsperrung-auf-der-a6-nach-schwerem-unfall-928128456251" class="sc-1vcyo0a-1 hyTLuW" a href="/de/story/ukraine-newsticker-240655635662" class="sc-1vcyo0a-1 hyTLuW" a href="/de/story/flugzeug-der-luxemburgischen-mannschaft-wegen-unwetter-umgeleitet-991648887208" class="sc-1vcyo0a-1 hyTLuW" a href="/de/story/nadal-zum-14-mal-koenig-von-paris-finalsieg-gegen-norweger-ruud-681901613911" class="sc-1vcyo0a-1 hyTLuW" a href="/de/story/die-bilder-des-tages-867096646033" class="sc-1vcyo0a-1 hyTLuW" a href="/de/story/promi-ticker-500140628298" class="sc-1vcyo0a-1 hyTLuW" a href="/de/story/maskenpflicht-in-bus-und-bahn-soll-naechste-woche-fallen-901618527805" class="sc-1vcyo0a-1 hyTLuW" a href="/de/story/baby-und-mann-sterben-bei-frontalzusammenstoss-696983931160" class="sc-1vcyo0a-1 hyTLuW" a href="/de/story/unbekannte-feuern-in-us-stadt-in-menge-drei-tote-und-elf-verletzte-709366376325" class="sc-1vcyo0a-1 hyTLuW" a href="/de/story/petruss-kasematten-erstrahlen-wieder-in-vollem-glanz-832051921241" class="sc-1vcyo0a-1 hyTLuW" a href="/de/story/fuerstin-charlene-von-monaco-hat-sich-mit-corona-infiziert-488796398755" class="sc-1vcyo0a-1 hyTLuW" a href="/de/story/containerlager-brennt-nach-explosion-lichterloh-mindestens-49-tote-481372111006" class="sc-1vcyo0a-1 hyTLuW" a href="/de/story/nur-ein-index-wird-vorerst-auf-eis-gelegt-769164783135" class="sc-1vcyo0a-1 hyTLuW" a href="/de/story/punks-feiern-chaostage-ausflug-und-nehmen-ferieninsel-sylt-in-beschlag-678833531630" class="sc-1vcyo0a-1 hyTLuW" a href="/de/story/mann-erhaelt-am-steuer-oralsex-und-verliert-bei-unfall-fast-seinen-penis-951729532003" class="sc-1vcyo0a-1 hyTLuW"
Which is nice, but not quite yet what we want, which is a nice formatted URL. However, we can see that the url we are looking for are (partially) the second, 6th, 10th, 14th, etc.. part of our huge string.
Time to cut it into pieces format it nice while storing it in an array.
- We clean the result and format it First we select the parts that we are interested in in the huge Info string. For that we use awk because it is faster that the built in unix utility "cut". We send that piece to a temporary string which we will then clean and format as a nice URL before putting it in an array.
- We compare with what was posted before (which we load) and select those which haven't been posted yet
Loading:ReadDB () { while read -r line ; do db[$nbdb]=$line; nbdb=$((nbdb+1)); done < $previous }
For the sorting, we take each url and check if it is in our database. if it is, we exit the loop, if it isn't, we put it on the pile to post along with the text extracted, and we loop to the next.
SortRSS () { dejala=0; for (( i=0; i < $nbur ; i++)); do for (( j=0; j < $nbdb ; j++)); do if [[ "${ur[$i]}" = "${db[$j]}" ]] ; then dejala=1; break; fi done if [[ "$dejala" = "0" ]] ; then post[$nbpost]=${ur[$i]}; t=`echo ${ur[$i]} | cut -c36- ` txt[$nbpost]=`echo ${t:0:${#t}-12} |sed 's/-/ /g'` nbpost=$((nbpost+1)); else dejala=0; fi done }
- We post the selected and save them for the next run
Post () { for (( i=0; i < $maxpost ; i++)); do line="L'Essentiel: ${txt[$i]} ${post[$i]} #news #Luxemburg"; echo ${post[$i]} >> $previous madonctl toot $line done }
We take off the 6 first characters and the last one and fill in with the full website url:
for ind in 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 ; do t=`echo $Info | awk -v i=$ind '{print $i}'` ur[$nbur]="https://www.lessentiel.lu/de${t:6:${#t}-7}"; nbur=$((nbur+1)); done
You can download the full script. It is a bit longer because it also contains a procedure to purge the database every 3 days to prevent it from growing too big.
.
[Top]
xpost from X.