mgpalmer

My stupidly simple TODO system

Ah hello there, long time no see! Here’s a quick post about something that came up with the super nice folks I’m working with now.

We’ve been discussing self-organization techniques and I offered to share my homegrown “system” I’ve been using for some years now, so why not dust off this bloggy thing here?

I started using it originally to track what I’ve been doing at work in order to write detailed invoices, but it turned out to be simple enough to use that I’ve just kept at it out of habit. It combines very basic features of a TODO list, a notepad, and a work journal with a minimum of fuss. So here goes:

It’s a standard text file, named for the project (i.e. “kill-all-humans.txt”), and kept in a Dropbox (or whatever) folder so it’s backed up and available on multiple devices. I also use https://cryptomator.org/ to encrypt the file before it goes in the “cloud”.

As a developer, I always have a text editor open, so this file is always at hand.

In the file, I start each day with a “header” entry like this:

2020-08-24 [Start: 09:30]

and if I remember to, I end each day with

[End: 19:30]

And in between, I write down (at start of day, or just whenever I need to remember something) a line or more like this:

TODO: Deploy and monitor endpoint

And when I’ve actually done things, change the TODO to an Asterisk *:

* Deploy and monitor endpoint

Sometimes I use an arrow -> for followup actions but there are no hard rules.

I also just copy and paste snippets of code, write down outlines of tasks, prepare commit messages, links to tickets, anything at all as a scratchpad. The only rule is to move everything that is done up in the file, and anything that’s still TODO goes downwards. For example, let’s say we have this currently:

2020-08-24
[Start: 09:30]

TODO: Standup
TODO: Deploy and monitor endpoint
TODO: Do code reviews

...

(copy-pasted stuff, other old junk, plans for world domination)

A while later, standup is done, I did the reviews, someone asked me to pair on a problem, and I remembered I need to top up the coolant fluid in the nuclear device. So I mark the done stuff as done and add the new tasks at the bottom:

2020-08-24
[Start: 09:30]

* Standup
TODO: Deploy and monitor endpoint
* Do code reviews

TODO: Pair with Kermit
TODO: Top up coolant

But the coolant is kinda urgent, so I move it to the top, and clean up:

2020-08-24
[Start: 09:30]

* Standup
* Do code reviews

TODO: Top up coolant
TODO: Deploy and monitor endpoint
TODO: Pair with Kermit

This way, my TODOs turn into a small log of my daily work almost automatically. I can just scroll down to remind myself what I was working on or what’s up next.

Occasionally I clean up the extant TODOs or clear out things that ~~people stopped yelling at me about~~ fixed themselves, or just let them turn into a kind of sedimentary layer.

Here’s a real-life example:

2020-08-24
[Start: 09:30]

* Dev tactical:
* Look at and report perf metrics
* Make ticket to fix or remove the whole is_donor/is_applicant stuff?
* Check out that one ticket NL email list
* Ask Grover to join Support meeting?
* Go over Honeybadger
* Organize vacation
* Clickup desktop?
* Get Redis stuff live
* check up on anon dump? -> all good
* Deployed and monitored Redis stuff, move back to main app followup
* Duplicate indices?
-> remove_index :debits, name: "index_debits_on_debit_collection_id", column: :debit_collection_id
* made ticket

* Deploy and monitor endpoint
* aaaaah cloudflare caching got in the way but I think I fixed it

* Worked with Gonzo on deterministic anonymization
* Add "salt" to hashing, make ticket

[End: 19:30]

“But”, I hear you say, “why not use TODO app XYZ, it’s so much slicker and has little booping sounds and syncs your tasks with your microwave?”. Well to be honest, I just can’t be bothered. And IMHO “productivity” programs are either proprietary and take down your data with them eventually, or are open-source and shit. Text files are forever.

That’s it. Bye!

More weird Ruby: Twiddle-wakka

Today I learned that the “~>” operator used in Gemfiles is called “twiddle wakka”: https://github.com/rubygems/rubygems/pull/123 , and used to be called “spermy operator”.

My day, it is made 😀

Now I’ll be off to check if our Gemfile is spermy enough…

How to send mails from Rails via a Hosteurope SMTP server

‘Sup, babes!

Just noting down something that should be trivial but has taken me an embarrassingly long time recently. Let’s write it down here for the edification of the masses, so they may be entertained and enlightened, and for me to look it up again next time.

Hosteurope, if you have an account there, provides you with an SMTP mail server which you can use to send out emails. I’m using it from Thunderbird (yeah I’m old, sue me), but you can also use the server programmatically to send out mails (I guess as long as you’re not overdoing it – this is probably not a good idea if you’re about to send out a large volume of mail, but for a hobby project it’s much simpler than setting up a dedicated mail service account like Mailgun and friends).

So yeah the principle is of course quite easy – just follow the example in the Rails guides.

Except it wasn’t for me, I had to try out a million combinations of all those options – most don’t actually change anything, it turns out. It wasn’t helpful either that the error message for most combinations was – a long hangup and then a timeout 😦

So here’s what works for me:

config.action_mailer.smtp_settings = {
  address: "<hosteurope SMTP server address for your account>",
  port: 465,
  user_name: "<username>",
  password: "<password>",
  authentication: :plain,
  enable_starttls_auto: true,
  tls: true,
}

So yay this is using TLS, and is otherwise super simple. If you don’t want to be like me and waste additional time of your life by using the wrong email server you’ve saved irresponsibly, here’s how you can get the mail server info from the confusing hellish vortex of unusability that is KIS:

Log in at https://kis.hosteurope.de/
Go to:
“Product admin”
“Domain and Mail”
“E-Mail”
“Manage e-mail accounts / Autoresponder / Filter / Webmailer”
Find the row with the relevant Email account (you likely only have one)
Click on the little (i) i-in-a-circle “Account information” button
Phew

What you want is under “Outbox”.

Sending mails locally without copy-paste shenanigans

So that’s that. Here’s another tip to make life easier while trying this out: Put the config block above not in config/environments/production.rb but in config/application.rb (which is shard between all enviroments), and add this around it:

if Rails.env.production? or (Rails.env.development? and ENV["SEND_REAL_MAILS_IN_DEVELOPMENT"] == "true")
  config.action_mailer.smtp_settings = {
    ... as above
  }
end

Then you can use the ENV switch to quickly try out sending from your local setup. For example, if you’re using Devise, you can abuse an existing local User to send yourself a mail at “yourown@email.com”:

$ SEND_REAL_MAILS_IN_DEVELOPMENT=true rails c
2.6.5 :001 >u = User.last; u.email = "yourown@email.com"; u.send_confirmation_instructions

Extra bonus tip because I love the “new” Rails encrypted credentials so much

So you probably shouldn’t check in your actual SMTP credentials to source control. Instead, use the neat encrypted storage vault of awesomeness:

if Rails.env.production? or (Rails.env.development? and ENV["SEND_REAL_MAILS_IN_DEVELOPMENT"] == "true")
  config.action_mailer.smtp_settings = {
    address: Rails.application.credentials.hosteurope[:smtp][:server],
    port: 465,
    user_name: Rails.application.credentials.hosteurope[:smtp][:user_name],
    password: Rails.application.credentials.hosteurope[:smtp][:password],
    authentication: :plain,
    enable_starttls_auto: true,
    tls: true,
  }
end

And then, on the command line:

$ EDITOR=micro rails credentials:edit

And add the actual credentials there:

hosteurope:
  smtp:
    server: "foo"
    user_name: "faa"
    password: "fii"

And now you can check it all in. Here’s a good article on the feature. Oh, and shoutout to my preferred terminal editor: https://micro-editor.github.io – because nano sucks and vim and emacs are for aliens. What an efficient way to offend many people all at once 😉

That’s that, folks, see ya next time!

Eine Novemberreise: München, Zürich, Köln, Friesland, part deux

Frohes neues Jahr, geschätzte Leser in potentia! Wir geben uns heute auch wieder leutselig und lassen den ganzen “warum-hab-ich-so-lange-nichts-geschrieben-Tanz” einfach mal weg.

Vorwärts! Letztes Mal waren wir in Köln stehengeblieben, wie ja schon so mancher ICE haha. Ne stimmt gar nicht, wir waren noch in Zürich (man sollte vielleicht erstmal seinen eigenen Post wieder lesen…)

Nachdem mich meine Gastgeberin ~~rausgeworfen hatte~~ andere Gäste erwartete, war ich mit dem “Pflichtteil” durch und entschied mich nach etwas Rumeierei für Köln, was ich schon immer mal besuchen wollte. Bonus: Von Zürich gut per Bahn zu erreichen ohne ewiges Umsteigen.

Kleiner Ausflug: Ich habe noch Ewigkeiten recherchiert, ob man irgendwie per Binnenschiff die Strecke oder zumindest einen Teil am Rhein lang fahren kann. Von Zürich nach Basel ist es eine kurze Zugfahrt, und Basel und Köln sind dann beliebte Stationen von Kreuzfahrten – nur leider überhaupt nicht im Winter 😦 Die Kreuzfahrten sind dann auch leider recht teuer, vor allem wenn man nur eine Strecke fahren würde, normalerweise sind die ja so gedacht, daß man an einem Hafen einsteigt und dann da auch wieder abgeliefert wird.

Es gibt eine Reihe von Anbietern für Frachtschiffe, z.B. https://frachtschiffreisen-pfeiffer.de, wo man sich in eine Kabine einmieten kann. Das würde mich ja ungemein reizen, einmal mit einem dieser großen Schüttgutkähne herumzutuckern. Leider sind die nicht so “benutzerfreundlich” (verständlicherweise, das Mitschleppen von romantisch verbrämten Landratten ist da nicht das Hauptgeschäft sondern wohl eher ein wohlmeinendes Hobby) – die meisten scheinen so zu funktionieren, daß man einen gewissen groben Zeitraum und ungefähre Strecke vorher bucht und dann recht kurzfristig zu einem Hafen fahren muß. Muß ich irgendwann mal machen, aber diesmal nicht praktikabel…

Köln also! Erster Eindruck: Der Dom ist ja praktisch direkt in den Hauptbahnhof gebaut bzw. umgekehrt 🙂 Überhaupt, um das mal vorzugreifen, scheint mir die Kölner Innenstadt viel dichter zu sein als andere Städte wie z.B. Berlin – was wohl historisch bedingt ist, eine so alte Stadt zwischen einem nicht verlegbaren Fluß und alter Bausubstanz ist dann schnell “komprimiert” und kann sich nicht wie z.B. Berlin in der Brandenburger Einöde ausbreiten.

Zweiter Eindruck: Der Kölner Dom ist riesig, beeindruckend, protzig, und riesig.

Vorsicht: Nackenstarre beim Hochschauen

Hier ruhen die 3 heiligen Könige sowie diverse Heilige und was sonst noch beim Auskehren über war…

Prunk: Got it.

Konnte ich leider schlecht einfangen, aber hier ist ein sehr schönes Fenster mit “Pixelmotiv”

Ich bin ja sehr unreligiös aufgewachsen, und dazu noch im Norden wo die großen Kirchen alle strenge graue evangelische Burgen sind, die eher “Zuflucht vor Sturmflut” ausstrahlen als “Christ sein ist ne Gaudi”. Der Dom ist da der Gegenentwurf – alles was irgendwie kostbar, heilig, oder älter als 1000 Jahre alt ist, wurde hier eingebaut oder gesammelt. Unbedingt einen Besuch wert.

Was sonst noch tun in Köln an zwei zugigen kalten Tagen? Kölsch trinken und gut Essen, natürlich.

Da brutzeln sie!

Egal ob früh oder spät, man bekommt erstmal ein Kölsch hingestellt

Om nom nom

Absolutes Highlight: Gaststätte Lommerzheim, wo man auch in der früh ungefragt von strammen Kölner Burschen ein Kölsch hingestellt bekommt, und wo es die dicksten, zartesten Koteletts gibt die ich je vernahm. Definitiv nicht gesund. Und wie schon oben angesprochen: Alles recht dicht und eng bestuhlt, groß gebaute Eigenbrötler wie ich müssen sich auf Ellbogenkontakt und unfreiwillige Unterhaltungen einstellen.

Kölsch ist auch ein prima Getränk – Schnelleinweisung für alle Unwissenden: Man bekommt ein sehr kleines Gläschen auf einen Bierdeckel gestellt, und auf dem Deckel wird mit einer geübten Drehung des Handgelenks ein Kugelschreiberstrich gezaubert. Daraus ergibt sich dann später die Rechnung, und aus dem Gläschen die gute Laune. Mir schmeckt das ganz hervorragend, und durch die kleinen Gläser ist es immer frisch. Wer genug hat oder Pause machen will, legt den Bierdeckel oben drauf. So ist das Originalprotokoll, soweit ich das verstanden habe – in vielen Läden wird aber auch noch gefragt ob man wirklich noch eins möchte, ich vermute gerade da wo viele sonst überforderte Touristen einkehren.

Wie der Dom ist wohl auch die Hohenzollernbrücke ein Standard für den Besucher. Da sind ein paar Vorhängeschlösser angebracht.

Etwas weniger offensichtlicher Tip von mir: Da ich auf der “falschen Seite” mein Hotel hatte, bin ich immer über die Hohenzollernbrücke in die Stadt gelaufen, via dem auch schicken Rheinboulevard, und über die südlicher gelegene Deutzer Brücke wieder zurück. Unter den Brücken fahren nachts die schon oben erwähnten Schüttgutfrachter durch die Dunkelheit den Rhein hinab. Auf den Brücken kann man die vorbeiziehen sehen, und das Tuckern und die Lichter im Dunkeln tun Ihr Übriges. Wenn es nicht so schneidend kalt gewesen wäre…

2019-11-21 14.54.49 — Mehr Essen: “Halve Hahn” im Gaffel am Dom

Man könnte jetzt den Eindruck bekommen, daß ich mich zwei Tage nur durch die lokale Küche gefressen und dann durch die lokalen Kneipen gesoffen hätte. Dieser Eindruck ist natürlich korrekt. Ich war aber zwischendurch nüchtern genug, um noch das Museum für Angewandte Kunst zu besuchen:

Es waren nicht alle Ausstellungen geöffnet, aber was ich gesehen habe hätte ich eher als “Museum des Industriedesign” bezeichnet. Aber obwohl ich mir oft mehr Details an den Tafeln gewünscht hätte, waren viele interessante Stücke zu sehen. Die “Kugel” oben, die aus einer Vielzahl von Schüsseln und Tellern verschiedener Größen besteht, muß ich unbedingt irgendwann mal haben.

Sonstige Notizen:

In der Sonderbar gibt es einen Nagelklotz! Das habe ich seit den Dorffesten meiner Kindheit nicht mehr gesehen 😀

Im HoteLux konnte ich endlich mal Chicken Kiev probieren. Der Laden ist vielleicht etwas bemüht “sowjetisch”, aber die Küche kann was.

Nach der ganzen Völlerei dann noch ein paar Tage bei meinen Eltern im Friesland verbracht. Dort war es eher ~~sterbenslangweilig~~ ruhig und friedlich:

Die Hooksieler Fußgängerzone und der Alte Hafen – ich hab natürlich ein bißchen geschummelt bei den Bildern, es sind nicht wirklich alle Bewohner am Ende der Touristensaison geschlossen wieder ins Meer getaucht. Aber wenn selbst die einzige Kneipe des Orts bis Weihnachten geschlossen hat…

Und man einen netten Stein findet und sich mit Ihm anfreundet:

2019-11-25 14.39.14 — Ich nenne Ihn “Bricky”. Ist eher schweigsam.

…dann ist es Zeit wieder ins dreckige alte Berlin zu fahren. Und dann, Monate später, darüber zu bloggen. Bis ein ander Mal!

Crowd-funding platform scraper, part 4 – Finishing the tests

Hello again everyone, let’s finish this series up!

Let’s continue from last time and speed up a bit. Until now, we had built up a little set of helper methods in the tests:

#setup is called automatically before each test, and sets up the internal test server
#run_scraper is “shelling out” to an external process and runs the scraper script against the test server, and also captures any output as well as the exit code
#assert_error is checking for a specific output on stderr and an exit code of 1

But until now, we’ve only tested a negative case, i.e. some condition where the test server only returns an HTTP error code, or a nonsensical HTTP body. Time to get positive!

The obvious first step is to add an assertion to check for output on stdout, and an exit code of 0:

def assert_success(expected_output, options = {})
  result = run_scraper(options)
  puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]
  assert_equal(expected_output, result[:stdout])
  assert_equal(0, result[:status])
  result
end

Eagle-eyed readers will have noticed the addition of this line:

puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]

This is basically a debugging helper for the tests themselves. Since our shelling-out code captures all output, it so happened that during development of these tests, I was finding they would fail for some reason I couldn’t see. It turned out the test setup was causing the test to fail, and the script was producing an error message – but I didn’t see it, as stderr was captured, while the test only complained about stdout being empty. Curses. So this line (and the equivalent in the assert_error method) will output what we normally can’t see if the test fails, but only if it fails (otherwise, we’d clutter the console while running the tests every time).

Finally, the meat of the tests are creating a way to pre-define certain paths in the temporary server which will return specific HTTP status codes and response bodies. I’ve spent a bit of time going back and forth until I found a way which lets us define several different paths with different responses beforehand, while also having a single “default” catchall response which can be “overwritten”. This is the result for the most simple case:

def setup
  @response_status = 200
  @response_body = 'Hello, Sailor!'
end

def start_test_server
  @app ||= proc { |env| [@response_status, {}, [@response_body]] }
  @test_server = Capybara::Server.new(@app).boot
  @testserver_url = "http://#{@test_server.host}:#{@test_server.port}"
end

def assert_success(expected_output, options = {})
  start_test_server unless @test_server
  result = run_scraper(options)
  puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]
  assert_equal(expected_output, result[:stdout])
  assert_equal(0, result[:status])
  result
end

def set_up_simple_success_case
  # Set it up so there's only one page (the index page), i.e. no pagination, and only one project
  index_html = lambda do
    <<-HTML
    <html><body>
    <h1>Le index page</h1>
    <div class='campaign-details'><a class='campaign-link' href='#{@testserver_url}/project/1'>project 1</a></div>
    </body></html>
    HTML
  end
  project_1_html = lambda do
    <<-HTML
    <html><body>
    <h1>Le project 1 page</h1>
    <h5 class='campaign-goal'> € 1,000 </h5>
    <p class='remaining-amount'> Bla bla <span> €80,0 </span> </p>
    </body></html>
    HTML
  end
  @app = proc do |env|
    html = case env["REQUEST_PATH"]
    when "/project/1" then project_1_html.call
    else index_html.call
    end
    [@response_status, {}, [html]]
  end
end

def test_simple_success_case
  set_up_simple_success_case
  assert_success "1 campaigns, 1000€ total, 800€ remaining, 200€ earned"
end

Oh boy, that’s quite a jump from the error cases 😉 Let’s pick this apart a bit. The structure is pretty much the same as before, except for one addition: The #start_test_server method. This is now separate from the #setup so we can call it at a time when we need it (i.e. after test-specific setup), not always at the start of everything.

Let’s just walk through this in the order it is actually being executed:

When running the test script, the Ruby process reads all this code in, and after that minitest automatically calls first the #setup method and then the #test_simple_success_case method
#setup just sets up two default response variables. Since they are instance variables, they are “global” to this test run (and get discarded after each run)
In #test_simple_success_case we just call #set_up_simple_success_case and then the #assert_success method.#set_up_simple_success_case is only a method on its own because we will re-use this setup in other tests (not shown here)
#set_up_simple_success_case defines the @app handler as an instance variable by making the central proc that is basically our “router” in the test server. In it, we check the request env for the path of the HTTP request, and either return the HTML for the simulated “index” page, or one for a specific, hard-coded “/project/1” project page. But the HTML is not simply a block of text – we define each as a lambda, which in turn evaluates a “heredoc” (i.e. a multiline string literal), and then, we use #call on the lambda objects when the app handler decides which page is being served. Why is this so convoluted, you ask? The problem is one of timing: We need to set up the HTML responses and @app handler before we start the test server (since it’s a constructor argument), but since we link from within the HTML to other pages (we simply use the inline string interpolation inside of the heredocs: "<a class='campaign-link' href='#{@testserver_url}/project/1'>project 1</a>"), we need to know the server’s @testserver_url. Which is only known after we start the server. Using a lambda lets us get around this conundrum, as the closure it generates will evaluate the instance variable @testserver_url only when the test is running, at which point the server has been started and the @testserver_url variable actually contains something useful.
That was the setup, now we continue to running the assertion helper #assert_success, to which we pass the expected output on stdout
Here, finally, and just in time, we start the internal test server via this line: “start_test_server unless @test_server” .
So let’s look into #start_test_server – this is basically the same setup we used all this time, except this picks up the @app instance variable and keeps its contents if it has already been defined (well, technically, if it’s “truthy”), but also sets up a simple default @app handler (in fact, the one we’ve been using for all the negative examples so far) if no-one has bothered to do so far.
Finally, everything is in place, and #assert_success shells out to run the actual scraper script, and compares the captured output to what we expected. The script makes a HTTP GET request to @testserver_url, which ends up in our @app handler, which executes #call on the “index” lambda, which “renders” the heredoc, which includes the @testserver_url + /projects/1 as a link, which then is scraped by the script, which then, again results in a HTTP GET request, which we again detect in the @app handler, which returns the other heredoc, which “renders” a “detail” page finally containing the amount the script scrapes, adds up, and output to stdout. Phew.

I just realized that this explanation is much longer than the actual code 😉 I hope someone learned a bit. If anyone reads this, let me know in the comments if this was much too verbose for you, or if it cleared something up?

Something to note here is that, of course, the HTML returned by the test server here is quite different from the real crowdfunding site. However, the bare-bones elements are there with the same structure of HTML tags and attributes necessary so that the scraper finds the data during the tests. A completely different approach would be to capture and save the actual HTML from the website, and use that for these tests. That has the advantage of “freezing” the website in time and letting later developers know how the site looked like when the code was written, in case it changes later. With a “real” project, I would certainly consider this, but in this case we need to preserve the anonymity of the website, so I can’t check in real HTML. Another factor is that these examples, being the bare minimum, just barely are short enough to fit right in with the test examples (records of the real responses would need to go into files or somesuch, otherwise the test code would be unreadable), and illustrate the structure we’re looking for quite clearly. As usual, we have a tradeoff to consider.

So, this setup is the main trick of the test code. We allow each test to define its own @app handler, with semi-dynamic HTML response bodies and HTTP status codes to specific request paths, and for each case, set up a scenario we want the script to respond to accordingly. Check out the current version of the code to see all tests – I’ve added a lot more than we talked about here. Some are quite elaborate, as the real site has three levels of links, the index page with detail page links, then a bunch of pagination links, each of which returns a page with more detail pages, all of them need to be simulated at least once for the tests to be meaningful.

Let’s wrap this all up now. Just some final notes:

During coding, I used #proc and #lambda pretty interchangeably, and later cleaned it up to only use #lambda, for clarity. There are some differences between these two ways of setting up a closure in Ruby, but for our purposes here they don’t matter. I just like lambda better because it reminds me of Half-Life 😉 [Update: Actually, I’ve went over this again and changed both #proc and #lambda to the “new” stabby lambda: -> {} Sorry, Mr. Freeman…]
I’ve also added tests for the “verbose” flag (and also the short form, “-v”), and for that refactored the debugging output to be much less, and less Ruby-specific. In fact, only the requested URLs are now printed in verbose mode, which (together with the return status code and the byte-length of the response body) is already a pretty good overview of what happens in the script.
While working on the tests, I actually found two bugs in the scraper script. One was introduced during refactoring, and was caught pretty much immediately by the tests, and the other one was an unhelpful crash that occurred when a pagination link returned something unexpected. This was caught by me methodically writing tests for each simple error condition (404 error, empty HTML) for each URL requested by the script. Hooray for TDD!

Well, and that’s it! I hope someone learned something by all this, or at least was entertained. I certainly learned a lot about writing about my code, instead of just hacking it out. The next thing I want to do with this, now that we have a pretty thorough language-agnostic test harness, is to try out writing the same functionality in programming languages I’m not so fluent in (like Javascript), or ones completely new to me, like Rust!

So, see you laters, scraper-alligators!

Eine Novemberreise: München, Zürich, Köln, Friesland, Teil 1

Heute machen wir mal etwas Anderes, geschätzte mögliche Leserschaft, wir bleiben mal unter uns und auf Deutsch 😉

Im November war ich mal wieder auf Reisen und im Rahmen des hier fortwährenden Projekts “ich schreib einfach mal was”, müßt Ihr mir jetzt zuhören. Wie früher bei Papa mit dem Diaprojektor. Argh.

2019-11-14 12.16.59 — Immer am Anfang und Ende: Der Berliner Hauptbahnhof, die Schichttorte unter den Bahnhöfen.

Reisen ist für mich immer zwiespaltig. Auf der einen Seite drängt es mich früher oder später dazu, endlich mal wieder Neues zu sehen, andere Landschaften, andere Städte, der liebgewonnenen Gleichförmigkeit des Kiezes zu entfliehen. Und Reisen hat das Schöne, daß man immer _nur_ reist, wenn man reist. Man ist ein Reisender, und sonst nichts, für eine kurze Weile. Alle anderen Sorgen bleiben zurück, man ist mit dem Moment beschäftigt. Wo heute unterkommen, wo essen, was essen, wie funktioniert dieser Fahrkartenautomat, was heute anschauen? Man ist vielleicht auch nicht glücklich als solches, und kann auch Stress und Eile haben, aber immerhin ist es ein anderer Stress als sonst.

Auf der anderen Seite – ich bin oft so schwerfällig, wirklich loszulegen. Und wenn ich dann das Zuhaus zu satt hatte und habe eine Reise geplant, sehe ich dann zusehends gegen den Termin an, hab regelrecht Lampenfieber…obwohl das schon tausendmal passiert ist. Am Tag der Abreise ist es dann am Schlimmsten, ich würde am Liebsten alles absagen und wieder nach Hause, mich einigeln. Aber wenn ich dann erstmal unterwegs bin…ist es wieder prima. Ich bin wieder Reisender. Für eine Weile, dann setzt Heimweh ein – aber das ist eine andere Geschichte.

Diesmal war eine Rundreise per Bahn geplant. Nachdem ich Jahre mit dem Motorrad herumgefahren bin, hab ich das letztes Jahr probiert, und es funktioniert prima! Man muß leicht packen, ein Wanderrucksack hilft, und es ist nicht wirklich günstig, aber man kommt doch gut durch, zumindest in Deutschland und der Schweiz. Im Gegensatz zu Anderen schimpfe ich nicht besonders auf die Deutsche Bahn, gelegentlich verliert sie Ihre Murmeln und steht da, betroffen von sich selbst, und stammelt etwas von umgekehrter Wagenreihenfolge, aber im Großen und Ganzen ein aufgewecktes Kind.

Dieses Jahr hatte ich mich wieder selbst ausgetrickst und mich in München und Zürich bei Freunden verabredet, also gab es kein Zurück. Die Rückreise war dann offen, aber zwischen Zürich und Berlin sind viele spannende Kilometer, da wird sich wohl was finden.

Mit dem Zug von Berlin nach München klappt ganz gut. Wiedersehen mit alten Freunden und zünftiges Essen im Brauhaus: Check. Am nächsten Tag dann Zeit totzuschlagen, und zufällig auf das sehr empfehlenswerte Deutsche Museum gekommen:

Bronzebarren, und Modelle mit erstaunlicher Detailschärfe aus der Bergbauaustellung.

Man läuft über eine Stunde durch überzeugend nachgebildete Stollen und schaut Modelle und in Kontext gesetzte Bohrer und ähnliches an. Keine gezwungene Multimediageschichten, keine schmuddeligen Touchscreens, super.

Dann mit dem DB-Fernbus nach Zürich. Kurioserweise gibt es keine vernünftige Bahnverbindung, aber der Bus geht gut. Fehlen die Schienen? Sind die Berge im Weg? Aber die Autobahn gibt es? Ein Mysterium.

In Zürich komme ich bei einer lieben Freundin unter – ein Glück, denn Zürich ist absolut unbezahlbar für Berliner Verhältnisse…wenn allein ein Bier schon 8-10 Franken (praktisch das Gleiche in Euro) kostet, ein Abendessen ohne Getränke 30+ Franken…

Umgekehrt ist es dann immer super, wenn man wieder heimkommt und die Preise radikal sinken. Ein Freund hat mal gesagt, nach einem Aufenthalt in der Schweiz ist dann in Berlin in erster Näherung alles umsonst 😉

2019-11-16 14.17.38 — Mein Lieblingsort in Zürich: Ein kleiner Brunnen mit Sitzgelegenheit, ein paar Schritte von meiner Unterkunft. Im letzten Jahr, im August, hab ich dort abends gesessen und gelesen, bei warmen Laternenlicht und Geplätscher.

Ansonsten ist Zürich wunderbar. Alles ist schick und sauber und funktioniert (zumindest in der Innenstadt), die Menschen wechseln nahtlos zu Deutsch oder Englisch wenn ich verständnislos das Schweizerdeutsch bestaune (es _klingt_ irgendwie Deutsch, aber ich verstehe kein Wort, es ist frustrierend), Essen ist wunderbar käselastig, und es gibt Berge! Als Kind der Nordseeküste habe ich mich in den letzten Jahren in Berge verliebt und muß sie jetzt immer wieder mal sehen. Eine Fernbeziehung, ja eine Affäre, aber um so intensiver ist es wenn wir uns sehen 😉

Im Sommer ist es regelrecht mediterran in Zürich, mit dem großen Züricher See, um den sich die Stadt legt, den Weinbergen mitten zwischen Wohnhäusern, die Architektur mit kleinen Wohnhäusern (mehr Villen…ja, hier ist das Geld) an sehr vertikalen Straßenzügen. Im Spätherbst wie diesmal ist es eher frisch und schneidend. Wir hatten Glück und konnten einen Ausflug auf den Uetliberg bei goldenem Licht machen:

Anschließend Käsefondue! Nie hat ein halbes Kilo Käse mit einem halben Laib Brot so gut geschmeckt 😉 Hier im Swiss Chuchi – ein bißchen touristisch und voll, aber sehr lecker. Wir mußten auf einen Tisch warten, in einer kahlen Hotellobby. Erst etwas befremdlich, aber dann wurde bekannt, daß der hervorragende Hauswein beim Warten umsonst ist, was dann alle Befremdlichkeiten schnell überwandt.

Um den See herum und am Chinagarten kann man gut flanieren. Nochmal wundervolles Licht, der Rest der Reise lag unter bleischweren Wolken.

2019-11-19 16.17.21 — Im Migros Supermarkt muß man selbst das Obst abwiegen – dafür gibt es vorgewogene Bündel von Bananen

2019-11-19 19.22.08 — Om nom nom. Fondue 4 evar!

Und ja, ich gestehe, ich bin dann zum zweiten Mal Fondue essen gegangen, aus Gründen. Empfehlung: Zebra Bar, die liegt ein bißchen außerhalb des “schicken” Zürichs in der Nähe der Langstraße, des örtlichen Rotlichtviertels. Aber das Fondue ist schnörkellos und lecker, der Koch/Ober/Besitzer ist nett, und alles ist etwas günstiger.

Nächstes Mal der Rest der Reise.

Crowd-funding platform scraper, part 3 – what’s your exit strategy?

After the last post has been pretty long, let’s make this a bit shorter and only show one addition to the script and the tests.

The assertion I wanted to tackle next is still not a positive case, but rather another error scenario (again, one that I came across during the initial development), which is that the first URL called returns an HTTP 404 error code.

For that, we need to change the testing setup from a hardcoded 200 response:

app = proc { |_env| [200, {}, ['Hello, Sailor!']] }

so that it doesn’t always return the same response code and body, but something we can control for each assertion.

This should do the trick:

def setup
  @response_status = 200
  @response_body = 'Hello, Sailor!'
  app = proc { |_env| [@response_status, {}, [@response_body]] }
end

Since the app proc uses instance variables, we can change them in the assertions even after the setup has been run.

Here is the assertion now:

def test_index_page_returns_404
  @response_status = 404
  assert_equal "Projects page returned HTTP '404' - Is the site down? Check the URL given?", run_scraper
end

Et voilà, we get an expected failure:

1) Failure:
TestScraper#test_index_page_returns_404 [./test:47]:
--- expected
+++ actual
@@ -1 +1 @@
-"Projects page returned HTTP '404' - Is the site down? Check the URL given?"
+""

Of course, we didn’t yet implement this specific error message. Let’s become more user-friendly:

abort("Projects page returned HTTP #{status} - Is the site down? Check the URL given?") if index.nil?

You will also notice we’ve added another little detail – we’re now exiting with a proper UNIX error code, via the #abort method.
Not only does this save us from unnecessarily deep nesting in the script, it also lets us play nice with other programs.

Here is a guide I’ve used to brush up on Ruby exit codes: https://www.honeybadger.io/blog/how-to-exit-a-ruby-program/

Thanks to this guide we now know that we should also be using stderr instead of the usual stdout to print error messages. Luckily, #abort already does this for us.

Oh, but when running the test, we get no output at all:

4) Failure:
TestScraper#test_index_page_returns_404 [./test:47]:
--- expected
+++ actual
@@ -1 +1 @@
-"Projects page returned HTTP '404' - Is the site down? Check the URL given?"
+""

Ah yes, we knew from the guide about shelling out that the `backticks` don’t capture stderr…two steps forward, one step back 🙂

Some more googling later:

https://www.honeybadger.io/blog/capturing-stdout-stderr-from-shell-commands-via-ruby/

Two articles from Honeybadger in a row, these guys are helping us out today 🙂

So, now we know how to run a subshell properly, and capture everything we want to know:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  [stdout.strip, stderr.strip, status]
end

assert_equal "No projects found, has the site changed? Check the URL given?", run_scraper[1]

Hah! This is much better, and also allows us to get at the exit code (now that it means something).

However, while I think it’s a good interface to return an array of results from the run_scraper method (they all belong together semantically), the access via the brackets [1] seems iffy to me – you can’t tell from looking at this what we’re trying to access there.

How about we wrap the result in a hash and then access it like so: run_scraper[:stderr]

That would be better. However, we’ll probably need to define testing helper methods sooner or later anyway to cut down on repetition – and I’d like to test the return code in one go as well, while we’re at it. In fact, let’s do both. Both is good:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  {
    stdout: stdout.strip,
    stderr: stderr.strip,
    status: status,
  }
end

def assert_error(expected_message)
  result = run_scraper
  assert_equal(expected_message, result[:stderr])
  assert_equal(1, result[:status])
  result
end

This looks so much nicer. But we still get an error in the test:

3) Failure:
TestScraper#test_index_page_has_no_projects [./test:62]:
--- expected
+++ actual
@@ -1 +1 @@
-1
+#<Process::Status: pid 22225 exit 1>

Interesting – Open3.capture3 doesn’t give us a simple integer but an object. Let’s see:

https://ruby-doc.org/core-2.5.0/Process/Status.html

I had assumed that the “status” from this line:

stdout, stderr, status = *Bundler.with_original_env

would simply be the exit status from the subshell, but it turns out to be an object with some more information. What we really want is status.exitstatus.

And now, all together:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  {
    stdout: stdout.strip,
    stderr: stderr.strip,
    status: status.exitstatus,
  }
end

def assert_error(expected_message)
  result = run_scraper
  assert_equal(expected_message, result[:stderr])
  assert_equal(1, result[:status])
  result
end

def test_index_page_returns_404
  @response_status = 404
  assert_error "Projects page returned HTTP 404 - Is the site down? Check the URL given?"
end

Ah, this gives me a warm and fuzzy feeling 😉 The actual assertion helper to be re-used is short and punchy, tests two things that are always occurring together in one go, and in turn uses a helper method with a clear purpose and a structured return value.

In my experience, if you are on the right path with Ruby (and keep refactoring), what emerges is usually something like a small domain-specific language, i.e. “talking” methods, usually short, that have intuitive use and return values, even if it’s something simple without any metaprogramming. Ruby still wins all the beauty contests in my opinion.

Here’s the state as of now.

Next time: Finally finishing up all assertions.

Crowd-funding platform scraper, part 2 – A test harness with a real HTTP server, adding an option parser

Hello and welcome back again, dear potential readers. It has been some time, right? Well I have an excuse: I’ve been travelling. Maybe I’ll post some photos some day soon? Who knows!

Let’s get back to that scraping project for now. Last time we left off, we had a basically working Ruby script, but no tests at all, and some time later, while working on the tests, I spent a lot of unexpected time on an issue with bundler and running Ruby in subshells.

(Editorial note: Of course this all really happened some time ago, and not in a clean timeline as presented here. I’m now writing this from the notes I took while working on the tests a while ago)

I’d like to add tests, an options parser, and clean everything up before switching to another language, and/or refactoring the Ruby code, or trying out parallelization. Parallelizysing. Para-make-it-go-at-the-same-time.

Of course, all those hard-coded puts and pp calls will never do. I did need them for debugging while developing, though, so that seems like a strong indication that we should keep them, but they should go to the “verbose” mode, and not be enabled by default. I’m resisting the urge to keep tinkering with that, though, as we first really need some tests.

Strangely enough, even after so many years of writing oh so many tests, it still feels like a chore. They’re absolutely indispensable, and after having them any code feels so much better to work with. But still, starting to write them feels like psyching oneself to go to the gym.

First of all, we will need some kind of switch to tell the code we’re in test mode. At first I thought to have a simple env variable like “TEST_MODE=true” but then decided on taking a page out of Rails’ book and use something like the RAILS_ENV or RACK_ENV, defaulting to “non-test”.

(Another note from the editorial future: In the end, it turns out I didn’t even need those env variables and found a much nicer way. I was tempted to clean all this back and forth up a bit, for brevity, but one thing I want to do with these posts is to show how a “real” developer arrives at solutions. A pet peeve of mine is how clean-shaven, readily-sprung-from-the-brow-of-zeus solutions in articles often look like, and how the reality is more like making sausage: Messy. But it’s normal, and if you clean up after yourself, the result is very tasty especially if you have some mustard… ok let’s stop that metaphor here).

So let’s use “RUN_MODE”. When the value is “test”, we’ll mock the HTTP responses, and otherwise just run normally.
I guess env variables are a pretty good way of “communicating” with the code in a portable way, i.e. it’ll work in most programming languages. On the other hand, they kinda feel like global variables, so I’m somewhat iffy about using them too much.

But let’s also think about what we want to do. We want to have a bunch of test cases, where we define the HTTP responses in the test setup, run the scraper, and then check the stdout output (stdoutput?) for what we expect. While writing this, I realized we’ll need to not only return the body of the HTTP response, but also the status code (to test the problem I had during developing, the 404 error of one of the detail pages). Possibly also HTTP headers and such.

Additionally, we’ll need to define and return several different responses in a defined order, and/or depending on the URL being called…since we might want to test that the index page contains only one, or none, or several detail page links, etc.

This might be obvious to the dear readers, and I also didn’t really think about this before. With the kind of Ruby projects I’m used to work with, there are lots of tools for this – the mocking framework in rspec (https://relishapp.com/rspec/rspec-mocks/docs) makes this pretty easy, and there are more sophisticated tools like https://github.com/bblimke/webmock and https://github.com/vcr/vcr .

However, these all work by reaching into the current Ruby process and redefining how the actual HTTP requests are being done. This works beautifully in Ruby (all you readers what are screaming about dependency injection out there, please calm down. It’s fine, actually), but will super duper not work when we try out another language.

So, another approach would be to go back to our trusty friend the env variable and start defining the HTTP response codes and bodies (probably from saved “fixture” files) there, i.e.:

./test ruby/scrape_crowdfunder FIRST_HTTP_RESPONSE_CODE=200 FIRST_HTTP_RESPONSE_BODY=./fixtures/first_response.html

But well just look how fugly this is. Nope nope nope. It also means that our actual application code needs understand all these env vars and contain a lot of test-specific code. Nope nope nope.

An adventurous thought occurs: We do have the CROWDFUNDER_PROJECTS_URL variable (or, actually, the command line parameter)…what if we ran an actual HTTP server, controlled from within our test code, and point the script at this? It’s quite easy to run simple http servers from Ruby code, and we could neatly define each response from within each test case…

Another thought occurs: Somebody probably already did this. Let’s google…

Ok, here are some options. As expected, mostly people are using mock frameworks that work only within Ruby, but here are some ideas (found most of these following: https://stackoverflow.com/questions/10166611/launching-a-web-server-inside-ruby-tests):

These snippets: https://gist.github.com/mojavelinux/a7e0cabb1b401300a4a5f7fa1ea6689c
- they seem quite low-level, we’d have to write a lot ourselves
This is much more elaborate: https://www.fedux.org/articles/2015/08/02/setting-up-a-ruby-based-http-server-to-back-your-test-suite.html
- but it spawns a child process for the server, and that makes me think it’s probably difficult to control the HTTP responses from within the tests. The author seemed to need only a server that serves the files out of a folder. Also, not packaged as a library.
Now we’re getting somewhere: https://github.com/grosser/stub_server
- This is pretty much it…except it’s a small library that hasn’t seen much action, apparantly, and the examples are only for rspec – I’d like to try minitest this time. But we’ll come back to this.
Looking at this:
- https://github.com/teamcapybara/capybara/blob/master/lib/capybara/server.rb
- https://github.com/teamcapybara/capybara/blob/master/spec/server_spec.rb
- I’m thinking it’s probably possible to just use this – we can load up the Capybara gem, but only use this class out of it. Has the advantage that Capybara is heavily battle-tested and will be kept up to date. Looking at the usage via the spec file, it seems easy to set up HTTP responses, they’re using the Rack interface.

So let’s try picking the server out of the Capybara gem. Off to the code-mobile!

(Some time in the code-mobile 🚗 later…)

After that roadblock (har har) is out of the way, here is the first test code, another simple Ruby shell script. It has an accompanying Gemfile, and we’re using minitest as a simple testing framework. I’ve always used rspec in my projects before, and wanted to try this out. It works pretty nicely for a minimal setup like this (you just have to declare the class and require 'autorun' and it, well, autoruns after all the code has been read by the Ruby process. And every method that starts with test_automagically becomes an assertion):

#!/usr/bin/env ruby
# frozen_string_literal: true

require 'rubygems'
require 'bundler/setup'
require 'pp'
require 'pry'

SCRAPER_PATH = ARGV.first
unless SCRAPER_PATH and File.exist?(SCRAPER_PATH)
  raise ArgumentError, "Please provide the scraper you want to test, i.e. './test ruby/scrape_crowdfunder'"
end

require "minitest/autorun"

class TestScraper < Minitest::Test
  def setup
    @testserver_url = "http://example.com"
  end

  def run_scraper
    path, script = SCRAPER_PATH.split("/")
    command = ""
    # We need to cd into any sub-folder(s) so the scripts there can do setup like rvm, bundler, nvm, etc.
    command += "cd #{path} && " if path
    command += "./#{script}"
    command += " #{@testserver_url}"
    # We need to have a "clean" Bundler env (i.e., forget any currently loaded gems),
    # as the script called might be another Ruby script, with its own Gemfile, and by default
    # shelling out "keeps" the gems from this test runner, making the script fail
    Bundler.with_original_env do
      `bash -c '#{command}'`
    end
  end

  def test_that_shit_works
    assert_equal "OHAI!", run_scraper
  end
end

Here is the first “successful” run of the test script:

crowdfunder_scraper $ ./test ruby/scrape_crowdfunder 
Run options: --seed 48205

# Running:

F

Finished in 0.708704s, 1.4110 runs/s, 1.4110 assertions/s.

1) Failure:
TestScraper#test_that_shit_works [./test:37]:
--- expected
+++ actual
@@ -1 +1,6 @@
-"OHAI!"
+"GET \"http://example.com\"
+[]
+[]
+[]
+0 campaigns, 0€ total, 0€ remaining, 0€ earned
+"

1 runs, 1 assertions, 1 failures, 0 errors, 0 skips

It is actually running the script, and failing because there’s too much output! Also, of course, our temporary example.com domain has no links, and so fails even more. We should have checks that the wrong domain, or a changed page, outputs something more explicit. We’ll add that to the tested code once the test actually sets something sensible.

Now to the interesting bit here – our internal test HTTP server. So far, we’ve kept requesting example.com over and over, which is hardly fair to it. And we could never change its response(s) to the one we want to simulate.

So let’s try requiring capybara/server and using its host and port in each assertion:

require "capybara/server"

def setup
  app = proc { |_env| [200, {}, ['Hello, Sailor!']] }
  server = Capybara::Server.new(app).boot
  @testserver_url = "http://#{server.host}:#{server.port}"
end

But then, we get this error:

NoMethodError: undefined method `server_port' for Capybara:Module
/home/mt/.rvm/gems/ruby-2.6.0/gems/capybara-2.14.0/lib/capybara/server.rb:63:in `initialize'
./test:25:in `new'
./test:25:in `setup'

Looking a bit at the code of the Server class, it seems to call Capybara.server_port … these methods are probably simply not defined because we are only requiring a single file out of the whole gem. Fiddling around with requiring some of the “config” files from the Capybara gem, but it doesn’t seem to work. Another idea would have been to just (re-)define the Capybara module ourselves and add the methods we need by trial and error, but that seems like a long road to go down, and hard to maintain.

So we’re just loading up Capybara completely:

require 'capybara'

and the error goes away.

So close now…let’s make an actually useful assertion now. We’re setting up the server to return just a string, no HTML:

def setup
  app = proc { |_env| [200, {}, ['Hello, Sailor!']] }
  server = Capybara::Server.new(app).boot
  @testserver_url = "http://#{server.host}:#{server.port}"
end

Then we have a test case that checks that in this case, we should see a warning for the user that the website seems to be wonky:

def test_index_page_has_no_projects
  assert_equal "No projects found, has the site changed? Check the URL given?", run_scraper
end

And when we run the tests now, they rightfully complain that the scraper doesn’t give us a useful error message, but only some junk output and a result of “0 projects”.

So now we’re doing actual TDD 🙂 We’ve tested something that is not actually a feature yet in our code. Let’s add that now:

if detail_urls.size == 0
  puts "No projects found, has the site changed? Check the URL given?"
else

Aaaaand:

3) Failure:
TestScraper#test_index_page_has_no_projects [./test:45]:
--- expected
+++ actual
@@ -1 +1,5 @@
-"No projects found, has the site changed? Check the URL given?"
+"GET \"http://127.0.0.1:36409\"
+[]
+[]
+No projects found, has the site changed? Check the URL given?
+"

Of course, this still fails because of the extra debugging output, but yaaaay!
This Capybara mini-server is great, hooray to open source!

The code is at this commit now: https://github.com/MGPalmer/crowdfunder_scraper/commit/aff8e7abddb4981fc73ea700821dd2e736213337

At this point, we have a good setup, it seems. We can now completely mock out the “real” webserver and replace it in our tests with one that we can control from within the tests (even though we currently only return ‘Hello Sailor’). And since our little command-line tool only produces output via stdout (i.e. it has no side-effects like database entries added or file written), we can completely black-box-test it.

Before we wrap this post up, let’s get this one assertion green. The code actually works, but the script produces debugging output which is 1) pretty ugly 2) not useful normally, only for the developer and only if something goes wrong. Classic case of making this optional (but still keeping it as part of the code – I don’t want to go back and add and remove debugging output every time I encounter an issue). In a web framework, we’d use a “debug” log level and the usually provided logging facilities, but here the usual thing is to add a flag to the script. AFAIK, the convention is to call it -v (and in long-form --verbose). And again, this seems like something that should have been solved a lot of times before, so some googling for “option parser command line” later, we find this:

https://www.ruby-toolbox.com/categories/CLI_Option_Parsers

It seems like this is a popular field in which to create libraries, there are a lot! We’ll just go with the built-in ‘OptionParser’ from the standard library, especially since the first example is our verbose flag.

Oh, and while we’re at it, we should also make it print out a good usage example:

options = {}
OptionParser.new do |opts|
  opts.banner = "Usage: ./scrape_crowdfunder [options] http://example.com/projects"

  opts.on("-v", "--[no-]verbose", "Run verbosely") do |v|
    options[:verbose] = v
  end
end.order!

…and this is how it looks like:

crowdfunder_scraper/ruby $ ./scrape_crowdfunder --help
Usage: ./scrape_crowdfunder http://example.com/projects
-v, --[no-]verbose Run verbosely

Within the actual script, we now just use something like this:

verbose = options[:verbose]
pp(page_urls) if verbose

and now one of the tests is passing \o/

Whew, this turned out to be another long post!

Join us next time for even more tests…

Solving “Bundler::GemfileNotFound” or mysteriously missing gem

Here’s a short interlude: I was having the worst time with this error, partly because the error message is pretty misleading, and partly because I’m an idiot. One of these problems could be solved…the other has to be worked around 😉

So, the issue was that we were trying to run one Ruby script from another via the “shell-out” mechanism. There are a couple of ways to do this (Here is a good overview), but we’re using the good old `Backticks` as we are not concerned about security for now (there’s no user input, everything is hardcoded). But when running the “inner” script, we get this error:

/home/mt/.rvm/gems/ruby-2.4.2/gems/bundler-1.17.1/lib/bundler/definition.rb:32:in `build': /home/mt/Development/crowdfunder_scraper/Gemfile not found (Bundler::GemfileNotFound)

when using the “inline” Gemfile syntax, or when using a normal Gemfile:

`require': cannot load such file -- httpx (LoadError)

one of our defined gems would be mysteriously missing. When cd-ing into the “inner” directory and running the script, it works, of course.

This is of course the abridged version. I was spending a whole lot of time fiddling around, trying to pinpoint the problem. A misleading thought was that the issue might stem from rvm and bundler not properly loading in the sub-shell environment, so a lot of cd-ing around within the backticks was tried: cd ruby && ./scrape_crowdfunder

To make a tedious story short, the hint that was finally pointing to the solution was that discovering (after converting both “outer” and “inner” scripts to a normal Gemfile again) that the same issue also occurs when running the inner script from the outer directory (both containing Gemfiles):

crowdfunder_scraper $ ./ruby/scrape_crowdfunder 
./ruby/scrape_crowdfunder:6:in `require': cannot load such file -- httpx (LoadError)

So what seems to happen is that the “inner” script keeps the gems loaded from the “outer” environment. The same thing happens when running via Backticks. Googling “ruby subshell inherits gems?” finally has the solution: Use Bundler.with_clean_env ! Go ahead and read that article, it explains the issue quite well. Basically, bundler sets up a couple of ENV variables when a Gemfile is encountered, and within the same shell and directory, doesn’t change it again. When putting the shell-out backticks within that method’s block, all is well.

So just some additional notes here: Since that article was written, the Bundler method was renamed to Bundler.with_original_env .

And also, I made a small git repository to demonstrate and test the issue for me and anyone else: https://github.com/MGPalmer/bundler_env_error_test

In it, we have the same setup as my problem: An “outer” and an “inner” directory, both having a Gemfile in it. The outer Gemfile actually requires no gems at all, the inner one wants cowsay.

In both dirs we have a script, the outer one simply prints a message and then shells out to the inner script, once within Bundler.with_original_env‘s block, and once without it. The former call works, the latter one reproduces the problem that the inner script can’t find the gems it wants:

bundler_env_error_test $ ./wrangler 

Let's get wranglin'.
 _______________________ 
| Moo moo I'm a cow yo. |
 ----------------------- 
      \   ^__^
       \  (oo)\_______
          (__)\       )\/\
              ||----w |
              ||     ||
Traceback (most recent call last):
	1: from ./cow:5:in `<main>'
./cow:5:in `require': cannot load such file -- cowsay (LoadError)

So there we go. Another one of these stumbling blocks when developing. I learned a little more about how Bundler works, but there was so much wasted time, a pity.

A somewhat amusing coda to this – after changing the test code from the last article to use with_original_env , the issue was solved as described above. But then suddenly, the “inner” Ruby script didn’t pick up the provided CROWDFUNDER_PROJECTS_URL env variable value anymore. For a short time I wasn’t sure I was using Ruby’s accessor to the ENV correctly, but I then realized that with_original_env was doing exactly what it’s saying on the tin – it resets the ENV, wiping out what we added to it within the script 😀

I realized that it’s a much cleaner interface anyway to simply add the url as a command-line parameter, and switched the code to that. So, next time: Finally finishing the tests.

Scraping a crowd-funding platform for fun and (non-)profit, part 1

Hello again dear readers (I now actually have a couple, because I’m sending the articles to friends and forcing them to read them. Hi Oana!).
Today we are going to do another somewhat pointless project. Slightly less useless than last time, I hope!

I’ve been looking at a crowdfunding website, which collects donations to user-created good causes (“projects” or “campaigns”). I’ll omit the name and URL here – it’s not that the info is private, but it seems uncouth to point at someone specifically (it’s a smallish company). They don’t talk about it much, but according to the Terms and Conditions, they are financing themselves by taking a percentage (3-8%) of donations to projects.

So I’m curious how much money they are actually moving – knowing nothing else about their finances or business model, does it seem feasible that they are cashflow-positive?

And looking at the website, they have a paginated list of projects (of reasonable size, 8 pages of 8 projects each, i.e. up to 64 projects currently), which I will assume are all that are currently active.

On the list page, they show a countdown à la “1234€ to go”. The total budget of the project is only shown on the detail page, let’s say in our example, that would be 1500€. So that project, currently, would have 1500 – 1234 = 266€ pledged to it, earning the platform about 266 * 0.03 ~ 8€, enough for some tasty Falafel for two!

At first, we could scrape all projects and see how much money in total has been pledged so far via the platform. This of course will miss any previous projects that have been completed (if there are any), but there is no way of knowing, so everything found out here is a lower bound. It’s mostly a finger exercise, anyway…

We could do an additional step and record a scraping run, and come back every couple of days, and then compare to see how the “velocity” of donations is, i.e. do they handle a lot of donations per day? This is of course more involved, as we’d need to record each scraping run and/or the results somewhere (a database etc.). Let’s shelve that for now. Baby steps.

But I also want to take an opportunity here to do something I think will be quite easy for me to do in Ruby (I have been scraping websites and consuming APIs with Ruby extensively in a past life), and see how it’ll work in other languages.

So let’s dump some ideas on how this ought to look like:

It should be a command-line tool, run via a single (bash) script, i.e. ./scrape_crowdfunder. It’ll write detailed debugging info to stdout when given -v as an option (note to self: Look at libraries for command-line option parsing), but otherwise will just output errors or in the best case "X campaigns, Y€ total, Z€ remaining, A€ earned".
The URL of where to start parsing will be given via env variable CROWDFUNDER_PROJECTS_URL so I can keep this out of the repository and protect the innocent.
There should be a sort of test harness which takes as input saved HTML pages for the projects index page, and one or two detail pages. A test script runs the tool, and checks the output on stdout for the expected result. This is an extreme example of “integration” testing (Is there a better name for this?), which allows us to swap different language implementations. We’ll probably have to add another env variable or something to make a switch somewhere to use the canned pages instead of the real ones.
I’ll write the tests in Ruby, because I’m most comfortable with it, and it’s well at home on the command line, and its dynamic nature works well with testing. Performance is not really a concern here.
Since we want the same test to be used on different implementations, let’s make it all one big repository. In a bigger project, we’d probably want a “main” project that pulls in the different implementations as libraries, but let’s stay lowtech for now.
Which means we can set up the tests like this: Have, under the main folder, one for the tests, and one for each language implementation:
project
/testcode1.rb /testcode2.rb /test-data /ruby ...../scrape_crowdfunder /javascript ...../scrape_crowdfunder
etc.
then we can run the tests also via a bash script, and just give the scriptname of the runner as a parameter:
./test ./ruby/scrape_crowdfunder
We might, eventually, also look into using different HTTP clients and parallelism there. I just stumbled upon httpx – another one in a long line of Ruby HTTP clients). We’ll need to tread lightly here though, as we don’t want to hammer the site repeatedly. My impression is that there is not so much traffic going on there so we’d probably be able to visibly cause traffic spikes if we go all-out, and that would be just impolite.

So now we have an idea where we’re going – let’s first set up the Ruby stuff and just hack something out, and then clean it up while writing the tests. I’m usually doing tasks in this order:

Just think about the whole project and write down lotsa notes
Hack around to try out the rough edges
Start writing tests once I know what structure the code is going to take, and then go all TDD and write code and tests in lockstep

I’m suspicious of people that claim to be really doing TDD by writing the test first before anything else. Maybe this works if you are adding a routine extension to an existing project, but if you are going green-field or doing some complex new feature? It seems like an invitation to a lot of rewriting

Of course, just now I’m also writing down these words here for the blog post, which I don’t usually do when working for a job. I wonder if I should? It would make everything a lot slower, but would provide seamless documentation over the long term. Also it seems that writing down clears up your thoughts quite a bit…Something to ponder in another post?

Let’s just get going:

$ mkdir crowdfunder_scraper
$ cd crowdfunder_scraper/
$ git init

I’m copying a .gitignore from another project, and adding a Gemfile – we’ll at least need a http library, and I hate the built-in Net::HTTP library from Ruby with a passion:

$ bundle init

At this time I remember I want to make this code public, and make a github repo and retroactively link my local one to it:

$ git remote add origin git@github.com:MGPalmer/crowdfunder_scraper.git

Now you can follow along or look at the last state here: https://github.com/MGPalmer/crowdfunder_scraper

$ mkdir ruby
$ cd ruby
$ touch scrape_crowdfunder
$ chmod +x scrape_crowdfunder

Let’s skip some more boring details. Just some notes:

I had to actually look up how to use a Gemfile outside of a framework like Rails: https://bundler.io/v2.0/guides/bundler_setup.html
I had to look up the old shebang thing: https://stackoverflow.com/questions/17447532/what-is-the-use-of-usr-local-bin-ruby-w-at-the-start-of-a-ruby-program
Added https://nokogiri.org as we’ll need to parse the HTML.
Obligatory HTML parsing reference: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

After some back and forth, here we are:

Gemfile:

# frozen_string_literal: true

source "https://rubygems.org"

git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }

gem "httpx"
gem "nokogiri"

scrape_crowdfunder:

#!/usr/bin/env ruby

require 'rubygems'
require 'bundler/setup'
require 'httpx'
require 'nokogiri'

puts "Hello World"

Aannnnnnd:

$ ./scrape_crowdfunder 
 Hello World

Yaaaay we got a working Ruby script with Bundler.

At this point I go in irb, load up the Gemfile, and fiddle around.

$ irb
$ require 'rubygems'; require 'bundler/setup'; require 'httpx'; require 'nokogiri'
$ 2.4.2 :003 > page = Nokogiri.parse(HTTPX.get("https://example.org").to_s).css("a.clicky")
etc.

(fiddle fiddle fiddle)

Yay, got it working!

Here’s the current code, left intentionally ugly:

Check it out on Github: https://github.com/MGPalmer/crowdfunder_scraper/commit/1784bd1b18db4230b1b099f1c943ea5d7b883413


#!/usr/bin/env ruby
# frozen_string_literal: true

require 'rubygems'
require 'bundler/setup'
require 'httpx'
require 'nokogiri'
require 'pp'

index_url  = ENV['CROWDFUNDER_PROJECTS_URL']
index_html = HTTPX.get(index_url).to_s
index      = Nokogiri.parse(index_html)

def get_n_parse(url)
  res = HTTPX.get(url)
  unless res.status == 200
    puts "AAAAAAAAAA HTTP error for #{url} - #{res.status}"
    return nil
  end
  Nokogiri.parse(res.body.to_s)
end

def parse_detail_page_urls(page)
  page.css('.campaign-details a.campaign-link').map { |a| a[:href] }
end

pages = [index]
page_urls = index.css('ul.pagination a.page-link').map { |a| a[:href] }
pp(page_urls)
page_urls.each do |page_url|
  pages << get_n_parse(page_url)
end

pages.compact!

detail_urls = []
pages.each do |page|
  detail_urls += parse_detail_page_urls(page)
end

pp(detail_urls)

campaigns = detail_urls.map do |detail_url|
  puts detail_url
  page = get_n_parse(detail_url)
  next unless page

  campaign_goal    = Integer(page.css('h5.campaign-goal').text.gsub(/€|,/, ''))
  remaining_amount = Integer(page.css('p.remaining-amount').inner_html.gsub(',', '').scan(/€(\d+)?\s/m).flatten.first)
  {
    url: detail_url,
    campaign_goal: campaign_goal,
    remaining_amount: remaining_amount
  }
end.compact

pp(campaigns)

count     = campaigns.size
total     = campaigns.inject(0) { |t, n| t + n[:campaign_goal] }
remaining = campaigns.inject(0) { |t, n| t + n[:remaining_amount] }

puts "#{count} campaigns, #{total}€ total, #{remaining}€ remaining, #{total - remaining}€ earned"

Some notes:

It was a little tricky getting the first-page, then each-pagination, then each-detail links right
Stumbled hard over one 404 page, httpx will happily give you a “” body for that . This needs to be tested, i.e. the tests should include cases for all of the HTTP calls to return errors, and check that the script doesn’t choke on them.
The markup is a bit of a bitch for the amounts (total and pledged) – had to use some regexps which are a little more complex than I’m really comfortable with. This needs to be tested thoroughly so we can refactor it later.
Should’ve first added verbose mode and a trigger for it, I ended up throwing puts and pp around a lot.
Also adding and using a debugger would have helped a lot, I didn’t want to slow down for that…
The script runs for quite a while – it has to do a couple of dozen HTTP calls, and when the code fails in one of the later ones, it’s a real PITA to have to re-run everything.
I’ve moved repeated code into methods, but of course nothing is properly organized.
But note how everything happens in discrete steps, and collects data from the previous step, making it easy to inspect the data at each point, and only in the end summing up the derived information we actually want.

But we want the numbers now! After running, and omitting all the debugging output:

57 campaigns, 2829964€ total, 2818168€ remaining, 11796€ earned

At, let’s say, 5% commission, this means the current projects are earning 11796 * 0.03 ~ 353€

This buys you a lot of Falafel, but it’s not much to run a company on But of course everybody has to start small, and again, we really don’t have all the facts here. But hey, our code works, even though the numbers it produces might be meaningless. Ready for a career in business intelligence 😉

Tune in next time when we clean up this mess, and add tests!