December 2019 – monogreen.de

Crowd-funding platform scraper, part 4 – Finishing the tests

Hello again everyone, let’s finish this series up!

Let’s continue from last time and speed up a bit. Until now, we had built up a little set of helper methods in the tests:

#setup is called automatically before each test, and sets up the internal test server
#run_scraper is “shelling out” to an external process and runs the scraper script against the test server, and also captures any output as well as the exit code
#assert_error is checking for a specific output on stderr and an exit code of 1

But until now, we’ve only tested a negative case, i.e. some condition where the test server only returns an HTTP error code, or a nonsensical HTTP body. Time to get positive!

The obvious first step is to add an assertion to check for output on stdout, and an exit code of 0:

def assert_success(expected_output, options = {})
  result = run_scraper(options)
  puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]
  assert_equal(expected_output, result[:stdout])
  assert_equal(0, result[:status])
  result
end

Eagle-eyed readers will have noticed the addition of this line:

puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]

This is basically a debugging helper for the tests themselves. Since our shelling-out code captures all output, it so happened that during development of these tests, I was finding they would fail for some reason I couldn’t see. It turned out the test setup was causing the test to fail, and the script was producing an error message – but I didn’t see it, as stderr was captured, while the test only complained about stdout being empty. Curses. So this line (and the equivalent in the assert_error method) will output what we normally can’t see if the test fails, but only if it fails (otherwise, we’d clutter the console while running the tests every time).

Finally, the meat of the tests are creating a way to pre-define certain paths in the temporary server which will return specific HTTP status codes and response bodies. I’ve spent a bit of time going back and forth until I found a way which lets us define several different paths with different responses beforehand, while also having a single “default” catchall response which can be “overwritten”. This is the result for the most simple case:

def setup
  @response_status = 200
  @response_body = 'Hello, Sailor!'
end

def start_test_server
  @app ||= proc { |env| [@response_status, {}, [@response_body]] }
  @test_server = Capybara::Server.new(@app).boot
  @testserver_url = "http://#{@test_server.host}:#{@test_server.port}"
end

def assert_success(expected_output, options = {})
  start_test_server unless @test_server
  result = run_scraper(options)
  puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]
  assert_equal(expected_output, result[:stdout])
  assert_equal(0, result[:status])
  result
end

def set_up_simple_success_case
  # Set it up so there's only one page (the index page), i.e. no pagination, and only one project
  index_html = lambda do
    <<-HTML
    <html><body>
    <h1>Le index page</h1>
    <div class='campaign-details'><a class='campaign-link' href='#{@testserver_url}/project/1'>project 1</a></div>
    </body></html>
    HTML
  end
  project_1_html = lambda do
    <<-HTML
    <html><body>
    <h1>Le project 1 page</h1>
    <h5 class='campaign-goal'> € 1,000 </h5>
    <p class='remaining-amount'> Bla bla <span> €80,0 </span> </p>
    </body></html>
    HTML
  end
  @app = proc do |env|
    html = case env["REQUEST_PATH"]
    when "/project/1" then project_1_html.call
    else index_html.call
    end
    [@response_status, {}, [html]]
  end
end

def test_simple_success_case
  set_up_simple_success_case
  assert_success "1 campaigns, 1000€ total, 800€ remaining, 200€ earned"
end

Oh boy, that’s quite a jump from the error cases 😉 Let’s pick this apart a bit. The structure is pretty much the same as before, except for one addition: The #start_test_server method. This is now separate from the #setup so we can call it at a time when we need it (i.e. after test-specific setup), not always at the start of everything.

Let’s just walk through this in the order it is actually being executed:

When running the test script, the Ruby process reads all this code in, and after that minitest automatically calls first the #setup method and then the #test_simple_success_case method
#setup just sets up two default response variables. Since they are instance variables, they are “global” to this test run (and get discarded after each run)
In #test_simple_success_case we just call #set_up_simple_success_case and then the #assert_success method.#set_up_simple_success_case is only a method on its own because we will re-use this setup in other tests (not shown here)
#set_up_simple_success_case defines the @app handler as an instance variable by making the central proc that is basically our “router” in the test server. In it, we check the request env for the path of the HTTP request, and either return the HTML for the simulated “index” page, or one for a specific, hard-coded “/project/1” project page. But the HTML is not simply a block of text – we define each as a lambda, which in turn evaluates a “heredoc” (i.e. a multiline string literal), and then, we use #call on the lambda objects when the app handler decides which page is being served. Why is this so convoluted, you ask? The problem is one of timing: We need to set up the HTML responses and @app handler before we start the test server (since it’s a constructor argument), but since we link from within the HTML to other pages (we simply use the inline string interpolation inside of the heredocs: "<a class='campaign-link' href='#{@testserver_url}/project/1'>project 1</a>"), we need to know the server’s @testserver_url. Which is only known after we start the server. Using a lambda lets us get around this conundrum, as the closure it generates will evaluate the instance variable @testserver_url only when the test is running, at which point the server has been started and the @testserver_url variable actually contains something useful.
That was the setup, now we continue to running the assertion helper #assert_success, to which we pass the expected output on stdout
Here, finally, and just in time, we start the internal test server via this line: “start_test_server unless @test_server” .
So let’s look into #start_test_server – this is basically the same setup we used all this time, except this picks up the @app instance variable and keeps its contents if it has already been defined (well, technically, if it’s “truthy”), but also sets up a simple default @app handler (in fact, the one we’ve been using for all the negative examples so far) if no-one has bothered to do so far.
Finally, everything is in place, and #assert_success shells out to run the actual scraper script, and compares the captured output to what we expected. The script makes a HTTP GET request to @testserver_url, which ends up in our @app handler, which executes #call on the “index” lambda, which “renders” the heredoc, which includes the @testserver_url + /projects/1 as a link, which then is scraped by the script, which then, again results in a HTTP GET request, which we again detect in the @app handler, which returns the other heredoc, which “renders” a “detail” page finally containing the amount the script scrapes, adds up, and output to stdout. Phew.

I just realized that this explanation is much longer than the actual code 😉 I hope someone learned a bit. If anyone reads this, let me know in the comments if this was much too verbose for you, or if it cleared something up?

Something to note here is that, of course, the HTML returned by the test server here is quite different from the real crowdfunding site. However, the bare-bones elements are there with the same structure of HTML tags and attributes necessary so that the scraper finds the data during the tests. A completely different approach would be to capture and save the actual HTML from the website, and use that for these tests. That has the advantage of “freezing” the website in time and letting later developers know how the site looked like when the code was written, in case it changes later. With a “real” project, I would certainly consider this, but in this case we need to preserve the anonymity of the website, so I can’t check in real HTML. Another factor is that these examples, being the bare minimum, just barely are short enough to fit right in with the test examples (records of the real responses would need to go into files or somesuch, otherwise the test code would be unreadable), and illustrate the structure we’re looking for quite clearly. As usual, we have a tradeoff to consider.

So, this setup is the main trick of the test code. We allow each test to define its own @app handler, with semi-dynamic HTML response bodies and HTTP status codes to specific request paths, and for each case, set up a scenario we want the script to respond to accordingly. Check out the current version of the code to see all tests – I’ve added a lot more than we talked about here. Some are quite elaborate, as the real site has three levels of links, the index page with detail page links, then a bunch of pagination links, each of which returns a page with more detail pages, all of them need to be simulated at least once for the tests to be meaningful.

Let’s wrap this all up now. Just some final notes:

During coding, I used #proc and #lambda pretty interchangeably, and later cleaned it up to only use #lambda, for clarity. There are some differences between these two ways of setting up a closure in Ruby, but for our purposes here they don’t matter. I just like lambda better because it reminds me of Half-Life 😉 [Update: Actually, I’ve went over this again and changed both #proc and #lambda to the “new” stabby lambda: -> {} Sorry, Mr. Freeman…]
I’ve also added tests for the “verbose” flag (and also the short form, “-v”), and for that refactored the debugging output to be much less, and less Ruby-specific. In fact, only the requested URLs are now printed in verbose mode, which (together with the return status code and the byte-length of the response body) is already a pretty good overview of what happens in the script.
While working on the tests, I actually found two bugs in the scraper script. One was introduced during refactoring, and was caught pretty much immediately by the tests, and the other one was an unhelpful crash that occurred when a pagination link returned something unexpected. This was caught by me methodically writing tests for each simple error condition (404 error, empty HTML) for each URL requested by the script. Hooray for TDD!

Well, and that’s it! I hope someone learned something by all this, or at least was entertained. I certainly learned a lot about writing about my code, instead of just hacking it out. The next thing I want to do with this, now that we have a pretty thorough language-agnostic test harness, is to try out writing the same functionality in programming languages I’m not so fluent in (like Javascript), or ones completely new to me, like Rust!

So, see you laters, scraper-alligators!

Eine Novemberreise: München, Zürich, Köln, Friesland, Teil 1

Heute machen wir mal etwas Anderes, geschätzte mögliche Leserschaft, wir bleiben mal unter uns und auf Deutsch 😉

Im November war ich mal wieder auf Reisen und im Rahmen des hier fortwährenden Projekts “ich schreib einfach mal was”, müßt Ihr mir jetzt zuhören. Wie früher bei Papa mit dem Diaprojektor. Argh.

2019-11-14 12.16.59 — Immer am Anfang und Ende: Der Berliner Hauptbahnhof, die Schichttorte unter den Bahnhöfen.

Reisen ist für mich immer zwiespaltig. Auf der einen Seite drängt es mich früher oder später dazu, endlich mal wieder Neues zu sehen, andere Landschaften, andere Städte, der liebgewonnenen Gleichförmigkeit des Kiezes zu entfliehen. Und Reisen hat das Schöne, daß man immer _nur_ reist, wenn man reist. Man ist ein Reisender, und sonst nichts, für eine kurze Weile. Alle anderen Sorgen bleiben zurück, man ist mit dem Moment beschäftigt. Wo heute unterkommen, wo essen, was essen, wie funktioniert dieser Fahrkartenautomat, was heute anschauen? Man ist vielleicht auch nicht glücklich als solches, und kann auch Stress und Eile haben, aber immerhin ist es ein anderer Stress als sonst.

Auf der anderen Seite – ich bin oft so schwerfällig, wirklich loszulegen. Und wenn ich dann das Zuhaus zu satt hatte und habe eine Reise geplant, sehe ich dann zusehends gegen den Termin an, hab regelrecht Lampenfieber…obwohl das schon tausendmal passiert ist. Am Tag der Abreise ist es dann am Schlimmsten, ich würde am Liebsten alles absagen und wieder nach Hause, mich einigeln. Aber wenn ich dann erstmal unterwegs bin…ist es wieder prima. Ich bin wieder Reisender. Für eine Weile, dann setzt Heimweh ein – aber das ist eine andere Geschichte.

Diesmal war eine Rundreise per Bahn geplant. Nachdem ich Jahre mit dem Motorrad herumgefahren bin, hab ich das letztes Jahr probiert, und es funktioniert prima! Man muß leicht packen, ein Wanderrucksack hilft, und es ist nicht wirklich günstig, aber man kommt doch gut durch, zumindest in Deutschland und der Schweiz. Im Gegensatz zu Anderen schimpfe ich nicht besonders auf die Deutsche Bahn, gelegentlich verliert sie Ihre Murmeln und steht da, betroffen von sich selbst, und stammelt etwas von umgekehrter Wagenreihenfolge, aber im Großen und Ganzen ein aufgewecktes Kind.

Dieses Jahr hatte ich mich wieder selbst ausgetrickst und mich in München und Zürich bei Freunden verabredet, also gab es kein Zurück. Die Rückreise war dann offen, aber zwischen Zürich und Berlin sind viele spannende Kilometer, da wird sich wohl was finden.

Mit dem Zug von Berlin nach München klappt ganz gut. Wiedersehen mit alten Freunden und zünftiges Essen im Brauhaus: Check. Am nächsten Tag dann Zeit totzuschlagen, und zufällig auf das sehr empfehlenswerte Deutsche Museum gekommen:

Bronzebarren, und Modelle mit erstaunlicher Detailschärfe aus der Bergbauaustellung.

Man läuft über eine Stunde durch überzeugend nachgebildete Stollen und schaut Modelle und in Kontext gesetzte Bohrer und ähnliches an. Keine gezwungene Multimediageschichten, keine schmuddeligen Touchscreens, super.

Dann mit dem DB-Fernbus nach Zürich. Kurioserweise gibt es keine vernünftige Bahnverbindung, aber der Bus geht gut. Fehlen die Schienen? Sind die Berge im Weg? Aber die Autobahn gibt es? Ein Mysterium.

In Zürich komme ich bei einer lieben Freundin unter – ein Glück, denn Zürich ist absolut unbezahlbar für Berliner Verhältnisse…wenn allein ein Bier schon 8-10 Franken (praktisch das Gleiche in Euro) kostet, ein Abendessen ohne Getränke 30+ Franken…

Umgekehrt ist es dann immer super, wenn man wieder heimkommt und die Preise radikal sinken. Ein Freund hat mal gesagt, nach einem Aufenthalt in der Schweiz ist dann in Berlin in erster Näherung alles umsonst 😉

2019-11-16 14.17.38 — Mein Lieblingsort in Zürich: Ein kleiner Brunnen mit Sitzgelegenheit, ein paar Schritte von meiner Unterkunft. Im letzten Jahr, im August, hab ich dort abends gesessen und gelesen, bei warmen Laternenlicht und Geplätscher.

Ansonsten ist Zürich wunderbar. Alles ist schick und sauber und funktioniert (zumindest in der Innenstadt), die Menschen wechseln nahtlos zu Deutsch oder Englisch wenn ich verständnislos das Schweizerdeutsch bestaune (es _klingt_ irgendwie Deutsch, aber ich verstehe kein Wort, es ist frustrierend), Essen ist wunderbar käselastig, und es gibt Berge! Als Kind der Nordseeküste habe ich mich in den letzten Jahren in Berge verliebt und muß sie jetzt immer wieder mal sehen. Eine Fernbeziehung, ja eine Affäre, aber um so intensiver ist es wenn wir uns sehen 😉

Im Sommer ist es regelrecht mediterran in Zürich, mit dem großen Züricher See, um den sich die Stadt legt, den Weinbergen mitten zwischen Wohnhäusern, die Architektur mit kleinen Wohnhäusern (mehr Villen…ja, hier ist das Geld) an sehr vertikalen Straßenzügen. Im Spätherbst wie diesmal ist es eher frisch und schneidend. Wir hatten Glück und konnten einen Ausflug auf den Uetliberg bei goldenem Licht machen:

Anschließend Käsefondue! Nie hat ein halbes Kilo Käse mit einem halben Laib Brot so gut geschmeckt 😉 Hier im Swiss Chuchi – ein bißchen touristisch und voll, aber sehr lecker. Wir mußten auf einen Tisch warten, in einer kahlen Hotellobby. Erst etwas befremdlich, aber dann wurde bekannt, daß der hervorragende Hauswein beim Warten umsonst ist, was dann alle Befremdlichkeiten schnell überwandt.

Um den See herum und am Chinagarten kann man gut flanieren. Nochmal wundervolles Licht, der Rest der Reise lag unter bleischweren Wolken.

2019-11-19 16.17.21 — Im Migros Supermarkt muß man selbst das Obst abwiegen – dafür gibt es vorgewogene Bündel von Bananen

2019-11-19 19.22.08 — Om nom nom. Fondue 4 evar!

Und ja, ich gestehe, ich bin dann zum zweiten Mal Fondue essen gegangen, aus Gründen. Empfehlung: Zebra Bar, die liegt ein bißchen außerhalb des “schicken” Zürichs in der Nähe der Langstraße, des örtlichen Rotlichtviertels. Aber das Fondue ist schnörkellos und lecker, der Koch/Ober/Besitzer ist nett, und alles ist etwas günstiger.

Nächstes Mal der Rest der Reise.

Crowd-funding platform scraper, part 3 – what’s your exit strategy?

After the last post has been pretty long, let’s make this a bit shorter and only show one addition to the script and the tests.

The assertion I wanted to tackle next is still not a positive case, but rather another error scenario (again, one that I came across during the initial development), which is that the first URL called returns an HTTP 404 error code.

For that, we need to change the testing setup from a hardcoded 200 response:

app = proc { |_env| [200, {}, ['Hello, Sailor!']] }

so that it doesn’t always return the same response code and body, but something we can control for each assertion.

This should do the trick:

def setup
  @response_status = 200
  @response_body = 'Hello, Sailor!'
  app = proc { |_env| [@response_status, {}, [@response_body]] }
end

Since the app proc uses instance variables, we can change them in the assertions even after the setup has been run.

Here is the assertion now:

def test_index_page_returns_404
  @response_status = 404
  assert_equal "Projects page returned HTTP '404' - Is the site down? Check the URL given?", run_scraper
end

Et voilà, we get an expected failure:

1) Failure:
TestScraper#test_index_page_returns_404 [./test:47]:
--- expected
+++ actual
@@ -1 +1 @@
-"Projects page returned HTTP '404' - Is the site down? Check the URL given?"
+""

Of course, we didn’t yet implement this specific error message. Let’s become more user-friendly:

abort("Projects page returned HTTP #{status} - Is the site down? Check the URL given?") if index.nil?

You will also notice we’ve added another little detail – we’re now exiting with a proper UNIX error code, via the #abort method.
Not only does this save us from unnecessarily deep nesting in the script, it also lets us play nice with other programs.

Here is a guide I’ve used to brush up on Ruby exit codes: https://www.honeybadger.io/blog/how-to-exit-a-ruby-program/

Thanks to this guide we now know that we should also be using stderr instead of the usual stdout to print error messages. Luckily, #abort already does this for us.

Oh, but when running the test, we get no output at all:

4) Failure:
TestScraper#test_index_page_returns_404 [./test:47]:
--- expected
+++ actual
@@ -1 +1 @@
-"Projects page returned HTTP '404' - Is the site down? Check the URL given?"
+""

Ah yes, we knew from the guide about shelling out that the `backticks` don’t capture stderr…two steps forward, one step back 🙂

Some more googling later:

https://www.honeybadger.io/blog/capturing-stdout-stderr-from-shell-commands-via-ruby/

Two articles from Honeybadger in a row, these guys are helping us out today 🙂

So, now we know how to run a subshell properly, and capture everything we want to know:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  [stdout.strip, stderr.strip, status]
end

assert_equal "No projects found, has the site changed? Check the URL given?", run_scraper[1]

Hah! This is much better, and also allows us to get at the exit code (now that it means something).

However, while I think it’s a good interface to return an array of results from the run_scraper method (they all belong together semantically), the access via the brackets [1] seems iffy to me – you can’t tell from looking at this what we’re trying to access there.

How about we wrap the result in a hash and then access it like so: run_scraper[:stderr]

That would be better. However, we’ll probably need to define testing helper methods sooner or later anyway to cut down on repetition – and I’d like to test the return code in one go as well, while we’re at it. In fact, let’s do both. Both is good:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  {
    stdout: stdout.strip,
    stderr: stderr.strip,
    status: status,
  }
end

def assert_error(expected_message)
  result = run_scraper
  assert_equal(expected_message, result[:stderr])
  assert_equal(1, result[:status])
  result
end

This looks so much nicer. But we still get an error in the test:

3) Failure:
TestScraper#test_index_page_has_no_projects [./test:62]:
--- expected
+++ actual
@@ -1 +1 @@
-1
+#<Process::Status: pid 22225 exit 1>

Interesting – Open3.capture3 doesn’t give us a simple integer but an object. Let’s see:

https://ruby-doc.org/core-2.5.0/Process/Status.html

I had assumed that the “status” from this line:

stdout, stderr, status = *Bundler.with_original_env

would simply be the exit status from the subshell, but it turns out to be an object with some more information. What we really want is status.exitstatus.

And now, all together:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  {
    stdout: stdout.strip,
    stderr: stderr.strip,
    status: status.exitstatus,
  }
end

def assert_error(expected_message)
  result = run_scraper
  assert_equal(expected_message, result[:stderr])
  assert_equal(1, result[:status])
  result
end

def test_index_page_returns_404
  @response_status = 404
  assert_error "Projects page returned HTTP 404 - Is the site down? Check the URL given?"
end

Ah, this gives me a warm and fuzzy feeling 😉 The actual assertion helper to be re-used is short and punchy, tests two things that are always occurring together in one go, and in turn uses a helper method with a clear purpose and a structured return value.

In my experience, if you are on the right path with Ruby (and keep refactoring), what emerges is usually something like a small domain-specific language, i.e. “talking” methods, usually short, that have intuitive use and return values, even if it’s something simple without any metaprogramming. Ruby still wins all the beauty contests in my opinion.

Here’s the state as of now.

Next time: Finally finishing up all assertions.

Crowd-funding platform scraper, part 2 – A test harness with a real HTTP server, adding an option parser

Hello and welcome back again, dear potential readers. It has been some time, right? Well I have an excuse: I’ve been travelling. Maybe I’ll post some photos some day soon? Who knows!

Let’s get back to that scraping project for now. Last time we left off, we had a basically working Ruby script, but no tests at all, and some time later, while working on the tests, I spent a lot of unexpected time on an issue with bundler and running Ruby in subshells.

(Editorial note: Of course this all really happened some time ago, and not in a clean timeline as presented here. I’m now writing this from the notes I took while working on the tests a while ago)

I’d like to add tests, an options parser, and clean everything up before switching to another language, and/or refactoring the Ruby code, or trying out parallelization. Parallelizysing. Para-make-it-go-at-the-same-time.

Of course, all those hard-coded puts and pp calls will never do. I did need them for debugging while developing, though, so that seems like a strong indication that we should keep them, but they should go to the “verbose” mode, and not be enabled by default. I’m resisting the urge to keep tinkering with that, though, as we first really need some tests.

Strangely enough, even after so many years of writing oh so many tests, it still feels like a chore. They’re absolutely indispensable, and after having them any code feels so much better to work with. But still, starting to write them feels like psyching oneself to go to the gym.

First of all, we will need some kind of switch to tell the code we’re in test mode. At first I thought to have a simple env variable like “TEST_MODE=true” but then decided on taking a page out of Rails’ book and use something like the RAILS_ENV or RACK_ENV, defaulting to “non-test”.

(Another note from the editorial future: In the end, it turns out I didn’t even need those env variables and found a much nicer way. I was tempted to clean all this back and forth up a bit, for brevity, but one thing I want to do with these posts is to show how a “real” developer arrives at solutions. A pet peeve of mine is how clean-shaven, readily-sprung-from-the-brow-of-zeus solutions in articles often look like, and how the reality is more like making sausage: Messy. But it’s normal, and if you clean up after yourself, the result is very tasty especially if you have some mustard… ok let’s stop that metaphor here).

So let’s use “RUN_MODE”. When the value is “test”, we’ll mock the HTTP responses, and otherwise just run normally.
I guess env variables are a pretty good way of “communicating” with the code in a portable way, i.e. it’ll work in most programming languages. On the other hand, they kinda feel like global variables, so I’m somewhat iffy about using them too much.

But let’s also think about what we want to do. We want to have a bunch of test cases, where we define the HTTP responses in the test setup, run the scraper, and then check the stdout output (stdoutput?) for what we expect. While writing this, I realized we’ll need to not only return the body of the HTTP response, but also the status code (to test the problem I had during developing, the 404 error of one of the detail pages). Possibly also HTTP headers and such.

Additionally, we’ll need to define and return several different responses in a defined order, and/or depending on the URL being called…since we might want to test that the index page contains only one, or none, or several detail page links, etc.

This might be obvious to the dear readers, and I also didn’t really think about this before. With the kind of Ruby projects I’m used to work with, there are lots of tools for this – the mocking framework in rspec (https://relishapp.com/rspec/rspec-mocks/docs) makes this pretty easy, and there are more sophisticated tools like https://github.com/bblimke/webmock and https://github.com/vcr/vcr .

However, these all work by reaching into the current Ruby process and redefining how the actual HTTP requests are being done. This works beautifully in Ruby (all you readers what are screaming about dependency injection out there, please calm down. It’s fine, actually), but will super duper not work when we try out another language.

So, another approach would be to go back to our trusty friend the env variable and start defining the HTTP response codes and bodies (probably from saved “fixture” files) there, i.e.:

./test ruby/scrape_crowdfunder FIRST_HTTP_RESPONSE_CODE=200 FIRST_HTTP_RESPONSE_BODY=./fixtures/first_response.html

But well just look how fugly this is. Nope nope nope. It also means that our actual application code needs understand all these env vars and contain a lot of test-specific code. Nope nope nope.

An adventurous thought occurs: We do have the CROWDFUNDER_PROJECTS_URL variable (or, actually, the command line parameter)…what if we ran an actual HTTP server, controlled from within our test code, and point the script at this? It’s quite easy to run simple http servers from Ruby code, and we could neatly define each response from within each test case…

Another thought occurs: Somebody probably already did this. Let’s google…

Ok, here are some options. As expected, mostly people are using mock frameworks that work only within Ruby, but here are some ideas (found most of these following: https://stackoverflow.com/questions/10166611/launching-a-web-server-inside-ruby-tests):

These snippets: https://gist.github.com/mojavelinux/a7e0cabb1b401300a4a5f7fa1ea6689c
- they seem quite low-level, we’d have to write a lot ourselves
This is much more elaborate: https://www.fedux.org/articles/2015/08/02/setting-up-a-ruby-based-http-server-to-back-your-test-suite.html
- but it spawns a child process for the server, and that makes me think it’s probably difficult to control the HTTP responses from within the tests. The author seemed to need only a server that serves the files out of a folder. Also, not packaged as a library.
Now we’re getting somewhere: https://github.com/grosser/stub_server
- This is pretty much it…except it’s a small library that hasn’t seen much action, apparantly, and the examples are only for rspec – I’d like to try minitest this time. But we’ll come back to this.
Looking at this:
- https://github.com/teamcapybara/capybara/blob/master/lib/capybara/server.rb
- https://github.com/teamcapybara/capybara/blob/master/spec/server_spec.rb
- I’m thinking it’s probably possible to just use this – we can load up the Capybara gem, but only use this class out of it. Has the advantage that Capybara is heavily battle-tested and will be kept up to date. Looking at the usage via the spec file, it seems easy to set up HTTP responses, they’re using the Rack interface.

So let’s try picking the server out of the Capybara gem. Off to the code-mobile!

(Some time in the code-mobile 🚗 later…)

After that roadblock (har har) is out of the way, here is the first test code, another simple Ruby shell script. It has an accompanying Gemfile, and we’re using minitest as a simple testing framework. I’ve always used rspec in my projects before, and wanted to try this out. It works pretty nicely for a minimal setup like this (you just have to declare the class and require 'autorun' and it, well, autoruns after all the code has been read by the Ruby process. And every method that starts with test_automagically becomes an assertion):

#!/usr/bin/env ruby
# frozen_string_literal: true

require 'rubygems'
require 'bundler/setup'
require 'pp'
require 'pry'

SCRAPER_PATH = ARGV.first
unless SCRAPER_PATH and File.exist?(SCRAPER_PATH)
  raise ArgumentError, "Please provide the scraper you want to test, i.e. './test ruby/scrape_crowdfunder'"
end

require "minitest/autorun"

class TestScraper < Minitest::Test
  def setup
    @testserver_url = "http://example.com"
  end

  def run_scraper
    path, script = SCRAPER_PATH.split("/")
    command = ""
    # We need to cd into any sub-folder(s) so the scripts there can do setup like rvm, bundler, nvm, etc.
    command += "cd #{path} && " if path
    command += "./#{script}"
    command += " #{@testserver_url}"
    # We need to have a "clean" Bundler env (i.e., forget any currently loaded gems),
    # as the script called might be another Ruby script, with its own Gemfile, and by default
    # shelling out "keeps" the gems from this test runner, making the script fail
    Bundler.with_original_env do
      `bash -c '#{command}'`
    end
  end

  def test_that_shit_works
    assert_equal "OHAI!", run_scraper
  end
end

Here is the first “successful” run of the test script:

crowdfunder_scraper $ ./test ruby/scrape_crowdfunder 
Run options: --seed 48205

# Running:

F

Finished in 0.708704s, 1.4110 runs/s, 1.4110 assertions/s.

1) Failure:
TestScraper#test_that_shit_works [./test:37]:
--- expected
+++ actual
@@ -1 +1,6 @@
-"OHAI!"
+"GET \"http://example.com\"
+[]
+[]
+[]
+0 campaigns, 0€ total, 0€ remaining, 0€ earned
+"

1 runs, 1 assertions, 1 failures, 0 errors, 0 skips

It is actually running the script, and failing because there’s too much output! Also, of course, our temporary example.com domain has no links, and so fails even more. We should have checks that the wrong domain, or a changed page, outputs something more explicit. We’ll add that to the tested code once the test actually sets something sensible.

Now to the interesting bit here – our internal test HTTP server. So far, we’ve kept requesting example.com over and over, which is hardly fair to it. And we could never change its response(s) to the one we want to simulate.

So let’s try requiring capybara/server and using its host and port in each assertion:

require "capybara/server"

def setup
  app = proc { |_env| [200, {}, ['Hello, Sailor!']] }
  server = Capybara::Server.new(app).boot
  @testserver_url = "http://#{server.host}:#{server.port}"
end

But then, we get this error:

NoMethodError: undefined method `server_port' for Capybara:Module
/home/mt/.rvm/gems/ruby-2.6.0/gems/capybara-2.14.0/lib/capybara/server.rb:63:in `initialize'
./test:25:in `new'
./test:25:in `setup'

Looking a bit at the code of the Server class, it seems to call Capybara.server_port … these methods are probably simply not defined because we are only requiring a single file out of the whole gem. Fiddling around with requiring some of the “config” files from the Capybara gem, but it doesn’t seem to work. Another idea would have been to just (re-)define the Capybara module ourselves and add the methods we need by trial and error, but that seems like a long road to go down, and hard to maintain.

So we’re just loading up Capybara completely:

require 'capybara'

and the error goes away.

So close now…let’s make an actually useful assertion now. We’re setting up the server to return just a string, no HTML:

def setup
  app = proc { |_env| [200, {}, ['Hello, Sailor!']] }
  server = Capybara::Server.new(app).boot
  @testserver_url = "http://#{server.host}:#{server.port}"
end

Then we have a test case that checks that in this case, we should see a warning for the user that the website seems to be wonky:

def test_index_page_has_no_projects
  assert_equal "No projects found, has the site changed? Check the URL given?", run_scraper
end

And when we run the tests now, they rightfully complain that the scraper doesn’t give us a useful error message, but only some junk output and a result of “0 projects”.

So now we’re doing actual TDD 🙂 We’ve tested something that is not actually a feature yet in our code. Let’s add that now:

if detail_urls.size == 0
  puts "No projects found, has the site changed? Check the URL given?"
else

Aaaaand:

3) Failure:
TestScraper#test_index_page_has_no_projects [./test:45]:
--- expected
+++ actual
@@ -1 +1,5 @@
-"No projects found, has the site changed? Check the URL given?"
+"GET \"http://127.0.0.1:36409\"
+[]
+[]
+No projects found, has the site changed? Check the URL given?
+"

Of course, this still fails because of the extra debugging output, but yaaaay!
This Capybara mini-server is great, hooray to open source!

The code is at this commit now: https://github.com/MGPalmer/crowdfunder_scraper/commit/aff8e7abddb4981fc73ea700821dd2e736213337

At this point, we have a good setup, it seems. We can now completely mock out the “real” webserver and replace it in our tests with one that we can control from within the tests (even though we currently only return ‘Hello Sailor’). And since our little command-line tool only produces output via stdout (i.e. it has no side-effects like database entries added or file written), we can completely black-box-test it.

Before we wrap this post up, let’s get this one assertion green. The code actually works, but the script produces debugging output which is 1) pretty ugly 2) not useful normally, only for the developer and only if something goes wrong. Classic case of making this optional (but still keeping it as part of the code – I don’t want to go back and add and remove debugging output every time I encounter an issue). In a web framework, we’d use a “debug” log level and the usually provided logging facilities, but here the usual thing is to add a flag to the script. AFAIK, the convention is to call it -v (and in long-form --verbose). And again, this seems like something that should have been solved a lot of times before, so some googling for “option parser command line” later, we find this:

https://www.ruby-toolbox.com/categories/CLI_Option_Parsers

It seems like this is a popular field in which to create libraries, there are a lot! We’ll just go with the built-in ‘OptionParser’ from the standard library, especially since the first example is our verbose flag.

Oh, and while we’re at it, we should also make it print out a good usage example:

options = {}
OptionParser.new do |opts|
  opts.banner = "Usage: ./scrape_crowdfunder [options] http://example.com/projects"

  opts.on("-v", "--[no-]verbose", "Run verbosely") do |v|
    options[:verbose] = v
  end
end.order!

…and this is how it looks like:

crowdfunder_scraper/ruby $ ./scrape_crowdfunder --help
Usage: ./scrape_crowdfunder http://example.com/projects
-v, --[no-]verbose Run verbosely

Within the actual script, we now just use something like this:

verbose = options[:verbose]
pp(page_urls) if verbose

and now one of the tests is passing \o/

Whew, this turned out to be another long post!

Join us next time for even more tests…