dev-diary – monogreen.de

Crowd-funding platform scraper, part 4 – Finishing the tests

Hello again everyone, let’s finish this series up!

Let’s continue from last time and speed up a bit. Until now, we had built up a little set of helper methods in the tests:

#setup is called automatically before each test, and sets up the internal test server
#run_scraper is “shelling out” to an external process and runs the scraper script against the test server, and also captures any output as well as the exit code
#assert_error is checking for a specific output on stderr and an exit code of 1

But until now, we’ve only tested a negative case, i.e. some condition where the test server only returns an HTTP error code, or a nonsensical HTTP body. Time to get positive!

The obvious first step is to add an assertion to check for output on stdout, and an exit code of 0:

def assert_success(expected_output, options = {})
  result = run_scraper(options)
  puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]
  assert_equal(expected_output, result[:stdout])
  assert_equal(0, result[:status])
  result
end

Eagle-eyed readers will have noticed the addition of this line:

puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]

This is basically a debugging helper for the tests themselves. Since our shelling-out code captures all output, it so happened that during development of these tests, I was finding they would fail for some reason I couldn’t see. It turned out the test setup was causing the test to fail, and the script was producing an error message – but I didn’t see it, as stderr was captured, while the test only complained about stdout being empty. Curses. So this line (and the equivalent in the assert_error method) will output what we normally can’t see if the test fails, but only if it fails (otherwise, we’d clutter the console while running the tests every time).

Finally, the meat of the tests are creating a way to pre-define certain paths in the temporary server which will return specific HTTP status codes and response bodies. I’ve spent a bit of time going back and forth until I found a way which lets us define several different paths with different responses beforehand, while also having a single “default” catchall response which can be “overwritten”. This is the result for the most simple case:

def setup
  @response_status = 200
  @response_body = 'Hello, Sailor!'
end

def start_test_server
  @app ||= proc { |env| [@response_status, {}, [@response_body]] }
  @test_server = Capybara::Server.new(@app).boot
  @testserver_url = "http://#{@test_server.host}:#{@test_server.port}"
end

def assert_success(expected_output, options = {})
  start_test_server unless @test_server
  result = run_scraper(options)
  puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]
  assert_equal(expected_output, result[:stdout])
  assert_equal(0, result[:status])
  result
end

def set_up_simple_success_case
  # Set it up so there's only one page (the index page), i.e. no pagination, and only one project
  index_html = lambda do
    <<-HTML
    <html><body>
    <h1>Le index page</h1>
    <div class='campaign-details'><a class='campaign-link' href='#{@testserver_url}/project/1'>project 1</a></div>
    </body></html>
    HTML
  end
  project_1_html = lambda do
    <<-HTML
    <html><body>
    <h1>Le project 1 page</h1>
    <h5 class='campaign-goal'> € 1,000 </h5>
    <p class='remaining-amount'> Bla bla <span> €80,0 </span> </p>
    </body></html>
    HTML
  end
  @app = proc do |env|
    html = case env["REQUEST_PATH"]
    when "/project/1" then project_1_html.call
    else index_html.call
    end
    [@response_status, {}, [html]]
  end
end

def test_simple_success_case
  set_up_simple_success_case
  assert_success "1 campaigns, 1000€ total, 800€ remaining, 200€ earned"
end

Oh boy, that’s quite a jump from the error cases 😉 Let’s pick this apart a bit. The structure is pretty much the same as before, except for one addition: The #start_test_server method. This is now separate from the #setup so we can call it at a time when we need it (i.e. after test-specific setup), not always at the start of everything.

Let’s just walk through this in the order it is actually being executed:

When running the test script, the Ruby process reads all this code in, and after that minitest automatically calls first the #setup method and then the #test_simple_success_case method
#setup just sets up two default response variables. Since they are instance variables, they are “global” to this test run (and get discarded after each run)
In #test_simple_success_case we just call #set_up_simple_success_case and then the #assert_success method.#set_up_simple_success_case is only a method on its own because we will re-use this setup in other tests (not shown here)
#set_up_simple_success_case defines the @app handler as an instance variable by making the central proc that is basically our “router” in the test server. In it, we check the request env for the path of the HTTP request, and either return the HTML for the simulated “index” page, or one for a specific, hard-coded “/project/1” project page. But the HTML is not simply a block of text – we define each as a lambda, which in turn evaluates a “heredoc” (i.e. a multiline string literal), and then, we use #call on the lambda objects when the app handler decides which page is being served. Why is this so convoluted, you ask? The problem is one of timing: We need to set up the HTML responses and @app handler before we start the test server (since it’s a constructor argument), but since we link from within the HTML to other pages (we simply use the inline string interpolation inside of the heredocs: "<a class='campaign-link' href='#{@testserver_url}/project/1'>project 1</a>"), we need to know the server’s @testserver_url. Which is only known after we start the server. Using a lambda lets us get around this conundrum, as the closure it generates will evaluate the instance variable @testserver_url only when the test is running, at which point the server has been started and the @testserver_url variable actually contains something useful.
That was the setup, now we continue to running the assertion helper #assert_success, to which we pass the expected output on stdout
Here, finally, and just in time, we start the internal test server via this line: “start_test_server unless @test_server” .
So let’s look into #start_test_server – this is basically the same setup we used all this time, except this picks up the @app instance variable and keeps its contents if it has already been defined (well, technically, if it’s “truthy”), but also sets up a simple default @app handler (in fact, the one we’ve been using for all the negative examples so far) if no-one has bothered to do so far.
Finally, everything is in place, and #assert_success shells out to run the actual scraper script, and compares the captured output to what we expected. The script makes a HTTP GET request to @testserver_url, which ends up in our @app handler, which executes #call on the “index” lambda, which “renders” the heredoc, which includes the @testserver_url + /projects/1 as a link, which then is scraped by the script, which then, again results in a HTTP GET request, which we again detect in the @app handler, which returns the other heredoc, which “renders” a “detail” page finally containing the amount the script scrapes, adds up, and output to stdout. Phew.

I just realized that this explanation is much longer than the actual code 😉 I hope someone learned a bit. If anyone reads this, let me know in the comments if this was much too verbose for you, or if it cleared something up?

Something to note here is that, of course, the HTML returned by the test server here is quite different from the real crowdfunding site. However, the bare-bones elements are there with the same structure of HTML tags and attributes necessary so that the scraper finds the data during the tests. A completely different approach would be to capture and save the actual HTML from the website, and use that for these tests. That has the advantage of “freezing” the website in time and letting later developers know how the site looked like when the code was written, in case it changes later. With a “real” project, I would certainly consider this, but in this case we need to preserve the anonymity of the website, so I can’t check in real HTML. Another factor is that these examples, being the bare minimum, just barely are short enough to fit right in with the test examples (records of the real responses would need to go into files or somesuch, otherwise the test code would be unreadable), and illustrate the structure we’re looking for quite clearly. As usual, we have a tradeoff to consider.

So, this setup is the main trick of the test code. We allow each test to define its own @app handler, with semi-dynamic HTML response bodies and HTTP status codes to specific request paths, and for each case, set up a scenario we want the script to respond to accordingly. Check out the current version of the code to see all tests – I’ve added a lot more than we talked about here. Some are quite elaborate, as the real site has three levels of links, the index page with detail page links, then a bunch of pagination links, each of which returns a page with more detail pages, all of them need to be simulated at least once for the tests to be meaningful.

Let’s wrap this all up now. Just some final notes:

During coding, I used #proc and #lambda pretty interchangeably, and later cleaned it up to only use #lambda, for clarity. There are some differences between these two ways of setting up a closure in Ruby, but for our purposes here they don’t matter. I just like lambda better because it reminds me of Half-Life 😉 [Update: Actually, I’ve went over this again and changed both #proc and #lambda to the “new” stabby lambda: -> {} Sorry, Mr. Freeman…]
I’ve also added tests for the “verbose” flag (and also the short form, “-v”), and for that refactored the debugging output to be much less, and less Ruby-specific. In fact, only the requested URLs are now printed in verbose mode, which (together with the return status code and the byte-length of the response body) is already a pretty good overview of what happens in the script.
While working on the tests, I actually found two bugs in the scraper script. One was introduced during refactoring, and was caught pretty much immediately by the tests, and the other one was an unhelpful crash that occurred when a pagination link returned something unexpected. This was caught by me methodically writing tests for each simple error condition (404 error, empty HTML) for each URL requested by the script. Hooray for TDD!

Well, and that’s it! I hope someone learned something by all this, or at least was entertained. I certainly learned a lot about writing about my code, instead of just hacking it out. The next thing I want to do with this, now that we have a pretty thorough language-agnostic test harness, is to try out writing the same functionality in programming languages I’m not so fluent in (like Javascript), or ones completely new to me, like Rust!

So, see you laters, scraper-alligators!

Crowd-funding platform scraper, part 3 – what’s your exit strategy?

After the last post has been pretty long, let’s make this a bit shorter and only show one addition to the script and the tests.

The assertion I wanted to tackle next is still not a positive case, but rather another error scenario (again, one that I came across during the initial development), which is that the first URL called returns an HTTP 404 error code.

For that, we need to change the testing setup from a hardcoded 200 response:

app = proc { |_env| [200, {}, ['Hello, Sailor!']] }

so that it doesn’t always return the same response code and body, but something we can control for each assertion.

This should do the trick:

def setup
  @response_status = 200
  @response_body = 'Hello, Sailor!'
  app = proc { |_env| [@response_status, {}, [@response_body]] }
end

Since the app proc uses instance variables, we can change them in the assertions even after the setup has been run.

Here is the assertion now:

def test_index_page_returns_404
  @response_status = 404
  assert_equal "Projects page returned HTTP '404' - Is the site down? Check the URL given?", run_scraper
end

Et voilà, we get an expected failure:

1) Failure:
TestScraper#test_index_page_returns_404 [./test:47]:
--- expected
+++ actual
@@ -1 +1 @@
-"Projects page returned HTTP '404' - Is the site down? Check the URL given?"
+""

Of course, we didn’t yet implement this specific error message. Let’s become more user-friendly:

abort("Projects page returned HTTP #{status} - Is the site down? Check the URL given?") if index.nil?

You will also notice we’ve added another little detail – we’re now exiting with a proper UNIX error code, via the #abort method.
Not only does this save us from unnecessarily deep nesting in the script, it also lets us play nice with other programs.

Here is a guide I’ve used to brush up on Ruby exit codes: https://www.honeybadger.io/blog/how-to-exit-a-ruby-program/

Thanks to this guide we now know that we should also be using stderr instead of the usual stdout to print error messages. Luckily, #abort already does this for us.

Oh, but when running the test, we get no output at all:

4) Failure:
TestScraper#test_index_page_returns_404 [./test:47]:
--- expected
+++ actual
@@ -1 +1 @@
-"Projects page returned HTTP '404' - Is the site down? Check the URL given?"
+""

Ah yes, we knew from the guide about shelling out that the `backticks` don’t capture stderr…two steps forward, one step back 🙂

Some more googling later:

https://www.honeybadger.io/blog/capturing-stdout-stderr-from-shell-commands-via-ruby/

Two articles from Honeybadger in a row, these guys are helping us out today 🙂

So, now we know how to run a subshell properly, and capture everything we want to know:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  [stdout.strip, stderr.strip, status]
end

assert_equal "No projects found, has the site changed? Check the URL given?", run_scraper[1]

Hah! This is much better, and also allows us to get at the exit code (now that it means something).

However, while I think it’s a good interface to return an array of results from the run_scraper method (they all belong together semantically), the access via the brackets [1] seems iffy to me – you can’t tell from looking at this what we’re trying to access there.

How about we wrap the result in a hash and then access it like so: run_scraper[:stderr]

That would be better. However, we’ll probably need to define testing helper methods sooner or later anyway to cut down on repetition – and I’d like to test the return code in one go as well, while we’re at it. In fact, let’s do both. Both is good:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  {
    stdout: stdout.strip,
    stderr: stderr.strip,
    status: status,
  }
end

def assert_error(expected_message)
  result = run_scraper
  assert_equal(expected_message, result[:stderr])
  assert_equal(1, result[:status])
  result
end

This looks so much nicer. But we still get an error in the test:

3) Failure:
TestScraper#test_index_page_has_no_projects [./test:62]:
--- expected
+++ actual
@@ -1 +1 @@
-1
+#<Process::Status: pid 22225 exit 1>

Interesting – Open3.capture3 doesn’t give us a simple integer but an object. Let’s see:

https://ruby-doc.org/core-2.5.0/Process/Status.html

I had assumed that the “status” from this line:

stdout, stderr, status = *Bundler.with_original_env

would simply be the exit status from the subshell, but it turns out to be an object with some more information. What we really want is status.exitstatus.

And now, all together:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  {
    stdout: stdout.strip,
    stderr: stderr.strip,
    status: status.exitstatus,
  }
end

def assert_error(expected_message)
  result = run_scraper
  assert_equal(expected_message, result[:stderr])
  assert_equal(1, result[:status])
  result
end

def test_index_page_returns_404
  @response_status = 404
  assert_error "Projects page returned HTTP 404 - Is the site down? Check the URL given?"
end

Ah, this gives me a warm and fuzzy feeling 😉 The actual assertion helper to be re-used is short and punchy, tests two things that are always occurring together in one go, and in turn uses a helper method with a clear purpose and a structured return value.

In my experience, if you are on the right path with Ruby (and keep refactoring), what emerges is usually something like a small domain-specific language, i.e. “talking” methods, usually short, that have intuitive use and return values, even if it’s something simple without any metaprogramming. Ruby still wins all the beauty contests in my opinion.

Here’s the state as of now.

Next time: Finally finishing up all assertions.

Bloody bloody computers (TODO #1 update)

This is the update to TODO #1 – pointing my old monogreen.de domain to this blog (which is the hosted “free” plan from wordpress.com, i.e. no bells and whistles).

A bit of history: I’ve been using hosteurope to host my mailboxes and to register some domain names for years (I get a warm and fuzzy feeling by actually owning my email address and my mails – I’m hoping of course that hosteurope, compared to say, Google, is not interested in scanning my emails and selling my habits…). But I only bought the “email” package from hosteurope, and not anything else like servers or webspace hosting.

Naturally I’d like to point monogreen.de to this blog. As I’ve written last time, you have to have a webspace package booked in order to set that up, so I bought the cheapest plan.

First hurdle: It turns out they make that available quite fast (<15min) after ordering – but you have to log out and back in to see the new package in KIS. Groan.

Next hurdle: As I now had two packages (one email-only, one webspace which also includes a mailbox as well as webspace, php hosting, etc.), I had to “transfer” the domain monogreen.de from the email package. With some trepidation, I did so, as it seemed like the “redirect” setting would only be available on the webspace package, and I assumed I could still have the email addresses themselves point to the actual inbox I’ve been using for years.

Well that worked alright, and could immediately set up the redirect under the webspace package. However…the redirect didn’t work. Knowing things sometimes take a while to start working at hosteurope, I waited a while, but still nothing. Then, being suspicious, I sent myself an email from a different address/provider, and – it bounced o.O . I had managed to break my primary email address.

Photo Credit: Kalle Gustafsson https://flic.kr/p/GweE1X

Half a panicky hour later I had everything back to as it was before (including setting back up the various email aliases to the actual mailbox, which the “move” had severed) – and here’s two tips for anyone in this situation, and for myself in the future:

Every package in KIS has its own block of menu entries – and especially the “Domain settings”. Also there is a general one. So in my case there were three places to look for when trying to move the domain back from the webspace package to the email package (the correct one is the “general” area)
Many changes in KIS take ~15min to take effect (they say that in the interface, actually, but you usually assume this is only a general “yeah yeah mostly immediately but let’s cover our asses” policy – it’s not with hosteurope. Needless to say that this makes trying out things extremely cumbersome and error-prone…

And fun fact: After all that, the damn redirect still didn’t work…until next morning, after I had given up, and it suddenly worked just perfectly. Groan.

In summary: Mission accomplished but with too much panicky clicking around…