Crowd-funding platform scraper, part 3 – what’s your exit strategy?

After the last post has been pretty long, let’s make this a bit shorter and only show one addition to the script and the tests.

The assertion I wanted to tackle next is still not a positive case, but rather another error scenario (again, one that I came across during the initial development), which is that the first URL called returns an HTTP 404 error code.

For that, we need to change the testing setup from a hardcoded 200 response:

app = proc { |_env| [200, {}, ['Hello, Sailor!']] }

so that it doesn’t always return the same response code and body, but something we can control for each assertion.

This should do the trick:

def setup
  @response_status = 200
  @response_body = 'Hello, Sailor!'
  app = proc { |_env| [@response_status, {}, [@response_body]] }
end

Since the app proc uses instance variables, we can change them in the assertions even after the setup has been run.

Here is the assertion now:

def test_index_page_returns_404
  @response_status = 404
  assert_equal "Projects page returned HTTP '404' - Is the site down? Check the URL given?", run_scraper
end

Et voilà, we get an expected failure:

1) Failure:
TestScraper#test_index_page_returns_404 [./test:47]:
--- expected
+++ actual
@@ -1 +1 @@
-"Projects page returned HTTP '404' - Is the site down? Check the URL given?"
+""

Of course, we didn’t yet implement this specific error message. Let’s become more user-friendly:

abort("Projects page returned HTTP #{status} - Is the site down? Check the URL given?") if index.nil?

You will also notice we’ve added another little detail – we’re now exiting with a proper UNIX error code, via the #abort method.
Not only does this save us from unnecessarily deep nesting in the script, it also lets us play nice with other programs.

Here is a guide I’ve used to brush up on Ruby exit codes: https://www.honeybadger.io/blog/how-to-exit-a-ruby-program/

Thanks to this guide we now know that we should also be using stderr instead of the usual stdout to print error messages. Luckily, #abort already does this for us.

Oh, but when running the test, we get no output at all:

4) Failure:
TestScraper#test_index_page_returns_404 [./test:47]:
--- expected
+++ actual
@@ -1 +1 @@
-"Projects page returned HTTP '404' - Is the site down? Check the URL given?"
+""

Ah yes, we knew from the guide about shelling out that the `backticks` don’t capture stderr…two steps forward, one step back 🙂

Some more googling later:

https://www.honeybadger.io/blog/capturing-stdout-stderr-from-shell-commands-via-ruby/

Two articles from Honeybadger in a row, these guys are helping us out today 🙂

So, now we know how to run a subshell properly, and capture everything we want to know:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  [stdout.strip, stderr.strip, status]
end

assert_equal "No projects found, has the site changed? Check the URL given?", run_scraper[1]

Hah! This is much better, and also allows us to get at the exit code (now that it means something).

However, while I think it’s a good interface to return an array of results from the run_scraper method (they all belong together semantically), the access via the brackets [1] seems iffy to me – you can’t tell from looking at this what we’re trying to access there.

How about we wrap the result in a hash and then access it like so: run_scraper[:stderr]

That would be better. However, we’ll probably need to define testing helper methods sooner or later anyway to cut down on repetition – and I’d like to test the return code in one go as well, while we’re at it. In fact, let’s do both. Both is good:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  {
    stdout: stdout.strip,
    stderr: stderr.strip,
    status: status,
  }
end

def assert_error(expected_message)
  result = run_scraper
  assert_equal(expected_message, result[:stderr])
  assert_equal(1, result[:status])
  result
end

This looks so much nicer. But we still get an error in the test:

3) Failure:
TestScraper#test_index_page_has_no_projects [./test:62]:
--- expected
+++ actual
@@ -1 +1 @@
-1
+#<Process::Status: pid 22225 exit 1>

Interesting – Open3.capture3 doesn’t give us a simple integer but an object. Let’s see:

https://ruby-doc.org/core-2.5.0/Process/Status.html

I had assumed that the “status” from this line:

stdout, stderr, status = *Bundler.with_original_env

would simply be the exit status from the subshell, but it turns out to be an object with some more information. What we really want is status.exitstatus.

And now, all together:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  {
    stdout: stdout.strip,
    stderr: stderr.strip,
    status: status.exitstatus,
  }
end

def assert_error(expected_message)
  result = run_scraper
  assert_equal(expected_message, result[:stderr])
  assert_equal(1, result[:status])
  result
end

def test_index_page_returns_404
  @response_status = 404
  assert_error "Projects page returned HTTP 404 - Is the site down? Check the URL given?"
end

Ah, this gives me a warm and fuzzy feeling 😉 The actual assertion helper to be re-used is short and punchy, tests two things that are always occurring together in one go, and in turn uses a helper method with a clear purpose and a structured return value.

In my experience, if you are on the right path with Ruby (and keep refactoring), what emerges is usually something like a small domain-specific language, i.e. “talking” methods, usually short, that have intuitive use and return values, even if it’s something simple without any metaprogramming. Ruby still wins all the beauty contests in my opinion.

Here’s the state as of now.

Next time: Finally finishing up all assertions.

Bloody bloody computers (TODO #1 update)

This is the update to TODO #1 – pointing my old monogreen.de domain to this blog (which is the hosted “free” plan from wordpress.com, i.e. no bells and whistles).

A bit of history: I’ve been using hosteurope to host my mailboxes and to register some domain names for years (I get a warm and fuzzy feeling by actually owning my email address and my mails – I’m hoping of course that hosteurope, compared to say, Google, is not interested in scanning my emails and selling my habits…). But I only bought the “email” package from hosteurope, and not anything else like servers or webspace hosting.

Naturally I’d like to point monogreen.de to this blog. As I’ve written last time, you have to have a webspace package booked in order to set that up, so I bought the cheapest plan.

First hurdle: It turns out they make that available quite fast (<15min) after ordering – but you have to log out and back in to see the new package in KIS. Groan.

Next hurdle: As I now had two packages (one email-only, one webspace which also includes a mailbox as well as webspace, php hosting, etc.), I had to “transfer” the domain monogreen.de from the email package. With some trepidation, I did so, as it seemed like the “redirect” setting would only be available on the webspace package, and I assumed I could still have the email addresses themselves point to the actual inbox I’ve been using for years.

Well that worked alright, and could immediately set up the redirect under the webspace package. However…the redirect didn’t work. Knowing things sometimes take a while to start working at hosteurope, I waited a while, but still nothing. Then, being suspicious, I sent myself an email from a different address/provider, and – it bounced o.O . I had managed to break my primary email address.

Photo Credit: Kalle Gustafsson https://flic.kr/p/GweE1X

Half a panicky hour later I had everything back to as it was before (including setting back up the various email aliases to the actual mailbox, which the “move” had severed) – and here’s two tips for anyone in this situation, and for myself in the future:

  • Every package in KIS has its own block of menu entries – and especially the “Domain settings”. Also there is a general one. So in my case there were three places to look for when trying to move the domain back from the webspace package to the email package (the correct one is the “general” area)
  • Many changes in KIS take ~15min to take effect (they say that in the interface, actually, but you usually assume this is only a general “yeah yeah mostly immediately but let’s cover our asses” policy – it’s not with hosteurope. Needless to say that this makes trying out things extremely cumbersome and error-prone…

And fun fact: After all that, the damn redirect still didn’t work…until next morning, after I had given up, and it suddenly worked just perfectly. Groan.

In summary: Mission accomplished but with too much panicky clicking around…