Crowd-funding platform scraper, part 4 – Finishing the tests

Hello again everyone, let’s finish this series up!

Let’s continue from last time and speed up a bit. Until now, we had built up a little set of helper methods in the tests:

#setup is called automatically before each test, and sets up the internal test server
#run_scraper is “shelling out” to an external process and runs the scraper script against the test server, and also captures any output as well as the exit code
#assert_error is checking for a specific output on stderr and an exit code of 1

But until now, we’ve only tested a negative case, i.e. some condition where the test server only returns an HTTP error code, or a nonsensical HTTP body. Time to get positive!

The obvious first step is to add an assertion to check for output on stdout, and an exit code of 0:

def assert_success(expected_output, options = {})
  result = run_scraper(options)
  puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]
  assert_equal(expected_output, result[:stdout])
  assert_equal(0, result[:status])
  result
end

Eagle-eyed readers will have noticed the addition of this line:

puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]

This is basically a debugging helper for the tests themselves. Since our shelling-out code captures all output, it so happened that during development of these tests, I was finding they would fail for some reason I couldn’t see. It turned out the test setup was causing the test to fail, and the script was producing an error message – but I didn’t see it, as stderr was captured, while the test only complained about stdout being empty. Curses. So this line (and the equivalent in the assert_error method) will output what we normally can’t see if the test fails, but only if it fails (otherwise, we’d clutter the console while running the tests every time).

Finally, the meat of the tests are creating a way to pre-define certain paths in the temporary server which will return specific HTTP status codes and response bodies. I’ve spent a bit of time going back and forth until I found a way which lets us define several different paths with different responses beforehand, while also having a single “default” catchall response which can be “overwritten”. This is the result for the most simple case:

def setup
  @response_status = 200
  @response_body = 'Hello, Sailor!'
end

def start_test_server
  @app ||= proc { |env| [@response_status, {}, [@response_body]] }
  @test_server = Capybara::Server.new(@app).boot
  @testserver_url = "http://#{@test_server.host}:#{@test_server.port}"
end

def assert_success(expected_output, options = {})
  start_test_server unless @test_server
  result = run_scraper(options)
  puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]
  assert_equal(expected_output, result[:stdout])
  assert_equal(0, result[:status])
  result
end

def set_up_simple_success_case
  # Set it up so there's only one page (the index page), i.e. no pagination, and only one project
  index_html = lambda do
    <<-HTML
    <html><body>
    <h1>Le index page</h1>
    <div class='campaign-details'><a class='campaign-link' href='#{@testserver_url}/project/1'>project 1</a></div>
    </body></html>
    HTML
  end
  project_1_html = lambda do
    <<-HTML
    <html><body>
    <h1>Le project 1 page</h1>
    <h5 class='campaign-goal'> € 1,000 </h5>
    <p class='remaining-amount'> Bla bla <span> €80,0 </span> </p>
    </body></html>
    HTML
  end
  @app = proc do |env|
    html = case env["REQUEST_PATH"]
    when "/project/1" then project_1_html.call
    else index_html.call
    end
    [@response_status, {}, [html]]
  end
end

def test_simple_success_case
  set_up_simple_success_case
  assert_success "1 campaigns, 1000€ total, 800€ remaining, 200€ earned"
end

Oh boy, that’s quite a jump from the error cases 😉 Let’s pick this apart a bit. The structure is pretty much the same as before, except for one addition: The #start_test_server method. This is now separate from the #setup so we can call it at a time when we need it (i.e. after test-specific setup), not always at the start of everything.

Let’s just walk through this in the order it is actually being executed:

When running the test script, the Ruby process reads all this code in, and after that minitest automatically calls first the #setup method and then the #test_simple_success_case method
#setup just sets up two default response variables. Since they are instance variables, they are “global” to this test run (and get discarded after each run)
In #test_simple_success_case we just call #set_up_simple_success_case and then the #assert_success method.#set_up_simple_success_case is only a method on its own because we will re-use this setup in other tests (not shown here)
#set_up_simple_success_case defines the @app handler as an instance variable by making the central proc that is basically our “router” in the test server. In it, we check the request env for the path of the HTTP request, and either return the HTML for the simulated “index” page, or one for a specific, hard-coded “/project/1” project page. But the HTML is not simply a block of text – we define each as a lambda, which in turn evaluates a “heredoc” (i.e. a multiline string literal), and then, we use #call on the lambda objects when the app handler decides which page is being served. Why is this so convoluted, you ask? The problem is one of timing: We need to set up the HTML responses and @app handler before we start the test server (since it’s a constructor argument), but since we link from within the HTML to other pages (we simply use the inline string interpolation inside of the heredocs: "<a class='campaign-link' href='#{@testserver_url}/project/1'>project 1</a>"), we need to know the server’s @testserver_url. Which is only known after we start the server. Using a lambda lets us get around this conundrum, as the closure it generates will evaluate the instance variable @testserver_url only when the test is running, at which point the server has been started and the @testserver_url variable actually contains something useful.
That was the setup, now we continue to running the assertion helper #assert_success, to which we pass the expected output on stdout
Here, finally, and just in time, we start the internal test server via this line: “start_test_server unless @test_server” .
So let’s look into #start_test_server – this is basically the same setup we used all this time, except this picks up the @app instance variable and keeps its contents if it has already been defined (well, technically, if it’s “truthy”), but also sets up a simple default @app handler (in fact, the one we’ve been using for all the negative examples so far) if no-one has bothered to do so far.
Finally, everything is in place, and #assert_success shells out to run the actual scraper script, and compares the captured output to what we expected. The script makes a HTTP GET request to @testserver_url, which ends up in our @app handler, which executes #call on the “index” lambda, which “renders” the heredoc, which includes the @testserver_url + /projects/1 as a link, which then is scraped by the script, which then, again results in a HTTP GET request, which we again detect in the @app handler, which returns the other heredoc, which “renders” a “detail” page finally containing the amount the script scrapes, adds up, and output to stdout. Phew.

I just realized that this explanation is much longer than the actual code 😉 I hope someone learned a bit. If anyone reads this, let me know in the comments if this was much too verbose for you, or if it cleared something up?

Something to note here is that, of course, the HTML returned by the test server here is quite different from the real crowdfunding site. However, the bare-bones elements are there with the same structure of HTML tags and attributes necessary so that the scraper finds the data during the tests. A completely different approach would be to capture and save the actual HTML from the website, and use that for these tests. That has the advantage of “freezing” the website in time and letting later developers know how the site looked like when the code was written, in case it changes later. With a “real” project, I would certainly consider this, but in this case we need to preserve the anonymity of the website, so I can’t check in real HTML. Another factor is that these examples, being the bare minimum, just barely are short enough to fit right in with the test examples (records of the real responses would need to go into files or somesuch, otherwise the test code would be unreadable), and illustrate the structure we’re looking for quite clearly. As usual, we have a tradeoff to consider.

So, this setup is the main trick of the test code. We allow each test to define its own @app handler, with semi-dynamic HTML response bodies and HTTP status codes to specific request paths, and for each case, set up a scenario we want the script to respond to accordingly. Check out the current version of the code to see all tests – I’ve added a lot more than we talked about here. Some are quite elaborate, as the real site has three levels of links, the index page with detail page links, then a bunch of pagination links, each of which returns a page with more detail pages, all of them need to be simulated at least once for the tests to be meaningful.

Let’s wrap this all up now. Just some final notes:

During coding, I used #proc and #lambda pretty interchangeably, and later cleaned it up to only use #lambda, for clarity. There are some differences between these two ways of setting up a closure in Ruby, but for our purposes here they don’t matter. I just like lambda better because it reminds me of Half-Life 😉 [Update: Actually, I’ve went over this again and changed both #proc and #lambda to the “new” stabby lambda: -> {} Sorry, Mr. Freeman…]
I’ve also added tests for the “verbose” flag (and also the short form, “-v”), and for that refactored the debugging output to be much less, and less Ruby-specific. In fact, only the requested URLs are now printed in verbose mode, which (together with the return status code and the byte-length of the response body) is already a pretty good overview of what happens in the script.
While working on the tests, I actually found two bugs in the scraper script. One was introduced during refactoring, and was caught pretty much immediately by the tests, and the other one was an unhelpful crash that occurred when a pagination link returned something unexpected. This was caught by me methodically writing tests for each simple error condition (404 error, empty HTML) for each URL requested by the script. Hooray for TDD!

Well, and that’s it! I hope someone learned something by all this, or at least was entertained. I certainly learned a lot about writing about my code, instead of just hacking it out. The next thing I want to do with this, now that we have a pretty thorough language-agnostic test harness, is to try out writing the same functionality in programming languages I’m not so fluent in (like Javascript), or ones completely new to me, like Rust!

So, see you laters, scraper-alligators!

Crowd-funding platform scraper, part 4 – Finishing the tests

One thought on “Crowd-funding platform scraper, part 4 – Finishing the tests”

Leave a comment Cancel reply

Related

One thought on “Crowd-funding platform scraper, part 4 – Finishing the tests”

Leave a comment Cancel reply