Hello again everyone, let’s finish this series up!
Let’s continue from last time and speed up a bit. Until now, we had built up a little set of helper methods in the tests:
#setup
is called automatically before each test, and sets up the internal test server#run_scraper
is “shelling out” to an external process and runs the scraper script against the test server, and also captures any output as well as the exit code#assert_error
is checking for a specific output on stderr and an exit code of 1
But until now, we’ve only tested a negative case, i.e. some condition where the test server only returns an HTTP error code, or a nonsensical HTTP body. Time to get positive!
The obvious first step is to add an assertion to check for output on stdout, and an exit code of 0:
def assert_success(expected_output, options = {}) result = run_scraper(options) puts result[:stderr] if result[:stderr] and expected_output != result[:stdout] assert_equal(expected_output, result[:stdout]) assert_equal(0, result[:status]) result end
Eagle-eyed readers will have noticed the addition of this line:
puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]
This is basically a debugging helper for the tests themselves. Since our shelling-out code captures all output, it so happened that during development of these tests, I was finding they would fail for some reason I couldn’t see. It turned out the test setup was causing the test to fail, and the script was producing an error message – but I didn’t see it, as stderr was captured, while the test only complained about stdout being empty. Curses. So this line (and the equivalent in the assert_error
method) will output what we normally can’t see if the test fails, but only if it fails (otherwise, we’d clutter the console while running the tests every time).
Finally, the meat of the tests are creating a way to pre-define certain paths in the temporary server which will return specific HTTP status codes and response bodies. I’ve spent a bit of time going back and forth until I found a way which lets us define several different paths with different responses beforehand, while also having a single “default” catchall response which can be “overwritten”. This is the result for the most simple case:
def setup @response_status = 200 @response_body = 'Hello, Sailor!' end def start_test_server @app ||= proc { |env| [@response_status, {}, [@response_body]] } @test_server = Capybara::Server.new(@app).boot @testserver_url = "http://#{@test_server.host}:#{@test_server.port}" end def assert_success(expected_output, options = {}) start_test_server unless @test_server result = run_scraper(options) puts result[:stderr] if result[:stderr] and expected_output != result[:stdout] assert_equal(expected_output, result[:stdout]) assert_equal(0, result[:status]) result end def set_up_simple_success_case # Set it up so there's only one page (the index page), i.e. no pagination, and only one project index_html = lambda do <<-HTML <html><body> <h1>Le index page</h1> <div class='campaign-details'><a class='campaign-link' href='#{@testserver_url}/project/1'>project 1</a></div> </body></html> HTML end project_1_html = lambda do <<-HTML <html><body> <h1>Le project 1 page</h1> <h5 class='campaign-goal'> € 1,000 </h5> <p class='remaining-amount'> Bla bla <span> €80,0 </span> </p> </body></html> HTML end @app = proc do |env| html = case env["REQUEST_PATH"] when "/project/1" then project_1_html.call else index_html.call end [@response_status, {}, [html]] end end def test_simple_success_case set_up_simple_success_case assert_success "1 campaigns, 1000€ total, 800€ remaining, 200€ earned" end
Oh boy, that’s quite a jump from the error cases 😉 Let’s pick this apart a bit. The structure is pretty much the same as before, except for one addition: The #start_test_server
method. This is now separate from the #setup
so we can call it at a time when we need it (i.e. after test-specific setup), not always at the start of everything.
Let’s just walk through this in the order it is actually being executed:
- When running the test script, the Ruby process reads all this code in, and after that minitest automatically calls first the
#setup
method and then the#test_simple_success_case
method #setup
just sets up two default response variables. Since they are instance variables, they are “global” to this test run (and get discarded after each run)- In
#test_simple_success_case
we just call#set_up_simple_success_case
and then the#assert_success
method.#set_up_simple_success_case
is only a method on its own because we will re-use this setup in other tests (not shown here) #set_up_simple_success_case
defines the@app
handler as an instance variable by making the central proc that is basically our “router” in the test server. In it, we check the requestenv
for the path of the HTTP request, and either return the HTML for the simulated “index” page, or one for a specific, hard-coded “/project/1” project page. But the HTML is not simply a block of text – we define each as a lambda, which in turn evaluates a “heredoc” (i.e. a multiline string literal), and then, we use#call
on the lambda objects when the app handler decides which page is being served. Why is this so convoluted, you ask? The problem is one of timing: We need to set up the HTML responses and@app
handler before we start the test server (since it’s a constructor argument), but since we link from within the HTML to other pages (we simply use the inline string interpolation inside of the heredocs:"<a class='campaign-link' href='#{@testserver_url}/project/1'>project 1</a>"
), we need to know the server’s@testserver_url
. Which is only known after we start the server. Using a lambda lets us get around this conundrum, as the closure it generates will evaluate the instance variable@testserver_url
only when the test is running, at which point the server has been started and the@testserver_url
variable actually contains something useful.- That was the setup, now we continue to running the assertion helper
#assert_success
, to which we pass the expected output on stdout - Here, finally, and just in time, we start the internal test server via this line: “
start_test_server unless @test_server
” . - So let’s look into
#start_test_server
– this is basically the same setup we used all this time, except this picks up the@app
instance variable and keeps its contents if it has already been defined (well, technically, if it’s “truthy”), but also sets up a simple default@app
handler (in fact, the one we’ve been using for all the negative examples so far) if no-one has bothered to do so far. - Finally, everything is in place, and
#assert_success
shells out to run the actual scraper script, and compares the captured output to what we expected. The script makes a HTTP GET request to@testserver_url
, which ends up in our@app
handler, which executes#call
on the “index” lambda, which “renders” the heredoc, which includes the@testserver_url + /projects/1
as a link, which then is scraped by the script, which then, again results in a HTTP GET request, which we again detect in the@app
handler, which returns the other heredoc, which “renders” a “detail” page finally containing the amount the script scrapes, adds up, and output to stdout. Phew.
I just realized that this explanation is much longer than the actual code 😉 I hope someone learned a bit. If anyone reads this, let me know in the comments if this was much too verbose for you, or if it cleared something up?
Something to note here is that, of course, the HTML returned by the test server here is quite different from the real crowdfunding site. However, the bare-bones elements are there with the same structure of HTML tags and attributes necessary so that the scraper finds the data during the tests. A completely different approach would be to capture and save the actual HTML from the website, and use that for these tests. That has the advantage of “freezing” the website in time and letting later developers know how the site looked like when the code was written, in case it changes later. With a “real” project, I would certainly consider this, but in this case we need to preserve the anonymity of the website, so I can’t check in real HTML. Another factor is that these examples, being the bare minimum, just barely are short enough to fit right in with the test examples (records of the real responses would need to go into files or somesuch, otherwise the test code would be unreadable), and illustrate the structure we’re looking for quite clearly. As usual, we have a tradeoff to consider.
So, this setup is the main trick of the test code. We allow each test to define its own @app
handler, with semi-dynamic HTML response bodies and HTTP status codes to specific request paths, and for each case, set up a scenario we want the script to respond to accordingly. Check out the current version of the code to see all tests – I’ve added a lot more than we talked about here. Some are quite elaborate, as the real site has three levels of links, the index page with detail page links, then a bunch of pagination links, each of which returns a page with more detail pages, all of them need to be simulated at least once for the tests to be meaningful.
Let’s wrap this all up now. Just some final notes:
- During coding, I used
#proc
and#lambda
pretty interchangeably, and later cleaned it up to only use#lambda
, for clarity. There are some differences between these two ways of setting up a closure in Ruby, but for our purposes here they don’t matter. I just like lambda better because it reminds me of Half-Life 😉 [Update: Actually, I’ve went over this again and changed both#proc
and#lambda
to the “new” stabby lambda:-> {}
Sorry, Mr. Freeman…] - I’ve also added tests for the “verbose” flag (and also the short form, “-v”), and for that refactored the debugging output to be much less, and less Ruby-specific. In fact, only the requested URLs are now printed in verbose mode, which (together with the return status code and the byte-length of the response body) is already a pretty good overview of what happens in the script.
- While working on the tests, I actually found two bugs in the scraper script. One was introduced during refactoring, and was caught pretty much immediately by the tests, and the other one was an unhelpful crash that occurred when a pagination link returned something unexpected. This was caught by me methodically writing tests for each simple error condition (404 error, empty HTML) for each URL requested by the script. Hooray for TDD!
Well, and that’s it! I hope someone learned something by all this, or at least was entertained. I certainly learned a lot about writing about my code, instead of just hacking it out. The next thing I want to do with this, now that we have a pretty thorough language-agnostic test harness, is to try out writing the same functionality in programming languages I’m not so fluent in (like Javascript), or ones completely new to me, like Rust!
So, see you laters, scraper-alligators!