Hello and welcome back again, dear potential readers. It has been some time, right? Well I have an excuse: I’ve been travelling. Maybe I’ll post some photos some day soon? Who knows!
Let’s get back to that scraping project for now. Last time we left off, we had a basically working Ruby script, but no tests at all, and some time later, while working on the tests, I spent a lot of unexpected time on an issue with bundler and running Ruby in subshells.
(Editorial note: Of course this all really happened some time ago, and not in a clean timeline as presented here. I’m now writing this from the notes I took while working on the tests a while ago)
I’d like to add tests, an options parser, and clean everything up before switching to another language, and/or refactoring the Ruby code, or trying out parallelization. Parallelizysing. Para-make-it-go-at-the-same-time.
Of course, all those hard-coded puts
and pp
calls will never do. I did need them for debugging while developing, though, so that seems like a strong indication that we should keep them, but they should go to the “verbose” mode, and not be enabled by default. I’m resisting the urge to keep tinkering with that, though, as we first really need some tests.
Strangely enough, even after so many years of writing oh so many tests, it still feels like a chore. They’re absolutely indispensable, and after having them any code feels so much better to work with. But still, starting to write them feels like psyching oneself to go to the gym.
First of all, we will need some kind of switch to tell the code we’re in test mode. At first I thought to have a simple env variable like “TEST_MODE=true” but then decided on taking a page out of Rails’ book and use something like the RAILS_ENV or RACK_ENV, defaulting to “non-test”.
(Another note from the editorial future: In the end, it turns out I didn’t even need those env variables and found a much nicer way. I was tempted to clean all this back and forth up a bit, for brevity, but one thing I want to do with these posts is to show how a “real” developer arrives at solutions. A pet peeve of mine is how clean-shaven, readily-sprung-from-the-brow-of-zeus solutions in articles often look like, and how the reality is more like making sausage: Messy. But it’s normal, and if you clean up after yourself, the result is very tasty especially if you have some mustard… ok let’s stop that metaphor here).
So let’s use “RUN_MODE”. When the value is “test”, we’ll mock the HTTP responses, and otherwise just run normally.
I guess env variables are a pretty good way of “communicating” with the code in a portable way, i.e. it’ll work in most programming languages. On the other hand, they kinda feel like global variables, so I’m somewhat iffy about using them too much.
But let’s also think about what we want to do. We want to have a bunch of test cases, where we define the HTTP responses in the test setup, run the scraper, and then check the stdout output (stdoutput?) for what we expect. While writing this, I realized we’ll need to not only return the body of the HTTP response, but also the status code (to test the problem I had during developing, the 404 error of one of the detail pages). Possibly also HTTP headers and such.
Additionally, we’ll need to define and return several different responses in a defined order, and/or depending on the URL being called…since we might want to test that the index page contains only one, or none, or several detail page links, etc.
This might be obvious to the dear readers, and I also didn’t really think about this before. With the kind of Ruby projects I’m used to work with, there are lots of tools for this – the mocking framework in rspec (https://relishapp.com/rspec/rspec-mocks/docs) makes this pretty easy, and there are more sophisticated tools like https://github.com/bblimke/webmock and https://github.com/vcr/vcr .
However, these all work by reaching into the current Ruby process and redefining how the actual HTTP requests are being done. This works beautifully in Ruby (all you readers what are screaming about dependency injection out there, please calm down. It’s fine, actually), but will super duper not work when we try out another language.
So, another approach would be to go back to our trusty friend the env variable and start defining the HTTP response codes and bodies (probably from saved “fixture” files) there, i.e.:
./test ruby/scrape_crowdfunder FIRST_HTTP_RESPONSE_CODE=200 FIRST_HTTP_RESPONSE_BODY=./fixtures/first_response.html
But well just look how fugly this is. Nope nope nope. It also means that our actual application code needs understand all these env vars and contain a lot of test-specific code. Nope nope nope.
An adventurous thought occurs: We do have the CROWDFUNDER_PROJECTS_URL variable (or, actually, the command line parameter)…what if we ran an actual HTTP server, controlled from within our test code, and point the script at this? It’s quite easy to run simple http servers from Ruby code, and we could neatly define each response from within each test case…
Another thought occurs: Somebody probably already did this. Let’s google…
Ok, here are some options. As expected, mostly people are using mock frameworks that work only within Ruby, but here are some ideas (found most of these following: https://stackoverflow.com/questions/10166611/launching-a-web-server-inside-ruby-tests):
- These snippets: https://gist.github.com/mojavelinux/a7e0cabb1b401300a4a5f7fa1ea6689c
- they seem quite low-level, we’d have to write a lot ourselves
- This is much more elaborate: https://www.fedux.org/articles/2015/08/02/setting-up-a-ruby-based-http-server-to-back-your-test-suite.html
- but it spawns a child process for the server, and that makes me think it’s probably difficult to control the HTTP responses from within the tests. The author seemed to need only a server that serves the files out of a folder. Also, not packaged as a library.
- Now we’re getting somewhere: https://github.com/grosser/stub_server
- This is pretty much it…except it’s a small library that hasn’t seen much action, apparantly, and the examples are only for rspec – I’d like to try minitest this time. But we’ll come back to this.
- Looking at this:
- https://github.com/teamcapybara/capybara/blob/master/lib/capybara/server.rb
- https://github.com/teamcapybara/capybara/blob/master/spec/server_spec.rb
- I’m thinking it’s probably possible to just use this – we can load up the Capybara gem, but only use this class out of it. Has the advantage that Capybara is heavily battle-tested and will be kept up to date. Looking at the usage via the spec file, it seems easy to set up HTTP responses, they’re using the Rack interface.
So let’s try picking the server out of the Capybara gem. Off to the code-mobile!
(Some time in the code-mobile 🚗 later…)
After that roadblock (har har) is out of the way, here is the first test code, another simple Ruby shell script. It has an accompanying Gemfile, and we’re using minitest as a simple testing framework. I’ve always used rspec in my projects before, and wanted to try this out. It works pretty nicely for a minimal setup like this (you just have to declare the class and require 'autorun'
and it, well, autoruns after all the code has been read by the Ruby process. And every method that starts with test_
automagically becomes an assertion):
#!/usr/bin/env ruby
# frozen_string_literal: true
require 'rubygems'
require 'bundler/setup'
require 'pp'
require 'pry'
SCRAPER_PATH = ARGV.first
unless SCRAPER_PATH and File.exist?(SCRAPER_PATH)
raise ArgumentError, "Please provide the scraper you want to test, i.e. './test ruby/scrape_crowdfunder'"
end
require "minitest/autorun"
class TestScraper < Minitest::Test
def setup
@testserver_url = "http://example.com"
end
def run_scraper
path, script = SCRAPER_PATH.split("/")
command = ""
# We need to cd into any sub-folder(s) so the scripts there can do setup like rvm, bundler, nvm, etc.
command += "cd #{path} && " if path
command += "./#{script}"
command += " #{@testserver_url}"
# We need to have a "clean" Bundler env (i.e., forget any currently loaded gems),
# as the script called might be another Ruby script, with its own Gemfile, and by default
# shelling out "keeps" the gems from this test runner, making the script fail
Bundler.with_original_env do
`bash -c '#{command}'`
end
end
def test_that_shit_works
assert_equal "OHAI!", run_scraper
end
end
Here is the first “successful” run of the test script:
crowdfunder_scraper $ ./test ruby/scrape_crowdfunder
Run options: --seed 48205
# Running:
F
Finished in 0.708704s, 1.4110 runs/s, 1.4110 assertions/s.
1) Failure:
TestScraper#test_that_shit_works [./test:37]:
--- expected
+++ actual
@@ -1 +1,6 @@
-"OHAI!"
+"GET \"http://example.com\"
+[]
+[]
+[]
+0 campaigns, 0€ total, 0€ remaining, 0€ earned
+"
1 runs, 1 assertions, 1 failures, 0 errors, 0 skips
It is actually running the script, and failing because there’s too much output! Also, of course, our temporary example.com domain has no links, and so fails even more. We should have checks that the wrong domain, or a changed page, outputs something more explicit. We’ll add that to the tested code once the test actually sets something sensible.
Now to the interesting bit here – our internal test HTTP server. So far, we’ve kept requesting example.com over and over, which is hardly fair to it. And we could never change its response(s) to the one we want to simulate.
So let’s try requiring capybara/server and using its host and port in each assertion:
require "capybara/server"
def setup
app = proc { |_env| [200, {}, ['Hello, Sailor!']] }
server = Capybara::Server.new(app).boot
@testserver_url = "http://#{server.host}:#{server.port}"
end
But then, we get this error:
NoMethodError: undefined method `server_port' for Capybara:Module
/home/mt/.rvm/gems/ruby-2.6.0/gems/capybara-2.14.0/lib/capybara/server.rb:63:in `initialize'
./test:25:in `new'
./test:25:in `setup'
Looking a bit at the code of the Server class, it seems to call Capybara.server_port
… these methods are probably simply not defined because we are only requiring a single file out of the whole gem. Fiddling around with requiring some of the “config” files from the Capybara gem, but it doesn’t seem to work. Another idea would have been to just (re-)define the Capybara module ourselves and add the methods we need by trial and error, but that seems like a long road to go down, and hard to maintain.
So we’re just loading up Capybara completely:
require 'capybara'
and the error goes away.
So close now…let’s make an actually useful assertion now. We’re setting up the server to return just a string, no HTML:
def setup
app = proc { |_env| [200, {}, ['Hello, Sailor!']] }
server = Capybara::Server.new(app).boot
@testserver_url = "http://#{server.host}:#{server.port}"
end
Then we have a test case that checks that in this case, we should see a warning for the user that the website seems to be wonky:
def test_index_page_has_no_projects
assert_equal "No projects found, has the site changed? Check the URL given?", run_scraper
end
And when we run the tests now, they rightfully complain that the scraper doesn’t give us a useful error message, but only some junk output and a result of “0 projects”.
So now we’re doing actual TDD 🙂 We’ve tested something that is not actually a feature yet in our code. Let’s add that now:
if detail_urls.size == 0
puts "No projects found, has the site changed? Check the URL given?"
else
Aaaaand:
3) Failure:
TestScraper#test_index_page_has_no_projects [./test:45]:
--- expected
+++ actual
@@ -1 +1,5 @@
-"No projects found, has the site changed? Check the URL given?"
+"GET \"http://127.0.0.1:36409\"
+[]
+[]
+No projects found, has the site changed? Check the URL given?
+"
Of course, this still fails because of the extra debugging output, but yaaaay!
This Capybara mini-server is great, hooray to open source!
The code is at this commit now: https://github.com/MGPalmer/crowdfunder_scraper/commit/aff8e7abddb4981fc73ea700821dd2e736213337
At this point, we have a good setup, it seems. We can now completely mock out the “real” webserver and replace it in our tests with one that we can control from within the tests (even though we currently only return ‘Hello Sailor’). And since our little command-line tool only produces output via stdout (i.e. it has no side-effects like database entries added or file written), we can completely black-box-test it.
Before we wrap this post up, let’s get this one assertion green. The code actually works, but the script produces debugging output which is 1) pretty ugly 2) not useful normally, only for the developer and only if something goes wrong. Classic case of making this optional (but still keeping it as part of the code – I don’t want to go back and add and remove debugging output every time I encounter an issue). In a web framework, we’d use a “debug” log level and the usually provided logging facilities, but here the usual thing is to add a flag to the script. AFAIK, the convention is to call it -v
(and in long-form --verbose
). And again, this seems like something that should have been solved a lot of times before, so some googling for “option parser command line” later, we find this:
https://www.ruby-toolbox.com/categories/CLI_Option_Parsers
It seems like this is a popular field in which to create libraries, there are a lot! We’ll just go with the built-in ‘OptionParser’ from the standard library, especially since the first example is our verbose flag.
Oh, and while we’re at it, we should also make it print out a good usage example:
options = {}
OptionParser.new do |opts|
opts.banner = "Usage: ./scrape_crowdfunder [options] http://example.com/projects"
opts.on("-v", "--[no-]verbose", "Run verbosely") do |v|
options[:verbose] = v
end
end.order!
…and this is how it looks like:
crowdfunder_scraper/ruby $ ./scrape_crowdfunder --help
Usage: ./scrape_crowdfunder http://example.com/projects
-v, --[no-]verbose Run verbosely
Within the actual script, we now just use something like this:
verbose = options[:verbose]
pp(page_urls) if verbose
and now one of the tests is passing \o/
Whew, this turned out to be another long post!
2 thoughts on “Crowd-funding platform scraper, part 2 – A test harness with a real HTTP server, adding an option parser”