Crowd-funding platform scraper, part 4 – Finishing the tests

Hello again everyone, let’s finish this series up!

Let’s continue from last time and speed up a bit. Until now, we had built up a little set of helper methods in the tests:

  1. #setup is called automatically before each test, and sets up the internal test server
  2. #run_scraper is “shelling out” to an external process and runs the scraper script against the test server, and also captures any output as well as the exit code
  3. #assert_error is checking for a specific output on stderr and an exit code of 1

But until now, we’ve only tested a negative case, i.e. some condition where the test server only returns an HTTP error code, or a nonsensical HTTP body. Time to get positive!

The obvious first step is to add an assertion to check for output on stdout, and an exit code of 0:

def assert_success(expected_output, options = {})
  result = run_scraper(options)
  puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]
  assert_equal(expected_output, result[:stdout])
  assert_equal(0, result[:status])
  result
end

Eagle-eyed readers will have noticed the addition of this line:

puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]

This is basically a debugging helper for the tests themselves. Since our shelling-out code captures all output, it so happened that during development of these tests, I was finding they would fail for some reason I couldn’t see. It turned out the test setup was causing the test to fail, and the script was producing an error message – but I didn’t see it, as stderr was captured, while the test only complained about stdout being empty. Curses. So this line (and the equivalent in the assert_error method) will output what we normally can’t see if the test fails, but only if it fails (otherwise, we’d clutter the console while running the tests every time).

Finally, the meat of the tests are creating a way to pre-define certain paths in the temporary server which will return specific HTTP status codes and response bodies. I’ve spent a bit of time going back and forth until I found a way which lets us define several different paths with different responses beforehand, while also having a single “default” catchall response which can be “overwritten”. This is the result for the most simple case:

def setup
  @response_status = 200
  @response_body = 'Hello, Sailor!'
end

def start_test_server
  @app ||= proc { |env| [@response_status, {}, [@response_body]] }
  @test_server = Capybara::Server.new(@app).boot
  @testserver_url = "http://#{@test_server.host}:#{@test_server.port}"
end

def assert_success(expected_output, options = {})
  start_test_server unless @test_server
  result = run_scraper(options)
  puts result[:stderr] if result[:stderr] and expected_output != result[:stdout]
  assert_equal(expected_output, result[:stdout])
  assert_equal(0, result[:status])
  result
end

def set_up_simple_success_case
  # Set it up so there's only one page (the index page), i.e. no pagination, and only one project
  index_html = lambda do
    <<-HTML
    <html><body>
    <h1>Le index page</h1>
    <div class='campaign-details'><a class='campaign-link' href='#{@testserver_url}/project/1'>project 1</a></div>
    </body></html>
    HTML
  end
  project_1_html = lambda do
    <<-HTML
    <html><body>
    <h1>Le project 1 page</h1>
    <h5 class='campaign-goal'> € 1,000 </h5>
    <p class='remaining-amount'> Bla bla <span> €80,0 </span> </p>
    </body></html>
    HTML
  end
  @app = proc do |env|
    html = case env["REQUEST_PATH"]
    when "/project/1" then project_1_html.call
    else index_html.call
    end
    [@response_status, {}, [html]]
  end
end

def test_simple_success_case
  set_up_simple_success_case
  assert_success "1 campaigns, 1000€ total, 800€ remaining, 200€ earned"
end

Oh boy, that’s quite a jump from the error cases 😉 Let’s pick this apart a bit. The structure is pretty much the same as before, except for one addition: The #start_test_server method. This is now separate from the #setup so we can call it at a time when we need it (i.e. after test-specific setup), not always at the start of everything.

Let’s just walk through this in the order it is actually being executed:

  1. When running the test script, the Ruby process reads all this code in, and after that minitest automatically calls first the #setup method and then the #test_simple_success_case method
  2. #setup just sets up two default response variables. Since they are instance variables, they are “global” to this test run (and get discarded after each run)
  3. In #test_simple_success_case we just call #set_up_simple_success_case and then the #assert_success method.#set_up_simple_success_case is only a method on its own because we will re-use this setup in other tests (not shown here)
  4. #set_up_simple_success_case defines the @app handler as an instance variable by making the central proc that is basically our “router” in the test server. In it, we check the request env for the path of the HTTP request, and either return the HTML for the simulated “index” page, or one for a specific, hard-coded “/project/1” project page. But the HTML is not simply a block of text – we define each as a lambda, which in turn evaluates a “heredoc” (i.e. a multiline string literal), and then, we use #call on the lambda objects when the app handler decides which page is being served. Why is this so convoluted, you ask? The problem is one of timing: We need to set up the HTML responses and @app handler before we start the test server (since it’s a constructor argument), but since we link from within the HTML to other pages (we simply use the inline string interpolation inside of the heredocs: "<a class='campaign-link' href='#{@testserver_url}/project/1'>project 1</a>"), we need to know the server’s @testserver_url. Which is only known after we start the server. Using a lambda lets us get around this conundrum, as the closure it generates will evaluate the instance variable @testserver_url only when the test is running, at which point the server has been started and the @testserver_url variable actually contains something useful.
  5. That was the setup, now we continue to running the assertion helper #assert_success, to which we pass the expected output on stdout
  6. Here, finally, and just in time, we start the internal test server via this line: “start_test_server unless @test_server” .
  7. So let’s look into #start_test_server – this is basically the same setup we used all this time, except this picks up the @app instance variable and keeps its contents if it has already been defined (well, technically, if it’s “truthy”), but also sets up a simple default @app handler (in fact, the one we’ve been using for all the negative examples so far) if no-one has bothered to do so far.
  8. Finally, everything is in place, and #assert_success shells out to run the actual scraper script, and compares the captured output to what we expected. The script makes a HTTP GET request to @testserver_url, which ends up in our @app handler, which executes #call on the “index” lambda, which “renders” the heredoc, which includes the @testserver_url + /projects/1 as a link, which then is scraped by the script, which then, again results in a HTTP GET request, which we again detect in the @app handler, which returns the other heredoc, which “renders” a “detail” page finally containing the amount the script scrapes, adds up, and output to stdout. Phew.

I just realized that this explanation is much longer than the actual code 😉 I hope someone learned a bit. If anyone reads this, let me know in the comments if this was much too verbose for you, or if it cleared something up?

Something to note here is that, of course, the HTML returned by the test server here is quite different from the real crowdfunding site. However, the bare-bones elements are there with the same structure of HTML tags and attributes necessary so that the scraper finds the data during the tests. A completely different approach would be to capture and save the actual HTML from the website, and use that for these tests. That has the advantage of “freezing” the website in time and letting later developers know how the site looked like when the code was written, in case it changes later. With a “real” project, I would certainly consider this, but in this case we need to preserve the anonymity of the website, so I can’t check in real HTML. Another factor is that these examples, being the bare minimum, just barely are short enough to fit right in with the test examples (records of the real responses would need to go into files or somesuch, otherwise the test code would be unreadable), and illustrate the structure we’re looking for quite clearly. As usual, we have a tradeoff to consider.

So, this setup is the main trick of the test code. We allow each test to define its own @app handler, with semi-dynamic HTML response bodies and HTTP status codes to specific request paths, and for each case, set up a scenario we want the script to respond to accordingly. Check out the current version of the code to see all tests – I’ve added a lot more than we talked about here. Some are quite elaborate, as the real site has three levels of links, the index page with detail page links, then a bunch of pagination links, each of which returns a page with more detail pages, all of them need to be simulated at least once for the tests to be meaningful.

Let’s wrap this all up now. Just some final notes:

  • During coding, I used #proc and #lambda pretty interchangeably, and later cleaned it up to only use #lambda, for clarity. There are some differences between these two ways of setting up a closure in Ruby, but for our purposes here they don’t matter. I just like lambda better because it reminds me of Half-Life 😉 [Update: Actually, I’ve went over this again and changed both #proc and #lambda to the “new” stabby lambda: -> {} Sorry, Mr. Freeman…]
  • I’ve also added tests for the “verbose” flag (and also the short form, “-v”), and for that refactored the debugging output to be much less, and less Ruby-specific. In fact, only the requested URLs are now printed in verbose mode, which (together with the return status code and the byte-length of the response body) is already a pretty good overview of what happens in the script.
  • While working on the tests, I actually found two bugs in the scraper script. One was introduced during refactoring, and was caught pretty much immediately by the tests, and the other one was an unhelpful crash that occurred when a pagination link returned something unexpected. This was caught by me methodically writing tests for each simple error condition (404 error, empty HTML) for each URL requested by the script. Hooray for TDD!

Well, and that’s it! I hope someone learned something by all this, or at least was entertained. I certainly learned a lot about writing about my code, instead of just hacking it out. The next thing I want to do with this, now that we have a pretty thorough language-agnostic test harness, is to try out writing the same functionality in programming languages I’m not so fluent in (like Javascript), or ones completely new to me, like Rust!

So, see you laters, scraper-alligators!

Crowd-funding platform scraper, part 3 – what’s your exit strategy?

After the last post has been pretty long, let’s make this a bit shorter and only show one addition to the script and the tests.

The assertion I wanted to tackle next is still not a positive case, but rather another error scenario (again, one that I came across during the initial development), which is that the first URL called returns an HTTP 404 error code.

For that, we need to change the testing setup from a hardcoded 200 response:

app = proc { |_env| [200, {}, ['Hello, Sailor!']] }

so that it doesn’t always return the same response code and body, but something we can control for each assertion.

This should do the trick:

def setup
  @response_status = 200
  @response_body = 'Hello, Sailor!'
  app = proc { |_env| [@response_status, {}, [@response_body]] }
end

Since the app proc uses instance variables, we can change them in the assertions even after the setup has been run.

Here is the assertion now:

def test_index_page_returns_404
  @response_status = 404
  assert_equal "Projects page returned HTTP '404' - Is the site down? Check the URL given?", run_scraper
end

Et voilà, we get an expected failure:

1) Failure:
TestScraper#test_index_page_returns_404 [./test:47]:
--- expected
+++ actual
@@ -1 +1 @@
-"Projects page returned HTTP '404' - Is the site down? Check the URL given?"
+""

Of course, we didn’t yet implement this specific error message. Let’s become more user-friendly:

abort("Projects page returned HTTP #{status} - Is the site down? Check the URL given?") if index.nil?

You will also notice we’ve added another little detail – we’re now exiting with a proper UNIX error code, via the #abort method.
Not only does this save us from unnecessarily deep nesting in the script, it also lets us play nice with other programs.

Here is a guide I’ve used to brush up on Ruby exit codes: https://www.honeybadger.io/blog/how-to-exit-a-ruby-program/

Thanks to this guide we now know that we should also be using stderr instead of the usual stdout to print error messages. Luckily, #abort already does this for us.

Oh, but when running the test, we get no output at all:

4) Failure:
TestScraper#test_index_page_returns_404 [./test:47]:
--- expected
+++ actual
@@ -1 +1 @@
-"Projects page returned HTTP '404' - Is the site down? Check the URL given?"
+""

Ah yes, we knew from the guide about shelling out that the `backticks` don’t capture stderr…two steps forward, one step back 🙂

Some more googling later:

https://www.honeybadger.io/blog/capturing-stdout-stderr-from-shell-commands-via-ruby/

Two articles from Honeybadger in a row, these guys are helping us out today 🙂

So, now we know how to run a subshell properly, and capture everything we want to know:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  [stdout.strip, stderr.strip, status]
end

assert_equal "No projects found, has the site changed? Check the URL given?", run_scraper[1]

Hah! This is much better, and also allows us to get at the exit code (now that it means something).

However, while I think it’s a good interface to return an array of results from the run_scraper method (they all belong together semantically), the access via the brackets [1] seems iffy to me – you can’t tell from looking at this what we’re trying to access there.

How about we wrap the result in a hash and then access it like so: run_scraper[:stderr]

That would be better. However, we’ll probably need to define testing helper methods sooner or later anyway to cut down on repetition – and I’d like to test the return code in one go as well, while we’re at it. In fact, let’s do both. Both is good:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  {
    stdout: stdout.strip,
    stderr: stderr.strip,
    status: status,
  }
end

def assert_error(expected_message)
  result = run_scraper
  assert_equal(expected_message, result[:stderr])
  assert_equal(1, result[:status])
  result
end

This looks so much nicer. But we still get an error in the test:

3) Failure:
TestScraper#test_index_page_has_no_projects [./test:62]:
--- expected
+++ actual
@@ -1 +1 @@
-1
+#<Process::Status: pid 22225 exit 1>

Interesting – Open3.capture3 doesn’t give us a simple integer but an object. Let’s see:

https://ruby-doc.org/core-2.5.0/Process/Status.html

I had assumed that the “status” from this line:

stdout, stderr, status = *Bundler.with_original_env

would simply be the exit status from the subshell, but it turns out to be an object with some more information. What we really want is status.exitstatus.

And now, all together:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  {
    stdout: stdout.strip,
    stderr: stderr.strip,
    status: status.exitstatus,
  }
end

def assert_error(expected_message)
  result = run_scraper
  assert_equal(expected_message, result[:stderr])
  assert_equal(1, result[:status])
  result
end

def test_index_page_returns_404
  @response_status = 404
  assert_error "Projects page returned HTTP 404 - Is the site down? Check the URL given?"
end

Ah, this gives me a warm and fuzzy feeling 😉 The actual assertion helper to be re-used is short and punchy, tests two things that are always occurring together in one go, and in turn uses a helper method with a clear purpose and a structured return value.

In my experience, if you are on the right path with Ruby (and keep refactoring), what emerges is usually something like a small domain-specific language, i.e. “talking” methods, usually short, that have intuitive use and return values, even if it’s something simple without any metaprogramming. Ruby still wins all the beauty contests in my opinion.

Here’s the state as of now.

Next time: Finally finishing up all assertions.

Crowd-funding platform scraper, part 2 – A test harness with a real HTTP server, adding an option parser

Hello and welcome back again, dear potential readers. It has been some time, right? Well I have an excuse: I’ve been travelling. Maybe I’ll post some photos some day soon? Who knows!

Let’s get back to that scraping project for now. Last time we left off, we had a basically working Ruby script, but no tests at all, and some time later, while working on the tests, I spent a lot of unexpected time on an issue with bundler and running Ruby in subshells.

(Editorial note: Of course this all really happened some time ago, and not in a clean timeline as presented here. I’m now writing this from the notes I took while working on the tests a while ago)

I’d like to add tests, an options parser, and clean everything up before switching to another language, and/or refactoring the Ruby code, or trying out parallelization. Parallelizysing. Para-make-it-go-at-the-same-time.

Of course, all those hard-coded puts and pp calls will never do. I did need them for debugging while developing, though, so that seems like a strong indication that we should keep them, but they should go to the “verbose” mode, and not be enabled by default. I’m resisting the urge to keep tinkering with that, though, as we first really need some tests.

Strangely enough, even after so many years of writing oh so many tests, it still feels like a chore. They’re absolutely indispensable, and after having them any code feels so much better to work with. But still, starting to write them feels like psyching oneself to go to the gym.

First of all, we will need some kind of switch to tell the code we’re in test mode. At first I thought to have a simple env variable like “TEST_MODE=true” but then decided on taking a page out of Rails’ book and use something like the RAILS_ENV or RACK_ENV, defaulting to “non-test”.

(Another note from the editorial future: In the end, it turns out I didn’t even need those env variables and found a much nicer way. I was tempted to clean all this back and forth up a bit, for brevity, but one thing I want to do with these posts is to show how a “real” developer arrives at solutions. A pet peeve of mine is how clean-shaven, readily-sprung-from-the-brow-of-zeus solutions in articles often look like, and how the reality is more like making sausage: Messy. But it’s normal, and if you clean up after yourself, the result is very tasty especially if you have some mustard… ok let’s stop that metaphor here).

So let’s use “RUN_MODE”. When the value is “test”, we’ll mock the HTTP responses, and otherwise just run normally.
I guess env variables are a pretty good way of “communicating” with the code in a portable way, i.e. it’ll work in most programming languages. On the other hand, they kinda feel like global variables, so I’m somewhat iffy about using them too much.

But let’s also think about what we want to do. We want to have a bunch of test cases, where we define the HTTP responses in the test setup, run the scraper, and then check the stdout output (stdoutput?) for what we expect. While writing this, I realized we’ll need to not only return the body of the HTTP response, but also the status code (to test the problem I had during developing, the 404 error of one of the detail pages). Possibly also HTTP headers and such.

Additionally, we’ll need to define and return several different responses in a defined order, and/or depending on the URL being called…since we might want to test that the index page contains only one, or none, or several detail page links, etc.

This might be obvious to the dear readers, and I also didn’t really think about this before. With the kind of Ruby projects I’m used to work with, there are lots of tools for this – the mocking framework in rspec (https://relishapp.com/rspec/rspec-mocks/docs) makes this pretty easy, and there are more sophisticated tools like https://github.com/bblimke/webmock and https://github.com/vcr/vcr .

However, these all work by reaching into the current Ruby process and redefining how the actual HTTP requests are being done. This works beautifully in Ruby (all you readers what are screaming about dependency injection out there, please calm down. It’s fine, actually), but will super duper not work when we try out another language.

So, another approach would be to go back to our trusty friend the env variable and start defining the HTTP response codes and bodies (probably from saved “fixture” files) there, i.e.:

./test ruby/scrape_crowdfunder FIRST_HTTP_RESPONSE_CODE=200 FIRST_HTTP_RESPONSE_BODY=./fixtures/first_response.html

But well just look how fugly this is. Nope nope nope. It also means that our actual application code needs understand all these env vars and contain a lot of test-specific code. Nope nope nope.

An adventurous thought occurs: We do have the CROWDFUNDER_PROJECTS_URL variable (or, actually, the command line parameter)…what if we ran an actual HTTP server, controlled from within our test code, and point the script at this? It’s quite easy to run simple http servers from Ruby code, and we could neatly define each response from within each test case…

Another thought occurs: Somebody probably already did this. Let’s google…

Ok, here are some options. As expected, mostly people are using mock frameworks that work only within Ruby, but here are some ideas (found most of these following: https://stackoverflow.com/questions/10166611/launching-a-web-server-inside-ruby-tests):

So let’s try picking the server out of the Capybara gem. Off to the code-mobile!

(Some time in the code-mobile 🚗 later…)

After that roadblock (har har) is out of the way, here is the first test code, another simple Ruby shell script. It has an accompanying Gemfile, and we’re using minitest as a simple testing framework. I’ve always used rspec in my projects before, and wanted to try this out. It works pretty nicely for a minimal setup like this (you just have to declare the class and require 'autorun' and it, well, autoruns after all the code has been read by the Ruby process. And every method that starts with test_automagically becomes an assertion):

#!/usr/bin/env ruby
# frozen_string_literal: true

require 'rubygems'
require 'bundler/setup'
require 'pp'
require 'pry'

SCRAPER_PATH = ARGV.first
unless SCRAPER_PATH and File.exist?(SCRAPER_PATH)
raise ArgumentError, "Please provide the scraper you want to test, i.e. './test ruby/scrape_crowdfunder'"
end

require "minitest/autorun"

class TestScraper < Minitest::Test
def setup
@testserver_url = "http://example.com"
end

def run_scraper
path, script = SCRAPER_PATH.split("/")
command = ""
# We need to cd into any sub-folder(s) so the scripts there can do setup like rvm, bundler, nvm, etc.
command += "cd #{path} && " if path
command += "./#{script}"
  command += " #{@testserver_url}"
# We need to have a "clean" Bundler env (i.e., forget any currently loaded gems),
# as the script called might be another Ruby script, with its own Gemfile, and by default
# shelling out "keeps" the gems from this test runner, making the script fail
Bundler.with_original_env do
`bash -c '#{command}'`
end
end

def test_that_shit_works
assert_equal "OHAI!", run_scraper
end
end

Here is the first “successful” run of the test script:

crowdfunder_scraper $ ./test ruby/scrape_crowdfunder 
Run options: --seed 48205

# Running:

F

Finished in 0.708704s, 1.4110 runs/s, 1.4110 assertions/s.

1) Failure:
TestScraper#test_that_shit_works [./test:37]:
--- expected
+++ actual
@@ -1 +1,6 @@
-"OHAI!"
+"GET \"http://example.com\"
+[]
+[]
+[]
+0 campaigns, 0€ total, 0€ remaining, 0€ earned
+"

1 runs, 1 assertions, 1 failures, 0 errors, 0 skips


It is actually running the script, and failing because there’s too much output! Also, of course, our temporary example.com domain has no links, and so fails even more. We should have checks that the wrong domain, or a changed page, outputs something more explicit. We’ll add that to the tested code once the test actually sets something sensible.

Now to the interesting bit here – our internal test HTTP server. So far, we’ve kept requesting example.com over and over, which is hardly fair to it. And we could never change its response(s) to the one we want to simulate.

So let’s try requiring capybara/server and using its host and port in each assertion:

require "capybara/server"

def setup
app = proc { |_env| [200, {}, ['Hello, Sailor!']] }
server = Capybara::Server.new(app).boot
@testserver_url = "http://#{server.host}:#{server.port}"
end

But then, we get this error:

NoMethodError: undefined method `server_port' for Capybara:Module
/home/mt/.rvm/gems/ruby-2.6.0/gems/capybara-2.14.0/lib/capybara/server.rb:63:in `initialize'
./test:25:in `new'
./test:25:in `setup'

Looking a bit at the code of the Server class, it seems to call Capybara.server_port … these methods are probably simply not defined because we are only requiring a single file out of the whole gem. Fiddling around with requiring some of the “config” files from the Capybara gem, but it doesn’t seem to work. Another idea would have been to just (re-)define the Capybara module ourselves and add the methods we need by trial and error, but that seems like a long road to go down, and hard to maintain.

So we’re just loading up Capybara completely:

require 'capybara'

and the error goes away.

So close now…let’s make an actually useful assertion now. We’re setting up the server to return just a string, no HTML:

def setup
app = proc { |_env| [200, {}, ['Hello, Sailor!']] }
server = Capybara::Server.new(app).boot
@testserver_url = "http://#{server.host}:#{server.port}"
end

Then we have a test case that checks that in this case, we should see a warning for the user that the website seems to be wonky:

def test_index_page_has_no_projects
assert_equal "No projects found, has the site changed? Check the URL given?", run_scraper
end

And when we run the tests now, they rightfully complain that the scraper doesn’t give us a useful error message, but only some junk output and a result of “0 projects”.

So now we’re doing actual TDD 🙂 We’ve tested something that is not actually a feature yet in our code. Let’s add that now:

if detail_urls.size == 0
puts "No projects found, has the site changed? Check the URL given?"
else

Aaaaand:

3) Failure:
TestScraper#test_index_page_has_no_projects [./test:45]:
--- expected
+++ actual
@@ -1 +1,5 @@
-"No projects found, has the site changed? Check the URL given?"
+"GET \"http://127.0.0.1:36409\"
+[]
+[]
+No projects found, has the site changed? Check the URL given?
+"

Of course, this still fails because of the extra debugging output, but yaaaay!
This Capybara mini-server is great, hooray to open source!

The code is at this commit now: https://github.com/MGPalmer/crowdfunder_scraper/commit/aff8e7abddb4981fc73ea700821dd2e736213337

At this point, we have a good setup, it seems. We can now completely mock out the “real” webserver and replace it in our tests with one that we can control from within the tests (even though we currently only return ‘Hello Sailor’). And since our little command-line tool only produces output via stdout (i.e. it has no side-effects like database entries added or file written), we can completely black-box-test it.

Before we wrap this post up, let’s get this one assertion green. The code actually works, but the script produces debugging output which is 1) pretty ugly 2) not useful normally, only for the developer and only if something goes wrong. Classic case of making this optional (but still keeping it as part of the code – I don’t want to go back and add and remove debugging output every time I encounter an issue). In a web framework, we’d use a “debug” log level and the usually provided logging facilities, but here the usual thing is to add a flag to the script. AFAIK, the convention is to call it -v (and in long-form --verbose). And again, this seems like something that should have been solved a lot of times before, so some googling for “option parser command line” later, we find this:

https://www.ruby-toolbox.com/categories/CLI_Option_Parsers

It seems like this is a popular field in which to create libraries, there are a lot! We’ll just go with the built-in ‘OptionParser’ from the standard library, especially since the first example is our verbose flag.

Oh, and while we’re at it, we should also make it print out a good usage example:

options = {}
OptionParser.new do |opts|
opts.banner = "Usage: ./scrape_crowdfunder [options] http://example.com/projects"

opts.on("-v", "--[no-]verbose", "Run verbosely") do |v|
options[:verbose] = v
end
end.order!

…and this is how it looks like:

crowdfunder_scraper/ruby $ ./scrape_crowdfunder --help
Usage: ./scrape_crowdfunder http://example.com/projects
-v, --[no-]verbose Run verbosely

Within the actual script, we now just use something like this:

verbose = options[:verbose]
pp(page_urls) if verbose

and now one of the tests is passing \o/

Whew, this turned out to be another long post!

Join us next time for even more tests…

Scraping a crowd-funding platform for fun and (non-)profit, part 1

Hello again dear readers (I now actually have a couple, because I’m sending the articles to friends and forcing them to read them. Hi Oana!).
Today we are going to do another somewhat pointless project. Slightly less useless than last time, I hope!

I’ve been looking at a crowdfunding website, which collects donations to user-created good causes (“projects” or “campaigns”). I’ll omit the name and URL here – it’s not that the info is private, but it seems uncouth to point at someone specifically (it’s a smallish company). They don’t talk about it much, but according to the Terms and Conditions, they are financing themselves by taking a percentage (3-8%) of donations to projects.

So I’m curious how much money they are actually moving – knowing nothing else about their finances or business model, does it seem feasible that they are cashflow-positive?

And looking at the website, they have a paginated list of projects (of reasonable size, 8 pages of 8 projects each, i.e. up to 64 projects currently), which I will assume are all that are currently active.

On the list page, they show a countdown à la “1234€ to go”. The total budget of the project is only shown on the detail page, let’s say in our example, that would be 1500€. So that project, currently, would have 1500 – 1234 = 266€ pledged to it, earning the platform about 266 * 0.03 ~ 8€, enough for some tasty Falafel for two!

At first, we could scrape all projects and see how much money in total has been pledged so far via the platform. This of course will miss any previous projects that have been completed (if there are any), but there is no way of knowing, so everything found out here is a lower bound. It’s mostly a finger exercise, anyway…

We could do an additional step and record a scraping run, and come back every couple of days, and then compare to see how the “velocity” of donations is, i.e. do they handle a lot of donations per day? This is of course more involved, as we’d need to record each scraping run and/or the results somewhere (a database etc.). Let’s shelve that for now. Baby steps.

But I also want to take an opportunity here to do something I think will be quite easy for me to do in Ruby (I have been scraping websites and consuming APIs with Ruby extensively in a past life), and see how it’ll work in other languages.

So let’s dump some ideas on how this ought to look like:

  • It should be a command-line tool, run via a single (bash) script, i.e. ./scrape_crowdfunder. It’ll write detailed debugging info to stdout when given -v as an option (note to self: Look at libraries for command-line option parsing), but otherwise will just output errors or in the best case "X campaigns, Y€ total, Z€ remaining, A€ earned".
  • The URL of where to start parsing will be given via env variable CROWDFUNDER_PROJECTS_URL so I can keep this out of the repository and protect the innocent.
  • There should be a sort of test harness which takes as input saved HTML pages for the projects index page, and one or two detail pages. A test script runs the tool, and checks the output on stdout for the expected result. This is an extreme example of “integration” testing (Is there a better name for this?), which allows us to swap different language implementations. We’ll probably have to add another env variable or something to make a switch somewhere to use the canned pages instead of the real ones.
  • I’ll write the tests in Ruby, because I’m most comfortable with it, and it’s well at home on the command line, and its dynamic nature works well with testing. Performance is not really a concern here.
  • Since we want the same test to be used on different implementations, let’s make it all one big repository. In a bigger project, we’d probably want a “main” project that pulls in the different implementations as libraries, but let’s stay lowtech for now.
  • Which means we can set up the tests like this: Have, under the main folder, one for the tests, and one for each language implementation:
    project
    /testcode1.rb
    /testcode2.rb
    /test-data
    /ruby
    ...../scrape_crowdfunder
    /javascript
    ...../scrape_crowdfunder

    etc.
    then we can run the tests also via a bash script, and just give the scriptname of the runner as a parameter:
    ./test ./ruby/scrape_crowdfunder
  • We might, eventually, also look into using different HTTP clients and parallelism there. I just stumbled upon httpx – another one in a long line of Ruby HTTP clients). We’ll need to tread lightly here though, as we don’t want to hammer the site repeatedly. My impression is that there is not so much traffic going on there so we’d probably be able to visibly cause traffic spikes if we go all-out, and that would be just impolite.

So now we have an idea where we’re going – let’s first set up the Ruby stuff and just hack something out, and then clean it up while writing the tests. I’m usually doing tasks in this order:

  1. Just think about the whole project and write down lotsa notes
  2. Hack around to try out the rough edges
  3. Start writing tests once I know what structure the code is going to take, and then go all TDD and write code and tests in lockstep

I’m suspicious of people that claim to be really doing TDD by writing the test first before anything else. Maybe this works if you are adding a routine extension to an existing project, but if you are going green-field or doing some complex new feature? It seems like an invitation to a lot of rewriting :/

Of course, just now I’m also writing down these words here for the blog post, which I don’t usually do when working for a job. I wonder if I should? It would make everything a lot slower, but would provide seamless documentation over the long term. Also it seems that writing down clears up your thoughts quite a bit…Something to ponder in another post?

Let’s just get going:

$ mkdir crowdfunder_scraper
$ cd crowdfunder_scraper/
$ git init

I’m copying a .gitignore from another project, and adding a Gemfile – we’ll at least need a http library, and I hate the built-in Net::HTTP library from Ruby with a passion:

$ bundle init

At this time I remember I want to make this code public, and make a github repo and retroactively link my local one to it:

$ git remote add origin git@github.com:MGPalmer/crowdfunder_scraper.git

Now you can follow along or look at the last state here: https://github.com/MGPalmer/crowdfunder_scraper

$ mkdir ruby
$ cd ruby
$ touch scrape_crowdfunder
$ chmod +x scrape_crowdfunder

Let’s skip some more boring details. Just some notes:

After some back and forth, here we are:

Gemfile:

# frozen_string_literal: true

source "https://rubygems.org"

git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }

gem "httpx"
gem "nokogiri"
scrape_crowdfunder:

#!/usr/bin/env ruby

require 'rubygems'
require 'bundler/setup'
require 'httpx'
require 'nokogiri'

puts "Hello World"

Aannnnnnd:

$ ./scrape_crowdfunder 
 Hello World

Yaaaay we got a working Ruby script with Bundler.

At this point I go in irb, load up the Gemfile, and fiddle around.

$ irb
$ require 'rubygems'; require 'bundler/setup'; require 'httpx'; require 'nokogiri'
$ 2.4.2 :003 > page = Nokogiri.parse(HTTPX.get("https://example.org").to_s).css("a.clicky")
etc.

(fiddle fiddle fiddle)

Yay, got it working!

Here’s the current code, left intentionally ugly:

Check it out on Github: https://github.com/MGPalmer/crowdfunder_scraper/commit/1784bd1b18db4230b1b099f1c943ea5d7b883413


#!/usr/bin/env ruby
# frozen_string_literal: true

require 'rubygems'
require 'bundler/setup'
require 'httpx'
require 'nokogiri'
require 'pp'

index_url  = ENV['CROWDFUNDER_PROJECTS_URL']
index_html = HTTPX.get(index_url).to_s
index      = Nokogiri.parse(index_html)

def get_n_parse(url)
  res = HTTPX.get(url)
  unless res.status == 200
    puts "AAAAAAAAAA HTTP error for #{url} - #{res.status}"
    return nil
  end
  Nokogiri.parse(res.body.to_s)
end

def parse_detail_page_urls(page)
  page.css('.campaign-details a.campaign-link').map { |a| a[:href] }
end

pages = [index]
page_urls = index.css('ul.pagination a.page-link').map { |a| a[:href] }
pp(page_urls)
page_urls.each do |page_url|
  pages << get_n_parse(page_url)
end

pages.compact!

detail_urls = []
pages.each do |page|
  detail_urls += parse_detail_page_urls(page)
end

pp(detail_urls)

campaigns = detail_urls.map do |detail_url|
  puts detail_url
  page = get_n_parse(detail_url)
  next unless page

  campaign_goal    = Integer(page.css('h5.campaign-goal').text.gsub(/€|,/, ''))
  remaining_amount = Integer(page.css('p.remaining-amount').inner_html.gsub(',', '').scan(/€(\d+)?\s/m).flatten.first)
  {
    url: detail_url,
    campaign_goal: campaign_goal,
    remaining_amount: remaining_amount
  }
end.compact

pp(campaigns)

count     = campaigns.size
total     = campaigns.inject(0) { |t, n| t + n[:campaign_goal] }
remaining = campaigns.inject(0) { |t, n| t + n[:remaining_amount] }

puts "#{count} campaigns, #{total}€ total, #{remaining}€ remaining, #{total - remaining}€ earned"

Some notes:

  • It was a little tricky getting the first-page, then each-pagination, then each-detail links right
  • Stumbled hard over one 404 page, httpx will happily give you a “” body for that :/ . This needs to be tested, i.e. the tests should include cases for all of the HTTP calls to return errors, and check that the script doesn’t choke on them.
  • The markup is a bit of a bitch for the amounts (total and pledged) – had to use some regexps which are a little more complex than I’m really comfortable with. This needs to be tested thoroughly so we can refactor it later.
  • Should’ve first added verbose mode and a trigger for it, I ended up throwing puts and pp around a lot.
  • Also adding and using a debugger would have helped a lot, I didn’t want to slow down for that…
  • The script runs for quite a while – it has to do a couple of dozen HTTP calls, and when the code fails in one of the later ones, it’s a real PITA to have to re-run everything.
  • I’ve moved repeated code into methods, but of course nothing is properly organized.
  • But note how everything happens in discrete steps, and collects data from the previous step, making it easy to inspect the data at each point, and only in the end summing up the derived information we actually want.

But we want the numbers now! After running, and omitting all the debugging output:

57 campaigns, 2829964€ total, 2818168€ remaining, 11796€ earned

At, let’s say, 5% commission, this means the current projects are earning 11796 * 0.03 ~ 353€

This buys you a lot of Falafel, but it’s not much to run a company on :/ But of course everybody has to start small, and again, we really don’t have all the facts here. But hey, our code works, even though the numbers it produces might be meaningless. Ready for a career in business intelligence 😉

Tune in next time when we clean up this mess, and add tests!