Crowd-funding platform scraper, part 3 – what’s your exit strategy?

After the last post has been pretty long, let’s make this a bit shorter and only show one addition to the script and the tests.

The assertion I wanted to tackle next is still not a positive case, but rather another error scenario (again, one that I came across during the initial development), which is that the first URL called returns an HTTP 404 error code.

For that, we need to change the testing setup from a hardcoded 200 response:

app = proc { |_env| [200, {}, ['Hello, Sailor!']] }

so that it doesn’t always return the same response code and body, but something we can control for each assertion.

This should do the trick:

def setup
  @response_status = 200
  @response_body = 'Hello, Sailor!'
  app = proc { |_env| [@response_status, {}, [@response_body]] }
end

Since the app proc uses instance variables, we can change them in the assertions even after the setup has been run.

Here is the assertion now:

def test_index_page_returns_404
  @response_status = 404
  assert_equal "Projects page returned HTTP '404' - Is the site down? Check the URL given?", run_scraper
end

Et voilà, we get an expected failure:

1) Failure:
TestScraper#test_index_page_returns_404 [./test:47]:
--- expected
+++ actual
@@ -1 +1 @@
-"Projects page returned HTTP '404' - Is the site down? Check the URL given?"
+""

Of course, we didn’t yet implement this specific error message. Let’s become more user-friendly:

abort("Projects page returned HTTP #{status} - Is the site down? Check the URL given?") if index.nil?

You will also notice we’ve added another little detail – we’re now exiting with a proper UNIX error code, via the #abort method.
Not only does this save us from unnecessarily deep nesting in the script, it also lets us play nice with other programs.

Here is a guide I’ve used to brush up on Ruby exit codes: https://www.honeybadger.io/blog/how-to-exit-a-ruby-program/

Thanks to this guide we now know that we should also be using stderr instead of the usual stdout to print error messages. Luckily, #abort already does this for us.

Oh, but when running the test, we get no output at all:

4) Failure:
TestScraper#test_index_page_returns_404 [./test:47]:
--- expected
+++ actual
@@ -1 +1 @@
-"Projects page returned HTTP '404' - Is the site down? Check the URL given?"
+""

Ah yes, we knew from the guide about shelling out that the `backticks` don’t capture stderr…two steps forward, one step back 🙂

Some more googling later:

https://www.honeybadger.io/blog/capturing-stdout-stderr-from-shell-commands-via-ruby/

Two articles from Honeybadger in a row, these guys are helping us out today 🙂

So, now we know how to run a subshell properly, and capture everything we want to know:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  [stdout.strip, stderr.strip, status]
end

assert_equal "No projects found, has the site changed? Check the URL given?", run_scraper[1]

Hah! This is much better, and also allows us to get at the exit code (now that it means something).

However, while I think it’s a good interface to return an array of results from the run_scraper method (they all belong together semantically), the access via the brackets [1] seems iffy to me – you can’t tell from looking at this what we’re trying to access there.

How about we wrap the result in a hash and then access it like so: run_scraper[:stderr]

That would be better. However, we’ll probably need to define testing helper methods sooner or later anyway to cut down on repetition – and I’d like to test the return code in one go as well, while we’re at it. In fact, let’s do both. Both is good:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  {
    stdout: stdout.strip,
    stderr: stderr.strip,
    status: status,
  }
end

def assert_error(expected_message)
  result = run_scraper
  assert_equal(expected_message, result[:stderr])
  assert_equal(1, result[:status])
  result
end

This looks so much nicer. But we still get an error in the test:

3) Failure:
TestScraper#test_index_page_has_no_projects [./test:62]:
--- expected
+++ actual
@@ -1 +1 @@
-1
+#<Process::Status: pid 22225 exit 1>

Interesting – Open3.capture3 doesn’t give us a simple integer but an object. Let’s see:

https://ruby-doc.org/core-2.5.0/Process/Status.html

I had assumed that the “status” from this line:

stdout, stderr, status = *Bundler.with_original_env

would simply be the exit status from the subshell, but it turns out to be an object with some more information. What we really want is status.exitstatus.

And now, all together:

def run_scraper
  (...)
  stdout, stderr, status = *Bundler.with_original_env do
    Open3.capture3(command)
  end
  {
    stdout: stdout.strip,
    stderr: stderr.strip,
    status: status.exitstatus,
  }
end

def assert_error(expected_message)
  result = run_scraper
  assert_equal(expected_message, result[:stderr])
  assert_equal(1, result[:status])
  result
end

def test_index_page_returns_404
  @response_status = 404
  assert_error "Projects page returned HTTP 404 - Is the site down? Check the URL given?"
end

Ah, this gives me a warm and fuzzy feeling 😉 The actual assertion helper to be re-used is short and punchy, tests two things that are always occurring together in one go, and in turn uses a helper method with a clear purpose and a structured return value.

In my experience, if you are on the right path with Ruby (and keep refactoring), what emerges is usually something like a small domain-specific language, i.e. “talking” methods, usually short, that have intuitive use and return values, even if it’s something simple without any metaprogramming. Ruby still wins all the beauty contests in my opinion.

Here’s the state as of now.

Next time: Finally finishing up all assertions.

Crowd-funding platform scraper, part 2 – A test harness with a real HTTP server, adding an option parser

Hello and welcome back again, dear potential readers. It has been some time, right? Well I have an excuse: I’ve been travelling. Maybe I’ll post some photos some day soon? Who knows!

Let’s get back to that scraping project for now. Last time we left off, we had a basically working Ruby script, but no tests at all, and some time later, while working on the tests, I spent a lot of unexpected time on an issue with bundler and running Ruby in subshells.

(Editorial note: Of course this all really happened some time ago, and not in a clean timeline as presented here. I’m now writing this from the notes I took while working on the tests a while ago)

I’d like to add tests, an options parser, and clean everything up before switching to another language, and/or refactoring the Ruby code, or trying out parallelization. Parallelizysing. Para-make-it-go-at-the-same-time.

Of course, all those hard-coded puts and pp calls will never do. I did need them for debugging while developing, though, so that seems like a strong indication that we should keep them, but they should go to the “verbose” mode, and not be enabled by default. I’m resisting the urge to keep tinkering with that, though, as we first really need some tests.

Strangely enough, even after so many years of writing oh so many tests, it still feels like a chore. They’re absolutely indispensable, and after having them any code feels so much better to work with. But still, starting to write them feels like psyching oneself to go to the gym.

First of all, we will need some kind of switch to tell the code we’re in test mode. At first I thought to have a simple env variable like “TEST_MODE=true” but then decided on taking a page out of Rails’ book and use something like the RAILS_ENV or RACK_ENV, defaulting to “non-test”.

(Another note from the editorial future: In the end, it turns out I didn’t even need those env variables and found a much nicer way. I was tempted to clean all this back and forth up a bit, for brevity, but one thing I want to do with these posts is to show how a “real” developer arrives at solutions. A pet peeve of mine is how clean-shaven, readily-sprung-from-the-brow-of-zeus solutions in articles often look like, and how the reality is more like making sausage: Messy. But it’s normal, and if you clean up after yourself, the result is very tasty especially if you have some mustard… ok let’s stop that metaphor here).

So let’s use “RUN_MODE”. When the value is “test”, we’ll mock the HTTP responses, and otherwise just run normally.
I guess env variables are a pretty good way of “communicating” with the code in a portable way, i.e. it’ll work in most programming languages. On the other hand, they kinda feel like global variables, so I’m somewhat iffy about using them too much.

But let’s also think about what we want to do. We want to have a bunch of test cases, where we define the HTTP responses in the test setup, run the scraper, and then check the stdout output (stdoutput?) for what we expect. While writing this, I realized we’ll need to not only return the body of the HTTP response, but also the status code (to test the problem I had during developing, the 404 error of one of the detail pages). Possibly also HTTP headers and such.

Additionally, we’ll need to define and return several different responses in a defined order, and/or depending on the URL being called…since we might want to test that the index page contains only one, or none, or several detail page links, etc.

This might be obvious to the dear readers, and I also didn’t really think about this before. With the kind of Ruby projects I’m used to work with, there are lots of tools for this – the mocking framework in rspec (https://relishapp.com/rspec/rspec-mocks/docs) makes this pretty easy, and there are more sophisticated tools like https://github.com/bblimke/webmock and https://github.com/vcr/vcr .

However, these all work by reaching into the current Ruby process and redefining how the actual HTTP requests are being done. This works beautifully in Ruby (all you readers what are screaming about dependency injection out there, please calm down. It’s fine, actually), but will super duper not work when we try out another language.

So, another approach would be to go back to our trusty friend the env variable and start defining the HTTP response codes and bodies (probably from saved “fixture” files) there, i.e.:

./test ruby/scrape_crowdfunder FIRST_HTTP_RESPONSE_CODE=200 FIRST_HTTP_RESPONSE_BODY=./fixtures/first_response.html

But well just look how fugly this is. Nope nope nope. It also means that our actual application code needs understand all these env vars and contain a lot of test-specific code. Nope nope nope.

An adventurous thought occurs: We do have the CROWDFUNDER_PROJECTS_URL variable (or, actually, the command line parameter)…what if we ran an actual HTTP server, controlled from within our test code, and point the script at this? It’s quite easy to run simple http servers from Ruby code, and we could neatly define each response from within each test case…

Another thought occurs: Somebody probably already did this. Let’s google…

Ok, here are some options. As expected, mostly people are using mock frameworks that work only within Ruby, but here are some ideas (found most of these following: https://stackoverflow.com/questions/10166611/launching-a-web-server-inside-ruby-tests):

So let’s try picking the server out of the Capybara gem. Off to the code-mobile!

(Some time in the code-mobile 🚗 later…)

After that roadblock (har har) is out of the way, here is the first test code, another simple Ruby shell script. It has an accompanying Gemfile, and we’re using minitest as a simple testing framework. I’ve always used rspec in my projects before, and wanted to try this out. It works pretty nicely for a minimal setup like this (you just have to declare the class and require 'autorun' and it, well, autoruns after all the code has been read by the Ruby process. And every method that starts with test_automagically becomes an assertion):

#!/usr/bin/env ruby
# frozen_string_literal: true

require 'rubygems'
require 'bundler/setup'
require 'pp'
require 'pry'

SCRAPER_PATH = ARGV.first
unless SCRAPER_PATH and File.exist?(SCRAPER_PATH)
raise ArgumentError, "Please provide the scraper you want to test, i.e. './test ruby/scrape_crowdfunder'"
end

require "minitest/autorun"

class TestScraper < Minitest::Test
def setup
@testserver_url = "http://example.com"
end

def run_scraper
path, script = SCRAPER_PATH.split("/")
command = ""
# We need to cd into any sub-folder(s) so the scripts there can do setup like rvm, bundler, nvm, etc.
command += "cd #{path} && " if path
command += "./#{script}"
  command += " #{@testserver_url}"
# We need to have a "clean" Bundler env (i.e., forget any currently loaded gems),
# as the script called might be another Ruby script, with its own Gemfile, and by default
# shelling out "keeps" the gems from this test runner, making the script fail
Bundler.with_original_env do
`bash -c '#{command}'`
end
end

def test_that_shit_works
assert_equal "OHAI!", run_scraper
end
end

Here is the first “successful” run of the test script:

crowdfunder_scraper $ ./test ruby/scrape_crowdfunder 
Run options: --seed 48205

# Running:

F

Finished in 0.708704s, 1.4110 runs/s, 1.4110 assertions/s.

1) Failure:
TestScraper#test_that_shit_works [./test:37]:
--- expected
+++ actual
@@ -1 +1,6 @@
-"OHAI!"
+"GET \"http://example.com\"
+[]
+[]
+[]
+0 campaigns, 0€ total, 0€ remaining, 0€ earned
+"

1 runs, 1 assertions, 1 failures, 0 errors, 0 skips


It is actually running the script, and failing because there’s too much output! Also, of course, our temporary example.com domain has no links, and so fails even more. We should have checks that the wrong domain, or a changed page, outputs something more explicit. We’ll add that to the tested code once the test actually sets something sensible.

Now to the interesting bit here – our internal test HTTP server. So far, we’ve kept requesting example.com over and over, which is hardly fair to it. And we could never change its response(s) to the one we want to simulate.

So let’s try requiring capybara/server and using its host and port in each assertion:

require "capybara/server"

def setup
app = proc { |_env| [200, {}, ['Hello, Sailor!']] }
server = Capybara::Server.new(app).boot
@testserver_url = "http://#{server.host}:#{server.port}"
end

But then, we get this error:

NoMethodError: undefined method `server_port' for Capybara:Module
/home/mt/.rvm/gems/ruby-2.6.0/gems/capybara-2.14.0/lib/capybara/server.rb:63:in `initialize'
./test:25:in `new'
./test:25:in `setup'

Looking a bit at the code of the Server class, it seems to call Capybara.server_port … these methods are probably simply not defined because we are only requiring a single file out of the whole gem. Fiddling around with requiring some of the “config” files from the Capybara gem, but it doesn’t seem to work. Another idea would have been to just (re-)define the Capybara module ourselves and add the methods we need by trial and error, but that seems like a long road to go down, and hard to maintain.

So we’re just loading up Capybara completely:

require 'capybara'

and the error goes away.

So close now…let’s make an actually useful assertion now. We’re setting up the server to return just a string, no HTML:

def setup
app = proc { |_env| [200, {}, ['Hello, Sailor!']] }
server = Capybara::Server.new(app).boot
@testserver_url = "http://#{server.host}:#{server.port}"
end

Then we have a test case that checks that in this case, we should see a warning for the user that the website seems to be wonky:

def test_index_page_has_no_projects
assert_equal "No projects found, has the site changed? Check the URL given?", run_scraper
end

And when we run the tests now, they rightfully complain that the scraper doesn’t give us a useful error message, but only some junk output and a result of “0 projects”.

So now we’re doing actual TDD 🙂 We’ve tested something that is not actually a feature yet in our code. Let’s add that now:

if detail_urls.size == 0
puts "No projects found, has the site changed? Check the URL given?"
else

Aaaaand:

3) Failure:
TestScraper#test_index_page_has_no_projects [./test:45]:
--- expected
+++ actual
@@ -1 +1,5 @@
-"No projects found, has the site changed? Check the URL given?"
+"GET \"http://127.0.0.1:36409\"
+[]
+[]
+No projects found, has the site changed? Check the URL given?
+"

Of course, this still fails because of the extra debugging output, but yaaaay!
This Capybara mini-server is great, hooray to open source!

The code is at this commit now: https://github.com/MGPalmer/crowdfunder_scraper/commit/aff8e7abddb4981fc73ea700821dd2e736213337

At this point, we have a good setup, it seems. We can now completely mock out the “real” webserver and replace it in our tests with one that we can control from within the tests (even though we currently only return ‘Hello Sailor’). And since our little command-line tool only produces output via stdout (i.e. it has no side-effects like database entries added or file written), we can completely black-box-test it.

Before we wrap this post up, let’s get this one assertion green. The code actually works, but the script produces debugging output which is 1) pretty ugly 2) not useful normally, only for the developer and only if something goes wrong. Classic case of making this optional (but still keeping it as part of the code – I don’t want to go back and add and remove debugging output every time I encounter an issue). In a web framework, we’d use a “debug” log level and the usually provided logging facilities, but here the usual thing is to add a flag to the script. AFAIK, the convention is to call it -v (and in long-form --verbose). And again, this seems like something that should have been solved a lot of times before, so some googling for “option parser command line” later, we find this:

https://www.ruby-toolbox.com/categories/CLI_Option_Parsers

It seems like this is a popular field in which to create libraries, there are a lot! We’ll just go with the built-in ‘OptionParser’ from the standard library, especially since the first example is our verbose flag.

Oh, and while we’re at it, we should also make it print out a good usage example:

options = {}
OptionParser.new do |opts|
opts.banner = "Usage: ./scrape_crowdfunder [options] http://example.com/projects"

opts.on("-v", "--[no-]verbose", "Run verbosely") do |v|
options[:verbose] = v
end
end.order!

…and this is how it looks like:

crowdfunder_scraper/ruby $ ./scrape_crowdfunder --help
Usage: ./scrape_crowdfunder http://example.com/projects
-v, --[no-]verbose Run verbosely

Within the actual script, we now just use something like this:

verbose = options[:verbose]
pp(page_urls) if verbose

and now one of the tests is passing \o/

Whew, this turned out to be another long post!

Join us next time for even more tests…

Solving “Bundler::GemfileNotFound” or mysteriously missing gem

Here’s a short interlude: I was having the worst time with this error, partly because the error message is pretty misleading, and partly because I’m an idiot. One of these problems could be solved…the other has to be worked around 😉

So, the issue was that we were trying to run one Ruby script from another via the “shell-out” mechanism. There are a couple of ways to do this (Here is a good overview), but we’re using the good old `Backticks` as we are not concerned about security for now (there’s no user input, everything is hardcoded). But when running the “inner” script, we get this error:

/home/mt/.rvm/gems/ruby-2.4.2/gems/bundler-1.17.1/lib/bundler/definition.rb:32:in `build': /home/mt/Development/crowdfunder_scraper/Gemfile not found (Bundler::GemfileNotFound)

when using the “inline” Gemfile syntax, or when using a normal Gemfile:

`require': cannot load such file -- httpx (LoadError)

one of our defined gems would be mysteriously missing. When cd-ing into the “inner” directory and running the script, it works, of course.

This is of course the abridged version. I was spending a whole lot of time fiddling around, trying to pinpoint the problem. A misleading thought was that the issue might stem from rvm and bundler not properly loading in the sub-shell environment, so a lot of cd-ing around within the backticks was tried: cd ruby && ./scrape_crowdfunder

To make a tedious story short, the hint that was finally pointing to the solution was that discovering (after converting both “outer” and “inner” scripts to a normal Gemfile again) that the same issue also occurs when running the inner script from the outer directory (both containing Gemfiles):

crowdfunder_scraper $ ./ruby/scrape_crowdfunder 
./ruby/scrape_crowdfunder:6:in `require': cannot load such file -- httpx (LoadError)

So what seems to happen is that the “inner” script keeps the gems loaded from the “outer” environment. The same thing happens when running via Backticks. Googling “ruby subshell inherits gems?” finally has the solution: Use Bundler.with_clean_env ! Go ahead and read that article, it explains the issue quite well. Basically, bundler sets up a couple of ENV variables when a Gemfile is encountered, and within the same shell and directory, doesn’t change it again. When putting the shell-out backticks within that method’s block, all is well.

So just some additional notes here: Since that article was written, the Bundler method was renamed to Bundler.with_original_env .

And also, I made a small git repository to demonstrate and test the issue for me and anyone else: https://github.com/MGPalmer/bundler_env_error_test

In it, we have the same setup as my problem: An “outer” and an “inner” directory, both having a Gemfile in it. The outer Gemfile actually requires no gems at all, the inner one wants cowsay.

In both dirs we have a script, the outer one simply prints a message and then shells out to the inner script, once within Bundler.with_original_env‘s block, and once without it. The former call works, the latter one reproduces the problem that the inner script can’t find the gems it wants:

bundler_env_error_test $ ./wrangler 

Let's get wranglin'.
 _______________________ 
| Moo moo I'm a cow yo. |
 ----------------------- 
      \   ^__^
       \  (oo)\_______
          (__)\       )\/\
              ||----w |
              ||     ||
Traceback (most recent call last):
	1: from ./cow:5:in `<main>'
./cow:5:in `require': cannot load such file -- cowsay (LoadError)

So there we go. Another one of these stumbling blocks when developing. I learned a little more about how Bundler works, but there was so much wasted time, a pity.

A somewhat amusing coda to this – after changing the test code from the last article to use with_original_env , the issue was solved as described above. But then suddenly, the “inner” Ruby script didn’t pick up the provided CROWDFUNDER_PROJECTS_URL env variable value anymore. For a short time I wasn’t sure I was using Ruby’s accessor to the ENV correctly, but I then realized that with_original_env was doing exactly what it’s saying on the tin – it resets the ENV, wiping out what we added to it within the script 😀

I realized that it’s a much cleaner interface anyway to simply add the url as a command-line parameter, and switched the code to that. So, next time: Finally finishing the tests.

Scraping a crowd-funding platform for fun and (non-)profit, part 1

Hello again dear readers (I now actually have a couple, because I’m sending the articles to friends and forcing them to read them. Hi Oana!).
Today we are going to do another somewhat pointless project. Slightly less useless than last time, I hope!

I’ve been looking at a crowdfunding website, which collects donations to user-created good causes (“projects” or “campaigns”). I’ll omit the name and URL here – it’s not that the info is private, but it seems uncouth to point at someone specifically (it’s a smallish company). They don’t talk about it much, but according to the Terms and Conditions, they are financing themselves by taking a percentage (3-8%) of donations to projects.

So I’m curious how much money they are actually moving – knowing nothing else about their finances or business model, does it seem feasible that they are cashflow-positive?

And looking at the website, they have a paginated list of projects (of reasonable size, 8 pages of 8 projects each, i.e. up to 64 projects currently), which I will assume are all that are currently active.

On the list page, they show a countdown à la “1234€ to go”. The total budget of the project is only shown on the detail page, let’s say in our example, that would be 1500€. So that project, currently, would have 1500 – 1234 = 266€ pledged to it, earning the platform about 266 * 0.03 ~ 8€, enough for some tasty Falafel for two!

At first, we could scrape all projects and see how much money in total has been pledged so far via the platform. This of course will miss any previous projects that have been completed (if there are any), but there is no way of knowing, so everything found out here is a lower bound. It’s mostly a finger exercise, anyway…

We could do an additional step and record a scraping run, and come back every couple of days, and then compare to see how the “velocity” of donations is, i.e. do they handle a lot of donations per day? This is of course more involved, as we’d need to record each scraping run and/or the results somewhere (a database etc.). Let’s shelve that for now. Baby steps.

But I also want to take an opportunity here to do something I think will be quite easy for me to do in Ruby (I have been scraping websites and consuming APIs with Ruby extensively in a past life), and see how it’ll work in other languages.

So let’s dump some ideas on how this ought to look like:

  • It should be a command-line tool, run via a single (bash) script, i.e. ./scrape_crowdfunder. It’ll write detailed debugging info to stdout when given -v as an option (note to self: Look at libraries for command-line option parsing), but otherwise will just output errors or in the best case "X campaigns, Y€ total, Z€ remaining, A€ earned".
  • The URL of where to start parsing will be given via env variable CROWDFUNDER_PROJECTS_URL so I can keep this out of the repository and protect the innocent.
  • There should be a sort of test harness which takes as input saved HTML pages for the projects index page, and one or two detail pages. A test script runs the tool, and checks the output on stdout for the expected result. This is an extreme example of “integration” testing (Is there a better name for this?), which allows us to swap different language implementations. We’ll probably have to add another env variable or something to make a switch somewhere to use the canned pages instead of the real ones.
  • I’ll write the tests in Ruby, because I’m most comfortable with it, and it’s well at home on the command line, and its dynamic nature works well with testing. Performance is not really a concern here.
  • Since we want the same test to be used on different implementations, let’s make it all one big repository. In a bigger project, we’d probably want a “main” project that pulls in the different implementations as libraries, but let’s stay lowtech for now.
  • Which means we can set up the tests like this: Have, under the main folder, one for the tests, and one for each language implementation:
    project
    /testcode1.rb
    /testcode2.rb
    /test-data
    /ruby
    ...../scrape_crowdfunder
    /javascript
    ...../scrape_crowdfunder

    etc.
    then we can run the tests also via a bash script, and just give the scriptname of the runner as a parameter:
    ./test ./ruby/scrape_crowdfunder
  • We might, eventually, also look into using different HTTP clients and parallelism there. I just stumbled upon httpx – another one in a long line of Ruby HTTP clients). We’ll need to tread lightly here though, as we don’t want to hammer the site repeatedly. My impression is that there is not so much traffic going on there so we’d probably be able to visibly cause traffic spikes if we go all-out, and that would be just impolite.

So now we have an idea where we’re going – let’s first set up the Ruby stuff and just hack something out, and then clean it up while writing the tests. I’m usually doing tasks in this order:

  1. Just think about the whole project and write down lotsa notes
  2. Hack around to try out the rough edges
  3. Start writing tests once I know what structure the code is going to take, and then go all TDD and write code and tests in lockstep

I’m suspicious of people that claim to be really doing TDD by writing the test first before anything else. Maybe this works if you are adding a routine extension to an existing project, but if you are going green-field or doing some complex new feature? It seems like an invitation to a lot of rewriting :/

Of course, just now I’m also writing down these words here for the blog post, which I don’t usually do when working for a job. I wonder if I should? It would make everything a lot slower, but would provide seamless documentation over the long term. Also it seems that writing down clears up your thoughts quite a bit…Something to ponder in another post?

Let’s just get going:

$ mkdir crowdfunder_scraper
$ cd crowdfunder_scraper/
$ git init

I’m copying a .gitignore from another project, and adding a Gemfile – we’ll at least need a http library, and I hate the built-in Net::HTTP library from Ruby with a passion:

$ bundle init

At this time I remember I want to make this code public, and make a github repo and retroactively link my local one to it:

$ git remote add origin git@github.com:MGPalmer/crowdfunder_scraper.git

Now you can follow along or look at the last state here: https://github.com/MGPalmer/crowdfunder_scraper

$ mkdir ruby
$ cd ruby
$ touch scrape_crowdfunder
$ chmod +x scrape_crowdfunder

Let’s skip some more boring details. Just some notes:

After some back and forth, here we are:

Gemfile:

# frozen_string_literal: true

source "https://rubygems.org"

git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }

gem "httpx"
gem "nokogiri"
scrape_crowdfunder:

#!/usr/bin/env ruby

require 'rubygems'
require 'bundler/setup'
require 'httpx'
require 'nokogiri'

puts "Hello World"

Aannnnnnd:

$ ./scrape_crowdfunder 
 Hello World

Yaaaay we got a working Ruby script with Bundler.

At this point I go in irb, load up the Gemfile, and fiddle around.

$ irb
$ require 'rubygems'; require 'bundler/setup'; require 'httpx'; require 'nokogiri'
$ 2.4.2 :003 > page = Nokogiri.parse(HTTPX.get("https://example.org").to_s).css("a.clicky")
etc.

(fiddle fiddle fiddle)

Yay, got it working!

Here’s the current code, left intentionally ugly:

Check it out on Github: https://github.com/MGPalmer/crowdfunder_scraper/commit/1784bd1b18db4230b1b099f1c943ea5d7b883413


#!/usr/bin/env ruby
# frozen_string_literal: true

require 'rubygems'
require 'bundler/setup'
require 'httpx'
require 'nokogiri'
require 'pp'

index_url  = ENV['CROWDFUNDER_PROJECTS_URL']
index_html = HTTPX.get(index_url).to_s
index      = Nokogiri.parse(index_html)

def get_n_parse(url)
  res = HTTPX.get(url)
  unless res.status == 200
    puts "AAAAAAAAAA HTTP error for #{url} - #{res.status}"
    return nil
  end
  Nokogiri.parse(res.body.to_s)
end

def parse_detail_page_urls(page)
  page.css('.campaign-details a.campaign-link').map { |a| a[:href] }
end

pages = [index]
page_urls = index.css('ul.pagination a.page-link').map { |a| a[:href] }
pp(page_urls)
page_urls.each do |page_url|
  pages << get_n_parse(page_url)
end

pages.compact!

detail_urls = []
pages.each do |page|
  detail_urls += parse_detail_page_urls(page)
end

pp(detail_urls)

campaigns = detail_urls.map do |detail_url|
  puts detail_url
  page = get_n_parse(detail_url)
  next unless page

  campaign_goal    = Integer(page.css('h5.campaign-goal').text.gsub(/€|,/, ''))
  remaining_amount = Integer(page.css('p.remaining-amount').inner_html.gsub(',', '').scan(/€(\d+)?\s/m).flatten.first)
  {
    url: detail_url,
    campaign_goal: campaign_goal,
    remaining_amount: remaining_amount
  }
end.compact

pp(campaigns)

count     = campaigns.size
total     = campaigns.inject(0) { |t, n| t + n[:campaign_goal] }
remaining = campaigns.inject(0) { |t, n| t + n[:remaining_amount] }

puts "#{count} campaigns, #{total}€ total, #{remaining}€ remaining, #{total - remaining}€ earned"

Some notes:

  • It was a little tricky getting the first-page, then each-pagination, then each-detail links right
  • Stumbled hard over one 404 page, httpx will happily give you a “” body for that :/ . This needs to be tested, i.e. the tests should include cases for all of the HTTP calls to return errors, and check that the script doesn’t choke on them.
  • The markup is a bit of a bitch for the amounts (total and pledged) – had to use some regexps which are a little more complex than I’m really comfortable with. This needs to be tested thoroughly so we can refactor it later.
  • Should’ve first added verbose mode and a trigger for it, I ended up throwing puts and pp around a lot.
  • Also adding and using a debugger would have helped a lot, I didn’t want to slow down for that…
  • The script runs for quite a while – it has to do a couple of dozen HTTP calls, and when the code fails in one of the later ones, it’s a real PITA to have to re-run everything.
  • I’ve moved repeated code into methods, but of course nothing is properly organized.
  • But note how everything happens in discrete steps, and collects data from the previous step, making it easy to inspect the data at each point, and only in the end summing up the derived information we actually want.

But we want the numbers now! After running, and omitting all the debugging output:

57 campaigns, 2829964€ total, 2818168€ remaining, 11796€ earned

At, let’s say, 5% commission, this means the current projects are earning 11796 * 0.03 ~ 353€

This buys you a lot of Falafel, but it’s not much to run a company on :/ But of course everybody has to start small, and again, we really don’t have all the facts here. But hey, our code works, even though the numbers it produces might be meaningless. Ready for a career in business intelligence 😉

Tune in next time when we clean up this mess, and add tests!

Keeping Ruby weird with emojis

Have a look at this: https://medium.com/carwow-product-engineering/emoji-driven-development-in-ruby-2d54264f7b08

Isn’t that just grand? 😀 It’s been quite a while (it might be 15 years or so) since I stumbled upon this little unusual programming language from Japan, which I soon fell in love with. Matchmaker was the weird, quirky, clever code-art that _why the lucky stiff made. Sadly he just up and info-suicided all his online stuff and was gone. This reminds me of his stuff 🙂