Solving “Bundler::GemfileNotFound” or mysteriously missing gem

Here’s a short interlude: I was having the worst time with this error, partly because the error message is pretty misleading, and partly because I’m an idiot. One of these problems could be solved…the other has to be worked around 😉

So, the issue was that we were trying to run one Ruby script from another via the “shell-out” mechanism. There are a couple of ways to do this (Here is a good overview), but we’re using the good old `Backticks` as we are not concerned about security for now (there’s no user input, everything is hardcoded). But when running the “inner” script, we get this error:

/home/mt/.rvm/gems/ruby-2.4.2/gems/bundler-1.17.1/lib/bundler/definition.rb:32:in `build': /home/mt/Development/crowdfunder_scraper/Gemfile not found (Bundler::GemfileNotFound)

when using the “inline” Gemfile syntax, or when using a normal Gemfile:

`require': cannot load such file -- httpx (LoadError)

one of our defined gems would be mysteriously missing. When cd-ing into the “inner” directory and running the script, it works, of course.

This is of course the abridged version. I was spending a whole lot of time fiddling around, trying to pinpoint the problem. A misleading thought was that the issue might stem from rvm and bundler not properly loading in the sub-shell environment, so a lot of cd-ing around within the backticks was tried: cd ruby && ./scrape_crowdfunder

To make a tedious story short, the hint that was finally pointing to the solution was that discovering (after converting both “outer” and “inner” scripts to a normal Gemfile again) that the same issue also occurs when running the inner script from the outer directory (both containing Gemfiles):

crowdfunder_scraper $ ./ruby/scrape_crowdfunder 
./ruby/scrape_crowdfunder:6:in `require': cannot load such file -- httpx (LoadError)

So what seems to happen is that the “inner” script keeps the gems loaded from the “outer” environment. The same thing happens when running via Backticks. Googling “ruby subshell inherits gems?” finally has the solution: Use Bundler.with_clean_env ! Go ahead and read that article, it explains the issue quite well. Basically, bundler sets up a couple of ENV variables when a Gemfile is encountered, and within the same shell and directory, doesn’t change it again. When putting the shell-out backticks within that method’s block, all is well.

So just some additional notes here: Since that article was written, the Bundler method was renamed to Bundler.with_original_env .

And also, I made a small git repository to demonstrate and test the issue for me and anyone else: https://github.com/MGPalmer/bundler_env_error_test

In it, we have the same setup as my problem: An “outer” and an “inner” directory, both having a Gemfile in it. The outer Gemfile actually requires no gems at all, the inner one wants cowsay.

In both dirs we have a script, the outer one simply prints a message and then shells out to the inner script, once within Bundler.with_original_env‘s block, and once without it. The former call works, the latter one reproduces the problem that the inner script can’t find the gems it wants:

bundler_env_error_test $ ./wrangler 

Let's get wranglin'.
 _______________________ 
| Moo moo I'm a cow yo. |
 ----------------------- 
      \   ^__^
       \  (oo)\_______
          (__)\       )\/\
              ||----w |
              ||     ||
Traceback (most recent call last):
	1: from ./cow:5:in `<main>'
./cow:5:in `require': cannot load such file -- cowsay (LoadError)

So there we go. Another one of these stumbling blocks when developing. I learned a little more about how Bundler works, but there was so much wasted time, a pity.

A somewhat amusing coda to this – after changing the test code from the last article to use with_original_env , the issue was solved as described above. But then suddenly, the “inner” Ruby script didn’t pick up the provided CROWDFUNDER_PROJECTS_URL env variable value anymore. For a short time I wasn’t sure I was using Ruby’s accessor to the ENV correctly, but I then realized that with_original_env was doing exactly what it’s saying on the tin – it resets the ENV, wiping out what we added to it within the script 😀

I realized that it’s a much cleaner interface anyway to simply add the url as a command-line parameter, and switched the code to that. So, next time: Finally finishing the tests.

Scraping a crowd-funding platform for fun and (non-)profit, part 1

Hello again dear readers (I now actually have a couple, because I’m sending the articles to friends and forcing them to read them. Hi Oana!).
Today we are going to do another somewhat pointless project. Slightly less useless than last time, I hope!

I’ve been looking at a crowdfunding website, which collects donations to user-created good causes (“projects” or “campaigns”). I’ll omit the name and URL here – it’s not that the info is private, but it seems uncouth to point at someone specifically (it’s a smallish company). They don’t talk about it much, but according to the Terms and Conditions, they are financing themselves by taking a percentage (3-8%) of donations to projects.

So I’m curious how much money they are actually moving – knowing nothing else about their finances or business model, does it seem feasible that they are cashflow-positive?

And looking at the website, they have a paginated list of projects (of reasonable size, 8 pages of 8 projects each, i.e. up to 64 projects currently), which I will assume are all that are currently active.

On the list page, they show a countdown à la “1234€ to go”. The total budget of the project is only shown on the detail page, let’s say in our example, that would be 1500€. So that project, currently, would have 1500 – 1234 = 266€ pledged to it, earning the platform about 266 * 0.03 ~ 8€, enough for some tasty Falafel for two!

At first, we could scrape all projects and see how much money in total has been pledged so far via the platform. This of course will miss any previous projects that have been completed (if there are any), but there is no way of knowing, so everything found out here is a lower bound. It’s mostly a finger exercise, anyway…

We could do an additional step and record a scraping run, and come back every couple of days, and then compare to see how the “velocity” of donations is, i.e. do they handle a lot of donations per day? This is of course more involved, as we’d need to record each scraping run and/or the results somewhere (a database etc.). Let’s shelve that for now. Baby steps.

But I also want to take an opportunity here to do something I think will be quite easy for me to do in Ruby (I have been scraping websites and consuming APIs with Ruby extensively in a past life), and see how it’ll work in other languages.

So let’s dump some ideas on how this ought to look like:

  • It should be a command-line tool, run via a single (bash) script, i.e. ./scrape_crowdfunder. It’ll write detailed debugging info to stdout when given -v as an option (note to self: Look at libraries for command-line option parsing), but otherwise will just output errors or in the best case "X campaigns, Y€ total, Z€ remaining, A€ earned".
  • The URL of where to start parsing will be given via env variable CROWDFUNDER_PROJECTS_URL so I can keep this out of the repository and protect the innocent.
  • There should be a sort of test harness which takes as input saved HTML pages for the projects index page, and one or two detail pages. A test script runs the tool, and checks the output on stdout for the expected result. This is an extreme example of “integration” testing (Is there a better name for this?), which allows us to swap different language implementations. We’ll probably have to add another env variable or something to make a switch somewhere to use the canned pages instead of the real ones.
  • I’ll write the tests in Ruby, because I’m most comfortable with it, and it’s well at home on the command line, and its dynamic nature works well with testing. Performance is not really a concern here.
  • Since we want the same test to be used on different implementations, let’s make it all one big repository. In a bigger project, we’d probably want a “main” project that pulls in the different implementations as libraries, but let’s stay lowtech for now.
  • Which means we can set up the tests like this: Have, under the main folder, one for the tests, and one for each language implementation:
    project
    /testcode1.rb
    /testcode2.rb
    /test-data
    /ruby
    ...../scrape_crowdfunder
    /javascript
    ...../scrape_crowdfunder

    etc.
    then we can run the tests also via a bash script, and just give the scriptname of the runner as a parameter:
    ./test ./ruby/scrape_crowdfunder
  • We might, eventually, also look into using different HTTP clients and parallelism there. I just stumbled upon httpx – another one in a long line of Ruby HTTP clients). We’ll need to tread lightly here though, as we don’t want to hammer the site repeatedly. My impression is that there is not so much traffic going on there so we’d probably be able to visibly cause traffic spikes if we go all-out, and that would be just impolite.

So now we have an idea where we’re going – let’s first set up the Ruby stuff and just hack something out, and then clean it up while writing the tests. I’m usually doing tasks in this order:

  1. Just think about the whole project and write down lotsa notes
  2. Hack around to try out the rough edges
  3. Start writing tests once I know what structure the code is going to take, and then go all TDD and write code and tests in lockstep

I’m suspicious of people that claim to be really doing TDD by writing the test first before anything else. Maybe this works if you are adding a routine extension to an existing project, but if you are going green-field or doing some complex new feature? It seems like an invitation to a lot of rewriting :/

Of course, just now I’m also writing down these words here for the blog post, which I don’t usually do when working for a job. I wonder if I should? It would make everything a lot slower, but would provide seamless documentation over the long term. Also it seems that writing down clears up your thoughts quite a bit…Something to ponder in another post?

Let’s just get going:

$ mkdir crowdfunder_scraper
$ cd crowdfunder_scraper/
$ git init

I’m copying a .gitignore from another project, and adding a Gemfile – we’ll at least need a http library, and I hate the built-in Net::HTTP library from Ruby with a passion:

$ bundle init

At this time I remember I want to make this code public, and make a github repo and retroactively link my local one to it:

$ git remote add origin git@github.com:MGPalmer/crowdfunder_scraper.git

Now you can follow along or look at the last state here: https://github.com/MGPalmer/crowdfunder_scraper

$ mkdir ruby
$ cd ruby
$ touch scrape_crowdfunder
$ chmod +x scrape_crowdfunder

Let’s skip some more boring details. Just some notes:

After some back and forth, here we are:

Gemfile:

# frozen_string_literal: true

source "https://rubygems.org"

git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }

gem "httpx"
gem "nokogiri"
scrape_crowdfunder:

#!/usr/bin/env ruby

require 'rubygems'
require 'bundler/setup'
require 'httpx'
require 'nokogiri'

puts "Hello World"

Aannnnnnd:

$ ./scrape_crowdfunder 
 Hello World

Yaaaay we got a working Ruby script with Bundler.

At this point I go in irb, load up the Gemfile, and fiddle around.

$ irb
$ require 'rubygems'; require 'bundler/setup'; require 'httpx'; require 'nokogiri'
$ 2.4.2 :003 > page = Nokogiri.parse(HTTPX.get("https://example.org").to_s).css("a.clicky")
etc.

(fiddle fiddle fiddle)

Yay, got it working!

Here’s the current code, left intentionally ugly:

Check it out on Github: https://github.com/MGPalmer/crowdfunder_scraper/commit/1784bd1b18db4230b1b099f1c943ea5d7b883413


#!/usr/bin/env ruby
# frozen_string_literal: true

require 'rubygems'
require 'bundler/setup'
require 'httpx'
require 'nokogiri'
require 'pp'

index_url  = ENV['CROWDFUNDER_PROJECTS_URL']
index_html = HTTPX.get(index_url).to_s
index      = Nokogiri.parse(index_html)

def get_n_parse(url)
  res = HTTPX.get(url)
  unless res.status == 200
    puts "AAAAAAAAAA HTTP error for #{url} - #{res.status}"
    return nil
  end
  Nokogiri.parse(res.body.to_s)
end

def parse_detail_page_urls(page)
  page.css('.campaign-details a.campaign-link').map { |a| a[:href] }
end

pages = [index]
page_urls = index.css('ul.pagination a.page-link').map { |a| a[:href] }
pp(page_urls)
page_urls.each do |page_url|
  pages << get_n_parse(page_url)
end

pages.compact!

detail_urls = []
pages.each do |page|
  detail_urls += parse_detail_page_urls(page)
end

pp(detail_urls)

campaigns = detail_urls.map do |detail_url|
  puts detail_url
  page = get_n_parse(detail_url)
  next unless page

  campaign_goal    = Integer(page.css('h5.campaign-goal').text.gsub(/€|,/, ''))
  remaining_amount = Integer(page.css('p.remaining-amount').inner_html.gsub(',', '').scan(/€(\d+)?\s/m).flatten.first)
  {
    url: detail_url,
    campaign_goal: campaign_goal,
    remaining_amount: remaining_amount
  }
end.compact

pp(campaigns)

count     = campaigns.size
total     = campaigns.inject(0) { |t, n| t + n[:campaign_goal] }
remaining = campaigns.inject(0) { |t, n| t + n[:remaining_amount] }

puts "#{count} campaigns, #{total}€ total, #{remaining}€ remaining, #{total - remaining}€ earned"

Some notes:

  • It was a little tricky getting the first-page, then each-pagination, then each-detail links right
  • Stumbled hard over one 404 page, httpx will happily give you a “” body for that :/ . This needs to be tested, i.e. the tests should include cases for all of the HTTP calls to return errors, and check that the script doesn’t choke on them.
  • The markup is a bit of a bitch for the amounts (total and pledged) – had to use some regexps which are a little more complex than I’m really comfortable with. This needs to be tested thoroughly so we can refactor it later.
  • Should’ve first added verbose mode and a trigger for it, I ended up throwing puts and pp around a lot.
  • Also adding and using a debugger would have helped a lot, I didn’t want to slow down for that…
  • The script runs for quite a while – it has to do a couple of dozen HTTP calls, and when the code fails in one of the later ones, it’s a real PITA to have to re-run everything.
  • I’ve moved repeated code into methods, but of course nothing is properly organized.
  • But note how everything happens in discrete steps, and collects data from the previous step, making it easy to inspect the data at each point, and only in the end summing up the derived information we actually want.

But we want the numbers now! After running, and omitting all the debugging output:

57 campaigns, 2829964€ total, 2818168€ remaining, 11796€ earned

At, let’s say, 5% commission, this means the current projects are earning 11796 * 0.03 ~ 353€

This buys you a lot of Falafel, but it’s not much to run a company on :/ But of course everybody has to start small, and again, we really don’t have all the facts here. But hey, our code works, even though the numbers it produces might be meaningless. Ready for a career in business intelligence 😉

Tune in next time when we clean up this mess, and add tests!

How not to create a stupid Rails extension that screams at you

Hello dear nonexistent readers, it’s story time again! This time, as last time, we’ll talk about how computers make our lives harder when we try to make them do stuff.

TL/DR of actually useful bits of info in this rant:

  • Spammers ruin everything. They suck.
  • Modern browsers make it impossible to just up and play sound at the user. Some kind of interaction is necessary beforehand (there are exceptions for sites that repeatedly use sound, like a video site), usually a button press etc. This actually makes a lot of sense and seems to be implemented pretty neatly.
  • Before fiddling endlessly with stuff and reading messy Stack Overflow answers, it makes super duper sense to take some minutes to read the actual browser docs…

Ok back to ranting

What I wanted to make was really stupid: An extension to the Rails active record framework, which would allow anyone to add a long-missed feature to validations: That when an error occurs, the page would not only display an error message but would also play a sound file really loudly that screams at you for being so stupid to cause errors.

Yeah yeah. One of those ideas that seemed hilarious at 2 am in the morning on the way home from the pub. Although this video, which is the inspiration, still cracks me up…

But I wanted to go at it really professionally. I had everything lined up – a list of requirements, possible bonus features (I18n for sound files? Try to play them in the Rails console as well?), plans for seeing how to create and publish a rubygem in 2019 (I made some gems before but the helpers around that kept changing), etc.

And so I spent quite some time on writing up lists of that. But I should have listened to a nagging feeling – that I should check first on how to actually play sound files in a browser on a page load. Because that sounded a bit like it might be misused and therefore be a bit tricky.

So I finally started on doing that. Some googling led to this SO question … which leads to a sinking feeling in my gut. There’s a whole lot of answers, all quite different, more or less contradicting, and lots of comments saying “doesn’t work in Browser x”…this seems like one of those browser-support-morasses where there’s a lot of history around a feature, everything evolved differently in different browsers, and lots of more or less smart people created workarounds and libraries and left a lot of deprecated info flying around online…i.e. the bad kind of web development. Been there. Not fun.

I also found something a bit more professional: https://developers.google.com/web/updates/2017/09/autoplay-policy-changes but it’s about videos? So I didn’t really read it.

But oh well, let’s actually do something. The first SO answer had something simple looking, so let’s make a test page:

<!doctype html>
<html lang=en>
  <head>
    <meta charset=utf-8>
    <title>Audio test</title>
  </head>
  <body>
    <p>Blah blah</p>
    
      var audio = new Audio("aaa.ogg");
      audio.play();
    
  </body>
</html>

Saving this as a .html file, and opened in a browser, it should directly play that “aaa.ogg” sound file. Easy peasy.

Right, we need a sound snippet, right? Let’s just record a quick soundbite (well, a scream, in keeping with the idea) with the Laptop microphone.

Another aside; On the joys and sorrows of using Ubuntu

I’m one of those freaks in web development that does not use a Mac. Sue me, I never got used to them – I was a Windows kid, then switched to Ubuntu when using Ruby/Rails on Windows was way, way too painful (never say the word “cygwin” to me…)

Ubuntu the good bit: Using the “home” button and typing “record” shows there are no programs installed that would help me, but asks if I want to install “Sound Recorder”. I say yes. Bam, it installs, and opens, and I can record sound. Takes like 20 seconds. Ain’t that something?

Buuut…ok where is my recording? Hello? Mr. Sound Recorder Program?

Not shown: A sensible UI

There is absolutely no way of actually interacting with your recordings – you can only play or delete them. After some educated guessing, it turns out the files end up in ~/Recordings. And they are in .ogg 😦

Ok back to playing .ogg files in the browser

Now we can put the aaa.ogg file in the same folder as the test html file, open it again, and….nothing.

Well except for this helpful error message (in the dev tools console): “Uncaught (in promise) DOMException” . Well fuck you too, Chrome. Googling reveals that this means “Autoplaying is not allowed, fool”.

Firefox has a much better error message: “NotAllowedError: The play method is not allowed by the user agent or the platform in the current context, possibly because the user denied permission.”

So now we get to the end of this little adventure. After actually reading this (which I skimmed over earlier): https://developers.google.com/web/updates/2017/09/autoplay-policy-changes – well, it really makes sense, doesn’t it? If it’d be possible to just start playing sound/video files, all pages would be full of ads and other bits of junk that blared at you. Some more fiddling to actually see this in action:

<!doctype html>
<html lang=en>
  <head>
    <meta charset=utf-8>
    <title>Audio test</title>
  </head>
  <body>
    <p id="hovercraft">Press play on tape</p>
    <button id="tape">Play</button>
    
      var audio = new Audio("aaa.ogg");
      document.getElementById("tape").addEventListener('click', function() {
        audio.play();
      });
      document.getElementById("hovercraft").addEventListener('mouseover', function() {
        audio.play();
      });
    
  </body>
</html>

If you load this in a browser, and move the mouse pointer over the button, nothing happens. But if you click the button, the sound plays – and if you mouse over after that, it also plays. The button press is needed to demonstrate that the user actually wants something from the page, and after that it is allowed to play sounds directly.

Which is all well and good but it kinda makes the original idea moot – because the fun bit is the immediate screaming after you submit a form. No fun if you have to allow it first 😦

The moral of the story

Well, one learning is that browsers are really quite sophisticated – I never thought about the implications of having an Audio API and dealing with malicious websites. I think the approach taken is pretty neat here.

Also, I could have saved a whole lot of time by looking at that doc in the first place. This is something I thought I wasn’t bad at, but I went ahead and made a castle in the sky before checking for the obvious cloud zoning laws and stratospheric building construction permits 😉

Keeping Ruby weird with emojis

Have a look at this: https://medium.com/carwow-product-engineering/emoji-driven-development-in-ruby-2d54264f7b08

Isn’t that just grand? 😀 It’s been quite a while (it might be 15 years or so) since I stumbled upon this little unusual programming language from Japan, which I soon fell in love with. Matchmaker was the weird, quirky, clever code-art that _why the lucky stiff made. Sadly he just up and info-suicided all his online stuff and was gone. This reminds me of his stuff 🙂

Bloody bloody computers (TODO #1 update)

This is the update to TODO #1 – pointing my old monogreen.de domain to this blog (which is the hosted “free” plan from wordpress.com, i.e. no bells and whistles).

A bit of history: I’ve been using hosteurope to host my mailboxes and to register some domain names for years (I get a warm and fuzzy feeling by actually owning my email address and my mails – I’m hoping of course that hosteurope, compared to say, Google, is not interested in scanning my emails and selling my habits…). But I only bought the “email” package from hosteurope, and not anything else like servers or webspace hosting.

Naturally I’d like to point monogreen.de to this blog. As I’ve written last time, you have to have a webspace package booked in order to set that up, so I bought the cheapest plan.

First hurdle: It turns out they make that available quite fast (<15min) after ordering – but you have to log out and back in to see the new package in KIS. Groan.

Next hurdle: As I now had two packages (one email-only, one webspace which also includes a mailbox as well as webspace, php hosting, etc.), I had to “transfer” the domain monogreen.de from the email package. With some trepidation, I did so, as it seemed like the “redirect” setting would only be available on the webspace package, and I assumed I could still have the email addresses themselves point to the actual inbox I’ve been using for years.

Well that worked alright, and could immediately set up the redirect under the webspace package. However…the redirect didn’t work. Knowing things sometimes take a while to start working at hosteurope, I waited a while, but still nothing. Then, being suspicious, I sent myself an email from a different address/provider, and – it bounced o.O . I had managed to break my primary email address.

Photo Credit: Kalle Gustafsson https://flic.kr/p/GweE1X

Half a panicky hour later I had everything back to as it was before (including setting back up the various email aliases to the actual mailbox, which the “move” had severed) – and here’s two tips for anyone in this situation, and for myself in the future:

  • Every package in KIS has its own block of menu entries – and especially the “Domain settings”. Also there is a general one. So in my case there were three places to look for when trying to move the domain back from the webspace package to the email package (the correct one is the “general” area)
  • Many changes in KIS take ~15min to take effect (they say that in the interface, actually, but you usually assume this is only a general “yeah yeah mostly immediately but let’s cover our asses” policy – it’s not with hosteurope. Needless to say that this makes trying out things extremely cumbersome and error-prone…

And fun fact: After all that, the damn redirect still didn’t work…until next morning, after I had given up, and it suddenly worked just perfectly. Groan.

In summary: Mission accomplished but with too much panicky clicking around…

TODO #1 – my own damn domain

TODO for next time (update: here it is) : Make monogreen.de show this blog. I own the domain, I’ve registered it at hosteurope.com a long time ago – but it appears you can’t add any sort of redirect etc. except editing the DNS settings, and it’s probably quite useless to have monogreen.de point to WordPress’s server(s). After getting lost for many a year in KIS, the ever-so-confusing admin interface from hosteurope, I ordered the smallest “Web Pack”, which hopefully then lets me add redirects…you also get classic shared managed hosting – maybe I’ll also play around with that. Should be strange after years of administrating proper servers from scratch…