Fun with nokogiri: Screen-scraping ikea.com

I recently discovered that ikea is really scraping-friendly – they have their categories and products belonging to a category all linked very clean, and the product pages themselves include, as json, the complete product data necessary to display a product somewhere else ( say, on an iPad ). They also feature the assembly instructions linked as pdf and more fun stuff. Just open any product page and browse the source, you will discover a field named “jProductData” which is just what you want.

So i decided it’s time for me to try nokogiri, a beautiful and fast framework for processing urls and searching through the HTML. I managed to write a scraper in only very few lines that actually works. Since the json is embedded in regular JavaScript, i had to use rkelly to parse the javascript part and extract the right data.

For the datamodel, I used ohm with redis, which takes about no time to setup and works as advertised.

If you are curious, the source is available at github.

Flattr, a great idea, a great video, great people behind

Flattr is exactly what I dreamed of when I wrote my post about the donation-button-dilemma on open-source projects. Flattr, by the creators of the pirate bay, is a service that let’s you specify an amount you want to spend monthly, I don’t know what they plan, but let’s say 10€, and you can flattr any project you like, very much like digging, and they will get a share of your money. Of course, this only works because the money adds up.

There are some unknown variables to me so far. I don’t know how much of the money reaches the designated receiver after all, how privacy and security is ensured, how money can be transferred to flattr and so on, yet, this idea opens up new horizons for people who want to invest time in software they don’t mean to sell, but still be able to live a life worth living.

I mentioned the video, it’s a well-made one, see below

( note: I’m using the logo without permission, hope the pirate-bay guys won’t mind that. )

iType Demo Video online – finally getting real

iType, my project, is finally getting close to release state. But talking is boring, so here is a short demo clip. If you have any questions, please write to itypeapp @ gmail.com

I am looking for beta testers, so if you want to test it, please let me know!

iType Demo Video from Moritz Haarmann on Vimeo.

Related Blogs

La Fonera 2n – Hands on, darling

La Fonera 2n – Hands on, darling
Since saturday, I’m sick lying in bed and keep trying not to annoy my girlfriend too much ( well, not that successfully i suppose ). Anyhow, I ordered a Fonera 2n last week, and it happened to arrive today. So after some unboxing ( UPS packaging materials can be so turbo-time-consuming ) and several cups of tea, I swapped the old and the new fonera, expecting everything to work.
It didn’t. I took a while to figure out that the F2N is a bit picky when it comes to connecting to the router that provides it with a vital ip-address and connectivity. But after some tries, it worked, and it’s really working well.
The first convinience is that you can instantly connect to your new Fonera through cable or WiFi using an actual domain name ( fonera.lan ), not some ugly ip. This feature is, of course, nothing substantially new, but it’s cool.
Next thing: Coooool web-interface. Just the way I want a web-interface to be, clean, well-structured, decent color-scheme ( see the screenshots ) and speedy. The navigation is simple: The so-called dashboard serves as an entry-point, and on top you can find a listing of all applications currently installed on the tinytinytiny internal flash memory. Below that, there is a seperate box featuring built-in features like Settings, File Manager ( later.. ) and UMTS/3G connectivity.
Of course, there are more applications that can be installed easily, also through the web-interface, like a printer-sharing extension, that let’s you use a printer that is connected to the external USB-port. Nice!

Since saturday, I’m sick lying in bed and keep trying not to annoy my girlfriend too much ( well, not that successfully i suppose ). Anyhow, I ordered a Fonera 2n last week, and it happened to arrive today. So after some unboxing ( UPS packaging materials can be so turbo-time-consuming ) and several cups of tea, I swapped the old and the new fonera, expecting everything to work.

It didn’t. I took a while to figure out that the F2N is a bit picky when it comes to connecting to the router that provides it with a vital ip-address and connectivity. But after some tries, it worked, and it’s really working well.

The first convinience is that you can instantly connect to your new Fonera through cable or WiFi using an actual domain name ( fonera.lan ), not some ugly ip. This feature is, of course, nothing substantially new, but it’s nice to have.

Next thing: Coooool web-interface. Just the way I want a web-interface to be, clean, well-structured, decent color-scheme ( see the screenshots ) and speedy. The navigation is simple: The so-called dashboard serves as an entry-point, and on top you can find a listing of all applications currently installed on the tinytinytiny internal flash memory. Below that, there is a seperate box featuring built-in features like Settings, File Manager ( later.. ) and UMTS/3G connectivity.

Of course, there are more applications that can be installed easily, also through the web-interface, like a printer-sharing extension, that let’s you use a printer that is connected to the external USB-port. Right, there is a USB-port, prominently featured on the front of the device, in fact the only port there. This indicates that USB is more seen as a end-user business rather than belonging to the admin’s ( or whoever is doing the job ) domain. This is an interesting point of view, and given the nice administration UI and the easy-to-use applications featured there, it is also reasonable.

But this USB-port is more than just a simple USB-port, because contrary to many other routers out there, the Fonera2n is sold with the remark that any number of USB devices can be connected ( of course via a hub ) to that port and serve you remotely. These devices include webcams, audio-interfaces ( I found a working one for 10€ in a local store ), printers and, of course, mass storage devices. So, if your ambitions are not too sophisticated, this tiny little box could be everything you need, with a power consumption that is assumable way below anything you might have.

It’s really simple, once your disks or flash sticks or whatever are connected, you are free to access them using the smb-protocol, a de-facto standard supported on most platforms. There are guides explaining how to use the Fonera as a budget Time Capsule replacement, which is also very handy, and using fast WiFi, chances are your Mac’s content will be backed up that century.

But the first impression also features some not-so-cool things, like the random connection losses I experienced over WiFi. This is especially nasty while running huge downloads over night. Of course, once reconnected the download continues, yet I’m not the one that gets up every two hours just to check the connection. Not yet.

Summary? The best you can get, in my opinion. It’s not only a full-fledged router, 4-port ethernet switch and 801.11b/g/n access point, it’s also perfectly capable of hosting tons of data and acting as a print server. It downloads your torrents and uploads your videos to youtube, so, what do you need more?

And oh yeah, the webcam module features motion detection, which is the most cool and unnecessary feature of all.

Business ideas: Which is the right one?

Honestly, I don’t know. But from what I can tell, the ones solving problems people have, even if they were not aware before, are the ones working. The other ideas usually don’t. What are problems worth solving?

I’d define a problem worth solving as something that has 3 attributes: It is persistent, in a sense that it reoccurs too many times, that it has an impact on some activity or workflow. That is something worth resolving. Another point is the “factor of annoyance”. Remember the  Microsoft’s Paperclip? High factor of annoyance. One dialog asking for a bit of love in the form of a ok-button? Not that annoying. But keep in mind, that’s all a very subjective rating. So if you find Dialog Boxes more annoying than speaking Paperclips, just change the order accordingly.

The third factor is the most important one: the Time spent working around. The attributes mentioned above may be pointless if it takes no time to work around the issue. If something takes a serious amount of time to be worked around, solved, corrected or understood, it’s time to do something. Recently, GMail and my Mac synced. The result was a Address Book disaster. Time to correct it: 3 hours. That’s a problem.

Anything that’s built to return money should solve a problem or improve something. Anything else? Naaa..That’s my point. What is yours?

Finally: ActiveResource with Service Discovery and Authentication

Yay. In case you’re looking for a release, there is none. Not yet, we are maybe going to release one, but it’s a question of time rather than a lack of good will.

Why. In my current position, I am building a set of applications ( most of them rails based ) communicating with each other in a RESTful manner. This is, well, just continue reading my ActiveResource rant here. It’s not really nice to use the official ActiveResource thing. It’s a lot of hardcoding ( e.g. you have to set the remote service’ URL in the model, not in some kind of configuration file, which makes switching from development to testing and production a pain ) and other shortcomings. It’s a good idea, yet far from being perfect. And the two things that bothered me most were service discovery, meaning, the easy ability to resolve a service by its name than by its url, and authentication. Both of them are crucial for a system exceeding the hello world boundaries. That is, what I’m doing. So, utilizing all of Ruby’s beauty, a) a Rails plugin was developed and b) a standalone Server acting as a Central Authentication Service and Service Discovery instance. And from what I can tell, it’s beautiful ( not the codebase, at the moment, but the functionality ).

What is this thing able to do? Well, for the simple parts, it handles all your authentication needs. No more password juggling, just do it in one place, and nowhere else. OpenID compatibility is on the way, both as consumer and provider. It’s nice to have this kind of functionality by only installing a plugin and create a before_filter.

The next big thing is the service discovery. ActiveResource wasn’t used as an entry point for customizations, it was HyperactiveResource. I extended it to provide the ability to connect to the above mentioned central instance ( the address of this instance is defined in a configuration file, by the way ) to retrieve a services’ address. A simple thing, yet it makes life so much easier.

Is there a clue? Yes. Bundling the two features above, you are able to allow and disallow communication between two services at your will. Bidirectional, so assuming you do have an E-Mail-service and an AddressBook-service, you now can allow the E-Mail to access your AddressBook, without allowing the other direction. Authentication is handled completely transparent to the developer, and the rest of the usage is like HyperactiveResource. Just a charm.

And for me? Fun is back :-)

Still working on usabilitizing the Web: ideas, cont’d

Hello, today I continue my list of ideas which would make the Web and Computers in General way more easier and safer to use. Although it won’t change anything, I just want to make you think about it and let me know your opinion.

I’m still a bit frustrated that there is nothing like a general file metadata format. While every OS nowadays has its own metadata-supplying filesystem, there doesn’t seem to be a way to exchange that information or reuse it. Neither to publish it online, in a useful manner. I was thinking about a format that enables one to store almost arbitrary information in for a file. Album art, MD5-hashes, Author information, keywords. Anything that helps to categorize a file. Thus, a download would start by clicking on the metadata file, a nice screen would popup showing the file type, the supplied information, giving you the ability to get a quick overview about a file.

This could help to prevent the download of malicious software etc.. And it would help to store files correctly without user interaction. As a teacher of mine noticed correctly, it’s impossible to store and organize files in a useful way using the old-school directory-filename scheme, plus the good old 3-char extension, e.g. avi or txt. Something more sophisticated is needed, and would add a bit more usability to this web.

This is once again just a rough idea, and I would love to hear your comments.

Test drive Piwik Analytics Software

Finally, I found the time to install Piwik, an open-source Google Analytics lookalike. I’ve chosen to maintain a solution besides analyzing server stats to track traffic. Why? By using a snippet that executes some code, it is ensured that most spiders are simply not counted, careless of how well they are disguised.

Back to Piwik: the two main advantages over Googles solution are that you keep the data local and can do whatever you want with it, and for the curious, statistics are realtime.

The installation went down without any problems, as expected. Unfortunately, another PHP software requiring MySQL, though Piwik uses DataObjects, so porting it to work with any other DBMS shouldn’t be a big deal.

The snippet to include is a bit larger than Google’s one, Piwiks one is this:




While Googles snippet looks something like this:


Note that Google updated their snippet recently to include now a try-catch blog. Read the Analytics page for more info here. If you want to try out Piwik, just drop me a line and I’ll give you an account here if you want.

I’m quite curious on how the stats look like, I feel better to host them here than anywhere else, and I’ll also post some comparison of the actual stat results. Stay tuned!