1. Pipes got a new block for Reddit

    Summary: Since Reddit did not work well with our RSS feed block, despite its slow request limit and a new improved detection of 429 responses, Pipes now got a dedicated block to access subreddit feeds. That new block uses an rss-bridge instance to solve the problem.

    As far as social media sites go Reddit is one of the accessible ones. Not only is the activity on the site visible without an account, Reddit even provides RSS feeds for the submissions to their subreddits. That is very useful especially for sites like Pipes that work well with RSS items.

    But until recently Pipes and Reddit did not work well together. When importing a Reddit feed into a pipe, it sometimes worked and sometimes did not. It turned out that the Reddit servers again and again would refuse to serve content when Pipes asked.

    That was surprising because at that time our downloader looked like this (slightly simplified):

    require 'open-uri'
    require 'lru_redux'
    
    # Download and cache downloads. Limit requests to the same domain to not spam it
    
    class Downloader
    
        def initialize()
            begin
                @@limiter
            rescue
                @@limiter = LruRedux::TTL::ThreadSafeCache.new(1000, 2)
            end
        end
    
        def get(url, js = false)
            url = URI.parse(URI.escape(url))
            result, date = Database.instance.getCache(key: 'url_' + url.to_s + '_' + js.to_s)
            
            if date.nil? || (date + 600) < Time.now.to_i
                while (@@limiter.key?(url.host))
                    sleep(1)
                end
                @@limiter[url.host] = 1
    
                result = URI.open(url, :allow_redirections => :all).read
    
                Database.instance.cache(key: 'url_' + url.to_s, value: result)
            end
            return result        
        end
        
    end
    

    That code did some effort to avoid spamming other sites:

    1. It caches every downloaded URL for 10 minutes, checking the date an URL was cached with if date.nil? || (date + 600) < Time.now.to_i.
    2. Pipes will only download data when requested externally, by a feed reader for example, which might be a lot less often than that 10 minute interval.
    3. It will remember the last 1000 hosts and wait 2 second before sending a new request to a host it just saw. That works via the combination of the LruRedux queue with a time-to-live of 2 seconds and the while (@@limiter.key?(url.host))

    But Reddit still sent a 429 response as soon as it saw Pipes requesting an RSS feed.


    429 responses are a way for a server to tell a client to slow down. They can a also contain instructions on how long to wait. Reddit for example did set the retry-after header and filled it with a 7, telling us to wait 7 seconds before trying again. So we did just that in the new version of the downloader:

    response = HTTParty.get(url)
    if response.code == 429
        if response.headers['retry-after'].to_i < 20
            sleep response.headers['retry-after'].to_i
            response = HTTParty.get(url)
            result = response.body
        else
            result = ""
        end
    else
        result = response.body
    end
    

    Whenever a server – that might be Reddit or a different site – responds with a 429 header and sets a reasonable retry-after limit, this code will wait before trying it one more time. If that second requests also fails it will give up.

    There is a second change: HTTParty. That is a HTTP client as a ruby gem and an alternative to the open-uri method used before, the hope was that it will improve compatibility.

    I made some more changes to accommodate Reddit. Instead of waiting a minimum of two seconds between requests to the same host the downloader would wait two or three seconds, via a two second sleep when the host was still in the queue. And at some point the downloader had code to specifically slow down when it saw that the requested URL lead to Reddit.

    But nothing helped, the Reddit server was still not happy.

    That is where the new Reddit block comes in. It does two things to reliably get a subreddit’s RSS feed:

    1. It uses the FOSS project RSS-Bridge to fetch and cache the requested feed
    2. That software is running on a different server than Pipes itself

    The way RSS-Bridge requests Reddit’s RSS feed works, it seems to be slow and cached enough to not offend the server. And by using a different server with a different IP we make sure that additional requests Pipes might make do not add to the load RSS-Bridge causes, which reduces the danger that Reddit’s server places new limits on our RSS-Bridge instance.

    The Reddit block is a new addition to the integration menu of the editor that got added last month. In my testing the block proved to work reliably so far, even when a single pipe contained multiple Reddit feeds, as in this example with five random subreddits:

    5 reddit blocks in a pipe

    The improvements made to the downloader will remain active, they will help to reduce the load on other sites and to react better to future 429 responses.

  2. More blocks for external sites

    With the help of RSS Box Pipes now supports more external sites directly.

    Not every site has an easily accessible RSS feed or a HTML structure that can be transformed into one. For those, or where it would be more comfortable to enter for example a channel name, having a special block is helpful to get data into Pipes. So far, there was a block for Twitter and the feed block detected hidden feeds for Youtube. Now there are additional blocks for these sites:

    They join the Twitter block in the new integration menu of the editor sidebar.

    onli,
  3. The Architecture and Software behind Pipes

    Now that Pipes got an open source release, I’d like to present this dive into the architecture of the software. I hope it will be helpful for anyone trying to build a similar system, and especially for developers wanting to customize Pipes CE.

    Software overview

    Pipes goal is to provide infrastructure for users to work with data from the internet. It provides a visual interface where data flows from block to block and each block does one operation. Internally, each data item is an item in an RSS feed that gets sent from block object to block object, and also external data input is often an RSS feed, though it is possible to fetch data from webpages and to send XML or JSON data to webhook blocks. A typical example for what users do with Pipes is combining multiple feeds into one, filtering them, or creating feeds for sites that do not provide them. It definitely stands in the tradition of Yahoo! Pipes, hence the name.

    Pipes example

    Pipes is a Ruby/Sinatra application, uses SQLite as database engine and Raphaël for the main part of the frontend, the editor. Ruby works well here, since it is a very expressive language with a healthy ecosystem around XML (oga, nokogiri) and some useful modules for RSS/ATOM normalization (feedparser, feedjira), and creation (included). Sinatra provides the webserver parts of the system and the HTML views, this part is rather small. SQLite is the data store for everything, including the pipes serialized to JSON and cache data. And finally Raphaël abstracts creating the SVG part of the javascript powered editor, painting the blocks and manipulating them.

    Portier is used as the auth system. It enables logins without having to store passwords. Users log in with their email address and only have to confirm that they control that address, by either clicking on a link sent to them or, for Gmail, logging in with their Google account. That work is done by an external broker, Pipes just has to check the token sent back by that system.

    Backend

    I will start with a description of the backend, because defining the data structure is also how Pipes started.

    It began with the realization that a pipe can be thought of as a tree. The pipe’s output object is the root of the tree, the block that connects to its input are its children, and so on, until you reach the blocks that have no inputs, the leaves. That will usually be a Feed or a Download block, though that later expanded to also allow other Pipes (as represented by one block), Twitter or the Webhook block. Running a pipe means asking the root object of the tree for output, which will ask its children for input, which iterates through the whole tree until blocks that provide input are encountered. This data then flows back through the tree up to the Output block, with each block doing its own operations on the transported data.

    So when you look at this pipe, where two Feed blocks fetch some data, one is filtered, then both are combined into one feed and that combined feed set as pipes output:

    pipes example in editor

    You are actually looking at this execution tree:

    pipes example tree

    The code that enables this structure is minimal. There is a Pipe object, defined in pipe.rb. There is also an abstract Block object, defined in block.rb, the parent class for all blocks. Those two classes together define the pipe execution flow. An instantiated subclass of Block has an array of inputs, containing other blocks. When the system runs a stored pipe, this happens:

    1. A Pipe object is created
    2. That object fetches the JSON structure as stored in the database and sets the output block as its root
    3. Then it creates the block objects that are the inputs of the current block, by recursively calling createInputs on its children and their children.
    4. To finally get the output, it calls Pipe.run(), which calls Block.run(), which will go through the tree as described above.
    5. The actual data manipulation work is done in a process function each indiviudal block has to define on its own. Look for example at the process function of the FilterBlock.

    To define pipe execution like that has a hidden advantage: When you want to see the output of a pipe at a specific node of the tree, at one specific block, all you have to do is set that block as the root element of a new Pipe.

    This describes almost all classes and files of the backend. Additionally, there is a server.rb, the entrypoint for the Sinatra application and where all routes as well as some helper functions are defined. twitterclient.rb is a small wrapper around the twitter gem and used by the twitter block. downloader.rb is central for most pipes, as it is used by the Feeds and the Download block to fetch data, and the central cache is set here that prevents requesting data from other servers too often. The Gemfile describes the requirements, the config.ru makes the application runnable by the usual ruby commands and also initializes the Moneta session store, that prevents users from being logged out when the server process restarts. HTML templates are stored under views/, ERB is used for them. The individual blocks are stored under blocks/. Images, Javascript, CSS etc are in public/, accessible to users.

    As you probably noted, this is all bundled into one application. The way a pipe works, it would be easy enough to divide the Sinatra part of the application and the Pipes part, into what could also be called server frontend and worker backend. That would provide a way to scale horizontally, over multiple processes and even multiple servers. So far this just has not been necessary, upgrading the server has been enough to keep pipes.digital running. But keep that option in mind if you run into performance limitations in deployments with a big amount of individual pipes.

    Frontend

    The website uses server side rendering, as provided by Sinatra. But that describes only the starting point and the less important parts of the frontend adequately. The heart is the editor that gets initialized by a HTML template (views/editor.erb), but then fetches and sends data via AJAX, more akin to a Single Page Application. It is a big SVG canvas powered by Raphaël, with absolutely positioned HTML input elements mixed into it, to make the blocks configurable. Its code lives in public/pipes-ui.js.

    The main objects here are blocks and connectors. Connectors can be Inputs or Outputs, and there exist TextInputs and TextOutputs for the text fields that double as parameters when calling a pipe. Let’s look at the Filter block as an example.

    Filter Block

    In simplified form it is defined like this:

    function FilterBlock(x, y) {
        Block.call(this, 1, 1, x, y, 'Filter', 200, 150);
        
        this.userinputs[0].style.width = '150px';
        this.userinputs[0].required = true;
        this.userinputs[0].placeholder = 'keyword';
        
        var textinput = new TextInput(x + 29, y + 37, this);
        this.textinputs.push(textinput);
        
        var field = document.createElement('select');
        var all = document.createElement('option');
        all.value = 'all';
        all.text = 'all';
        var content = document.createElement('option');
        content.value = 'content';
        content.text = 'item.content';
        
        field.appendChild(all);
        field.appendChild(content);
        
        field.style.position = 'absolute';
        field.setAttribute('data-xoffset', 30);
        field.setAttribute('data-yoffset', 75);
        field.style.left = x + 30 + 'px';
        field.style.top = y + 75 + 'px';
        field.style.width = '150px';
        field.name = 'field';
        document.querySelector('#program').appendChild(field);
        this.userinputs.push(field)
    }
    

    What happens here? First, Block.call creates a Block object. The constructor of the Block object (that’s just the function Block) creates the needed SVG elements. The two 1s define the number of Inputs and Outputs the block will have, x and y are the position on the canvas, 'Filter' the title as shown in the title bar of the block, and finally the width and height are set to 200 and 150.

    The Block constructor also creates a userinputs array and adds one HTML input element to it, a text box. That array is where all the inputs of a block have to be stored. That way, other code like the drag’n drop handlers know which absolutely positioned inputs belong to which block. This first userinput element is then changed, a width is set, it is marked as required and it gets a placeholder.

    Next, it also gets a TextInput, allowing connections to TextBlocks. That’s the green circle at the left of the text input shown above.

    The code that follows shows how to add additional userinput elements, in this case a HTML select element with two options. Note how elements like this are added to the DOM (since they are absolutely positioned and regular HTML elements) and also to the userinputs array, for the link to the block.

    Not shown here but working similar to the userinputs and textinputs arrays is the deco array that does the same thing for Raphaël SVG elements, used for example to add text and lines to blocks.

    Other main classes, but something you probably won’t need to change, is Connector, the parent of Input and Output, the half-circles at the left and right of input blocks. They manage the block connections.

    When a pipe is saved these steps happen:

    1. The function save() is called.
    2. It calls serialize()
    3. serialize iterates through the blocks array and creates a JSON representation of their current state, including which blocks they connect to
    4. This JSON array gets sent to the backend

    While I’m happy with the resulting user interface, this approach to building it involves a lot of manual coding work and is not modern. It would be quite interesting to see an implementation of this interface based for example on Vue, as long as it involves no javascript build process. Not bloating the development process with such an abomination was a high priority goal when starting the project.

    Database

    SQLite is used to store all data in a file. The file database.rb creates all tables and the functions for accessing them.

    Important for understanding how Pipes works are only two: users and pipes. Let’s look at users first:

    CREATE TABLE IF NOT EXISTS users(
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        email TEXT UNIQUE
    );
    

    Every user get its own id and a field for the email address. Note how that address is not the id, to make it easier to change it later.

    The data for each pipe is stored in the second table:

    CREATE TABLE IF NOT EXISTS pipes(
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        pipe TEXT,
        title TEXT,
        description TEXT,
        user INTEGER,
        preview TEXT,
        public BOOLEAN DEFAULT 0,
        publicdate INTEGER DEFAULT 0,
        date INTEGER DEFAULT CURRENT_TIMESTAMP,
        FOREIGN KEY(user) REFERENCES users(id)
    );
    

    The purpose of most of these fields should be clear. Each pipe gets its own id, can have a title and a description and is linked to a user id. For shared pipes public is set to 1. Most important is the pipe column. It contains a JSON representation of a pipe, as created in the frontend when serializing the editor content. That JSON structure is later transformed into the tree, as described above.

    Outlook: Open development questions

    The above describes Pipes as it is today. But how could future development look like?

    Much of the development so far was user driven. Exchanges by email or in the issue tracker lead to improvements, big and small, like being able to run javascript before crawling sites, create CSS selector for fetching data from sites by clicking on the target elements in an overlay, or keeping media enclosures. I’m sure more will follow, but that development is unpredictable to me.

    One thought I return to quite often is the focus on RSS feeds. Right now, that is baked hard into the software, as the data exchange format between blocks. That plus normalization is a big part of what makes the current approach as powerful as it is. But I often wonder whether there could be a better approach, something that still allows manipulating RSS feeds and creating them as pipe output, but making it more straightforward to work with raw text, XML, JSON or other structured data instead. That was also the starting point of the usecase focused blog series describing how Pipes can be used today. Finding a better concept could allow supporting more usecases (how would they look like?) and also remove the need for normalizing RSS/ATOM/JSON-Feeds before manipulating them, which could become an optional step of the feed block. One advantage there would be avoiding the issue of sometimes removing parts of the original feed not covered by that normalization.

    Splitting the backend into a server part and a multiprocess program running the pipes would be an optimization useful for big deployments. It could also be nice when deploying to multicore servers with weak cores, like those ARM servers scaleway used to offer or a Raspberry Pi at home. Resque could work well for that.

    Porting the user interface to a more modern javascript library could be worthwhile if it would result in new capabilities for the user, maybe even supporting new usecases that way.


    Questions, suggestions? Join the pipes gitter channel or open a github issue.

    onli,
  4. Pipes CE: A free version of Pipes

    If you ever wanted to run Pipes locally or look at the source code, now you can. Pipes CE is an AGPL licensed version of Pipes. Right now it is almost equal to what is run on pipes.digital, minus stuff like billing code and with a slightly different color scheme.

    /blog/upload/pipesce.png

    That’s a big step! It moves Pipes away from being a closed project to it being free software, software that can survive even without the involvement of this site. If you are a developer, have a look at the Github repository, and if you have want to see some new code merged into the core you now have the option to send in a pull request.

    onli,
  5. How to get notified of new YouTube videos in your feed reader

    This is part 6 in a series explaining how pipes works and what you can do with it. The article before this showed how to embed tweets into RSS feeds created by pipes.

    Some sites have working RSS feeds that can be used to easily feed them into pipes, but they hide them. Regular discovery can’t work when the feed is not linked in the head of a page, which is exactly what happens sometimes. The biggest offender here is probably YouTube. But since the feeds do exist and their url follows a known scheme, pipes can access them anyway.

    Let’s say I want to see all new videos added to the T90Official Age of Empires 2 channel. Load that page:

    /blog/upload/youtube_t90.jpg

    All we need is the url of the channel, https://www.youtube.com/channel/UCZUT79WUUpZlZ-XMF7l4CFg. Some channels also have a nicer to look at user url, like https://www.youtube.com/user/GamersNexus. That would also work.

    Create a new pipe, drag a feed block from the left toolbar into the editor area and connect it to the pipe output to the right:

    /blog/upload/youtube_feedblock.png

    Now the created feed contains items with links to all new uploaded videos:

    /blog/upload/youtube_t90_feed.png

    The second YouTube element that can be accessed are playlists. For example the game rating discussions by GamersGlobal:

    /blog/upload/youtube_gg_playlist.png

    Set the url https://www.youtube.com/playlist?list=PLK9Cx_P99XLIBMkdW48UMA8OCnwBXUEVZ as target of a feed block. And the feed will now contain the videos in the playlist:

    /blog/upload/youtube_playlist_feed.png

    Both type of feeds get updated as one would expect when new videos get added.

    Having those feeds is useful to filter out videos you don’t like, combine multiple channels into one or just to be notified about new videos of your favorite channel not on the YouTube website or app, but in your feed reader.