Pipes Techblog

Why Pipes has new Timeouts and a new Pricing Plan, Incident Report
Summary: Among other optimizations, Pipes had to introduce a limit on how long each free pipe can run to keep the server from falling over. A new cheaper paid plan is meant as an option for those needing a longer execution time back and not being covered by an enterprise. The following will describe in detail what happened.

The incident

On december 23, Pipes went offline. During the last year this happened just a few times. The server is meant to autorestart Pipes’ ruby service via supervisor if something goes wrong, but there are several situation where this can fail, e.g. when the Out-Of-Memory (OOM) protection kills the ruby process of the webserver we use, puma. That OOM situation did happen a bit too often a while ago, but earlier in the year a server admin colleague of mine gave expert advice on how to configure the server for our workload, and since then Pipes had been stable. So when I got notified this time I restarted the service and chalked it up to a load spike and bad luck. Pipes was up and running.

But then it broke again. And again. Restarting the service, even restarting the server only helped for a short while. Something was wrong and I had to get serious about this, otherwise Pipes could have been unstable over the holidays. And no one wants to work on this while preparing a christmas dinner ;) But the site not being available for so long was not an option.

Investigation

So this is what was tried. Escalating step by step:
1. I investigated why the ruby process was dying. The log showed: At first it was indeed the OOM-killer just doing its work, but why was it triggered now?
2. To get an answer, I ran the local development environment with a production size database. This revealed that the hourly sqlite VACUUM after a cache clean was very heavy with such a slightly bigger database, together with the base load qualified to be a problem for the server. Maybe that was the root cause, our database got bigger over time and now hit a critical level? Indeed after removing the VACUUM instruction Pipes came up nicely and worked for a while. No luck, after a while the site got stuck again.
3. But this time, no OOM action was in the logs. Related to VACUUM or not, one could finally observe in which state the ruby process was when it was killed earlier. First observation: 100% processor load on one core. The only long running actions Pipes has are downloads (which should not cause high processor load) and runnning a Pipe itself. That’s the moment when I had to try a strict limit on how long a pipe can run.
4. While Pipes was working now, very soon after the restart - with the time limit applied - the cpu load spiked again to 100% and stayed there. Nice change: This time Pipes as a whole continued to work somewhat. This improvement had to be caused by the timeout limiting the total load, but it was not as effective as expected. Via rbspy (big recommendation!) a flamegraph gathered from the running process revealed that a whole pipe could still run a lot longer than the target timeout, as long as a single blocks did not finish. Additional timeout checks during block execution followed, before that the check was only applied at block transitions.
5. With those changes Pipes stayed stable over night, though some bigger pipes will now give no or less results.
This is the gathered flamegraph minus some details (it’s an svg, click on it to get the full sized version):

Letting a single pipe run that long was a mistake of the original design. Pipes is threaded via the puma webserver, but not highly concurrent via multiple processes. That means that it needed just some of those long running pipes and the ruby program would be overloaded. Also, the nginx webserver would terminate the connection anyway after some seconds, continuing with the pipe would not help in avoiding a 504 error (though after one successful run the result would be cached for a bit). Implementing a timeout on the ruby side was the correct thing to do, even though it means that some pipes that worked before will not work anymore.

Outlook

In the next year, I want to improve this situation.

First measure, and that offer starts right now, is a new cheaper supporter plan that provides more pipes than the free plan and raises the pipe time limit to the original level.
Pipes has to manage the server resources better, limiting the Pipes made under a free plan a bit stronger than the fewer paid pipes will help with that. At the same time, Pipes is supposed to be a useful web infrastructure for everyone, so the limitations should not be too severe. The paid plans we had so far were targeted at professionals using Pipes for commercial projects, that does certainly not cover everyone interested in Pipes. The new plan is likely more attractive for those not in that category.

Next step will be to work on a new architecture of the software, with the goal of better distributing the workload caused by pipes on multiple processor cores and also separating the website from that pipe backend. That way, the ruby program could easier keep longer running pipes around and better keep load spikes under control. There is however sadly no guarantee that a new architecture will come together and actually be helpful, operations like that can always fail. But for now I am confident that it can succeed in providing an improvement.

Finally, moving to a stronger server will help all users by each pipe finishing faster. That would mean higher project costs though. I hope additional income from paid plans can balance that. It would be awesome if you could support me there, be it via the new supporter plan or one of the other two!
onli, 26.12.2020
Pipes got a new block for Reddit
Summary: Since Reddit did not work well with our RSS feed block, despite its slow request limit and a new improved detection of 429 responses, Pipes now got a dedicated block to access subreddit feeds. That new block uses an rss-bridge instance to solve the problem.

As far as social media sites go Reddit is one of the accessible ones. Not only is the activity on the site visible without an account, Reddit even provides RSS feeds for the submissions to their subreddits. That is very useful especially for sites like Pipes that work well with RSS items.

But until recently Pipes and Reddit did not work well together. When importing a Reddit feed into a pipe, it sometimes worked and sometimes did not. It turned out that the Reddit servers again and again would refuse to serve content when Pipes asked.

That was surprising because at that time our downloader looked like this (slightly simplified):
```
require 'open-uri'
require 'lru_redux'

# Download and cache downloads. Limit requests to the same domain to not spam it

class Downloader

    def initialize()
        begin
            @@limiter
        rescue
            @@limiter = LruRedux::TTL::ThreadSafeCache.new(1000, 2)
        end
    end

    def get(url, js = false)
        url = URI.parse(URI.escape(url))
        result, date = Database.instance.getCache(key: 'url_' + url.to_s + '_' + js.to_s)
        
        if date.nil? || (date + 600) < Time.now.to_i
            while (@@limiter.key?(url.host))
                sleep(1)
            end
            @@limiter[url.host] = 1

            result = URI.open(url, :allow_redirections => :all).read

            Database.instance.cache(key: 'url_' + url.to_s, value: result)
        end
        return result        
    end
    
end
```
That code did some effort to avoid spamming other sites:
1. It caches every downloaded URL for 10 minutes, checking the date an URL was cached with if date.nil? || (date + 600) < Time.now.to_i.
2. Pipes will only download data when requested externally, by a feed reader for example, which might be a lot less often than that 10 minute interval.
3. It will remember the last 1000 hosts and wait 2 second before sending a new request to a host it just saw. That works via the combination of the LruRedux queue with a time-to-live of 2 seconds and the while (@@limiter.key?(url.host))
But Reddit still sent a 429 response as soon as it saw Pipes requesting an RSS feed.

429 responses are a way for a server to tell a client to slow down. They can a also contain instructions on how long to wait. Reddit for example did set the retry-after header and filled it with a 7, telling us to wait 7 seconds before trying again. So we did just that in the new version of the downloader:
```
response = HTTParty.get(url)
if response.code == 429
    if response.headers['retry-after'].to_i < 20
        sleep response.headers['retry-after'].to_i
        response = HTTParty.get(url)
        result = response.body
    else
        result = ""
    end
else
    result = response.body
end
```
Whenever a server – that might be Reddit or a different site – responds with a 429 header and sets a reasonable retry-after limit, this code will wait before trying it one more time. If that second requests also fails it will give up.

There is a second change: HTTParty. That is a HTTP client as a ruby gem and an alternative to the open-uri method used before, the hope was that it will improve compatibility.

I made some more changes to accommodate Reddit. Instead of waiting a minimum of two seconds between requests to the same host the downloader would wait two or three seconds, via a two second sleep when the host was still in the queue. And at some point the downloader had code to specifically slow down when it saw that the requested URL lead to Reddit.

But nothing helped, the Reddit server was still not happy.

That is where the new Reddit block comes in. It does two things to reliably get a subreddit’s RSS feed:
1. It uses the FOSS project RSS-Bridge to fetch and cache the requested feed
2. That software is running on a different server than Pipes itself
The way RSS-Bridge requests Reddit’s RSS feed works, it seems to be slow and cached enough to not offend the server. And by using a different server with a different IP we make sure that additional requests Pipes might make do not add to the load RSS-Bridge causes, which reduces the danger that Reddit’s server places new limits on our RSS-Bridge instance.

The Reddit block is a new addition to the integration menu of the editor that got added last month. In my testing the block proved to work reliably so far, even when a single pipe contained multiple Reddit feeds, as in this example with five random subreddits:

The improvements made to the downloader will remain active, they will help to reduce the load on other sites and to react better to future 429 responses.
onli, 21.07.2020 update
More blocks for external sites
With the help of RSS Box Pipes now supports more external sites directly.

Not every site has an easily accessible RSS feed or a HTML structure that can be transformed into one. For those, or where it would be more comfortable to enter for example a channel name, having a special block is helpful to get data into Pipes. So far, there was a block for Twitter and the feed block detected hidden feeds for Youtube. Now there are additional blocks for these sites:
- Vimeo
- Speedrun.com
- Dailymotion
- Periscope
- Mixcloud
- SVT Play
They join the Twitter block in the new integration menu of the editor sidebar.
onli, 26.06.2020
The Architecture and Software behind Pipes
Now that Pipes got an open source release, I’d like to present this dive into the architecture of the software. I hope it will be helpful for anyone trying to build a similar system, and especially for developers wanting to customize Pipes CE.

Software overview

Pipes goal is to provide infrastructure for users to work with data from the internet. It provides a visual interface where data flows from block to block and each block does one operation. Internally, each data item is an item in an RSS feed that gets sent from block object to block object, and also external data input is often an RSS feed, though it is possible to fetch data from webpages and to send XML or JSON data to webhook blocks. A typical example for what users do with Pipes is combining multiple feeds into one, filtering them, or creating feeds for sites that do not provide them. It definitely stands in the tradition of Yahoo! Pipes, hence the name.

Pipes is a Ruby/Sinatra application, uses SQLite as database engine and Raphaël for the main part of the frontend, the editor. Ruby works well here, since it is a very expressive language with a healthy ecosystem around XML (oga, nokogiri) and some useful modules for RSS/ATOM normalization (feedparser, feedjira), and creation (included). Sinatra provides the webserver parts of the system and the HTML views, this part is rather small. SQLite is the data store for everything, including the pipes serialized to JSON and cache data. And finally Raphaël abstracts creating the SVG part of the javascript powered editor, painting the blocks and manipulating them.

Portier is used as the auth system. It enables logins without having to store passwords. Users log in with their email address and only have to confirm that they control that address, by either clicking on a link sent to them or, for Gmail, logging in with their Google account. That work is done by an external broker, Pipes just has to check the token sent back by that system.

Backend

I will start with a description of the backend, because defining the data structure is also how Pipes started.

It began with the realization that a pipe can be thought of as a tree. The pipe’s output object is the root of the tree, the block that connects to its input are its children, and so on, until you reach the blocks that have no inputs, the leaves. That will usually be a Feed or a Download block, though that later expanded to also allow other Pipes (as represented by one block), Twitter or the Webhook block. Running a pipe means asking the root object of the tree for output, which will ask its children for input, which iterates through the whole tree until blocks that provide input are encountered. This data then flows back through the tree up to the Output block, with each block doing its own operations on the transported data.

So when you look at this pipe, where two Feed blocks fetch some data, one is filtered, then both are combined into one feed and that combined feed set as pipes output:

You are actually looking at this execution tree:

The code that enables this structure is minimal. There is a Pipe object, defined in pipe.rb. There is also an abstract Block object, defined in block.rb, the parent class for all blocks. Those two classes together define the pipe execution flow. An instantiated subclass of Block has an array of inputs, containing other blocks. When the system runs a stored pipe, this happens:
1. A Pipe object is created
2. That object fetches the JSON structure as stored in the database and sets the output block as its root
3. Then it creates the block objects that are the inputs of the current block, by recursively calling createInputs on its children and their children.
4. To finally get the output, it calls Pipe.run(), which calls Block.run(), which will go through the tree as described above.
5. The actual data manipulation work is done in a process function each indiviudal block has to define on its own. Look for example at the process function of the FilterBlock.
To define pipe execution like that has a hidden advantage: When you want to see the output of a pipe at a specific node of the tree, at one specific block, all you have to do is set that block as the root element of a new Pipe.

This describes almost all classes and files of the backend. Additionally, there is a server.rb, the entrypoint for the Sinatra application and where all routes as well as some helper functions are defined. twitterclient.rb is a small wrapper around the twitter gem and used by the twitter block. downloader.rb is central for most pipes, as it is used by the Feeds and the Download block to fetch data, and the central cache is set here that prevents requesting data from other servers too often. The Gemfile describes the requirements, the config.ru makes the application runnable by the usual ruby commands and also initializes the Moneta session store, that prevents users from being logged out when the server process restarts. HTML templates are stored under views/, ERB is used for them. The individual blocks are stored under blocks/. Images, Javascript, CSS etc are in public/, accessible to users.

As you probably noted, this is all bundled into one application. The way a pipe works, it would be easy enough to divide the Sinatra part of the application and the Pipes part, into what could also be called server frontend and worker backend. That would provide a way to scale horizontally, over multiple processes and even multiple servers. So far this just has not been necessary, upgrading the server has been enough to keep pipes.digital running. But keep that option in mind if you run into performance limitations in deployments with a big amount of individual pipes.

Frontend

The website uses server side rendering, as provided by Sinatra. But that describes only the starting point and the less important parts of the frontend adequately. The heart is the editor that gets initialized by a HTML template (views/editor.erb), but then fetches and sends data via AJAX, more akin to a Single Page Application. It is a big SVG canvas powered by Raphaël, with absolutely positioned HTML input elements mixed into it, to make the blocks configurable. Its code lives in public/pipes-ui.js.

The main objects here are blocks and connectors. Connectors can be Inputs or Outputs, and there exist TextInputs and TextOutputs for the text fields that double as parameters when calling a pipe. Let’s look at the Filter block as an example.

In simplified form it is defined like this:
```
function FilterBlock(x, y) {
    Block.call(this, 1, 1, x, y, 'Filter', 200, 150);
    
    this.userinputs[0].style.width = '150px';
    this.userinputs[0].required = true;
    this.userinputs[0].placeholder = 'keyword';
    
    var textinput = new TextInput(x + 29, y + 37, this);
    this.textinputs.push(textinput);
    
    var field = document.createElement('select');
    var all = document.createElement('option');
    all.value = 'all';
    all.text = 'all';
    var content = document.createElement('option');
    content.value = 'content';
    content.text = 'item.content';
    
    field.appendChild(all);
    field.appendChild(content);
    
    field.style.position = 'absolute';
    field.setAttribute('data-xoffset', 30);
    field.setAttribute('data-yoffset', 75);
    field.style.left = x + 30 + 'px';
    field.style.top = y + 75 + 'px';
    field.style.width = '150px';
    field.name = 'field';
    document.querySelector('#program').appendChild(field);
    this.userinputs.push(field)
}
```
What happens here? First, Block.call creates a Block object. The constructor of the Block object (that’s just the function Block) creates the needed SVG elements. The two 1s define the number of Inputs and Outputs the block will have, x and y are the position on the canvas, 'Filter' the title as shown in the title bar of the block, and finally the width and height are set to 200 and 150.

The Block constructor also creates a userinputs array and adds one HTML input element to it, a text box. That array is where all the inputs of a block have to be stored. That way, other code like the drag’n drop handlers know which absolutely positioned inputs belong to which block. This first userinput element is then changed, a width is set, it is marked as required and it gets a placeholder.

Next, it also gets a TextInput, allowing connections to TextBlocks. That’s the green circle at the left of the text input shown above.

The code that follows shows how to add additional userinput elements, in this case a HTML select element with two options. Note how elements like this are added to the DOM (since they are absolutely positioned and regular HTML elements) and also to the userinputs array, for the link to the block.

Not shown here but working similar to the userinputs and textinputs arrays is the deco array that does the same thing for Raphaël SVG elements, used for example to add text and lines to blocks.

Other main classes, but something you probably won’t need to change, is Connector, the parent of Input and Output, the half-circles at the left and right of input blocks. They manage the block connections.

When a pipe is saved these steps happen:
1. The function save() is called.
2. It calls serialize()
3. serialize iterates through the blocks array and creates a JSON representation of their current state, including which blocks they connect to
4. This JSON array gets sent to the backend
While I’m happy with the resulting user interface, this approach to building it involves a lot of manual coding work and is not modern. It would be quite interesting to see an implementation of this interface based for example on Vue, as long as it involves no javascript build process. Not bloating the development process with such an abomination was a high priority goal when starting the project.

Database

SQLite is used to store all data in a file. The file database.rb creates all tables and the functions for accessing them.

Important for understanding how Pipes works are only two: users and pipes. Let’s look at users first:
```
CREATE TABLE IF NOT EXISTS users(
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    email TEXT UNIQUE
);
```
Every user get its own id and a field for the email address. Note how that address is not the id, to make it easier to change it later.

The data for each pipe is stored in the second table:
```
CREATE TABLE IF NOT EXISTS pipes(
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    pipe TEXT,
    title TEXT,
    description TEXT,
    user INTEGER,
    preview TEXT,
    public BOOLEAN DEFAULT 0,
    publicdate INTEGER DEFAULT 0,
    date INTEGER DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY(user) REFERENCES users(id)
);
```
The purpose of most of these fields should be clear. Each pipe gets its own id, can have a title and a description and is linked to a user id. For shared pipes public is set to 1. Most important is the pipe column. It contains a JSON representation of a pipe, as created in the frontend when serializing the editor content. That JSON structure is later transformed into the tree, as described above.

Outlook: Open development questions

The above describes Pipes as it is today. But how could future development look like?

Much of the development so far was user driven. Exchanges by email or in the issue tracker lead to improvements, big and small, like being able to run javascript before crawling sites, create CSS selector for fetching data from sites by clicking on the target elements in an overlay, or keeping media enclosures. I’m sure more will follow, but that development is unpredictable to me.

One thought I return to quite often is the focus on RSS feeds. Right now, that is baked hard into the software, as the data exchange format between blocks. That plus normalization is a big part of what makes the current approach as powerful as it is. But I often wonder whether there could be a better approach, something that still allows manipulating RSS feeds and creating them as pipe output, but making it more straightforward to work with raw text, XML, JSON or other structured data instead. That was also the starting point of the usecase focused blog series describing how Pipes can be used today. Finding a better concept could allow supporting more usecases (how would they look like?) and also remove the need for normalizing RSS/ATOM/JSON-Feeds before manipulating them, which could become an optional step of the feed block. One advantage there would be avoiding the issue of sometimes removing parts of the original feed not covered by that normalization.

Splitting the backend into a server part and a multiprocess program running the pipes would be an optimization useful for big deployments. It could also be nice when deploying to multicore servers with weak cores, like those ARM servers scaleway used to offer or a Raspberry Pi at home. Resque could work well for that.

Porting the user interface to a more modern javascript library could be worthwhile if it would result in new capabilities for the user, maybe even supporting new usecases that way.

Questions, suggestions? Join the pipes gitter channel or open a github issue.
onli, 23.04.2020
Pipes CE: A free version of Pipes

If you ever wanted to run Pipes locally or look at the source code, now you can. Pipes CE is an AGPL licensed version of Pipes. Right now it is almost equal to what is run on pipes.digital, minus stuff like billing code and with a slightly different color scheme.

That’s a big step! It moves Pipes away from being a closed project to it being free software, software that can survive even without the involvement of this site. If you are a developer, have a look at the Github repository, and if you have want to see some new code merged into the core you now have the option to send in a pull request.

onli, 17.04.2020

The incident

Investigation

Outlook

Software overview

Backend

Frontend

Database

Outlook: Open development questions