Pipes Feed Preview: Pipes Techblog

  1. Why Pipes has new Timeouts and a new Pricing Plan, Incident Report

    Sat, 26 Dec 2020 10:44:31 -0000

    <p><em>Summary: Among other optimizations, Pipes had to introduce a limit on how long each free pipe can run to keep the server from falling over. A new cheaper paid plan is meant as an option for those needing a longer execution time back and not being covered by an enterprise. The following will describe in detail what happened.</em></p> <h3>The incident</h3> <p>On december 23, Pipes went offline. During the last year this happened just a few times. The server is meant to autorestart Pipes’ ruby service via supervisor if something goes wrong, but there are several situation where this can fail, e.g. when the Out-Of-Memory (OOM) protection kills the ruby process of the webserver we use, puma. That OOM situation did happen a bit too often a while ago, but earlier in the year a server admin colleague of mine gave expert advice on how to configure the server for our workload, and since then Pipes had been stable. So when I got notified this time I restarted the service and chalked it up to a load spike and bad luck. Pipes was up and running.</p> <p><a href="/blog/upload/joshua-aragon-FkjaN-7gWC0-unsplash.jpg"><img src="/blog/upload/joshua-aragon-FkjaN-7gWC0-unsplash.jpg" alt="/blog/upload/joshua-aragon-FkjaN-7gWC0-unsplash.jpg" /></a></p> <p>But then it broke again. And again. Restarting the service, even restarting the server only helped for a short while. Something was wrong and I had to get serious about this, otherwise Pipes could have been unstable over the holidays. And no one wants to work on this while preparing a christmas dinner ;) But the site not being available for so long was not an option.</p> <h3>Investigation</h3> <p>So this is what was tried. Escalating step by step:</p> <ol> <li>I investigated why the ruby process was dying. The log showed: At first it was indeed the OOM-killer just doing its work, but why was it triggered now?</li> <li>To get an answer, I ran the local development environment with a production size database. This revealed that the hourly sqlite VACUUM after a cache clean was very heavy with such a slightly bigger database, together with the base load qualified to be a problem for the server. Maybe that was the root cause, our database got bigger over time and now hit a critical level? Indeed after removing the VACUUM instruction Pipes came up nicely and worked for a while. No luck, after a while the site got stuck again.</li> <li>But this time, no OOM action was in the logs. Related to VACUUM or not, one could finally observe in which state the ruby process was when it was killed earlier. First observation: 100% processor load on one core. The only long running actions Pipes has are downloads (which should not cause high processor load) and runnning a Pipe itself. That’s the moment when I had to try <strong>a strict limit on how long a pipe can run</strong>.</li> <li>While Pipes was working now, very soon after the restart - with the time limit applied - the cpu load spiked again to 100% and stayed there. Nice change: This time Pipes as a whole continued to work somewhat. This improvement had to be caused by the timeout limiting the total load, but it was not as effective as expected. Via <a href="https://rbspy.github.io/">rbspy</a> (big recommendation!) a flamegraph gathered from the running process revealed that a whole pipe could still run a lot longer than the target timeout, as long as a single blocks did not finish. Additional timeout checks during block execution followed, before that the check was only applied at block transitions.</li> <li>With those changes Pipes stayed stable over night, though some bigger pipes will now give no or less results.</li> </ol> <p>This is the gathered flamegraph minus some details (it’s an svg, click on it to get the full sized version):</p> <p><a href="/blog/upload/rbspy-2020-12-23-rkl5ZWksYv.flamegraph.svg"><img src="/blog/upload/rbspy-2020-12-23-rkl5ZWksYv.flamegraph.svg" alt="/blog/upload/rbspy-2020-12-23-rkl5ZWksYv.flamegraph.svg" width="400" /></a></p> <p>Letting a single pipe run that long was a mistake of the original design. Pipes is threaded via the puma webserver, but not highly concurrent via multiple processes. That means that it needed just some of those long running pipes and the ruby program would be overloaded. Also, the nginx webserver would terminate the connection anyway after some seconds, continuing with the pipe would not help in avoiding a 504 error (though after one successful run the result would be cached for a bit). Implementing a timeout on the ruby side was the correct thing to do, even though it means that some pipes that worked before will not work anymore.</p> <h3>Outlook</h3> <p>In the next year, I want to improve this situation.</p> <p>First measure, and that offer starts right now, is a new <a href="https://www.pipes.digital/pricing">cheaper supporter plan</a> that provides more pipes than the free plan and raises the pipe time limit to the original level.<br /> Pipes has to manage the server resources better, limiting the Pipes made under a free plan a bit stronger than the fewer paid pipes will help with that. At the same time, Pipes is supposed to be a useful web infrastructure for everyone, so the limitations should not be too severe. The paid plans we had so far were targeted at professionals using Pipes for commercial projects, that does certainly not cover everyone interested in Pipes. The new plan is likely more attractive for those not in that category.</p> <p>Next step will be to work on a new architecture of the software, with the goal of better distributing the workload caused by pipes on multiple processor cores and also separating the website from that pipe backend. That way, the ruby program could easier keep longer running pipes around and better keep load spikes under control. There is however sadly no guarantee that a new architecture will come together and actually be helpful, operations like that can always fail. But for now I am confident that it can succeed in providing an improvement.</p> <p>Finally, moving to a stronger server will help all users by each pipe finishing faster. That would mean higher project costs though. I hope additional income from paid plans can balance that. It would be awesome if you could support me there, be it via the new supporter plan or one of the other two!</p>
  2. Pipes got a new block for Reddit

    Tue, 21 Jul 2020 14:37:19 -0000

    <p><em>Summary: Since Reddit did not work well with our RSS feed block, despite its slow request limit and a new improved detection of <code>429</code> responses, Pipes now got a dedicated block to access subreddit feeds. That new block uses an <a href="https://github.com/RSS-Bridge/rss-bridge">rss-bridge</a> instance to solve the problem.</em></p> <p>As far as social media sites go Reddit is one of the accessible ones. Not only is the activity on the site visible without an account, Reddit even provides RSS feeds for the submissions to their subreddits. That is very useful especially for sites like Pipes that work well with RSS items.</p> <p>But until recently Pipes and Reddit did not work well together. When importing a Reddit feed into a pipe, it sometimes worked and sometimes did not. It turned out that the Reddit servers again and again would refuse to serve content when Pipes asked.</p> <p>That was surprising because at that time our downloader looked like this (slightly simplified):</p> <pre><code>require 'open-uri' require 'lru_redux' # Download and cache downloads. Limit requests to the same domain to not spam it class Downloader def initialize() begin @@limiter rescue @@limiter = LruRedux::TTL::ThreadSafeCache.new(1000, 2) end end def get(url, js = false) url = URI.parse(URI.escape(url)) result, date = Database.instance.getCache(key: 'url_' + url.to_s + '_' + js.to_s) if date.nil? || (date + 600) &lt; Time.now.to_i while (@@limiter.key?(url.host)) sleep(1) end @@limiter[url.host] = 1 result = URI.open(url, :allow_redirections =&gt; :all).read Database.instance.cache(key: 'url_' + url.to_s, value: result) end return result end end </code></pre> <p>That code did some effort to avoid spamming other sites:</p> <ol> <li>It caches every downloaded URL for 10 minutes, checking the date an URL was cached with <code>if date.nil? || (date + 600) &lt; Time.now.to_i</code>.</li> <li>Pipes will only download data when requested externally, by a feed reader for example, which might be a lot less often than that 10 minute interval.</li> <li>It will remember the last 1000 hosts and wait 2 second before sending a new request to a host it just saw. That works via the combination of the LruRedux queue with a <em>time-to-live</em> of 2 seconds and the <code>while (@@limiter.key?(url.host))</code></li> </ol> <p>But Reddit still sent a <code>429</code> response as soon as it saw Pipes requesting an RSS feed.</p> <hr /> <p><code>429</code> responses are a way for a server to tell a client to slow down. They can a also contain instructions on how long to wait. Reddit for example did set the <code>retry-after</code> header and filled it with a <code>7</code>, telling us to wait 7 seconds before trying again. So we did just that in the new version of the downloader:</p> <pre><code>response = HTTParty.get(url) if response.code == 429 if response.headers['retry-after'].to_i &lt; 20 sleep response.headers['retry-after'].to_i response = HTTParty.get(url) result = response.body else result = "" end else result = response.body end </code></pre> <p>Whenever a server – that might be Reddit or a different site – responds with a <code>429</code> header and sets a reasonable <code>retry-after</code> limit, this code will wait before trying it one more time. If that second requests also fails it will give up.</p> <p>There is a second change: <code>HTTParty</code>. That is a HTTP client as a <a href="https://github.com/jnunemaker/httparty">ruby gem</a> and an alternative to the <code>open-uri</code> method used before, the hope was that it will improve compatibility.</p> <p>I made some more changes to accommodate Reddit. Instead of waiting a minimum of two seconds between requests to the same host the downloader would wait two or three seconds, via a two second sleep when the host was still in the queue. And at some point the downloader had code to specifically slow down when it saw that the requested URL lead to Reddit.</p> <p><strong>But nothing helped, the Reddit server was still not happy.</strong></p> <p>That is where the new Reddit block comes in. It does two things to reliably get a subreddit’s RSS feed:</p> <ol> <li>It uses the FOSS project <a href="https://github.com/RSS-Bridge/rss-bridge">RSS-Bridge</a> to fetch and cache the requested feed</li> <li>That software is running on a different server than Pipes itself</li> </ol> <p>The way RSS-Bridge requests Reddit’s RSS feed works, it seems to be slow and cached enough to not offend the server. And by using a different server with a different IP we make sure that additional requests Pipes might make do not add to the load RSS-Bridge causes, which reduces the danger that Reddit’s server places new limits on our RSS-Bridge instance.</p> <p>The Reddit block is a new addition to the integration menu of the editor that <a href="https://pipes.digital/blog/9/More%20blocks%20for%20external%20sites">got added last month</a>. In my testing the block proved to work reliably so far, even when a single pipe contained multiple Reddit feeds, as in this example with five random subreddits:</p> <p><a href="/blog/upload/redditblog.webp"><img src="/blog/upload/redditblog.webp" alt="5 reddit blocks in a pipe" /></a></p> <p>The improvements made to the downloader will remain active, they will help to reduce the load on other sites and to react better to future <code>429</code> responses.</p>
  3. More blocks for external sites

    Fri, 26 Jun 2020 11:47:35 -0000

    <p>With the help of <a href="https://github.com/stefansundin/rssbox">RSS Box</a> Pipes now supports more external sites directly.</p> <p>Not every site has an easily accessible RSS feed or a HTML structure that can be transformed into one. For those, or where it would be more comfortable to enter for example a channel name, having a special block is helpful to get data into Pipes. So far, there was a block for Twitter and the feed block detected hidden feeds for Youtube. Now there are additional blocks for these sites:</p> <ul> <li><a href="https://vimeo.com/">Vimeo</a></li> <li><a href="https://www.speedrun.com/">Speedrun.com</a></li> <li><a href="https://www.dailymotion.com/">Dailymotion</a></li> <li><a href="https://www.periscope.tv/">Periscope</a></li> <li><a href="https://www.mixcloud.com/">Mixcloud</a></li> <li><a href="https://www.svtplay.se/">SVT Play</a></li> </ul> <p>They join the Twitter block in the new integration menu of the editor sidebar.</p>
  4. The Architecture and Software behind Pipes

    Thu, 23 Apr 2020 18:10:11 -0000

    <p>Now that Pipes got an <a href="https://pipes.digital/blog/7/Pipes%20CE:%20A%20free%20version%20of%20Pipes">open source release</a>, I’d like to present this dive into the architecture of the software. I hope it will be helpful for anyone trying to build a similar system, and especially for developers wanting to customize <a href="https://github.com/pipes-digital/pipes">Pipes CE</a>.</p> <h3>Software overview</h3> <p>Pipes goal is to provide infrastructure for users to work with data from the internet. It provides a visual interface where data flows from block to block and each block does one operation. Internally, each data item is an item in an RSS feed that gets sent from block object to block object, and also external data input is often an RSS feed, though it is possible to fetch data from webpages and to send XML or JSON data to webhook blocks. A typical example for what users do with Pipes is combining multiple feeds into one, filtering them, or creating feeds for sites that do not provide them. It definitely stands in the tradition of <a href="https://en.wikipedia.org/wiki/Yahoo!_Pipes">Yahoo! Pipes</a>, hence the name.</p> <p><a href="https://www.pipes.digital/pipe-example2.png"><img src="https://www.pipes.digital/pipe-example2.png" alt="Pipes example" /></a></p> <p>Pipes is a <a href="http://sinatrarb.com/">Ruby/Sinatra</a> application, uses <a href="https://www.sqlite.org/index.html">SQLite</a> as database engine and <a href="https://dmitrybaranovskiy.github.io/raphael/">Raphaël</a> for the main part of the frontend, the editor. Ruby works well here, since it is a very expressive language with a healthy ecosystem around XML (<a href="https://gitlab.com/yorickpeterse/oga">oga</a>, <a href="https://nokogiri.org/">nokogiri</a>) and some useful modules for RSS/ATOM normalization (<a href="https://github.com/feedparser/feedparser">feedparser</a>, <a href="https://github.com/feedjira/feedjira">feedjira</a>), and creation (<a href="https://ruby-doc.org/stdlib-2.4.0/libdoc/rss/rdoc/RSS.html">included</a>). Sinatra provides the webserver parts of the system and the HTML views, this part is rather small. SQLite is the data store for everything, including the pipes serialized to JSON and cache data. And finally Raphaël abstracts creating the SVG part of the javascript powered editor, painting the blocks and manipulating them.</p> <p><a href="https://portier.github.io/">Portier</a> is used as the auth system. It enables logins without having to store passwords. Users log in with their email address and only have to confirm that they control that address, by either clicking on a link sent to them or, for Gmail, logging in with their Google account. That work is done by an external broker, Pipes just has to check the token sent back by that system.</p> <h3>Backend</h3> <p>I will start with a description of the backend, because defining the data structure is also how Pipes started.</p> <p>It began with the realization that a pipe can be thought of as a tree. The pipe’s output object is the root of the tree, the block that connects to its input are its children, and so on, until you reach the blocks that have no inputs, the leaves. That will usually be a Feed or a Download block, though that later expanded to also allow other Pipes (as represented by one block), Twitter or the Webhook block. Running a pipe means asking the root object of the tree for output, which will ask its children for input, which iterates through the whole tree until blocks that provide input are encountered. This data then flows back through the tree up to the Output block, with each block doing its own operations on the transported data.</p> <p>So when you look at this pipe, where two Feed blocks fetch some data, one is filtered, then both are combined into one feed and that combined feed set as pipes output:</p> <p><a href="/blog/architecture_example_editor.png"><img src="/blog/architecture_example_editor.png" alt="pipes example in editor" /></a></p> <p>You are actually looking at this execution tree:</p> <p><a href="/blog/architecture_example_pipe.svg"><img src="/blog/architecture_example_pipe.svg" alt="pipes example tree" /></a></p> <p>The code that enables this structure is minimal. There is a <code>Pipe</code> object, defined in <a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/pipe.rb">pipe.rb</a>. There is also an abstract <code>Block</code> object, defined in <a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/block.rb">block.rb</a>, the parent class for all blocks. Those two classes together define the pipe execution flow. An instantiated subclass of <code>Block</code> has an array of <code>inputs</code>, containing other blocks. When the system runs a stored pipe, this happens:</p> <ol> <li>A <code>Pipe</code> object is created</li> <li>That object fetches the JSON structure as stored in the database and sets the output block as its <code>root</code></li> <li>Then it creates the block objects that are the inputs of the current block, by recursively calling <a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/pipe.rb#L68"><code>createInputs</code></a> on its children and their children.</li> <li>To finally get the output, it calls <code>Pipe.run()</code>, which calls <code>Block.run()</code>, which will go through the tree as described above.</li> <li>The actual data manipulation work is done in a <code>process</code> function each indiviudal block has to define on its own. Look for example at <a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/blocks/filterblock.rb#L6">the process function</a> of the <code>FilterBlock</code>.</li> </ol> <p>To define pipe execution like that has a hidden advantage: When you want to see the output of a pipe at a specific node of the tree, at one specific block, all you have to do is set that block as the root element of a new <code>Pipe</code>.</p> <p>This describes almost all classes and files of the backend. Additionally, there is a <a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/server.rb">server.rb</a>, the entrypoint for the Sinatra application and where all routes as well as some helper functions are defined. <a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/twitterclient.rb">twitterclient.rb</a> is a small wrapper around the <a href="https://github.com/sferik/twitter">twitter gem</a> and used by the twitter block. <a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/downloader.rb">downloader.rb</a> is central for most pipes, as it is used by the Feeds and the Download block to fetch data, and the central cache is set here that prevents requesting data from other servers too often. The <a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/Gemfile">Gemfile</a> describes the requirements, the <a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/config.ru">config.ru</a> makes the application runnable by the usual ruby commands and also initializes the <a href="https://github.com/moneta-rb/moneta">Moneta</a> session store, that prevents users from being logged out when the server process restarts. HTML templates are stored under <a href="https://github.com/pipes-digital/pipes/tree/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/views">views/</a>, ERB is used for them. The individual blocks are stored under <a href="https://github.com/pipes-digital/pipes/tree/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/blocks">blocks/</a>. Images, Javascript, CSS etc are in <a href="https://github.com/pipes-digital/pipes/tree/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/public">public/</a>, accessible to users.</p> <p>As you probably noted, this is all bundled into one application. The way a pipe works, it would be easy enough to divide the Sinatra part of the application and the Pipes part, into what could also be called server frontend and worker backend. That would provide a way to scale horizontally, over multiple processes and even multiple servers. So far this just has not been necessary, upgrading the server has been enough to keep pipes.digital running. But keep that option in mind if you run into performance limitations in deployments with a big amount of individual pipes.</p> <h3>Frontend</h3> <p>The website uses server side rendering, as provided by Sinatra. But that describes only the starting point and the less important parts of the frontend adequately. The heart is the editor that gets initialized by a HTML template (<a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/views/editor.erb">views/editor.erb</a>), but then fetches and sends data via AJAX, more akin to a Single Page Application. It is a big SVG canvas powered by Raphaël, with absolutely positioned HTML input elements mixed into it, to make the blocks configurable. Its code lives in <a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/public/pipes-ui.js">public/pipes-ui.js</a>.</p> <p>The main objects here are blocks and connectors. Connectors can be Inputs or Outputs, and there exist TextInputs and TextOutputs for the text fields that double as parameters when calling a pipe. Let’s look at the Filter block as an example.</p> <p><img src="https://www.pipes.digital/imgs/filter_block_tiny.png" alt="Filter Block" /></p> <p>In simplified form it is defined like this:</p> <pre><code>function FilterBlock(x, y) { Block.call(this, 1, 1, x, y, 'Filter', 200, 150); this.userinputs[0].style.width = '150px'; this.userinputs[0].required = true; this.userinputs[0].placeholder = 'keyword'; var textinput = new TextInput(x + 29, y + 37, this); this.textinputs.push(textinput); var field = document.createElement('select'); var all = document.createElement('option'); all.value = 'all'; all.text = 'all'; var content = document.createElement('option'); content.value = 'content'; content.text = 'item.content'; field.appendChild(all); field.appendChild(content); field.style.position = 'absolute'; field.setAttribute('data-xoffset', 30); field.setAttribute('data-yoffset', 75); field.style.left = x + 30 + 'px'; field.style.top = y + 75 + 'px'; field.style.width = '150px'; field.name = 'field'; document.querySelector('#program').appendChild(field); this.userinputs.push(field) } </code></pre> <p>What happens here? First, Block.call creates a Block object. The constructor of the Block object (that’s just the function <a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/public/pipes-ui.js#L193"><code>Block</code></a>) creates the needed SVG elements. The two <code>1</code>s define the number of Inputs and Outputs the block will have, <code>x</code> and <code>y</code> are the position on the canvas, <code>'Filter'</code> the title as shown in the title bar of the block, and finally the width and height are set to <code>200</code> and <code>150</code>.</p> <p>The Block constructor also creates a userinputs array and adds one HTML input element to it, a text box. That array is where all the inputs of a block have to be stored. That way, other code like the drag’n drop handlers know which absolutely positioned inputs belong to which block. This first userinput element is then changed, a width is set, it is marked as required and it gets a placeholder.</p> <p>Next, it also gets a TextInput, allowing connections to TextBlocks. That’s the green circle at the left of the text input shown above.</p> <p>The code that follows shows how to add additional userinput elements, in this case a HTML select element with two options. Note how elements like this are added to the DOM (since they are absolutely positioned and regular HTML elements) and also to the userinputs array, for the link to the block.</p> <p>Not shown here but working similar to the userinputs and textinputs arrays is the <code>deco</code> array that does the same thing for Raphaël SVG elements, used for example to add text and lines to blocks.</p> <p>Other main classes, but something you probably won’t need to change, is <a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/public/pipes-ui.js#L1173"><code>Connector</code></a>, the parent of <code>Input</code> and <code>Output</code>, the half-circles at the left and right of input blocks. They manage the block connections.</p> <p>When a pipe is saved these steps happen:</p> <ol> <li>The function <a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/public/pipes-ui.js#L1551"><code>save()</code></a> is called.</li> <li>It calls <code>serialize()</code></li> <li><code>serialize</code> iterates through the <code>blocks</code> array and creates a JSON representation of their current state, including which blocks they connect to</li> <li>This JSON array gets sent to the backend</li> </ol> <p>While I’m happy with the resulting user interface, this approach to building it involves a lot of manual coding work and is not modern. It would be quite interesting to see an implementation of this interface based for example on Vue, as long as it involves no javascript build process. Not bloating the development process with such an abomination was a high priority goal when starting the project.</p> <h3>Database</h3> <p>SQLite is used to store all data in a file. The file <a href="https://github.com/pipes-digital/pipes/blob/12b5e3681bb7f75788b46dc403ff79cdbcb4066e/database.rb"><code>database.rb</code></a> creates all tables and the functions for accessing them.</p> <p>Important for understanding how Pipes works are only two: <em>users</em> and <em>pipes</em>. Let’s look at users first:</p> <pre><code>CREATE TABLE IF NOT EXISTS users( id INTEGER PRIMARY KEY AUTOINCREMENT, email TEXT UNIQUE ); </code></pre> <p>Every user get its own id and a field for the email address. Note how that address is not the id, to make it easier to change it later.</p> <p>The data for each pipe is stored in the second table:</p> <pre><code>CREATE TABLE IF NOT EXISTS pipes( id INTEGER PRIMARY KEY AUTOINCREMENT, pipe TEXT, title TEXT, description TEXT, user INTEGER, preview TEXT, public BOOLEAN DEFAULT 0, publicdate INTEGER DEFAULT 0, date INTEGER DEFAULT CURRENT_TIMESTAMP, FOREIGN KEY(user) REFERENCES users(id) ); </code></pre> <p>The purpose of most of these fields should be clear. Each pipe gets its own id, can have a title and a description and is linked to a user id. For shared pipes <code>public</code> is set to 1. Most important is the <code>pipe</code> column. It contains a JSON representation of a pipe, as created in the frontend when serializing the editor content. That JSON structure is later transformed into the tree, as described above.</p> <h3>Outlook: Open development questions</h3> <p>The above describes Pipes as it is today. But how could future development look like?</p> <p>Much of the development so far was user driven. Exchanges by email or in <a href="https://github.com/pipes-digital/pipes/issues">the issue tracker</a> lead to improvements, big and small, like being able to run javascript before crawling sites, create CSS selector for fetching data from sites by clicking on the target elements in an overlay, or keeping media enclosures. I’m sure more will follow, but that development is unpredictable to me.</p> <p>One thought I return to quite often is the focus on RSS feeds. Right now, that is baked hard into the software, as the data exchange format between blocks. That plus normalization is a big part of what makes the current approach as powerful as it is. But I often wonder whether there could be a better approach, something that still allows manipulating RSS feeds and creating them as pipe output, but making it more straightforward to work with raw text, XML, JSON or other structured data instead. That was also the starting point of the <a href="https://www.pipes.digital/blog/archive/howto">usecase focused blog series</a> describing how Pipes can be used today. Finding a better concept could allow supporting more usecases (how would they look like?) and also remove the need for normalizing RSS/ATOM/JSON-Feeds before manipulating them, which could become an optional step of the feed block. One advantage there would be avoiding the issue of sometimes removing parts of the original feed not covered by that normalization.</p> <p>Splitting the backend into a server part and a multiprocess program running the pipes would be an optimization useful for big deployments. It could also be nice when deploying to multicore servers with weak cores, like those ARM servers scaleway used to offer or a Raspberry Pi at home. <a href="https://github.com/resque/resque">Resque</a> could work well for that.</p> <p>Porting the user interface to a more modern javascript library could be worthwhile if it would result in new capabilities for the user, maybe even supporting new usecases that way.</p> <hr /> <p>Questions, suggestions? Join the <a href="https://gitter.im/pipes-digital/community">pipes gitter channel</a> or <a href="https://github.com/pipes-digital/pipes/issues">open a github issue</a>.</p>
  5. Pipes CE: A free version of Pipes

    Fri, 17 Apr 2020 09:45:32 -0000

    <p>If you ever wanted to run Pipes locally or look at the source code, now you can. <a href="https://github.com/pipes-digital/pipes">Pipes CE</a> is an AGPL licensed version of Pipes. Right now it is almost equal to what is run on pipes.digital, minus stuff like billing code and with a slightly different color scheme.</p> <p><a href="/blog/upload/pipesce.png"><img src="/blog/upload/pipesce.png" alt="/blog/upload/pipesce.png" /></a></p> <p>That’s a big step! It moves Pipes away from being a closed project to it being free software, software that can survive even without the involvement of this site. If you are a developer, have a look at <a href="https://github.com/pipes-digital/pipes">the Github repository</a>, and if you have want to see some new code merged into the core you now have the option to send in a pull request.</p>
  6. How to get notified of new YouTube videos in your feed reader

    Wed, 26 Feb 2020 08:49:25 -0000

    <p><em>This is part 6 in a series explaining how pipes works and what you can do with it. The article before this showed how to <a href="https://www.pipes.digital/blog/5/How%20to%20convert%20tweets%20to%20RSS%20feeds">embed tweets into RSS feeds</a> created by pipes.</em></p> <p>Some sites have working RSS feeds that can be used to easily feed them into pipes, but they hide them. Regular discovery can’t work when the feed is not linked in the head of a page, which is exactly what happens sometimes. The biggest offender here is probably YouTube. But since the feeds do exist and their url follows a known scheme, pipes can access them anyway.</p> <p>Let’s say I want to see all new videos added to the <a href="https://www.youtube.com/channel/UCZUT79WUUpZlZ-XMF7l4CFg">T90Official Age of Empires 2 channel</a>. Load that page:</p> <p><a href="/blog/upload/youtube_t90.jpg"><img src="/blog/upload/youtube_t90.jpg" alt="/blog/upload/youtube_t90.jpg" /></a></p> <p>All we need is the url of the channel, <code>https://www.youtube.com/channel/UCZUT79WUUpZlZ-XMF7l4CFg</code>. Some channels also have a nicer to look at user url, like <a href="https://www.youtube.com/user/GamersNexus"><code>https://www.youtube.com/user/GamersNexus</code></a>. That would also work.</p> <p><a href="https://www.pipes.digital/editor">Create a new pipe</a>, drag a feed block from the left toolbar into the editor area and connect it to the pipe output to the right:</p> <p><a href="/blog/upload/youtube_feedblock.png"><img src="/blog/upload/youtube_feedblock.png" alt="/blog/upload/youtube_feedblock.png" /></a></p> <p>Now the created feed contains items with links to all new uploaded videos:</p> <p><a href="/blog/upload/youtube_t90_feed.png"><img src="/blog/upload/youtube_t90_feed.png" alt="/blog/upload/youtube_t90_feed.png" /></a></p> <p>The second YouTube element that can be accessed are playlists. For example the <a href="https://www.youtube.com/playlist?list=PLK9Cx_P99XLIBMkdW48UMA8OCnwBXUEVZ">game rating discussions</a> by GamersGlobal:</p> <p><a href="/blog/upload/youtube_gg_playlist.png"><img src="/blog/upload/youtube_gg_playlist.png" alt="/blog/upload/youtube_gg_playlist.png" /></a></p> <p>Set the url <code>https://www.youtube.com/playlist?list=PLK9Cx_P99XLIBMkdW48UMA8OCnwBXUEVZ</code> as target of a feed block. And the feed will now contain the videos in the playlist:</p> <p><a href="/blog/upload/youtube_playlist_feed.png"><img src="/blog/upload/youtube_playlist_feed.png" alt="/blog/upload/youtube_playlist_feed.png" /></a></p> <p>Both type of feeds get updated as one would expect when new videos get added.</p> <p>Having those feeds is useful to filter out videos you don’t like, combine multiple channels into one or just to be notified about new videos of your favorite channel not on the YouTube website or app, but in your feed reader.</p>
  7. How to convert tweets to RSS feeds

    Fri, 21 Feb 2020 08:36:18 -0000

    <p><em>This is part 5 in a series explaining how pipes works and what you can do with it. Read <a href="https://www.pipes.digital/blog/1/How%20to%20filter%20RSS%20feeds">part 1</a> to learn how to filter feeds, <a href="https://www.pipes.digital/blog/2/How%20to%20combine%20RSS%20feeds">part 2</a> to see how to combine feeds, <a href="https://www.pipes.digital/blog/3/How%20to%20create%20RSS%20feeds%20for%20sites%20that%20have%20none">part 3</a> to create feeds for sites that have none and <a href="https://www.pipes.digital/blog/4/How%20to%20make%20a%20full%20text%20RSS%20feed%20out%20of%20a%20shortened%20feed">part 4</a> to convert a shortened feed to a full text RSS feed.</em></p> <p>Maybe you do not want to read tweets on Twitter, or you want to embed a bunch of them somewhere else. It might be useful to access them as an RSS feed, and in the process be able to sort and combine them into one bigger feed. That’s something pipes can help you with.</p> <p>Like always we will first <a href="https://www.pipes.digital/editor">create a new pipe</a>.</p> <p>Now in the editor, start by dragging a Tweets block from the left toolbar into the editor area. Pipes supports three modes for accessing tweets. Let’s start with the search, just enter a keyword. Note how in the inspector first tweets appear:</p> <p><a href="/blog/upload/tweets_search.jpg"><img src="/blog/upload/tweets_search.jpg" alt="/blog/upload/tweets_search.jpg" /></a></p> <p>The second method is accessing a hashtag by prepending <code>#</code>. I also added a truncate block, that way we can limit how big the final feed will be:</p> <p><a href="/blog/upload/tweets_hashtag.jpg"><img src="/blog/upload/tweets_hashtag.jpg" alt="/blog/upload/tweets_hashtag.jpg" /></a></p> <p>And the third option is accessing the public tweets of a user account, by entering <code>@username</code>:</p> <p><a href="/blog/upload/tweets_username.jpg"><img src="/blog/upload/tweets_username.jpg" alt="/blog/upload/tweets_username.jpg" /></a></p> <p>You can combine all three. You place one tweets block for every type of tweet you want to access and lead them to a combine block. Remember that the combine block adds the feeds together block by block, so you almost always want to add a sort block behind it:</p> <p><a href="/blog/upload/tweets_combined.jpg"><img src="/blog/upload/tweets_combined.jpg" alt="/blog/upload/tweets_combined.jpg" /></a></p> <p>A further small improvement is possible: If like in our example the tweet blocks might output the same tweets, those duplicates can be filtered out by adding a unique block to the mix. Here I also changed the sort order to the more twitter like <em>descending</em>:</p> <p><a href="/blog/upload/tweets_combined_unique.jpg"><img src="/blog/upload/tweets_combined_unique.jpg" alt="/blog/upload/tweets_combined_unique.jpg" /></a></p> <p>And that’s it. Adding tweets to your pipe is as simple as adding a tweets block. The result is a working RSS feed (with a rather strange mix of tweets in this example) you could use anywhere you want:</p> <p><a href="/blog/upload/tweets_preview.jpg"><img src="/blog/upload/tweets_preview.jpg" alt="/blog/upload/tweets_preview.jpg" /></a></p> <p>You can see the pipe I created for this article <a href="https://www.pipes.digital/pipe/K9P84V9D">here</a>. You could fork it to access the tweets for hashtags, accounts and search keywords you would prefer.</p>
  8. How to make a full text RSS feed out of a shortened feed

    Thu, 20 Feb 2020 07:58:54 -0000

    <p><em>This is part 4 in a series explaining how pipes works and what you can do with it. Read <a href="https://www.pipes.digital/blog/1/How%20to%20filter%20RSS%20feeds">part 1</a> to learn how to filter feeds, <a href="https://www.pipes.digital/blog/2/How%20to%20combine%20RSS%20feeds">part 2</a> to see how to combine feeds and <a href="https://www.pipes.digital/blog/3/How%20to%20create%20RSS%20feeds%20for%20sites%20that%20have%20none">part 3</a> to create feeds for sites that have none.</em></p> <p>Some sites prefer not to put the whole article into their RSS feed. For them, having an RSS feed is nice to inform users that new articles are available, but they want you to visit the site, to see the article in the full design, and often to have a chance to show ads. As a reader you might want to respect that. But if not, that’s okay as well. Pipes often can transform such a feed into a full text feed, and here I will show how that works.</p> <p>The prior article explaining <a href="https://www.pipes.digital/blog/3/How%20to%20create%20RSS%20feeds%20for%20sites%20that%20have%20none">how to create a feed for a site</a> will be useful for this, because that’s basically what we will do to get the full text feed content, but with the original feed as a starting point.</p> <p>Like always we will first <a href="https://www.pipes.digital/editor/">create a new pipe</a>.</p> <p>First drag a feed block into the editor area. Enter the feed url. In this example, we will also truncate it to get only the two newest feed items: If we already crawl the site it is only fair to keep the load low. But that’s optional.</p> <p><a href="/blog/upload/fullfeed_truncated.jpg"><img src="/blog/upload/fullfeed_truncated.jpg" alt="/blog/upload/fullfeed_truncated.jpg" /></a></p> <p>From the shortened feed we need the links to the articles. This time I decided to take them from the <code>guid</code> element, the <code>link</code> element would have worked as well. Use an extract block to get the text stored in those elements:</p> <p><a href="/blog/upload/fullfeed_extract_link.jpg"><img src="/blog/upload/fullfeed_extract_link.jpg" alt="/blog/upload/fullfeed_extract_link.jpg" /></a></p> <p>Next comes downloading these articles. But since we have multiple links, we need a foreach block:</p> <p><a href="/blog/upload/fullfeed_foreach.jpg"><img src="/blog/upload/fullfeed_foreach.jpg" alt="/blog/upload/fullfeed_foreach.jpg" /></a></p> <p>Place it, then add a download block next to it, and then drag the download block onto the foreach block (the foreach block should turn yellow when hovering it with the download block). This means: For each url given to this block, download it:</p> <p><a href="/blog/upload/fullfeed_foreach_download.jpg"><img src="/blog/upload/fullfeed_foreach_download.jpg" alt="/blog/upload/fullfeed_foreach_download.jpg" /></a></p> <p>Since we need multiple elements from the downloaded articles, at this point duplicate the output of the foreach block with a duplicate block:</p> <p><a href="/blog/upload/fullfeed_foreach_download_duplicated.jpg"><img src="/blog/upload/fullfeed_foreach_download_duplicated.jpg" alt="/blog/upload/fullfeed_foreach_download_duplicated.jpg" /></a></p> <p>The next step is extracting the elements we need for the full text feed from the article pages. Again it’s the extract block that does the work, and which selectors to pick depends on the HTML structure of site and will vary a lot. But this time, unlike when using a download block on its own, the downloaded site is wrapped inside items of an RSS feed, that’s the output of the foreach block. Thus activate the option <em>Start at item.content</em> on each extract block when entering the CSS selectors. For the example site they look like this:</p> <p><a href="/blog/upload/fullfeed_full_extract.jpg"><img src="/blog/upload/fullfeed_full_extract.jpg" alt="/blog/upload/fullfeed_full_extract.jpg" /></a></p> <p>A feed builder block is the only regular part of the pipe missing. Place it near the extract blocks and connect them together, set a title and connect its output to the pipe output at the right (which I forgot to do before taking this screenshot):</p> <p><a href="/blog/upload/fullfeed_createfeed.jpg"><img src="/blog/upload/fullfeed_createfeed.jpg" alt="/blog/upload/fullfeed_createfeed.jpg" /></a></p> <p>One last thing is not perfect yet: images. Some sites like the example site load them via Javascript only when they become visible. That will not work in a feed since most readers will not execute the Javascript, even if it were inlined into the article text. But we can fix that: The real image link is normally stored in a data element, here it is <code>data-src</code>. With one replace block remove the current <code>src</code> that just has a placeholder, with a second transform <code>data-src</code> to <code>src</code>:</p> <p><a href="/blog/upload/fullfeed_imagefix.jpg"><img src="/blog/upload/fullfeed_imagefix.jpg" alt="/blog/upload/fullfeed_imagefix.jpg" /></a></p> <p>Now the images load. The whole feed looks good:</p> <p><a href="/blog/upload/fullfeed_preview.jpg"><img src="/blog/upload/fullfeed_preview.jpg" alt="/blog/upload/fullfeed_preview.jpg" /></a></p> <p>And that’s everything you need to create a full text RSS feed when the site you want to follow just delivers a shortened RSS feed.</p> <p>You can see the pipe I created for this article <a href="https://www.pipes.digital/pipe/1qWE0vNy">here</a>. You can fork it to use it as a starting point to transform the shortened feed of your favorite site into a full text feed.</p>
  9. How to create RSS feeds for sites that have none

    Wed, 19 Feb 2020 14:23:22 -0000

    <p><em>This is part 3 in a series explaining how pipes works and what you can do with it. <a href="https://www.pipes.digital/blog/1/How%20to%20filter%20RSS%20feeds">Read part 1</a> to learn how to filter feeds and <a href="https://www.pipes.digital/blog/2/How%20to%20combine%20RSS%20feeds">part 2</a> to see how to combine feeds.</em></p> <p>Now that we saw how to filter and combine feeds, what to do about sites that don’t provide an RSS feed? Is there no way to work with them in Pipes? There is, if it’s a HTML site. Pipes can download the site and extract the elements needed to create a proper RSS feed.</p> <p>Let’s start by <a href="https://www.pipes.digital/editor">creating a new pipe</a>.</p> <p>First, drag a download block from the toolbar in the left in the editor area:</p> <p><a href="/blog/upload/create_start.jpg"><img src="/blog/upload/create_start.jpg" alt="/blog/upload/create_start.jpg" /></a></p> <p>The example here will be this blog, so enter its url <em>https://www.pipes.digital/blog/</em>. Now add an extractor bock, and connect it to the download block:</p> <p><a href="/blog/upload/create_extract.jpg"><img src="/blog/upload/create_extract.jpg" alt="/blog/upload/create_extract.jpg" /></a></p> <p>You see that it asks for a selector, and <a href="https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors">CSS selectors</a> work best for HTML pages. The correct selectors will vary for every site. There is a helper tool, but we can also enter the selector manually and should start with that.</p> <p>To find the selector, I think it’s easiest to inspect the target site with your browser to find those (In Firefox: Right click on something on a page and click on <em>Inspect Element</em>). Then you can look at the HTML and create a selector based on the structure, ids and classes. For example here in this blog, to get the <strong>article text</strong> I’d use the selector <code>article &gt; div</code>:</p> <p><a href="/blog/upload/create_extract_content.jpg"><img src="/blog/upload/create_extract_content.jpg" alt="/blog/upload/create_extract_content.jpg" /></a></p> <p>But we need three more things: The article <strong>titles, dates and links</strong>.</p> <p>For the date we can try the selector that comes with the extractor button right next to the selector. An overlay opens:</p> <p><a href="/blog/upload/create_extract_extractor.jpg"><img src="/blog/upload/create_extract_extractor.jpg" alt="/blog/upload/create_extract_extractor.jpg" /></a></p> <p>In this overlay, if you click on an element it tries to find a selector for it. If you click on a second element, the tool will try to extend the selector to also match it. If you click twice on an element that marks it as a non-target. In this example a simple click on the date is enough to get the correct selector, <code>time</code>. Click on OK at the top to use the found selector and close the overlay.</p> <p>Article title and link both come from the title of the entries here, and that’s the same selector <code>article header h1 a</code>. For the link, the difference is that we grab the content of the <code>href</code> attribute:</p> <p><a href="/blog/upload/create_extract_linkandtitle.jpg"><img src="/blog/upload/create_extract_linkandtitle.jpg" alt="/blog/upload/create_extract_linkandtitle.jpg" /></a></p> <p>Now we have everything we need to construct the feed. Place a <em>Build Feed</em> block:</p> <p><a href="/blog/upload/create_build.jpg"><img src="/blog/upload/create_build.jpg" alt="/blog/upload/create_build.jpg" /></a></p> <p>It gets a title (for the whole feed) and takes 4 inputs, for the items content, title, date and link. Just connect the extractor blocks. And don’t forget to connect the feed builder block to the pipe output!</p> <p><a href="/blog/upload/create_build_connected.jpg"><img src="/blog/upload/create_build_connected.jpg" alt="/blog/upload/create_build_connected.jpg" /></a></p> <p>Now we can look at the preview:</p> <p><a href="/blog/upload/create_preview.jpg"><img src="/blog/upload/create_preview.jpg" alt="/blog/upload/create_preview.jpg" /></a></p> <p>Perfect. If the blog didn’t have a feed already, with this newly created feed it could be read with an RSS reader anyway.</p> <p>The pipe I created for this article <a href="https://www.pipes.digital/pipe/DNJGnkOa">is here</a>, you can fork it to use it as a starting point for your own custom feed.</p>
  10. How to combine RSS feeds

    Tue, 18 Feb 2020 17:53:34 -0000

    <p><em>This is part 2 in a series explaining how pipes works and what you can do with it. <a href="https://www.pipes.digital/blog/1/How%20to%20filter%20RSS%20feeds">Read part 1</a> to learn how to filter feeds.</em></p> <p>After filtering, creating a combined RSS feed is one of the most popular things people do with pipes. And it’s easy to do. I will show how you can create an ultimate tech review feed, combining three publications into one: <a href="https://www.techpowerup.com/review/">TechPowerUp</a>, <a href="https://www.gamersnexus.net/reviews/hwreviews">GamersNexus</a> and <a href="https://www.techspot.com/reviews/">TechSpot</a>.</p> <p>First, place three feed blocks into your pipe by dragging them from the toolbar at the left into the editor area:</p> <p><a href="/blog/upload/combine_start.jpg"><img src="/blog/upload/combine_start.jpg" alt="/blog/upload/combine_start.jpg" /></a></p> <p>And now enter the feeds you want to combine. It’s not always easy to find the right one if the autodetection fails, especially if you just want the feed for a specific subcategory. If you don’t want to search through the source code of the site you can use a browser extension like <a href="https://addons.mozilla.org/en-US/firefox/addon/awesome-rss">Awesome RSS</a> to see all linked RSS feeds. In this case, our source feeds are these three:</p> <ol> <li><a href="https://www.techspot.com/reviews.xml">https://www.techspot.com/reviews.xml</a></li> <li><a href="https://www.techpowerup.com/rss/reviews">https://www.techpowerup.com/rss/reviews</a></li> <li><a href="https://www.gamersnexus.net/reviews/hwreviews/feed?type=rss">https://www.gamersnexus.net/reviews/hwreviews/feed?type=rss</a></li> </ol> <p>With the feeds found, enter them into the feed blocks:</p> <p><a href="/blog/upload/combine_feeds.jpg"><img src="/blog/upload/combine_feeds.jpg" alt="/blog/upload/combine_feeds.jpg" /></a></p> <p>Now to combine the three feeds add a combine block and connect it to the feed blocks. In this example I also named the feed blocks to make it easier to know which one is which:</p> <p><a href="/blog/upload/combine_combined.jpg"><img src="/blog/upload/combine_combined.jpg" alt="/blog/upload/combine_combined.jpg" /></a></p> <p>And they are combined!</p> <p>If you check the feed preview now you might find the result not perfect. The combine block combines the feeds, but it does so one feed at a time. So first are all the articles of the first feed, then the ones of the second, then those of the third. To change that, let’s add a sort block and sort by <code>item.published</code>, the date of the publication:</p> <p><a href="/blog/upload/combine_sorted.jpg"><img src="/blog/upload/combine_sorted.jpg" alt="/blog/upload/combine_sorted.jpg" /></a></p> <p>Now that’s really it! The <a href="https://www.pipes.digital/feedpreview/LOM48MNr">feed preview</a> looks exactly like I wanted:</p> <p><a href="/blog/upload/combine_preview.jpg"><img src="/blog/upload/combine_preview.jpg" alt="/blog/upload/combine_preview.jpg" /></a></p> <p>You can see the pipe I created for this article <a href="https://www.pipes.digital/pipe/LOM48MNr">here</a>. You could fork it to create a mega feed for some publications you would like to combine!</p> <p>The next article in this series will explain how to create feeds for sites that don’t have one already.</p>