Where parallels cross

Interesting bits of life

Moldable Emacs: capture links from HTML with Playground

Too long; didn't read

There is a simple way in moldable-emacs to grab a list of links from an HTML page. This shows you one more way of using the Playground.

The problem

I like to read blogs on my ebook reader. And I have spent too much time selecting rectangles on my screen to grab a bunch of links. It is cool, but often these links are in a HTML list: it would be so easy to grab them. If only I could make HTML something I know how to navigate. Something like a Lisp list. Wait! Wouldn't that be possible?!

It is a problem indeed

There is a more generic problem here. We are prisoners of the way we present data. When our (typical) browser shows an HTML page, you can only search that view. If you want to grab a list of links you need to rely on an extension.

Rather than forced to consume a pre-chewed view, I would like a language that allows me to query the data. This makes me an explorer. This gives me a thrill to learn the language in depth. I am becoming a more evolved user and I can extract novel and unexpected value from the provided data. (I guess I should also be able to pick the language I like most, so we get the benefits of (bio-)diversity, but that is for another blog.)

Can we already achieve that in the marvellous Emacs somehow?

And there is a solution

If we are familiar with Lisp, we could already grab links. I was not doing that because there is friction: you have to jump from the web to an Emacs scratch file, then hack your way through. The browser extension way seemed fairly easier.

After I made a little Elisp-Pandoc script to make Epub files, I started feeling the pain.

Now let's assume I want to download Nyxt blog posts: https://nyxt.atlas.engineer/articles. There are a lot of links there.

If we use Emacs' url-retrieve-synchronously to get the HTML in a buffer, we will need the following.

(progn
  (switch-to-buffer (url-retrieve-synchronously "https://nyxt.atlas.engineer/articles"))
  (goto-char (point-min))
  (delete-region (point-min) url-http-end-of-headers)
  (html-mode))

We would get some HTML like this.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><title>Nyxt</title><head><meta name="viewport" content="width=device-width, initial-scale=1"><meta charset="UTF-8"><link rel="stylesheet" type="text/css" media="screen" href="/static/css/pure-min.css"></link><link rel="stylesheet" type="text/css" media="screen" href="/static/css/grids-responsive-min.css"></link><link rel="stylesheet" type="text/css" media="screen" href="/static/css/main.css"></link></head><body><header><div class="pure-menu pure-menu-horizontal"><ul class="pure-menu-list"><li class="pure-menu-heading menu-brand"><a href="/"><img src="/image/nyxt_128x128.png" /></a></li><li class="pure-menu-item"><a href="/applications" class="pure-menu-link">Applications</a></li><li class="pure-menu-item pure-menu-selected"><a href="/articles" class="pure-menu-link">Articles</a></li><li class="pure-menu-item"><a href="/download" class="pure-menu-link">Download</a></li><li class="pure-menu-item"><a href="/faq" class="pure-menu-link">FAQ</a></li><li class="pure-menu-item"><a href="https://discourse.atlas.engineer/" class="pure-menu-link">Forum</a></li><li class="pure-menu-item"><a href="https://kiwiirc.com/nextclient/irc.libera.chat/nyxt" class="pure-menu-link">Chat</a></li><li class="pure-menu-item"><a href="https://github.com/atlas-engineer/nyxt" class="pure-menu-link">Source</a></li><li class="pure-menu-item"><a..                                                                                                                .

We just went to a illegible structured data view of our web-page. At this point it may seem we regressed our situation.

I shall prove the contrary with moldable-emacs!

Given you have installed emacs-tree-sitter and the basic languages it supports, let me show what happens when you call me/mold and call the mold CodeAsTree on this HTML buffer.

/assets/blog/2021/07/19/moldable-emacs-capture-links-from-html-with-playground/screen-2021-06-25-01-33-02.jpg

The buffer on the left is a flattened list representing all the HTML data. There is an entry and an entry that the first includes.

...
(:type attribute :text "class=\"pure-menu-link\"" :begin 8260 :end 8282 :buffer " *http nyxt.atlas.engineer:443*-432463" :buffer-file nil)
 (:type attribute_name :text "class" :begin 8260 :end 8265 :buffer " *http nyxt.atlas.engineer:443*-432463" :buffer-file nil)
...

This view flattens all the data at our feet. I usually exploit the :type to filter the information I need. I first search the tree for the type of information I want. In this case I am looking for the articles. One article is "Superuser batch downloading", so I search the smallest element of the list that contains this data. (At some point I may decide to make a function to speed up this search.)

I find the following entry.

(:type element :text "<li><a href=\"/article/sharing-files.org\">Superuser batch downloading</a> <small>(2021-04-05)</small></li>" :begin 2609 :end 2714 :buffer " *http nyxt.atlas.engineer:443*-432463" :buffer-file nil)

This reminds me that I want the link though! So now I search for "/article/sharing-files.org".

I find this.

(:type attribute_value :text "/article/sharing-files.org" :begin 2622 :end 2648 :buffer " *http nyxt.atlas.engineer:443*-432463" :buffer-file nil)

So we can infer that the links I want need to have an org extension and have :type attribute_value.

This is the moment to call me/mold again and select Playground!

Let me show how it looks before explaining what I did.

/assets/blog/2021/07/19/moldable-emacs-capture-links-from-html-with-playground/screen-2021-06-25-01-47-03.jpg

You can see I got my list of links. Let's look at my throwaway code.

(--> (me/by-type 'attribute_value self)
  (--map (plist-get it :text) it)
  (--filter
   (s-ends-with? ".org" it)
   it)
  (--map
   (concat
    "https://nyxt.atlas.engineer"
    it)
   it))

First, the flattened tree is in the buffer variable self. So when I use me/by-type, I am filtering only the :type field matching attribute_value. Then I get the text of those entries and pick only the ones ending by "org". Cherry on the cake, I add the prefix link because the Nyxt folks are using relative links.

I like to evaluate while I develop the code (REPL <3), that is why you see the code appearing in the Playground buffer (I shall share my code for that, totally inspired/copied from Malabarba post).

Now let's evaluate with another mold EvalSexp.

/assets/blog/2021/07/19/moldable-emacs-capture-links-from-html-with-playground/screen-2021-06-25-01-54-39.jpg

The list of links is there now.

You can still use the old Playground if you want to download those! That is an Elisp buffer so you can evaluate what you like.

Do you see? Now the data is available to us. Most structured data becomes available because we use tree-sitter. (Even Clojure programs!)

Once the data is available and moldable, you can create your own view. For example, you can use a Playground mold to make a list. And then a mold that transform that list in a CSV buffer. And you can plot that with another mold. Or bake your own visualization!

This is how you can mold your tools and world to enable fantastic explorations.

Conclusion

Transform views in data and make views that suite you best. With moldable-emacs this becomes simple. Load that in your Emacs and start exploring your structure data. Once you start with it, you will see data everywhere. Ah and let me know if you have trouble running it, this mode is still in a very early release.

Happy exploring!

Comments