Skip to main content

How would you make an HTML Parser?

Hola folks,

Here we will be walking through how html is being parsed by a widely popular parser: Angle Sharp
You can find it here !
This is just a walkthrough and gives an idea on the breadth of issues  one has to deal with while designing a parser.

Let's sneak peek into what kind of data structures are used and how is the code structured.

It all begins with this line:

var parser = new HtmlParser();

We have folowing variations in constructing the HtmlParser object:
  public HtmlParser()
            : this(Configuration.Default)

        /// <summary>
        /// Creates a new parser with the custom options.
        /// </summary>
        /// <param name="options">The options to use.</param>
        public HtmlParser(HtmlParserOptions options)
            : this(options, Configuration.Default)

        /// <summary>
        /// Creates a new parser with the custom configuration.
        /// </summary>
        /// <param name="configuration">The configuration to use.</param>
        public HtmlParser(IConfiguration configuration)
            : this(new HtmlParserOptions { IsScripting = configuration.IsScripting() }, configuration)

        /// <summary>
        /// Creates a new parser with the custom options and configuration.
        /// </summary>
        /// <param name="options">The options to use.</param>
        /// <param name="configuration">The configuration to use.</param>
        public HtmlParser(HtmlParserOptions options, IConfiguration configuration)
            : this(options, BrowsingContext.New(configuration))

        /// <summary>
        /// Creates a new parser with the custom options and the given context.
        /// </summary>
        /// <param name="options">The options to use.</param>
        /// <param name="context">The context to use.</param>
        public HtmlParser(HtmlParserOptions options, IBrowsingContext context)
            _options = options;
            _context = context;

You can find three things here:
1. Configuration:This is an interesting thing, opening it up we find an array of associated standard services like :
This can help us create customized configuration for contexts. Wondering what is Factory?
It is a static class bundling available factories. Factories are mostly instance mappings.
Let's break open one factory class, let's take:AttributeSelectorFactory
It has a dictionary that stores CombinatorSymbol, SimpleSelector Attribute as key value pair.
Eg: Combinator Symbol for exactly is public static readonly String Exactly = "=";
Corresponding value is SimpleSelector AttrMatch

        public static SimpleSelector AttrMatch(String match, String value, String prefix = null)
            var front = match;

            if (!String.IsNullOrEmpty(prefix))
                front = FormFront(prefix, match);
                match = FormMatch(prefix, match);

            var code = FormCode(front, "=", value.CssString());
            return new SimpleSelector(_ => _.GetAttribute(match).Is(value), Priority.OneClass, code);

It is kinda, operator and their corresponding code.

2. HtmlParserOptions: This is a simple struct having four boolean fields saving properties like if the document IsEmbedded, IsScripting, IsScrictMode, callback on OnCreated
3. IBrowsingContext: A simple and lightweight browsing context having different EventListeners added on actions like Parsed, Requested, Requesting etc and having data regarding creator, history, security etc. Won't be going deep in here.

Cutting down, it in essence readys up the HtmlDocumetn for us that has basic information regarding the BrowsingContext, HtmlParserOptions and it has a proper mapping of what kind of operations to be performed in different supported formats.


            var document = CreateDocument(source);
            var parser = new HtmlDomBuilder(document);
            return parser.Parse(_options);

1. CreateDocument:

This does two things: 
(a) var textSource = new TextSource(source);
(b) var document = new HtmlDocument(_context, textSource);

textSource: It's just a stream abstraction to handle encoding and more.
HtmlDocument: Represents a document node that contains only HTML nodes.
It has methods like Clone(), LoadAsync, Get/Set title etc.

2. HtmlDomBuilder: It intakes this HtmlDocument, that is seemingly not yet parsed. HtmlDomBuilder essentially constructs the tree (as per described here: ). It parses tokens, decides what is it, open tag, closed tag, formatting element, plain text, script etc, it wold be another blog post that we will surely do, in order to learn how is it doing this parsing, with a perspective on code architecture.

ParseAsync will prefetch tokens taking size and current position into consideration and making it an async job where as parse will simply take the entire document and will be running in the foreground as expected. But both are trying to create the DOM tree i.e. filling in the HtmlDocument "elements".
It will be interesting to walkthrough the Home and Foreign methods , code has taken care of what if foreign elements are present or the tag is not closed etc.
The connection between IElement and INode is intiguing. Please note
- The Element interface represents an object within a DOM document.
- The Node interface is an interface from which a number of DOM types inherit, and allows these various types to be treated similarly.

And Element implements Node interface, so it's bound to have the expected fields for ChildNodes , etc.
There are things like shadow roots and pseudo elements that will be needing attention in the next post too.

This is just a quarter walk into the code, I would be glad to cover the finer details in the upcoming post.

Hope it's helpful for someone trying to dive deep into the Angle Shrp code.
If you have queries, feel free to ping in comments or via mail, woul dbe happy to discuss.

See you the next post!
Soon enough !
Till then, keep hacking~



Popular posts from this blog

Duh - Saves you the trouble to correct your command


This is no more an expression for me but a command now. Thanks to the hack I have been doing for past couple of days.

What's it about? Well, here it goes.

How many times it happens that we screw up commands on terminal?
A typo, a syntax mistake or jumbled up arguments. The command doesn't run and then we spend time retyping it ensuring everything is in place this time.
Quite time consuming, eh?

My laziness simply denied me such a behaviour. So I coded up a powershell cmdlet which can do this for me.
Now if I mess up a command, I just have to type 'Duh' and the right command will be displayed on the prompt for you to check and execute (press ENTER).

How Duh operates internally?
Well, guess what. Answer lies in the "tries".
We have a trie and we do closest match using Leveinshtein Distance.
In short, how to figure out how close two strings are?  Find the no of letters you need to remove/insert/replace in order to attain string 2 from string 1.
This is wha…

What emotions run through your playlist?

Hola !!

Long time. So I was upto creating this application one night , I named: "Playlist Emotions" , "What does the song says" :P , "PlayWithEmotions" and what not, each project of a failed experimentation.


The project took way too long. Thanks to the noble idea of making and deploying it as a GWT application and then to improve upon the GUI of the app.
FYI , I tried but done neither of the things above. My Google Console developers trial account  supposedly has some issues with this app, yet to be resolved. So before the entire idea behind the app and the excitement of results it displays fades out , I thought, lemme write a blog post on the same.

So what is this all about?

The idea originated from a candid discussion with my friend on how songs influence our moods and also how our emotions affect the type of songs we listen to.

Being fascinated about knowing what kinds of songs I listen to, I thought of creating an application where I would just e…