Monday, 18 September 2017

How would you make an HTML Parser?

Hola folks,

Here we will be walking through how html is being parsed by a widely popular parser: Angle Sharp
You can find it here !
This is just a walkthrough and gives an idea on the breadth of issues  one has to deal with while designing a parser.

Let's sneak peek into what kind of data structures are used and how is the code structured.

It all begins with this line:

var parser = new HtmlParser();

We have folowing variations in constructing the HtmlParser object:
  public HtmlParser()
            : this(Configuration.Default)

        /// <summary>
        /// Creates a new parser with the custom options.
        /// </summary>
        /// <param name="options">The options to use.</param>
        public HtmlParser(HtmlParserOptions options)
            : this(options, Configuration.Default)

        /// <summary>
        /// Creates a new parser with the custom configuration.
        /// </summary>
        /// <param name="configuration">The configuration to use.</param>
        public HtmlParser(IConfiguration configuration)
            : this(new HtmlParserOptions { IsScripting = configuration.IsScripting() }, configuration)

        /// <summary>
        /// Creates a new parser with the custom options and configuration.
        /// </summary>
        /// <param name="options">The options to use.</param>
        /// <param name="configuration">The configuration to use.</param>
        public HtmlParser(HtmlParserOptions options, IConfiguration configuration)
            : this(options, BrowsingContext.New(configuration))

        /// <summary>
        /// Creates a new parser with the custom options and the given context.
        /// </summary>
        /// <param name="options">The options to use.</param>
        /// <param name="context">The context to use.</param>
        public HtmlParser(HtmlParserOptions options, IBrowsingContext context)
            _options = options;
            _context = context;

You can find three things here:
1. Configuration:This is an interesting thing, opening it up we find an array of associated standard services like :
This can help us create customized configuration for contexts. Wondering what is Factory?
It is a static class bundling available factories. Factories are mostly instance mappings.
Let's break open one factory class, let's take:AttributeSelectorFactory
It has a dictionary that stores CombinatorSymbol, SimpleSelector Attribute as key value pair.
Eg: Combinator Symbol for exactly is public static readonly String Exactly = "=";
Corresponding value is SimpleSelector AttrMatch

        public static SimpleSelector AttrMatch(String match, String value, String prefix = null)
            var front = match;

            if (!String.IsNullOrEmpty(prefix))
                front = FormFront(prefix, match);
                match = FormMatch(prefix, match);

            var code = FormCode(front, "=", value.CssString());
            return new SimpleSelector(_ => _.GetAttribute(match).Is(value), Priority.OneClass, code);

It is kinda, operator and their corresponding code.

2. HtmlParserOptions: This is a simple struct having four boolean fields saving properties like if the document IsEmbedded, IsScripting, IsScrictMode, callback on OnCreated
3. IBrowsingContext: A simple and lightweight browsing context having different EventListeners added on actions like Parsed, Requested, Requesting etc and having data regarding creator, history, security etc. Won't be going deep in here.

Cutting down, it in essence readys up the HtmlDocumetn for us that has basic information regarding the BrowsingContext, HtmlParserOptions and it has a proper mapping of what kind of operations to be performed in different supported formats.


            var document = CreateDocument(source);
            var parser = new HtmlDomBuilder(document);
            return parser.Parse(_options);

1. CreateDocument:

This does two things: 
(a) var textSource = new TextSource(source);
(b) var document = new HtmlDocument(_context, textSource);

textSource: It's just a stream abstraction to handle encoding and more.
HtmlDocument: Represents a document node that contains only HTML nodes.
It has methods like Clone(), LoadAsync, Get/Set title etc.

2. HtmlDomBuilder: It intakes this HtmlDocument, that is seemingly not yet parsed. HtmlDomBuilder essentially constructs the tree (as per described here: ). It parses tokens, decides what is it, open tag, closed tag, formatting element, plain text, script etc, it wold be another blog post that we will surely do, in order to learn how is it doing this parsing, with a perspective on code architecture.

ParseAsync will prefetch tokens taking size and current position into consideration and making it an async job where as parse will simply take the entire document and will be running in the foreground as expected. But both are trying to create the DOM tree i.e. filling in the HtmlDocument "elements".
It will be interesting to walkthrough the Home and Foreign methods , code has taken care of what if foreign elements are present or the tag is not closed etc.
The connection between IElement and INode is intiguing. Please note
- The Element interface represents an object within a DOM document.
- The Node interface is an interface from which a number of DOM types inherit, and allows these various types to be treated similarly.

And Element implements Node interface, so it's bound to have the expected fields for ChildNodes , etc.
There are things like shadow roots and pseudo elements that will be needing attention in the next post too.

This is just a quarter walk into the code, I would be glad to cover the finer details in the upcoming post.

Hope it's helpful for someone trying to dive deep into the Angle Shrp code.
If you have queries, feel free to ping in comments or via mail, woul dbe happy to discuss.

See you the next post!
Soon enough !
Till then, keep hacking~


No comments:

Post a Comment

Outreachy experience and application tips

One of the best experiences of my student life was to make it to this list: