Monday, 18 September 2017

How would you make an HTML Parser?

Hola folks,

Here we will be walking through how html is being parsed by a widely popular parser: Angle Sharp
You can find it here !
This is just a walkthrough and gives an idea on the breadth of issues  one has to deal with while designing a parser.

Let's sneak peek into what kind of data structures are used and how is the code structured.

It all begins with this line:

var parser = new HtmlParser();

We have folowing variations in constructing the HtmlParser object:
  public HtmlParser()
            : this(Configuration.Default)

        /// <summary>
        /// Creates a new parser with the custom options.
        /// </summary>
        /// <param name="options">The options to use.</param>
        public HtmlParser(HtmlParserOptions options)
            : this(options, Configuration.Default)

        /// <summary>
        /// Creates a new parser with the custom configuration.
        /// </summary>
        /// <param name="configuration">The configuration to use.</param>
        public HtmlParser(IConfiguration configuration)
            : this(new HtmlParserOptions { IsScripting = configuration.IsScripting() }, configuration)

        /// <summary>
        /// Creates a new parser with the custom options and configuration.
        /// </summary>
        /// <param name="options">The options to use.</param>
        /// <param name="configuration">The configuration to use.</param>
        public HtmlParser(HtmlParserOptions options, IConfiguration configuration)
            : this(options, BrowsingContext.New(configuration))

        /// <summary>
        /// Creates a new parser with the custom options and the given context.
        /// </summary>
        /// <param name="options">The options to use.</param>
        /// <param name="context">The context to use.</param>
        public HtmlParser(HtmlParserOptions options, IBrowsingContext context)
            _options = options;
            _context = context;

You can find three things here:
1. Configuration:This is an interesting thing, opening it up we find an array of associated standard services like :
This can help us create customized configuration for contexts. Wondering what is Factory?
It is a static class bundling available factories. Factories are mostly instance mappings.
Let's break open one factory class, let's take:AttributeSelectorFactory
It has a dictionary that stores CombinatorSymbol, SimpleSelector Attribute as key value pair.
Eg: Combinator Symbol for exactly is public static readonly String Exactly = "=";
Corresponding value is SimpleSelector AttrMatch

        public static SimpleSelector AttrMatch(String match, String value, String prefix = null)
            var front = match;

            if (!String.IsNullOrEmpty(prefix))
                front = FormFront(prefix, match);
                match = FormMatch(prefix, match);

            var code = FormCode(front, "=", value.CssString());
            return new SimpleSelector(_ => _.GetAttribute(match).Is(value), Priority.OneClass, code);

It is kinda, operator and their corresponding code.

2. HtmlParserOptions: This is a simple struct having four boolean fields saving properties like if the document IsEmbedded, IsScripting, IsScrictMode, callback on OnCreated
3. IBrowsingContext: A simple and lightweight browsing context having different EventListeners added on actions like Parsed, Requested, Requesting etc and having data regarding creator, history, security etc. Won't be going deep in here.

Cutting down, it in essence readys up the HtmlDocumetn for us that has basic information regarding the BrowsingContext, HtmlParserOptions and it has a proper mapping of what kind of operations to be performed in different supported formats.


            var document = CreateDocument(source);
            var parser = new HtmlDomBuilder(document);
            return parser.Parse(_options);

1. CreateDocument:

This does two things: 
(a) var textSource = new TextSource(source);
(b) var document = new HtmlDocument(_context, textSource);

textSource: It's just a stream abstraction to handle encoding and more.
HtmlDocument: Represents a document node that contains only HTML nodes.
It has methods like Clone(), LoadAsync, Get/Set title etc.

2. HtmlDomBuilder: It intakes this HtmlDocument, that is seemingly not yet parsed. HtmlDomBuilder essentially constructs the tree (as per described here: ). It parses tokens, decides what is it, open tag, closed tag, formatting element, plain text, script etc, it wold be another blog post that we will surely do, in order to learn how is it doing this parsing, with a perspective on code architecture.

ParseAsync will prefetch tokens taking size and current position into consideration and making it an async job where as parse will simply take the entire document and will be running in the foreground as expected. But both are trying to create the DOM tree i.e. filling in the HtmlDocument "elements".
It will be interesting to walkthrough the Home and Foreign methods , code has taken care of what if foreign elements are present or the tag is not closed etc.
The connection between IElement and INode is intiguing. Please note
- The Element interface represents an object within a DOM document.
- The Node interface is an interface from which a number of DOM types inherit, and allows these various types to be treated similarly.

And Element implements Node interface, so it's bound to have the expected fields for ChildNodes , etc.
There are things like shadow roots and pseudo elements that will be needing attention in the next post too.

This is just a quarter walk into the code, I would be glad to cover the finer details in the upcoming post.

Hope it's helpful for someone trying to dive deep into the Angle Shrp code.
If you have queries, feel free to ping in comments or via mail, woul dbe happy to discuss.

See you the next post!
Soon enough !
Till then, keep hacking~


Sunday, 29 January 2017


 Meanwhile, my cache went blank. No usage. How frequent are the chances of cache being blank if no use of a particular app is done?
*me makes a search engine query*

Hmm, so it has been rather a block on me restarting on the tech blogs I used to write. Truth is I wrote many, but couldn't finish them, so in draft. Well, this would be the feeble-most realization of writer's block if any.

Anyway, what am I intending to write here or will you just end up listening to a person who sounds tech and blabbers? Umm, no.
(Not for this post, for the least.)

So, here are quite a few realizations regarding tech I have been having lately.

One is tech at use. I don't know what software engineering has done to me but I end up finding a tech solution to my day-to-day problems.

One be "No-bai".
No, it's not a foreign language phrase though may seem like.
Just wanna say "no-bai" as in no maid, this problem happens when maid refuses to show up to work and you are unable to keep track of her because, well, mornings are meant for sleeping still. So maid steps in or steps out, you are mostly clueless until your sleeping ears turn sensitive to the adamant sounds of utensils she turns, overturns and places at some subtle surface so that it makes the loudest sound possible, thereby making it evident "She is working".

So, to solve this problem, I got a simple solution what if I ask her to just click a button on an app when she enters kitchen. What will that button do? Clicks her image and send it to server with timestamp :P
And then a server side small program with a rather simple interface to calculate number of days she worked.
The only issue was the app was supposed to be android and this can't be executed until my flatmate gets her new handset :P Even that is risky. Searched for some raspberry pi/arduino kinda thing but that will be more complicated to handle. I don't want her to leave the job anyway :P

Next thing , umm, is the saddening impacts of big data. Be the consequences come up as some unexpected political victory or anything else, this is a true sadness prevalent all around. Humans being taken advantage of for having a psyche type, a personality, an emotion as it gets recorded in terms of likes or upvotes or shares or comments.
Just think how you began today? What all you did , what all apps you used and now just imagine the traces of your day and preferences you have left on web today!
Be it booking Uber or withdrawing 1500 INR or taking a selfie or ordering food or making a purchase online/offline via debit/credit cards. It's all recorded. Infact, just once for knowledge sake go to Google Settings and find out the data it collects. It even tracks which app you clicked how many times on your android device and in what order.
You think this is useless? Let's all just way to see how these patterns are being monetized over.
This is high time and calls action from all of us, atleast until we have a just law to govern us all, only awareness can help us.
I really appreciate the way Mozilla is driving the Data Privacy concerns. It's initiatives are in high regard.
Have a look here:

Only we can make a change here.
Just think twice before installing an Android app. Check for what all permissions it asks for.
Data theft from text messages is not a new thing. Elaborating a bit, suppose you gives some app access to read your text messages, now that app can read your text message inbox. Your inbox that has all your crucial details alongwith the bank updates. Would you like to let your bank details available to a third party app so easily?
Think twice.
Thing is be it apps or web services, they are not all free. Your data is the  new currency.
Stay conscious, stay aware!

PS: The post might be inconsistent or random, reason:external influences at the time of jotting down :P But this for sure breaks the pause on this blog. See you in the posts to be rolled ahead, in a rather much frequent manner, till then keep exploring and keep hacking.


A secret love message ~

Hmm, so the trick of putting up a catchy seasonal title worked out, we got you here! Folks, we will be talking about a really cool tech...