Parsing HTML with C# by the Book (DTD)

by Tyler Jensen 11. May 2006 21:14
HTML parser written in C# based on the DTD; free with source code (BSD license).

Recently I had to write an HTML parser for a project I've been working on for some time now. First I tried translating an open source C++ parser but it really wasn't what I wanted and it was also under the GPL. After contacting the author and realizing (or re-remembering) that I could not use a GPL derivative in a commercial library or application, I scrapped that and went back to the source: the official HTML DTD.

Re-remembering how to read a DTD after not having done so for so long was a chore, but the folks at Autistic Cuckoo helped. So I found a very helpful tutorial. I spent the next day or two writing the code in the file you linked below. I took some inspiration from a few files I found while browsing the FireFox code under the Mozilla license. The rest of it came from studying the DTD and trying to figure out a way to encapsulate that in a usable object model.

Here's an example of how to use it:

HtmlDocument doc = new HtmlDocument(url, html);
StringBuilder sb = new StringBuilder();
Collection<HtmlTag> pcdata = doc.GetList(DtdElement.A);
foreach (HtmlTag tag in pcdata)
{
  if (!tag.EndTag)
  {
    Dictionary<string, string> attributes = doc.GetAttributes(tag);
    sb.AppendLine("");
    sb.AppendLine("A: " + doc.ReadSlice(tag.Slice));

    foreach (KeyValuePair<string, string> pair in attributes)
    {
      sb.AppendLine(" " + pair.Key + "=" + pair.Value);
    }
  }
}

I'm releasing it under the BSD license, which I like much more than the GPL as I'm not really a "true" free software zealot. The only think I ask is that if you fix a bug or make an improvement, please share it with me and I'll put up a new version here.  

NetBrick.Net.OpenUtils1.zip (35.31 KB)

Tags:

Code

blog comments powered by Disqus

Me...

Tyler Jensen

Tyler Jensen
.NET Developer and Architect

Month List

Other Stuff