Chapter 18: IE-based HTML-to-PDF Conversion

Contents

18.1 OpenUrl Method: Overview

18.1.1 OpenUrl vs. ImportFromUrl

As of Version 2.9, AspPDF.NET expands its HTML-to-PDF functionality by adding a new method, OpenUrl, to the PdfDocument object. Like the ImportFromUrl method described in Chapter 13, OpenUrl helps convert an arbitrary URL, or an HTML string, to a PDF document, but it does so differently.

While ImportFromUrl relies on our own in-house HTML-rendering engine, OpenUrl delegates the job to Microsoft Internet Explorer by connecting to the IE WebBrowser object residing in the MSHTML.dll library. Harnessing the robust IE engine helps achieve the perfect PDF snapshot of any HTML document, no matter how complex, whereas the ImportFromUrl method is only capable of rendering basic HTML.

However, unlike ImportFromUrl, the OpenUrl method produces a rasterized image of the HTML document. This method's output is essentially a static bitmap convertible to PDF. The scalability and searchability of the original document is not preserved.

18.1.2 OpenUrl Usage

The PdfDocument.OpenUrl method expects four arguments: a URL or HTML string, a list of parameters, and a username and password. The first argument is required, the others are optional. The method returns an instance of the PdfImage object representing the image of the specified URL or HTML string. This image can then be drawn on a PDF page via the PdfCanvas.DrawImage method in an arbitrary position and at an arbitrary scale. By default, the entire web document is converted to a single image. Pagination is covered in the next section.

The following code sample converts our corporate web site http://www.persits.com to a PDF:

PdfManager objPdf = new PdfManager();

// Create empty document
PdfDocument objDoc = objPdf.CreateDocument();

// Add a new page
PdfPage objPage = objDoc.Pages.Add();

// Convert www.persits.com to image
PdfImage objImage = objDoc.OpenUrl("http://www.persits.com/old/index.html");

// Shrink as needed to fit width
float fScale = objPage.Width / objImage.Width;

// Align image top with page top
float fY = objPage.Height - objImage.Height * fScale;

// Draw image
PdfParam objParam = objPdf.CreateParam();
objParam["x"] = 0;
objParam["y"] = fY;
objParam["ScaleX"] = fScale;
objParam["ScaleY"] = fScale;
objPage.Canvas.DrawImage(objImage, objParam);

// Save document, the Save method returns generated file name
string strFilename = objDoc.Save(Server.MapPath("iehtmltopdf.pdf"), false);

lblResult.Text = "Success! Download your PDF file <A HREF=" + strFilename + ">here</A>";
Dim objPdf As PdfManager = new PdfManager()

' Create empty document
Dim objDoc As PdfDocument = objPdf.CreateDocument()

' Add a new page
Dim objPage As PdfPage= objDoc.Pages.Add()

' Convert www.persits.com to image
Dim objImage As PdfImage= objDoc.OpenUrl("http://www.persits.com/old/index.html")

' Shrink as needed to fit width
Dim fScale As Single = objPage.Width / objImage.Width

' Align image top with page top
Dim fY As Single = objPage.Height - objImage.Height * fScale

' Draw image
Dim objParam As PdfParam = objPdf.CreateParam()
objParam("x") = 0
objParam("y") = fY
objParam("ScaleX") = fScale
objParam("ScaleY") = fScale
objPage.Canvas.DrawImage(objImage, objParam)

' Save document, the Save method returns generated file name
Dim strFilename As String = objDoc.Save(Server.MapPath("iehtmltopdf.pdf"), False)

Click the links below to run this code sample:

18.1.3 Authentication

If the URL being opened is protected with Basic authentication, the valid username and password must be passed to the OpenUrl method as the 3rd and 4th arguments, as follows:

PdfImage objImage = objDoc.OpenUrl( url, "", "username", "password" );

Under .NET, the Username and Password arguments can instead be used to pass an authentication cookie in case both the script calling OpenUrl and the URL itself are protected by the same user account under .NET Forms authentication. To pass a cookie to OpenUrl, the cookie name prepended with the prefix "Cookie:" is passed via the Username argument, and the cookie value via the Password argument. The following example illustrates this technique.

Suppose you need to implement a "Click here for a PDF version of this page" feature in a .NET-based web application. The application is protected with .NET Forms Authentication:

<authentication mode="Forms">
  <forms name="MyAuthForm" loginUrl="login.aspx" protection="All">
    <credentials passwordFormat = "SHA1">
      <user name="JSmith" password="13A23E365BFDBA30F788956BC2B8083ADB746CA3"/>
       ... other users
    </credentials>
  </forms>
</authentication>

The page that needs to be converted to PDF, say report.aspx, contains the button "Download PDF version of this report" that invokes another script, say convert.aspx, which calls OpenUrl. Both scripts reside in the same directory under the same protection.

If convert.aspx simply calls objDoc.OpenUrl( "http://localhost/dir/report.aspx", ... ), the page that ends up being converted will be login.aspx and not report.aspx, because AspPDF.NET itself has not been authenticated against the user database and naturally will be forwarded to the login screen.

To solve this problem, we just need to pass the authentication cookie whose name is MyAuthForm (the same as the form name) to OpenUrl. The following code (placed in convert.aspx) demonstrates this technique:

...
string strName = "Cookie:" + Request.Cookies["MyAuthForm"].Name;
string strValue = Request.Cookies["MyAuthForm"].Value;

// Convert URL to image
PdfImage objImage = objDoc.OpenUrl("http://localhost/dir/report.aspx", "", strName, strValue);
...
Dim strName As String = "Cookie:" + Request.Cookies("MyAuthForm").Name
Dim strValue As String = Request.Cookies("MyAuthForm").Value

' Convert URL to image
Dim objImage As PdfImage = objDoc.OpenUrl("http://localhost/dir/report.aspx", "", strName, strValue)

18.1.4 Direct HTML Feed

The first argument to the OpenUrl method can be used to directly pass an HTML string instead of a URL. The string must start with the characters "<HTML" or "<html" to signal that the value is to be treated as an HTML text and not a URL. The non-ASCII characters in the string must be in Unicode format, and not encoded in any way. For example:

string strText = "<html><table border><tr><th>AA</th><th>BB</th></tr><tr><td>X</td><td>Y</td></tr></table></html>";
strText = strText.Replace("AA", "Greek");
strText = strText.Replace("BB", "Chinese");
strText = strText.Replace('X', Convert.ToChar(0x03A9));
strText = strText.Replace('Y', Convert.ToChar(0x56FD));

Set Image = Doc.OpenUrl( strText )
...

The script above produces the following output:

Note that in the direct HTML feed mode, there is no "base" URL by default, so if your HTML string contains images and other objects pointed to via their relative paths, you must also provide the base URL information via the <base> tag, as follows:

string strText = "<html><base href=\"c:\\images\\\">
...
<img src=\"logo.png\">
...
</html>";

18.2 Pagination

18.2.1 PageHeight & AspectRatio Parameters

As mentioned above, the OpenUrl method returns the snapshot of the HTML document as a single continuous image by default. For long HTML documents spanning multiple pages, this default behavior may not be practical as multiple images representing the individual pages of the document are needed instead.

The OpenUrl method is capable of splitting the HTML document's snapshot image into multiple pages. When used in the pagination mode, OpenUrl generates a linked list of images. The method returns an instance of the PdfImage object which represents the top page of the document. The subsequent images are obtained via the PdfImage.NextImage property. This property returns the next PdfImage object in the sequence or Nothing (null) if the current image is the last one in the linked list.

The pixel height of each individual page image can either be specified directly, via the PageHeight parameter, or be computed based on the current document's page width and the desired aspect ratio specified via the AspectRatio parameter. For example, the line

PdfImage objImage = objDoc.OpenUrl( url, "PageHeight=792" );

makes all page images (except possibly the last one) 792 pixels high.

Since the width of the document image wholly depends on the underlying HTML code and is not always known in advance, it is often more practical to specify the page height indirectly, via an aspect ratio that matches the aspect ratio of the PDF page on which this image is ultimately to be drawn. For example, the line

PdfImage objImage = objDoc.OpenUrl( url, "AspectRatio=0.7727" );

makes the aspect ratio of all the page images (except possibly the last one) the same as that of the standard US Letter page (which is 8.5"/11" = 0.7727.) The image height is computed automatically by dividing the document width, whatever it happens to be, by the specified aspect ratio value. When drawing the images on the PDF pages, a scaling factor has to be applied to make the image occupy the entire area of the page. The PageHeight and AspectRatio parameters are mutually exclusive. If both are specified, PageHeight is ignored.

The following code sample converts the URL http://support.persits.com/default.asp?displayall=1 to a multi-page PDF with pagination based on the US Letter aspect ratio:

PdfManager objPdf = new PdfManager();

// Create empty document
PdfDocument objDoc = objPdf.CreateDocument();

PdfParam objParam = objPdf.CreateParam();

// Convert URL to image
PdfImage objImage = objDoc.OpenUrl("http://support.persits.com/default.asp?displayall=1",
  "AspectRatio=0.7727");

// Iterate through all images
while( objImage != null )
{
  // Add a new page
  PdfPage objPage = objDoc.Pages.Add();

  // Compute scale based on image width and page width
  float fScale = objPage.Width / objImage.Width;

  // Draw image
  objParam["x"] = 0;
  objParam["y"] = objPage.Height - objImage.Height * fScale;
  objParam["ScaleX"] = fScale;
  objParam["ScaleY"] = fScale;
  objPage.Canvas.DrawImage( objImage, objParam );

  // Go to next image
  objImage = objImage.NextImage;
}

// Save document, the Save method returns generated file name
string strFilename = objDoc.Save( Server.MapPath("pages.pdf"), false );
Dim objPdf As PdfManager = new PdfManager()

' Create empty document
Dim objDoc As PdfDocument = objPdf.CreateDocument()

Dim objParam As PdfParam = objPdf.CreateParam()

' Convert URL to image
Dim objImage As PdfImage = objDoc.OpenUrl("http://support.persits.com/default.asp?displayall=1", _
  "AspectRatio=0.7727")

' Iterate through all images
While Not objImage Is Nothing
  ' Add a new page
  Dim objPage As PdfPage = objDoc.Pages.Add()

  ' Compute scale based on image width and page width
  Dim fScale As Single = objPage.Width / objImage.Width

  ' Draw image
  objParam("x") = 0
  objParam("y") = objPage.Height - objImage.Height * fScale
  objParam("ScaleX") = fScale
  objParam("ScaleY") = fScale
  objPage.Canvas.DrawImage( objImage, objParam )

  ' Go to next image
  objImage = objImage.NextImage
End While

' Save document, the Save method returns generated file name
Dim strFilename As String = objDoc.Save( Server.MapPath("pages.pdf"), False )

Click the links below to run this code sample:

18.2.2 Hemming

The code sample in the previous subsection produces a paginated PDF document in which the page delimiters often fall on critical content such as text or images, as shown below:

For cleaner cutting, the OpenUrl method can be instructed to push the bottom edge of each page upwards until it meets a relatively blank row of pixels. For the lack of a better term, we dubbed this process "hemming", a word used by tailors. Hemming reduces the height of some or all page images somewhat.

By default, OpenUrl performs no hemming. If the Hem parameter is specified and set to a non-zero value, OpenUrl scans the specified number of pixel rows of each page, starting with the bottom row, looking for a row with the fewest number of pixels deviating from the white background. Once this row is found, it is used as the new page delimiter row, and the next page begins with the row directly below it. If Hem is set to a negative number such as -1, the entire page image is scanned in search for a suitable row. The background color against which the pixels are compared is specified via the HemColor parameter and is usually white. If this parameter is omitted, the predominant color for each row is computed and used as the base color instead.

The image below demonstrates the improvement in pagination if the code sample above is modified by adding the Hem and HemColor parameter, as follows:

PdfImage objImage = objDoc.OpenUrl( "http://support.persits.com/default.asp?displayall=1", "AspectRatio=0.7727; Hem=40; HemColor=white");

18.2.3 Colored Page Breaks

As of Version 3.0, the OpenUrl method is capable of splitting a document into pages along colored horizontal delimiters contained in the document. The parameter PageBreakColor specifies the color of the delimiter. For example, an HTML document may contain the following construct where the page break should be:

<div style="background-color: green; width: 100%; height: 1pt"></div>

This construct appears in the document as a thin green horizontal line. Setting the PageBreakColor parameter to green, will cause OpenUrl to create a page break right before this green line, as follows:

PdfImage objImage = objDoc.OpenUrl( strUrl, "AspectRatio=0.7727; PageBreakColor=green");

When PageBreakColor is specified, OpenUrl scans each page image from the top down looking for a row of pixels of the specified color. The parameter PageBreakThreshold specifies what percentage of pixels in a row must be of the specified color for this row to be considered a page break line. By default, this value is 0.8 which defines the default threshold percentage to be 80%.

18.3 Hyperlinks

The OpenUrl method is capable of preserving the hyperlinks on the HTML document being converted. If the method is called with the parameter Hyperlinks set to True, every image object it generates is populated with the collection of PdfRect objects, each representing a hyperlink depicted on this image. It is your application's responsibility to draw those hyperlinks (in the form of link annotations connected to URL actions) on the PDF pages along with the images themselves. Annotations and actions are described in Chapter 10 - Interactive Features.

As of Version 2.9, the PdfImage object is equipped with the Hyperlinks property which returns a collection of PdfRect objects. The PdfRect object encalsulates the standard properties of a rectange (Left, Bottom, Top, Right, Width, Height) and also a string property, Text. The properties (Left, Bottom) and (Width, Height) return the coordinates of the lower-left corner of the hyperlink relative to the lower-left corner of the image to which this hyperlink belongs, and the hyperlink's dimensions, respectively. The Text property returns the target URL of this hyperlink.

Note that the coordinates of the hyperlinks are provided in the coordinate space of the image (with its origin in the lower-left corner, as in standard PDF practice.) When the image is drawn on a PDF page at a certain location (as specified by the X and Y parameters of the DrawImage method), the hyperlink annotations must be drawn with the same X and Y displacements. Also, if scaling is applied to the image via the ScaleX and ScaleY parameters of the DrawImage method, the same scaling must apply to the hyperlink coordinates and dimensions as well. Failure to adjust the hyperlink coordinates properly will result in a misalignment between the depiction of the hyperlink on the page and the actual clickable hyperlink area.

The following code sample performs hemming (described in the previous section) as well as hyperlink rendering. Clickable hyperlinks on the PDF pages are created with the help of the PdfAnnot and PdfAction objects.

Note that the same scaling is applied to both the image and the coordinates and dimensions of the link annotations. In addition to that, the Y-coordinate shift applied to the image is also applied to the annotations (the X-coordinate shift is 0 in our example.)

...
objHyperlinkParam["x"] = objRect.Left * fScale;
objHyperlinkParam["y"] = objRect.Bottom * fScale + objParam["y"];
objHyperlinkParam["width"] = objRect.Width * fScale;
objHyperlinkParam["height"] = objRect.Height * fScale;
...
PdfManager objPdf = new PdfManager();

// Create empty document
PdfDocument objDoc = objPdf.CreateDocument();

// Parameter object for image drawing
PdfParam objParam = objPdf.CreateParam();

// Parameter object for hyperlink annotation drawing
PdfParam objHyperlinkParam = objPdf.CreateParam();
objHyperlinkParam.Set( "Type = link" );

// Convert URL to image. Enable hyperlinks. Use hemming.
PdfImage objImage = objDoc.OpenUrl( "http://support.persits.com/default.asp?displayall=1",
  "AspectRatio=0.7727; hyperlinks=true; hem=50; hemcolor=white" );

// Iterate through all images
while( objImage != null )
{
  // Add a new page
  PdfPage objPage = objDoc.Pages.Add();

  // Compute scale based on image width and page width
  float fScale = objPage.Width / objImage.Width;

  // Draw image
  objParam["x"] = 0;
  objParam["y"] = objPage.Height - objImage.Height * fScale;
  objParam["ScaleX"] = fScale;
  objParam["ScaleY"] = fScale;
  objPage.Canvas.DrawImage( objImage, objParam );

  // Now draw hyperlinks from the Image.Hyperlinks collection
  foreach( PdfRect objRect in objImage.Hyperlinks )
  {
    objHyperlinkParam["x"] = objRect.Left * fScale;
    // Y-coordinate must be lifted by the same amount as the image itself
    objHyperlinkParam["y"] = objRect.Bottom * fScale + objParam["y"];
    objHyperlinkParam["width"] = objRect.Width * fScale;
    objHyperlinkParam["height"] = objRect.Height * fScale;
    objHyperlinkParam["border"] = 0;

    // Create link annotation
    PdfAnnot objAnnot = objPage.Annots.Add("", objHyperlinkParam);
    objAnnot.SetAction( objDoc.CreateAction("type=URI", objRect.Text) );
  }

  // Go to next image
  objImage = objImage.NextImage;
}

// Save document, the Save method returns generated file name string strFilename = objDoc.Save( Server.MapPath("hyperlinks.pdf"), false );
Dim objPdf As PdfManager = New PdfManager()

' Create empty document
Dim objDoc As PdfDocument = objPdf.CreateDocument()

' Parameter object for image drawing
Dim objParam As PdfParam = objPdf.CreateParam()

' Parameter object for hyperlink annotation drawing
Dim objHyperlinkParam As PdfParam = objPdf.CreateParam()
objHyperlinkParam.Set("Type = link")

' Convert URL to image. Enable hyperlinks. Use hemming.
Dim objImage As PdfImage = objDoc.OpenUrl("http://support.persits.com/default.asp?displayall=1", _
  "AspectRatio=0.7727; hyperlinks=true; hem=50; hemcolor=white")

' Iterate through all images
While Not objImage Is Nothing
  ' Add a new page
  Dim objPage As PdfPage = objDoc.Pages.Add()

  ' Compute scale based on image width and page width
  Dim fScale As Single = objPage.Width / objImage.Width

  ' Draw image
  objParam("x") = 0
  objParam("y") = objPage.Height - objImage.Height * fScale
  objParam("ScaleX") = fScale
  objParam("ScaleY") = fScale
  objPage.Canvas.DrawImage(objImage, objParam)

  ' Now draw hyperlinks from the Image.Hyperlinks collection
  For Each objRect As PdfRect In objImage.Hyperlinks
    objHyperlinkParam("x") = objRect.Left * fScale
    ' Y-coordinate must be lifted by the same amount as the image itself
    objHyperlinkParam("y") = objRect.Bottom * fScale + objParam("y")
    objHyperlinkParam("width") = objRect.Width * fScale
    objHyperlinkParam("height") = objRect.Height * fScale
    objHyperlinkParam("border") = 0

    ' Create link annotation
    Dim objAnnot As PdfAnnot = objPage.Annots.Add("", objHyperlinkParam)
    objAnnot.SetAction(objDoc.CreateAction("type=URI", objRect.Text))
  Next

  ' Go to next image
  objImage = objImage.NextImage
End While

' Save document, the Save method returns generated file name Dim strFilename As String = objDoc.Save(Server.MapPath("hyperlinks.pdf"), False)

Click the links below to run this code sample:

18.4 Miscellaneous Parameters

18.4.1 Allowed Content

By default, the OpenUrl method instructs Internet Explorer to only display images and videos when rendering the HTML document. The method supports six parameter, all optional, that control the type of content IE is allowed to load. These parameters are:

  • Images (True by default) - instructs IE to load images;
  • Video (True by default) - instructs IE to load video;
  • DownloadActiveX (False by default) - instructs IE to download ActiveX controls;
  • RunActiveX (False by default) - instructs IE to run ActiveX controls;
  • Java (False by default) - instructs IE to enable Java applets;
  • Scripts (False by default) - instructs IE to enable scripts.

18.4.2 Threading

OpenUrl creates a separate thread of execution to communicate with IE. Failure to create a separate thread would cause the method to hang when used under ASP.NET (but not under classic ASP.) While it is recommended that the separate-thread mode be used under both ASP and ASP.NET, there is still a way to enable the single-thread mode by setting the parameter SingleThread to True. The single-thread mode of operation can only be used in classic ASP and stand-alone environments but not in ASP.NET.

18.4.3 Internal Window Dimensions

OpenUrl creates an internal window to hold the WebBrowser control. The default dimensions of this window is 100x100. In most cases, the window is resized automatically to accommodate the HTML document. However, if the specified URL is a frameset, the window may retain its original size which is usually not large enough for the HTML content to fit. The parameters WindowWidth and WindowHeight enable you to specify a window size large enough to accommodate your frameset.

18.5 IE Compatibility Mode

In most cases, the IE rendering engine runs under a "compatibility mode" by default. Sometimes this causes serious rendering issues -- the output generated by the OpenUrl method looks considerably different than what is displayed by a browser.

To switch the IE rendering engine to the "regular" mode, the following simple change in the registry is needed:

Under the key

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Internet Explorer\MAIN\FeatureControl\FEATURE_BROWSER_EMULATION

and, on a 64-bit server, under the key

HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Internet Explorer\MAIN\FeatureControl\FEATURE_BROWSER_EMULATION

the following DWORD entry must be added for IIS:

Name=w3wp.exe, Value=9000 (decimal), as follows:

IIS has to be reset (iisreset command at the command prompt) for the change to take effect.