XHTML2RTF: An HTML to RTF conversion tool based on XSL

Gauloran

Global Moderatör
7 Tem 2013
8,187
633
This article describes a conversion tool which takes an HTML ******** as input and generates a Microsoft Word ******** for printing.

https://www.codeproject.com/KB/HTML/XHTML2RTF/XHTML2RTF_src.zip

Overview
This article describes a conversion tool which takes an HTML ******** as input and generates a Microsoft Word ******** for printing.

It all started when I had to work on a new information system with hundreds of computers. We decided to go for a 100% web-based application. Everything was fine until we had to print official ********s from the application...

Although there are standardization efforts in progress (both at the W3C with XHTML-PRINT and IEEE with the Print Working Group), and besides some good tools to print HTML (HTML Print from Bersoft, ScriptX from MeadCo), none of these seemed to address my needs. I wanted to keep my Web-based application, and reuse the generated HTML to feed a printer...

Have you tried to print HTML ********s? Have you tried to format your HTML ********s for printing, with specific fonts, sizes, headers, footers, and margins?

If you have, then you know that HTML is not appropriate for printing - but you can find other formats and use new tools to convert HTML ********s into Microsoft Word format, a format suitable for printing... And this is what this article is about.

Contents
Features
Introduction
Usage
Samples (********s and code)
Implementation
To do list
References
Features
The XHTML2RTF conversion tool:

Converts XHTML ********s into RTF ********s.
Generated RTF can be previewed and printed by Microsoft Word (commercialware) and Word Viewer (freeware).
Uses an XSL style sheet and Microsoft XML SDK 3.0.
Runs on Windows XP and Windows 2000 Server (and probably others).
Can be plugged into Web-based (ASP) or Batch (WSH) applications.
Is highly extensible and customizable - new tags can be supported easily, and direct RTF commands can be sent to the output (with no rendering in the HTML flow) with the <xhtml2rtf:raw> tag.
Supports RTF-specific fields like page numbering and total number of pages via <xhtml2rtf:page_number> and <xhtml2rtf:total_number_of_pages> tags.
Introduction
The XHTML2RTF conversion tool uses XSL style sheet to convert an XHTML ******** into an RTF ********, suitable for previewing and printing with Word (or Word Viewer).

XHTML = HTML + XML
The Extensible HyperText Markup Language (XHTML) is a family of current and future ******** types and modules that reproduce, subset, and extend HTML, reformulated in XML. XHTML family ******** types are all XML-based, and ultimately are designed to work in conjunction with XML-based user agents. XHTML is the successor of HTML, and a series of specifications has been developed for XHTML.

The XHTML2RTF conversion tool reads XHTML ********s as input. As a consequence, you have to adapt your application in order to use this tool.

XSL
XSL stands for eXtensible Stylesheet Language. It is a family of recommendations for defining XML ******** transformation and presentation. It consists of three parts:

A programming language for transforming XML ********s: XSL Transformations (XSLT).
An expression language used by XSLT to access or refer to parts of an XML ********: XML Path Language (XPath). This language provides pattern matching (xsl:template match), conditional statements (xsl:when test), loops (for-each), etc...
An XML vocabulary for specifying formatting semantics: similar to W3C cascading style sheets (CSS), this vocabulary provides enhanced presentation features.
For more about XSL, please refer to XSL references pages.

The XHTML2RTF conversion tool uses XSL to transform XHTML ********s (XML ********s) into RTF ********s. This is the core of the tool - anything else is just a glue to build your application. Everything is in the XSL style sheet.

Microsoft XML SDK 3.0
Microsoft provides an XML SDK for processing XML and XSL ********s. It's often installed with the operating system, but you can download and install the latest SDK. See References section for more on MSXML SDK.

The XHTML2RTF conversion tool uses XML SDK objects and methods to process XHTML and transform it into RTF. XML SDK API is available to Web application as well as batch applications and so is the XHTML2RTF conversion tool.

Microsoft Rich Text Format (RTF)
Microsoft created an exchange format for Word ********s: Rich Text Format (RTF). Unlike the native Word format, it is ********ed; moreover, RTF has been here for some time (so you can view RTF ********s with good old Word 97). There is also a free RTF viewer (Word 97/2000 Viewer), and even WordPad (installed with most Windows releases) can open, view and edit RTF ********s.

XHTML2RTF
The XHTML to RTF converter consists of an XSL style sheet for parsing XHTML tags and generating their RTF equivalents.

Usage
From HTML to XHTML
You have to adapt your application to generate XHTML ********s if you want to use the XHTML2RTF conversion tool:

Include an XML declaration at the beginning of the ********

Kod:
<?xml version="1.0" encoding="iso-8859-1" ?>

Include XHTML namespace declaration (the default) and XHTML2RTF namespace declaration in tag <html>:

Kod:
<html 

   xmlns="http://www.w3.org/1999/xhtml" 

   xmlns:xhtml2rtf=
     "http://www.lutecia.info/download/xmlns/xhtml2rtf">
...
</html>

Use lower case for both tag names and attribute names:
<P></P> becomes <p></p>
<A HREF="...">...</a> becomes <a href="...">...</a>
etc...
Add termination for all tags (XHTML is more strict than HTML):
<link rel="stylesheet" href="..."> becomes <link rel="stylesheet" href="..." />
<hr> becomes <hr />
<br> becomes <br />
Quote all attribute values:
<table class=noprint> becomes <table class="noprint">
<a href=mypage.asp> becomes <a href="/KB/HTML/mypage.asp">
Use encoded characters for non-ASCII and/or special characters:
& becomes &
é becomes é
è becomes è etc...
Replace HTML character entities by their code (XML knows very few character entity references - use character codes instead):
  becomes  
è becomes è
é becomes é
ê becomes ê etc...
Do not use direct style for tags (use class and an external CSS style sheet instead)
<div style='background:#c0c0c0; font-size: 125%; padding:1.0pt 10.0pt 1.0pt 10.0pt;'>
becomes <div class="mydivstyle">.
Thus, you will be able to customize the RTF output for your class (it's too hard to parse an HTML style declaration within an XSL style sheet).

Spaces in HTML and RTF
In HTML, spaces are not significant - most browsers ignore them when they render the ********. On the other hand, Microsoft Word (and RTF) render spaces as visible characters. Be careful when building your HTML ********: do not generate spaces or they will be shown in your Word ********.

Header and footer in HTML and RTF
The default header in the RTF ******** contains the HTML <title> (from the <head> section). You can change the header by setting the parameters header-font-size-default, header-distance-from-edge, and header-indentation-left (see parameters below). You can also create your own header by using the classes "rtf_header" and "rtf_header_first" in your HTML ********:

rtf_header_first defines a complete HTML content for the header on the first page of the ********
rtf_header defines a complete HTML content for the header on all other pages of the ********
The default footer in the RTF ******** contains the page number and the ******** date (current date and time; i.e. print date and time). You can change the footer by setting the parameters footer-font-size-default, footer-distance-from-edge and use-default-footer (see parameters below).

Page break
To force a page break in your RTF ********, you can use the CSS style "page-break-before" or "page-break-after" with the value "always":

Kod:
This is on page 1
<p style="page-break-before:always"/>
This is on page 2

Note that other values for these CSS styles (left, right, auto...) are not supported (only "always" is supported).

XSL style sheet parameters
The XSL style sheet xhtml2rtf.xsl provides a set of parameters so that you can change the stylesheet's default behavior:

page-start-number: Page start number (default: 1)
page-setup-paper-width: Paper width in TWIPS (default: 11907 TWIPS = 21 cm, i.e. A4 format)
page-setup-paper-height: Paper height in TWIPS (default: 16840 TWIPS = 29.7 cm, i.e. A4 format)
page-setup-margin-top: Top margin in TWIPS (default: 1440 TWIPS = 1 inch = 2.54 cm)
page-setup-margin-bottom: Bottom margin in TWIPS (default: 1440 TWIPS = 1 inch = 2.54 cm)
page-setup-margin-left: Left margin in TWIPS (default: 1134 TWIPS = 2 cm)
page-setup-margin-right: Right margin in TWIPS (default: 1134 TWIPS = 2 cm)
font-size-default: Default font size in TWIPS (default: 20 TWIPS = 10 pt.)
font-name-default: Default font name (default: 'Times New Roman')
font-name-fixed: Default font name for fixed-width text, like PRE or CODE (default: 'Courier New')
font-name-barcode: Barcode font name (default: '3 of 9 Barcode')
header-font-size-default: Header default font size in TWIPS (default: 14 TWIPS = 7 pt.)
header-distance-from-edge: Default distance between top of page and top of header, in TWIPS (default: 720 TWIPS = 1.27 cm)
header-indentation-left: Header left indentation in TWIPS (default: 0)
footer-font-size-default: Footer default font size in TWIPS (default: 14 TWIPS = 7 pt.)
footer-distance-from-edge: Default distance between bottom of page and bottom of footer, in TWIPS (default: 720 TWIPS = 1.27 cm)
use-default-footer: Boolean flag: 1 to use default footer (page number and date) or 0 no footer (default: 1)
********-protected: Boolean flag: 1 protected (cannot be modified) or 0 unprotected (default: 1)
normalize-space: Boolean flag: 1 spaces are normalized and trimmed, or 0 no normalization no trim (default: 0)
my-normalize-space: Boolean flag: 1 spaces are normalized (not trimmed), or 0 no normalization (default: 1)

Batch mode (WSH)
I wrote a BATCH program (XHTML2RTF.BAT) which relies on Windows Script Host (WSH) to call the XML DOM SDK and transforms an HTML file into its RTF equivalent (output is done in stdout).

To use this component from batch: call the program XHTML2RTF.BAT with the HTML file name as parameter. The RTF file is generated in stdout, so you should redirect the output with the ">" operator. Then you can open the generated file with Microsoft Word (or Wordpad):

Kod:
C:\> XHTML2RTF.BAT Readme.htm > Readme.rtf
C:\> START WINWORD Readme.rtf

To pass parameters to the XHTML2RTF program, use the -p flag followed by the parameter name and value.

For example:

Kod:
C:\> XHTML2RTF.BAT -p page-start-number=5 -p ********-protected=0 
              -p font-name-default='Arial' Readme.htm > Readme.rtf
C:\> START WINWORD Readme.rtf


Web-based (ASP)
I wrote a simple ASP library to call the component from an ASP page, producing RTF ******** from live, dynamic content (results from a SQL database request, for example).

To use this component from a web page, you have to include the XHTML2RTF.inc file in your page, and call the function XHTMLString2RTF(), passing the XHTML ******** (as a string):

Kod:
<!--#include file="XHTML2RTF.inc"-->
var strXHTML = " \
<html xmlns=\"http://www.w3.org/1999/xhtml\"
      xmlns:xhtml2rtf=
          \"http://www.lutecia.info/download/xmlns/xhtml2rtf\"> \
  <head> \
    <title>Hello, World! from string</title> \
  </head> \
  <body> \
    <h1>Hello, World!</h1> \
  </body> \
</html> \
";
XHTMLString2RTF(strXHTML);

Note: The real production system uses SQL requests, generates XML output, transforms it into XHTML via a first XSL style sheet, and then transforms it into an RTF ********. The example above is just that - an example for demonstration purposes. Please do not generate HTML via strings on your production system ;-)

Raw RTF output
The XHTML2RTF conversion tool provides a direct RTF output with no rendering in XHTML. The tool processes a special tag (<xhtml2rtf:raw>) to send the RTF directly. For example, this code will send a TAB character in the RTF output: <xhtml2rtf:raw class="rtf">\tab </xhtml2rtf:raw>. This code will not be rendered in your Web browser, since the class "rtf" is defined in the CSS style sheet as "display:none".

There are many uses for this raw output - in particular, you can work around most of the current limitations in the conversion tool (as listed in the TODO section). For example, you can send the RTF code for an image, even if the conversion tool doesn't handle images yet:

Kod:
<xhtml2rtf:raw class="rtf">
 {\*\shppict{\pict\picw3043\pich3043\picwgoal1725\pichgoal1725\pngblip
  89504e470d0a1a0a0000000d49484452000000730000007308020000002421
  aab1000000017352474200aece1ce90000000467414d410000b18f0bfc61050000
  ...
 }}
</xhtml2rtf:raw>

To find out what RTF code is appropriate for this image, I just used Word to edit a ******** with a picture, and then saved it in the RTF format. I opened the resulting file as text, and copied/pasted the RTF code into the XHTML output, within the <xhtml2rtf:raw> tags.

RTF-specific fields
Some RTF-specific fields are available in the conversion tool.

Page number
You can display the current page number in your RTF ******** via <xhtml2rtf:page_number>:

Kod:
PAGE <xhtml2rtf:page_number/>

Total number of pages
You can display total number of pages in your RTF ******** via <xhtml2rtf:total_number_of_pages>:

Kod:
PAGE <xhtml2rtf:page_number/> / <xhtml2rtf:total_number_of_pages/>

Samples

Hello, World! (HTML and RTF) https://www.codeproject.com/KB/HTML/XHTML2RTF/HelloWorld.zip
Custom Header, two pages (HTML and RTF) https://www.codeproject.com/KB/HTML/XHTML2RTF/CustomHeader.zip
No Footer (HTML and RTF) https://www.codeproject.com/KB/HTML/XHTML2RTF/NoFooter.zip
Table (HTML and RTF) https://www.codeproject.com/KB/HTML/XHTML2RTF/SimpleTable.zip
The Readme file you're reading in RTF https://www.codeproject.com/KB/HTML/XHTML2RTF/Readme.zip

Implementation

The XHTML to RTF converter consists of an XSL style sheet for parsing XHTML tags and generating their RTF equivalents.

To do list
Full support for XHTML tags <ul>, <li>, <ol> (not fully supported)
Full support for XHTML tags <table>, <tr>, <td> (not fully supported)
Support XHTML objects (<object>), images (<img>), and applets (<applet>) (not supported yet)
Support XHTML attribute <title> with RTF annotations (bugs in the current version)
Support XHTML hyphen and soft hyphen characters
Support XHTML INS and DEL elements
Support XHTML lists (<ul>, <ol>, <li>, <dl>, <dt>, <dd>)- unordered, ordered, and definition lists
Support XHTML DIR and MENU elements (deprecated)
Support XHTML table captions: The CAPTION element
Support XHTML row groups: the THEAD, TFOOT, and TBODY elements
Support XHTML column groups: the COLGROUP and COL elements
Support XHTML STYLE element
Support XHTML font color attribute (even if deprecated)
Support another popular format for printing: Adobe's PDF format (though one)
 
Üst

Turkhackteam.org internet sitesi 5651 sayılı kanun’un 2. maddesinin 1. fıkrasının m) bendi ile aynı kanunun 5. maddesi kapsamında "Yer Sağlayıcı" konumundadır. İçerikler ön onay olmaksızın tamamen kullanıcılar tarafından oluşturulmaktadır. Turkhackteam.org; Yer sağlayıcı olarak, kullanıcılar tarafından oluşturulan içeriği ya da hukuka aykırı paylaşımı kontrol etmekle ya da araştırmakla yükümlü değildir. Türkhackteam saldırı timleri Türk sitelerine hiçbir zararlı faaliyette bulunmaz. Türkhackteam üyelerinin yaptığı bireysel hack faaliyetlerinden Türkhackteam sorumlu değildir. Sitelerinize Türkhackteam ismi kullanılarak hack faaliyetinde bulunulursa, site-sunucu erişim loglarından bu faaliyeti gerçekleştiren ip adresini tespit edip diğer kanıtlarla birlikte savcılığa suç duyurusunda bulununuz.