Docbook notes

From Public wiki of Kevin P. Inscoe
Jump to navigation Jump to search

Executive summary

What is Doocbook?

[[1]] is an [[2]] standard for creating technical documentation in XML. The documentation from Spring or Hibernate is generated from DocBook. Particularly suited for computer-related content, DocBook is a set of XML tags, defined by a Document Type Definition (DTD) and XML schema for technical content. In addition to the DTD, DocBook and other open source projects supply a collection of tools and frameworks that enable developers to transform DocBook-compliant XML into PDF, HTML, Eclipse Help, and even MAN pages. This alleviates the need to write the same material multiple times or manually convert from one format to another.

No big deal, you say. I can write HTML, Eclipse Help, and PDF if I have to. Why is DocBook different? In the DocBook model, you write the raw documentation once in XML and then transform or "compile" it to the desired target formats using DocBook tools—a paradigm familiar to developers, who already code and compile. So DocBook is ideal for developers who have to write documentation, not only because it is targeted for technical content but also because DocBook files—like any other XML files—can be managed and edited within an IDE. No longer is documentation someone else's problem; it's right there next to your code. This encourages you keep the software documentation in sync as you develop.



Fedora and several Linux distributions such as Fedora and Redhat use [[3]] to "brand" and otherwise bind together Docbook sources as one cohesive looking collection. I shall use this going forward.

Why Docbook

From John Simpson

John Simpson -

On 2010-01-19, at 1629, tom foster wrote:
> Just out of curiosity, why not stick with latex?  There's
> nothing wrong with docbook, of course.  Is it that you just
> prefer one way of marking up text to another?

hrmmm... that's actually a good question. at first it was a case of trying both, just so that when/if i did decide on one or the other, it was a more informed decision rather than just a knee-jerk thing.

what i've found so far...

  • TeX and LaTeX seem to be geared heavily towards academic and technical documents. docbook seems to be more of a generic solution.
  • TeX and LaTeX have a way to write out mathematical equations using a form of plain-text notation. for example, "${x = \frac{{ - b \pm \sqrt {b^2 - 4ac} }}Template:2a}$" renders as the quadratic formula. docbook doesn't have anything like this, but then it's rare that i need to render mathematical formulas like this.
  • the TeX/LaTeX language itself is like nothing else i've ever seen. it took me a while to get used to its structure, and i'm still not really that comfortable with it. it's not even consistent within itself- for example, the notations for "{\bf bold} text" and "\emph{italic} text" are structurally different, which just seems kinda weird to me.
  • LaTeX seems to be the only language with anything like the "\marginpar" environment.
  • the only halfway-usable free editor for LaTeX is lyx, and when i did spend an hour playing with it, i couldn't really get my head wrapped around how it worked. a google search found four or five free editors, either for docbook or for XML in general (and one of those was lyx.) i played with several of them, and for now i've settled on "XMLmind XML Editor Personal Edition", which has native versions for windows and mac osx, as well as a java version which will run on any machine (including linux) with a java runtime environment. the free version doesn't have built-in commands to render the docbook xml into other formats, but i'm okay with that- the truth is i would rather edit a Makefile and see what commands are actually involved in building a document.

  • docbook appears to have more active development going on than LaTeX does, but i could just be missing the bulk of the LaTeX community.
  • the docbook language is standard XML, which is much easier for me (or anybody who knows HTML) to understand. the structure is much more consistent, all of the tags are structured the same way.
  • over the past few years, i've gone from straight HTML to combining HTML with CSS,. i've gotten used to using stylesheets to separate the content from the presentation. docbook seems to be a lot closer to the ideal of separating content from presentation than LaTeX is. it even uses stylesheets (even if they are written in xslt instead of CSS.)
  • docbook has a transformation which builds a set of linked HTML pages from a single document, complete with navigation links between the pages and tables of contents at various levels within the pages. the closest i've seen for LaTeX is that it can export a single HTML page of the entire document. if you ever see a web page structured like that, you can tell the original source was a docbook document.
  • docbook has a working transform to build an ePub file, and after building a Makefile to create both HTML and PDF output from a single docbook file, it took me less than ten minutes to figure out how to add ePub output to that Makefile. of course, part of that is because i've been playing with ePub files recently- the docbook stylesheet produces all but one of the necessary files, and i was able to write Makefile instructions to create that one file (called "mimetype", contents "application/epub+zip") and then to zip things up in the right order (the "mimetype" file must be first, and it must be stored with zero compression.) not a big deal, "xsltproc" did all of the content-related work for me, and the rest was simple:
IMAGE=fig1.png fig2.png

$(NAME).valid: $(NAME).xml
       xmllint --valid --noout $(NAME).xml && touch $(NAME).valid

       echo 'application/epub+zip' > $@

$(NAME).epub: $(NAME).valid $(NAME)-epub.xsl $(NAME).xml mimetype
       rm -rf META-INF OEBPS
       xsltproc $(NAME)-epub.xsl $(NAME).xml
       cp $(IMAGES) OEBPS/
       rm -f $@
       zip -X0 $@ mimetype
       zip -Xr9D $@ META-INF OEBPS

and for the curious, $(NAME)-epub.xsl is a wrapper which includes the normal stylesheet, then modifies a few variables to customize the appearance of the final result. it looks like this:

<?xml version="1.0" encoding="ASCII"?>
<xsl:stylesheet xmlns:xsl='' version='1.0'>
<xsl:import href='/usr/local/share/xml/xsl/docbook-xsl/epub/docbook.xsl' />
<xsl:param name='chapter.autolabel'    select='1' />
<xsl:param name='generate.chapter.toc' select='0' />
<xsl:param name='section.autolabel'    select='1' />

in general, LaTeX seems to have more packages for doing custom formatting within a document, but it's definitely geared more towards producing a single document, especially if it's an academic or technical document.

docbook seems to have more options for the overall format of the final presentation, and it seems to be suited for a wider assortment of types of documents. it also has a shallower learning curve, at least for me.

so while i haven't fully decided which direction i'm going to go with, i'm definitely leaning towards docbook- not only for this project, but because it see it being useful for future projects as well.

a docbook file is just an XML file, which is like HTML only using different tags, and a lot more strict about those tags balancing out correctly.

the point of docbook is that you create the initial document in XML, and then run it through one or more processing programs to convert it to some other format. one of the biggest strengths is that you can use different processing chains on the same input file, and produce multiple types of output, from the same input file. for example, you can convert the same input file to a PDF file, a single-page HTML document, a multi-page set of HTML documents, or an epub file suitable for use on one of those ebook reader devices/programs.

for example. let's say you have "mybook.xml", written in docbook.

you would normally start off by running "xmllint" on it, to make sure it's valid XML. if it's not, you need to fix it before going any further.

$ xmllint --valid --noout mybook.xml

(if you get no output, it means there were no errors and you're good.)

the "xsltproc" program applies an xsl stylesheet against your xml input file, transforming it into some other format (like XHTML.) the trick (for me at least) is knowing which stylesheet to use, as well as possibly customizing the stylesheets.)

for example, to convert your docbook input to XHTML, you might run...

$ mkdir html1
$ xsltproc -o html1/ /usr/share/docbook-xsl/xhtml/onechunk.xsl mybook.xml

this will write an "index.html" file within the html1 directory.

or to make a set of XHTML files with working links between them, you might run...

$ mkdir html2
$ xsltproc -o html2/ /usr/share/docbook-xsl/xhtml/chunk.xsl mybook.xml

this creates a whole set of XHTML files, with index.html, in the html2 directory.

by using different stylesheets against the same input, you are able to transform your docbook content into any number of other formats. of course, some of them are a bit more complicated- for example, the epub stylesheet creates all of the content-related files you need for an epub file, but it doesn't actually build the final epub file itself. for that, you need to do something like...

$ rm -rf META-INF OEBPS mybook.epub
$ xsltproc /usr/share/docbook-xsl/epub/docbook.xsl mybook.xml
$ echo 'application/epub+zip' > mimetype
$ zip -X0 mybook.epub mimetype
$ zip -Xr9D mybook.epub META-INF OEBPS

an epub file is just a zip file with a specific set of contents. the first file in the zip must be called "mimetype", and must be stored with zero compression. it must also contain a file called META-INF/container.xml, which has a pointer to the root description of the content. in most cases this is called OEBPS/content.opf, and the other content (which consists of xhtml files, along with any stylesheets, images, etc. which may be needed) are also stored in the OEBPS document. note that the OEBPS directory and its contents don't have to have those specific names, but "META-INF/container.xml" must point to wherever the .opf file is which contains the pointers for the other content. using the OEBPS structure is just a commonly used convention.

the stylesheet tells xsltproc to create the META-INF/container.xml file, along with the OEBPS directory and its contents (the .opf file, a .ncx table of contents file, and the xhtml files which contain the actual content.) the rest of the commands just create the zip file (with a .epub name) using the proper compression options.

customizing the stylesheets isn't overly difficult- basically all you do is write your own .xsl file which includes the default .xsl file, then sets some variables which the rules within the xslt code look for, to modify how they work... and then when you call xsltproc, you tell it to run your stylesheet instead of the system standard one. for example, you might call this "mybook-html.xsl" ...

 <?xml version="1.0" encoding="ASCII"?>
 <xsl:stylesheet xmlns:xsl='' version='1.0'>
 <xsl:import href='/usr/share/docbook-xsl/xhtml/chunk.xsl' />
 <xsl:param name='chapter.autolabel'    select='1' />
 <xsl:param name='generate.chapter.toc' select='0' />
 <xsl:param name='section.autolabel'    select='1' />
 <xsl:param name='html.stylesheet'>mybook.css</xsl:param>

this stylesheet first includes the "xhtml/chunk.xsl" stylesheet, which contains rules to transform docbook input to multiple XHTML file output. these rules look for certain variables, and the stylesheet has default values for them. by changing the values of these variables, we change how the actual transform is done.

in this example, chapters will be given automatic labels (i.e. 1, 2, 3, etc.) in the table of contents, we will not be creating tables of contents at the top of each chapter, and section names within each chapter will be automatically named (i.e. "1.1", "1.2", "1.3", "2.1", "2.2", "3.1", etc.) and when it generates each xhtml file, it will include the appropriate "<link ref='stylesheet' ... />" tag to make it use a stylesheet with that name. (of course you have to write that .css file, and copy it into the directory where the xhtml files are written yourself.)

with all that said.

as you know, there *are* text editors designed specifically for editing HTML and CSS. there are also editors, some of which are the same HTML editors, which have features to handle XML and docbook. these are nice, not only because they highlight the syntax as you're typing, but they can also help you auto-complete the tags and automatically make sure your text is syntactically correct (i.e. all opening tags have the appropriate closing tags, in the right order.)

in fact, the editor i use on the mac doesn't even show you the docbook markup directly- you type into an editor which is almost WYSIWYG, although you can view a tree structure of the tags and content (i usually keep that open alongside what i'm typing.)

and while the professional version of that program apparently knows how to run the docbook->whatever transforms within the program (i'm using the free version, nice editor but no built-in transforms) what it all comes down to is that it's just a glorified text editor. as long as you know the commands for the transforms you need done, you can put them into a shell script or (even better) a Makefile, use the editor to edit the input file, and run the transforms from a command line (by running the script, or by typing "make".) this is how i'm doing it on the mac, it works fine, and to be honest i'd rather do it this way- becuase i *know* what's happening in order to transform the document.

doing the transforms on linux should be the same general process, although you may need to figure out where the standard docbook stylesheets are actually installed on your system (or you may not need to, usually the RPM files set up a system-wide "catalog" for you. i'm running it on mac, so no catalog, which means i need to give full pathnames to the stylesheets.)

From Jesse Goerz

Jesse Goerz -

Docbook has absolutely nothing to do with styles. Really. I could say "It's like saying CSS is a part of the HTML standard", but that's a terrible analogy because HTML includes elements whose sole purpose is to dictate presentation.

In Docbook, styles are accomplished via stylesheets. Docbook is an XML documentation schema (well, DTD). That's it. However, it's pretty useless unless you wish everyone to read your raw XML files. So what you need to control style is the Docbook stylesheets created by Norman Walsh and this little reference:

So all you do is tag your content by what it is, not how it should look. The paradox is that when your typing your document using the docbook schema for the first time, you are almost invariably thinking about how this will look. Why do that? The only reason is if there is some visual component. Just put a place holder there and worry about it later.

I haven't used it in a long time, but I loved it. Here are my positives:

  1. It's plain text baby! diff, awk, grep, svn, cvs, git, need I mention

them all!

  1. Single source documentation!
  1. Automated build via make for multiple output formats.
  1. My "presentation needs" have almost always been met by the default

XSL stylesheets. When they didn't, minor tweaks were sufficient, or I just lived with it.

  1. I can edit it with vim! abbreviations, macros, syntax highlighting,

oh my!

  1. With XLST you can transform to anything, you could create your own

docbook programming language with the right XSL transform (ok, that's sick).

  1. Xinclude and hierarchy is the bomb! Combined with a folding editor,

you can work on some seriously large and complex documents without much fuss. I converted 4 peoples documents from various versions of word to text and tagged all the content (including an index) in about 4 hours. I never could have done this in a GUI and mouse configuration. Touch typing with vim and macros all the way. The final output was around 80 pages in a Letter sized PDF. This was for my senior design course at UCF.

What I didn't like:

  1. TABLES! arghh! I once wrote a simple one off awk script to read a

csv file to produce the large tables I needed.

  1. MATH! arghh! I hope this has changed, there was quite a push to

include mathml, I broke down and used other tools to create images and included them that way, not perfect, but doable.

  1. Trying to convince anyone who has used a graphical editor with built

in styling to use this. And that includes little paragraph buttons and silly font icons. I can be very productive using this tool, but when you're on a team, everyone must buy in. Unless of course you're willing to manage all the project documentation, all the time.

If your documents are highly structured and fairly consistent across the full range of documents you write, docbook and the docbook stylesheets will serve you very well.

From Klaatu

Crash course

Docbook (-XML) for Dummies

Highlights from with my own observations thrown in.

1. Why to use DocBook instead of (Microsoft Word/Lyx,TeX,<Insert your favorite text processor here>)

  • It's an open standard, it's open-source, lots of people already use it
  • It's plain text (Vim)!
  • What I hope to achieve: portability of printing using styles (XSLT). Outputs include epub, PDF, HTML, printed book press ready and yeah I know publishers (read "Agents") want everything in Microsoft Word format and yeah I can do that also using conversion tools combined with style.

Why I wrote this document: on getting a docbook system set up: "currently I feel like playing a text-adventure, getting different types of hints from different persons." - Yeah what Ken said.

2. What's DocBook?

Ok, here we go. DocBook* is an open standard for how to describe a document. You know how Microsoft Word reads ".doc" files and Adobe Acrobat reads ".pdf" files? DocBook is another format for files. But it's open, unlike Word, and it's simple enough you can write it yourself, unlike PDF. Plus, it's plain text, so you can use all your old text processing tools (sed, grep, wc), and it's XML, so you can use any XML processing tools you find (**like what?).

So if you want to publish a document with DocBook, you can just open a text file in your editor and start typing:

 <?xml version="1.0"?>
   <!DOCTYPE ???>
      <para>It was the best of times, it was the worst of times.</para>

That's not so tough. That's DocBook. Text goes in between <para> </para>, which mark the start and end of a paragraph. There are a bunch of other "tags" like this you can use to emphasis text, make lists or tables, build a table of contents, or include graphics in your text. There's a big list here (URL???). You don't have to memorize them. Just go look it up.

But that text file you typed isn't very useful to people who want to read your book. They'll want to read it in their web browser, for example. Or maybe they'll want to print it out.

What you need are some DocBook tools. -- hmm, i'll need xml here... -- no, not really

Actually "DocBook-XML", since there's an older version based on SGML instead of XML, but that version is dying. And only some computer geeks care that it's based on XML. I mean, when you're writing an HTML file to put on your web page, do you care that HTML is based on SGML? Nah. Just be aware that it's technically "DocBook-XML" if you plan to search old Usenet postings, because originally it wasn't XML.

The executive summary:

-- docbook is just a standard -- xml and sgml, though sgml is dying\\ -- xslt is a way to convert xml to something else\\ -- saxon is a java xslt processor. xsltproc is another (but not java). (xalan is, but probably not as good.)\\ -- note, saxon still needs a sax parser (what's sax?); i use crimson\\ -- everybody uses norm walsh's xslt stylesheets\\ -- they can generate html (chunked and not), javahelp, and "[[4]]"\\ -- you can make "fo" into pdf with [FOP]\\ -- (i could also use passivetex to use TeX to make a pdf, but it's not java)

3. Getting things working

-- start converting\\ -- getting it to work at all (xmlto)

4. Getting things working smoothly

-- integrating with ant\\

  -- need java versions of everything\\
     -- saxon (docbook->html, docbook->javahelp, docbook->fo), fop (fo->pdf)\\
  -- download saxon\\
     -- tried saxon 7.1 (newest one that works with java 1.3)\\
     -- strange error message; apparently that's code for "you need to downgrade to saxon 6.5.2"\\
  -- download fop\\
     -- strange error messages\\
     -- apparently it's something wrong with my <qandaset>\\
  -- making [html]\\

-- validation

*** here below is all Kevin Inscoe's notes ****

5. Editing

Although any XML aware editor can do the job for you ([[5]]. Personally, I like Vi (standard unix-like editor) and [[6]] will do syntax highlighting of xml/sgml. If you like Emacs it has sglm-mode, and it works.

How is spell checking accomplished?

In vim or some other text editor, you may need to add all the tags to the dictionary. That way, if you misspell one, it'll trigger, and if you get it right, you don't get spurious spelling errors.

In emacs, the sgml mode ignores tags, so you're on your own there. Otherwise, it checks the data just fine.

Writing books in DocBook

"DocBook contains a large number of XML elements from which we only use a subset. Our subset, however, uses other elements than the subset used by Simplified DocBook; that is why we are not using Simplified DocBook (which would otherwise be a nice idea, because many editors support Simplified DocBook only). Be aware that XML has two special characters that cannot be used literally in text: < (the opening angle bracket) and & (the ampersand sign). Whenever you need to use those two characters literally, you have to use entities, like in this example: <literal> & <programlisting>

We structure our documentation using nested section elements. A book as DocBook defines it would look like this:



Editors (viewer only)

Books on DocBook (Sample chapter:

Is DocBook dead?

"DocBook excels at what I call monolithic single sourcing: creating multiple outputs of a large (or large-ish) document. A document can be split out into manuals for different levels of user experience and different operating systems, as well as into multiple formats. In the end, though, you’re still dealing with the book-chapter-section model of writing and publishing.

DITA, on the other hand, is more for discrete single sourcing. DITA, as many of us know, is oriented towards topics. You can take various bits of content — at a very granular level — and combine them as needed. With some extensive customization, however, this can be done with DocBook too.

So, is DocBook dead? Is it dying? I don’t think so, on either count."

Publishing, publishing tools, stylesheets, output and conversion

[Publishing Model]

I choose xsltproc instead of [[7]] for my XSLT processor.

I am considering using [[8]] for my entire publishing end to end.

in Gentoo notes are stored in /usr/share/doc/saxon* and /usr/share/saxon/package.env with online documentation at

This is how I arrive at my target outputs:

|| border=1 ||! Source ||! Output target ||! Style processor ||! Formatting Objects processor || || DocBook || PDF || xsltproc || ? || || DocBook || HTML || xsltproc || ? || || DocBook || ePub || xsltproc || ? || || DocBook || Microsoft Word (for some publishing agents) || Saxon? || ? || || DocBook || Mobile web || Saxon? || ? || || DocBook || Apple iPad/iPhone application || custom? || ? ||


On Gentoo I have Docbook 4.5 [app-text/docbook-xml-dtd] and Docbook XSL Stylesheets [app-text/docbook-xsl-stylesheets 1.75] installed. This installed the following style sheets:


The xsltproc route

See this [[9]] from IBM.

Command goes like:

 # xsltproc /usr/share/sgml/docbook/xsl-ns-stylesheets/html/docbook.xsl mybook.xml

EPUB output

See this [[10]] from IBM.

My typical build script looks like this:

 # Check for errors
 echo "Checking for errors..."
 xmllint --xinclude --noout --postvalid --timing --noent MyBook.xml

 echo " "
 echo " "
 echo "Making HTML output..."
 xsltproc --xinclude -o MyBook.html /usr/share/sgml/docbook/xsl-stylesheets/html/docbook.xsl MyBook.xml

 echo " "
 echo " "
 echo "Making PDF output..."
 xsltproc --xinclude -o /usr/share/sgml/docbook/xsl-stylesheets/fo/docbook.xsl MyBook.xml

 echo " "
 echo " "
 echo "Making EPUB output..."
 xsltproc --xinclude /usr/share/sgml/docbook/xsl-stylesheets/epub/docbook.xsl MyBook.xml

My typical book main document with Xincludes

A note about Xincludes:

<?xml version="1.0" encoding="utf-8"?>
<!-- <!DOCTYPE book SYSTEM "/usr/share/sgml/docbook/xml-dtd-4.5/docbookx.dtd"> -->
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
        <title>Kevin's Book</title>
                <ulink url=""></ulink>
        <edition>Draft 1</edition>

                                <email>kevin at inscoe dot org</email>

                <holder>Kevin Patrick Inscoe</holder>
                <para>Permission to use, copy, modify and distribute
                the DocBook DTD and its accompanying documentation for any purpose and
                without fee is hereby granted in perpetuity, provided that the above
                copyright notice and this paragraph appear in all copies.</para>


                <publishername>Self published</publishername>

                <para>If your book has an abstract then it should go here.</para>


        <xi:include xmlns:xi="" href="dedication.xml" /> 
        <xi:include xmlns:xi="" href="acknowledgements.xml" /> 
        <xi:include xmlns:xi="" href="forward.xml" /> 
        <xi:include xmlns:xi="" href="preface.xml" /> 
        <xi:include xmlns:xi="" href="chap1.xml" />