What are entities?

Entities are simply a data-replacement facility. An entity declaration says, "wherever you see X, replace it with Y". Then, you sprinkle a bunch of X's around your document, which are the entity references. When the XML parser sees an X, it mentally substitutes Y in its place.

(Note the distintion: X is the entity reference, Y is the entity, and the entity declaration is what ties the two together. One entity declaration, one entity, and possibly many entity references.)

Any entity can be internal or external. An external entity points to data in an external file, so it functions like an "import" or "include" statement. An internal entity's value is defined in-line, so it functions like a macro. (Again, both includes and macros basically serve the same purpose: "wherever you see X, replace it with Y".)

Here's an example:

(In the DTD:)
<!-- declare the standard copyright notice: -->
<!ENTITY copyright "Copyright 1998-2002 Acme Inc.">

(In the XML:)
<!-- include the standard copyright notice here: -->
<p>&copyright;</p>
So far so good. No it gets a bit more obscure:

Any entity can be a general entity or a parameter entity. GE refs can only be used in XML; PE refs can only be used in DTD. (Note that every entity *declaration* appears in DTD, not XML.)

An external general entity may be an unparsed entity. An unparsed entity doesn't have to be a file containing XML or DTD; it might be a GIF file for example. This is the one exception to the replacement rule. It wouldn't make any sense to take the contents a GIF file and put it in the middle of XML or DTD. So instead, unparsed entities are just references to the external data file, whatever it may be.

So, in combination, you've got:

There are also character references like &lt; and &#10; and &#x00F4; which are basically pre-defined general internal entities. The 5 entity references &lt; &gt; &amp; &quot; &apos; are all pre-defined to be equal to < > & " ' respectively. Character references like &#10; and &#x00F4; are defined to be equal to the character specified by their decimal or hexadecimal value, respectively.

Here's a more complicated example.


What is the DTD, really?

Most people refer to the file whose name ends with .dtd as the DTD. In fact, this file contains only part of the DTD - it contains the "external subset" of the DTD. The "internal subset" is contained in-line in the DOCTYPE statement, e.g.:

<!DOCTYPE root PUBLIC http://xxx yyy.dtd
[
  <!ELEMENT parent (child)*>
  <!ELEMENT child (#PCDATA)>
]>
The part between the [square brackets] is the internal subset.

The entire conceptual "grammar" that is defined by combining the internal subset and the external subset is called the DTD, which stands for Document Type Definition. By contrast, the DOCTYPE statement is the Document Type *Declaration* (which is NOT the DTD - this part has no abbreviation). It contains the actual internal subset, as well as a *reference* to the external subset.

So what is the .dtd file? It's an external entity. The DOCTYPE statement declares this external entity, and it also is special in that it automatically acts as a reference to this entity. It's basicially the same as this pseudo-XML:

<!DOCTYPE root
[
  <!-- declare the external entity -->
  <!ENTITY % externalsubset PUBLIC http://xxx yyy.dtd>
  <!-- reference it -->
  %externalsubset;     
  <!-- now add the internal subset part -->
  <!ELEMENT parent (child)*>
  <!ELEMENT child (#PCDATA)>
]>