groff 1.23.0 added .MR to its -man macro package. The NEWS file states
that the inclusion of the macro "was prompted by its introduction to
Plan 9 from User Space's troff in August 2020." From d32deab it seems
that the name for Plan 9 from User Space's implementation was suggested
by groff maintainer G. Brandon Robinson.
Not sure if the intention was to make these definitions compatible, but
it would be nice if they were.
Currently, Plan 9 from User Space's .MR expects its second argument to
be parenthesized. groff's .MR does not. This results in extra
parentheses appearing in manual references when viewing Plan 9 from User
Space's manual pages on a system using groff.
1421 lines
29 KiB
Groff
1421 lines
29 KiB
Groff
.TH HTML 3
|
|
.SH NAME
|
|
parsehtml,
|
|
printitems,
|
|
validitems,
|
|
freeitems,
|
|
freedocinfo,
|
|
dimenkind,
|
|
dimenspec,
|
|
targetid,
|
|
targetname,
|
|
fromStr,
|
|
toStr
|
|
\- HTML parser
|
|
.SH SYNOPSIS
|
|
.nf
|
|
.PP
|
|
.ft L
|
|
#include <u.h>
|
|
#include <libc.h>
|
|
#include <html.h>
|
|
.ft P
|
|
.PP
|
|
.ta \w'\fLToken* 'u
|
|
.B
|
|
Item* parsehtml(uchar* data, int datalen, Rune* src, int mtype,
|
|
.B
|
|
int chset, Docinfo** pdi)
|
|
.PP
|
|
.B
|
|
void printitems(Item* items, char* msg)
|
|
.PP
|
|
.B
|
|
int validitems(Item* items)
|
|
.PP
|
|
.B
|
|
void freeitems(Item* items)
|
|
.PP
|
|
.B
|
|
void freedocinfo(Docinfo* d)
|
|
.PP
|
|
.B
|
|
int dimenkind(Dimen d)
|
|
.PP
|
|
.B
|
|
int dimenspec(Dimen d)
|
|
.PP
|
|
.B
|
|
int targetid(Rune* s)
|
|
.PP
|
|
.B
|
|
Rune* targetname(int targid)
|
|
.PP
|
|
.B
|
|
uchar* fromStr(Rune* buf, int n, int chset)
|
|
.PP
|
|
.B
|
|
Rune* toStr(uchar* buf, int n, int chset)
|
|
.SH DESCRIPTION
|
|
.PP
|
|
This library implements a parser for HTML 4.0 documents.
|
|
The parsed HTML is converted into an intermediate representation that
|
|
describes how the formatted HTML should be laid out.
|
|
.PP
|
|
.I Parsehtml
|
|
parses an entire HTML document contained in the buffer
|
|
.I data
|
|
and having length
|
|
.IR datalen .
|
|
The URL of the document should be passed in as
|
|
.IR src .
|
|
.I Mtype
|
|
is the media type of the document, which should be either
|
|
.B TextHtml
|
|
or
|
|
.BR TextPlain .
|
|
The character set of the document is described in
|
|
.IR chset ,
|
|
which can be one of
|
|
.BR US_Ascii ,
|
|
.BR ISO_8859_1 ,
|
|
.B UTF_8
|
|
or
|
|
.BR Unicode .
|
|
The return value is a linked list of
|
|
.B Item
|
|
structures, described in detail below.
|
|
As a side effect,
|
|
.BI * pdi
|
|
is set to point to a newly created
|
|
.B Docinfo
|
|
structure, containing information pertaining to the entire document.
|
|
.PP
|
|
The library expects two allocation routines to be provided by the
|
|
caller,
|
|
.B emalloc
|
|
and
|
|
.BR erealloc .
|
|
These routines are analogous to the standard malloc and realloc routines,
|
|
except that they should not return if the memory allocation fails.
|
|
In addition,
|
|
.B emalloc
|
|
is required to zero the memory.
|
|
.PP
|
|
For debugging purposes,
|
|
.I printitems
|
|
may be called to display the contents of an item list; individual items may
|
|
be printed using the
|
|
.B %I
|
|
print verb, installed on the first call to
|
|
.IR parsehtml .
|
|
.I validitems
|
|
traverses the item list, checking that all of the pointers are valid.
|
|
It returns
|
|
.B 1
|
|
is everything is ok, and
|
|
.B 0
|
|
if an error was found.
|
|
Normally, one would not call these routines directly.
|
|
Instead, one sets the global variable
|
|
.I dbgbuild
|
|
and the library calls them automatically.
|
|
One can also set
|
|
.IR warn ,
|
|
to cause the library to print a warning whenever it finds a problem with the
|
|
input document, and
|
|
.IR dbglex ,
|
|
to print debugging information in the lexer.
|
|
.PP
|
|
When an item list is finished with, it should be freed with
|
|
.IR freeitems .
|
|
Then,
|
|
.I freedocinfo
|
|
should be called on the pointer returned in
|
|
.BI * pdi\f1.
|
|
.PP
|
|
.I Dimenkind
|
|
and
|
|
.I dimenspec
|
|
are provided to interpret the
|
|
.B Dimen
|
|
type, as described in the section
|
|
.IR "Dimension Specifications" .
|
|
.PP
|
|
Frame target names are mapped to integer ids via a global, permanent mapping.
|
|
To find the value for a given name, call
|
|
.IR targetid ,
|
|
which allocates a new id if the name hasn't been seen before.
|
|
The name of a given, known id may be retrieved using
|
|
.IR targetname .
|
|
The library predefines
|
|
.BR FTtop ,
|
|
.BR FTself ,
|
|
.B FTparent
|
|
and
|
|
.BR FTblank .
|
|
.PP
|
|
The library handles all text as Unicode strings (type
|
|
.BR Rune* ).
|
|
Character set conversion is provided by
|
|
.I fromStr
|
|
and
|
|
.IR toStr .
|
|
.I FromStr
|
|
takes
|
|
.I n
|
|
Unicode characters from
|
|
.I buf
|
|
and converts them to the character set described by
|
|
.IR chset .
|
|
.I ToStr
|
|
takes
|
|
.I n
|
|
bytes from
|
|
.IR buf ,
|
|
interpretted as belonging to character set
|
|
.IR chset ,
|
|
and converts them to a Unicode string.
|
|
Both routines null-terminate the result, and use
|
|
.B emalloc
|
|
to allocate space for it.
|
|
.SS Items
|
|
The return value of
|
|
.I parsehtml
|
|
is a linked list of variant structures,
|
|
with the generic portion described by the following definition:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Genattr* 'u
|
|
typedef struct Item Item;
|
|
struct Item
|
|
{
|
|
Item* next;
|
|
int width;
|
|
int height;
|
|
int ascent;
|
|
int anchorid;
|
|
int state;
|
|
Genattr* genattr;
|
|
int tag;
|
|
};
|
|
.EE
|
|
.PP
|
|
The field
|
|
.B next
|
|
points to the successor in the linked list of items, while
|
|
.BR width ,
|
|
.BR height ,
|
|
and
|
|
.B ascent
|
|
are intended for use by the caller as part of the layout process.
|
|
.BR Anchorid ,
|
|
if non-zero, gives the integer id assigned by the parser to the anchor that
|
|
this item is in (see section
|
|
.IR Anchors ).
|
|
.B State
|
|
is a collection of flags and values described as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'IFindentshift = 'u
|
|
enum
|
|
{
|
|
IFbrk = 0x80000000,
|
|
IFbrksp = 0x40000000,
|
|
IFnobrk = 0x20000000,
|
|
IFcleft = 0x10000000,
|
|
IFcright = 0x08000000,
|
|
IFwrap = 0x04000000,
|
|
IFhang = 0x02000000,
|
|
IFrjust = 0x01000000,
|
|
IFcjust = 0x00800000,
|
|
IFsmap = 0x00400000,
|
|
IFindentshift = 8,
|
|
IFindentmask = (255<<IFindentshift),
|
|
IFhangmask = 255
|
|
};
|
|
.EE
|
|
.PP
|
|
.B IFbrk
|
|
is set if a break is to be forced before placing this item.
|
|
.B IFbrksp
|
|
is set if a 1 line space should be added to the break (in which case
|
|
.B IFbrk
|
|
is also set).
|
|
.B IFnobrk
|
|
is set if a break is not permitted before the item.
|
|
.B IFcleft
|
|
is set if left floats should be cleared (that is, if the list of pending left floats should be placed)
|
|
before this item is placed, and
|
|
.B IFcright
|
|
is set for right floats.
|
|
In both cases, IFbrk is also set.
|
|
.B IFwrap
|
|
is set if the line containing this item is allowed to wrap.
|
|
.B IFhang
|
|
is set if this item hangs into the left indent.
|
|
.B IFrjust
|
|
is set if the line containing this item should be right justified,
|
|
and
|
|
.B IFcjust
|
|
is set for center justified lines.
|
|
.B IFsmap
|
|
is used to indicate that an image is a server-side map.
|
|
The low 8 bits, represented by
|
|
.BR IFhangmask ,
|
|
indicate the current hang into left indent, in tenths of a tabstop.
|
|
The next 8 bits, represented by
|
|
.B IFindentmask
|
|
and
|
|
.BR IFindentshift ,
|
|
indicate the current indent in tab stops.
|
|
.PP
|
|
The field
|
|
.B genattr
|
|
is an optional pointer to an auxiliary structure, described in the section
|
|
.IR "Generic Attributes" .
|
|
.PP
|
|
Finally,
|
|
.B tag
|
|
describes which variant type this item has.
|
|
It can have one of the values
|
|
.BR Itexttag ,
|
|
.BR Iruletag ,
|
|
.BR Iimagetag ,
|
|
.BR Iformfieldtag ,
|
|
.BR Itabletag ,
|
|
.B Ifloattag
|
|
or
|
|
.BR Ispacertag .
|
|
For each of these values, there is an additional structure defined, which
|
|
includes Item as an unnamed initial substructure, and then defines additional
|
|
fields.
|
|
.PP
|
|
Items of type
|
|
.B Itexttag
|
|
represent a piece of text, using the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Rune* 'u
|
|
struct Itext
|
|
{
|
|
Item;
|
|
Rune* s;
|
|
int fnt;
|
|
int fg;
|
|
uchar voff;
|
|
uchar ul;
|
|
};
|
|
.EE
|
|
.PP
|
|
Here
|
|
.B s
|
|
is a null-terminated Unicode string of the actual characters making up this text item,
|
|
.B fnt
|
|
is the font number (described in the section
|
|
.IR "Font Numbers" ),
|
|
and
|
|
.B fg
|
|
is the RGB encoded color for the text.
|
|
.B Voff
|
|
measures the vertical offset from the baseline; subtract
|
|
.B Voffbias
|
|
to get the actual value (negative values represent a displacement down the page).
|
|
The field
|
|
.B ul
|
|
is the underline style:
|
|
.B ULnone
|
|
if no underline,
|
|
.B ULunder
|
|
for conventional underline, and
|
|
.B ULmid
|
|
for strike-through.
|
|
.PP
|
|
Items of type
|
|
.B Iruletag
|
|
represent a horizontal rule, as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Dimen 'u
|
|
struct Irule
|
|
{
|
|
Item;
|
|
uchar align;
|
|
uchar noshade;
|
|
int size;
|
|
Dimen wspec;
|
|
};
|
|
.EE
|
|
.PP
|
|
Here
|
|
.B align
|
|
is the alignment specification (described in the corresponding section),
|
|
.B noshade
|
|
is set if the rule should not be shaded,
|
|
.B size
|
|
is the height of the rule (as set by the size attribute),
|
|
and
|
|
.B wspec
|
|
is the desired width (see section
|
|
.IR "Dimension Specifications" ).
|
|
.PP
|
|
Items of type
|
|
.B Iimagetag
|
|
describe embedded images, for which the following structure is defined:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Iimage* 'u
|
|
struct Iimage
|
|
{
|
|
Item;
|
|
Rune* imsrc;
|
|
int imwidth;
|
|
int imheight;
|
|
Rune* altrep;
|
|
Map* map;
|
|
int ctlid;
|
|
uchar align;
|
|
uchar hspace;
|
|
uchar vspace;
|
|
uchar border;
|
|
Iimage* nextimage;
|
|
};
|
|
.EE
|
|
.PP
|
|
Here
|
|
.B imsrc
|
|
is the URL of the image source,
|
|
.B imwidth
|
|
and
|
|
.BR imheight ,
|
|
if non-zero, contain the specified width and height for the image,
|
|
and
|
|
.B altrep
|
|
is the text to use as an alternative to the image, if the image is not displayed.
|
|
.BR Map ,
|
|
if set, points to a structure describing an associated client-side image map.
|
|
.B Ctlid
|
|
is reserved for use by the application, for handling animated images.
|
|
.B Align
|
|
encodes the alignment specification of the image.
|
|
.B Hspace
|
|
contains the number of pixels to pad the image with on either side, and
|
|
.B Vspace
|
|
the padding above and below.
|
|
.B Border
|
|
is the width of the border to draw around the image.
|
|
.B Nextimage
|
|
points to the next image in the document (the head of this list is
|
|
.BR Docinfo.images ).
|
|
.PP
|
|
For items of type
|
|
.BR Iformfieldtag ,
|
|
the following structure is defined:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Formfield* 'u
|
|
struct Iformfield
|
|
{
|
|
Item;
|
|
Formfield* formfield;
|
|
};
|
|
.EE
|
|
.PP
|
|
This adds a single field,
|
|
.BR formfield ,
|
|
which points to a structure describing a field in a form, described in section
|
|
.IR Forms .
|
|
.PP
|
|
For items of type
|
|
.BR Itabletag ,
|
|
the following structure is defined:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Table* 'u
|
|
struct Itable
|
|
{
|
|
Item;
|
|
Table* table;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Table
|
|
points to a structure describing the table, described in the section
|
|
.IR Tables .
|
|
.PP
|
|
For items of type
|
|
.BR Ifloattag ,
|
|
the following structure is defined:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Ifloat* 'u
|
|
struct Ifloat
|
|
{
|
|
Item;
|
|
Item* item;
|
|
int x;
|
|
int y;
|
|
uchar side;
|
|
uchar infloats;
|
|
Ifloat* nextfloat;
|
|
};
|
|
.EE
|
|
.PP
|
|
The
|
|
.B item
|
|
points to a single item (either a table or an image) that floats (the text of the
|
|
document flows around it), and
|
|
.B side
|
|
indicates the margin that this float sticks to; it is either
|
|
.B ALleft
|
|
or
|
|
.BR ALright .
|
|
.B X
|
|
and
|
|
.B y
|
|
are reserved for use by the caller; these are typically used for the coordinates
|
|
of the top of the float.
|
|
.B Infloats
|
|
is used by the caller to keep track of whether it has placed the float.
|
|
.B Nextfloat
|
|
is used by the caller to link together all of the floats that it has placed.
|
|
.PP
|
|
For items of type
|
|
.BR Ispacertag ,
|
|
the following structure is defined:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Item; 'u
|
|
struct Ispacer
|
|
{
|
|
Item;
|
|
int spkind;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Spkind
|
|
encodes the kind of spacer, and may be one of
|
|
.B ISPnull
|
|
(zero height and width),
|
|
.B ISPvline
|
|
(takes on height and ascent of the current font),
|
|
.B ISPhspace
|
|
(has the width of a space in the current font) and
|
|
.B ISPgeneral
|
|
(for all other purposes, such as between markers and lists).
|
|
.SS Generic Attributes
|
|
.PP
|
|
The genattr field of an item, if non-nil, points to a structure that holds
|
|
the values of attributes not specific to any particular
|
|
item type, as they occur on a wide variety of underlying HTML tags.
|
|
The structure is as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'SEvent* 'u
|
|
typedef struct Genattr Genattr;
|
|
struct Genattr
|
|
{
|
|
Rune* id;
|
|
Rune* class;
|
|
Rune* style;
|
|
Rune* title;
|
|
SEvent* events;
|
|
};
|
|
.EE
|
|
.PP
|
|
Fields
|
|
.BR id ,
|
|
.BR class ,
|
|
.B style
|
|
and
|
|
.BR title ,
|
|
when non-nil, contain values of correspondingly named attributes of the HTML tag
|
|
associated with this item.
|
|
.B Events
|
|
is a linked list of events (with corresponding scripted actions) associated with the item:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'SEvent* 'u
|
|
typedef struct SEvent SEvent;
|
|
struct SEvent
|
|
{
|
|
SEvent* next;
|
|
int type;
|
|
Rune* script;
|
|
};
|
|
.EE
|
|
.PP
|
|
Here,
|
|
.B next
|
|
points to the next event in the list,
|
|
.B type
|
|
is one of
|
|
.BR SEonblur ,
|
|
.BR SEonchange ,
|
|
.BR SEonclick ,
|
|
.BR SEondblclick ,
|
|
.BR SEonfocus ,
|
|
.BR SEonkeypress ,
|
|
.BR SEonkeyup ,
|
|
.BR SEonload ,
|
|
.BR SEonmousedown ,
|
|
.BR SEonmousemove ,
|
|
.BR SEonmouseout ,
|
|
.BR SEonmouseover ,
|
|
.BR SEonmouseup ,
|
|
.BR SEonreset ,
|
|
.BR SEonselect ,
|
|
.B SEonsubmit
|
|
or
|
|
.BR SEonunload ,
|
|
and
|
|
.B script
|
|
is the text of the associated script.
|
|
.SS Dimension Specifications
|
|
.PP
|
|
Some structures include a dimension specification, used where
|
|
a number can be followed by a
|
|
.B %
|
|
or a
|
|
.B *
|
|
to indicate
|
|
percentage of total or relative weight.
|
|
This is encoded using the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'int 'u
|
|
typedef struct Dimen Dimen;
|
|
struct Dimen
|
|
{
|
|
int kindspec;
|
|
};
|
|
.EE
|
|
.PP
|
|
Separate kind and spec values are extracted using
|
|
.I dimenkind
|
|
and
|
|
.IR dimenspec .
|
|
.I Dimenkind
|
|
returns one of
|
|
.BR Dnone ,
|
|
.BR Dpixels ,
|
|
.B Dpercent
|
|
or
|
|
.BR Drelative .
|
|
.B Dnone
|
|
means that no dimension was specified.
|
|
In all other cases,
|
|
.I dimenspec
|
|
should be called to find the absolute number of pixels, the percentage of total,
|
|
or the relative weight.
|
|
.SS Background Specifications
|
|
.PP
|
|
It is possible to set the background of the entire document, and also
|
|
for some parts of the document (such as tables).
|
|
This is encoded as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Rune* 'u
|
|
typedef struct Background Background;
|
|
struct Background
|
|
{
|
|
Rune* image;
|
|
int color;
|
|
};
|
|
.EE
|
|
.PP
|
|
.BR Image ,
|
|
if non-nil, is the URL of an image to use as the background.
|
|
If this is nil,
|
|
.B color
|
|
is used instead, as the RGB value for a solid fill color.
|
|
.SS Alignment Specifications
|
|
.PP
|
|
Certain items have alignment specifiers taken from the following
|
|
enumerated type:
|
|
.PP
|
|
.EX
|
|
.ta 6n
|
|
enum
|
|
{
|
|
ALnone = 0, ALleft, ALcenter, ALright, ALjustify,
|
|
ALchar, ALtop, ALmiddle, ALbottom, ALbaseline
|
|
};
|
|
.EE
|
|
.PP
|
|
These values correspond to the various alignment types named in the HTML 4.0
|
|
standard.
|
|
If an item has an alignment of
|
|
.B ALleft
|
|
or
|
|
.BR ALright ,
|
|
the library automatically encapsulates it inside a float item.
|
|
.PP
|
|
Tables, and the various rows, columns and cells within them, have a more
|
|
complex alignment specification, composed of separate vertical and
|
|
horizontal alignments:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'uchar 'u
|
|
typedef struct Align Align;
|
|
struct Align
|
|
{
|
|
uchar halign;
|
|
uchar valign;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Halign
|
|
can be one of
|
|
.BR ALnone ,
|
|
.BR ALleft ,
|
|
.BR ALcenter ,
|
|
.BR ALright ,
|
|
.B ALjustify
|
|
or
|
|
.BR ALchar .
|
|
.B Valign
|
|
can be one of
|
|
.BR ALnone ,
|
|
.BR ALmiddle ,
|
|
.BR ALbottom ,
|
|
.BR ALtop
|
|
or
|
|
.BR ALbaseline .
|
|
.SS Font Numbers
|
|
.PP
|
|
Text items have an associated font number (the
|
|
.B fnt
|
|
field), which is encoded as
|
|
.BR style*NumSize+size .
|
|
Here,
|
|
.B style
|
|
is one of
|
|
.BR FntR ,
|
|
.BR FntI ,
|
|
.B FntB
|
|
or
|
|
.BR FntT ,
|
|
for roman, italic, bold and typewriter font styles, respectively, and size is
|
|
.BR Tiny ,
|
|
.BR Small ,
|
|
.BR Normal ,
|
|
.B Large
|
|
or
|
|
.BR Verylarge .
|
|
The total number of possible font numbers is
|
|
.BR NumFnt ,
|
|
and the default font number is
|
|
.B DefFnt
|
|
(which is roman style, normal size).
|
|
.SS Document Info
|
|
.PP
|
|
Global information about an HTML page is stored in the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'DestAnchor* 'u
|
|
typedef struct Docinfo Docinfo;
|
|
struct Docinfo
|
|
{
|
|
// stuff from HTTP headers, doc head, and body tag
|
|
Rune* src;
|
|
Rune* base;
|
|
Rune* doctitle;
|
|
Background background;
|
|
Iimage* backgrounditem;
|
|
int text;
|
|
int link;
|
|
int vlink;
|
|
int alink;
|
|
int target;
|
|
int chset;
|
|
int mediatype;
|
|
int scripttype;
|
|
int hasscripts;
|
|
Rune* refresh;
|
|
Kidinfo* kidinfo;
|
|
int frameid;
|
|
|
|
// info needed to respond to user actions
|
|
Anchor* anchors;
|
|
DestAnchor* dests;
|
|
Form* forms;
|
|
Table* tables;
|
|
Map* maps;
|
|
Iimage* images;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Src
|
|
gives the URL of the original source of the document,
|
|
and
|
|
.B base
|
|
is the base URL.
|
|
.B Doctitle
|
|
is the document's title, as set by a
|
|
.B <title>
|
|
element.
|
|
.B Background
|
|
is as described in the section
|
|
.IR "Background Specifications" ,
|
|
and
|
|
.B backgrounditem
|
|
is set to be an image item for the document's background image (if given as a URL),
|
|
or else nil.
|
|
.B Text
|
|
gives the default foregound text color of the document,
|
|
.B link
|
|
the unvisited hyperlink color,
|
|
.B vlink
|
|
the visited hyperlink color, and
|
|
.B alink
|
|
the color for highlighting hyperlinks (all in 24-bit RGB format).
|
|
.B Target
|
|
is the default target frame id.
|
|
.B Chset
|
|
and
|
|
.B mediatype
|
|
are as for the
|
|
.I chset
|
|
and
|
|
.I mtype
|
|
parameters to
|
|
.IR parsehtml .
|
|
.B Scripttype
|
|
is the type of any scripts contained in the document, and is always
|
|
.BR TextJavascript .
|
|
.B Hasscripts
|
|
is set if the document contains any scripts.
|
|
Scripting is currently unsupported.
|
|
.B Refresh
|
|
is the contents of a
|
|
.B "<meta http-equiv=Refresh ...>"
|
|
tag, if any.
|
|
.B Kidinfo
|
|
is set if this document is a frameset (see section
|
|
.IR Frames ).
|
|
.B Frameid
|
|
is this document's frame id.
|
|
.PP
|
|
.B Anchors
|
|
is a list of hyperlinks contained in the document,
|
|
and
|
|
.B dests
|
|
is a list of hyperlink destinations within the page (see the following section for details).
|
|
.BR Forms ,
|
|
.B tables
|
|
and
|
|
.B maps
|
|
are lists of the various forms, tables and client-side maps contained
|
|
in the document, as described in subsequent sections.
|
|
.B Images
|
|
is a list of all the image items in the document.
|
|
.SS Anchors
|
|
.PP
|
|
The library builds two lists for all of the
|
|
.B <a>
|
|
elements (anchors) in a document.
|
|
Each anchor is assigned a unique anchor id within the document.
|
|
For anchors which are hyperlinks (the
|
|
.B href
|
|
attribute was supplied), the following structure is defined:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Anchor* 'u
|
|
typedef struct Anchor Anchor;
|
|
struct Anchor
|
|
{
|
|
Anchor* next;
|
|
int index;
|
|
Rune* name;
|
|
Rune* href;
|
|
int target;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
points to the next anchor in the list (the head of this list is
|
|
.BR Docinfo.anchors ).
|
|
.B Index
|
|
is the anchor id; each item within this hyperlink is tagged with this value
|
|
in its
|
|
.B anchorid
|
|
field.
|
|
.B Name
|
|
and
|
|
.B href
|
|
are the values of the correspondingly named attributes of the anchor
|
|
(in particular, href is the URL to go to).
|
|
.B Target
|
|
is the value of the target attribute (if provided) converted to a frame id.
|
|
.PP
|
|
Destinations within the document (anchors with the name attribute set)
|
|
are held in the
|
|
.B Docinfo.dests
|
|
list, using the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'DestAnchor* 'u
|
|
typedef struct DestAnchor DestAnchor;
|
|
struct DestAnchor
|
|
{
|
|
DestAnchor* next;
|
|
int index;
|
|
Rune* name;
|
|
Item* item;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
is the next element of the list,
|
|
.B index
|
|
is the anchor id,
|
|
.B name
|
|
is the value of the name attribute, and
|
|
.B item
|
|
is points to the item within the parsed document that should be considered
|
|
to be the destination.
|
|
.SS Forms
|
|
.PP
|
|
Any forms within a document are kept in a list, headed by
|
|
.BR Docinfo.forms .
|
|
The elements of this list are as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Formfield* 'u
|
|
typedef struct Form Form;
|
|
struct Form
|
|
{
|
|
Form* next;
|
|
int formid;
|
|
Rune* name;
|
|
Rune* action;
|
|
int target;
|
|
int method;
|
|
int nfields;
|
|
Formfield* fields;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
points to the next form in the list.
|
|
.B Formid
|
|
is a serial number for the form within the document.
|
|
.B Name
|
|
is the value of the form's name or id attribute.
|
|
.B Action
|
|
is the value of any action attribute.
|
|
.B Target
|
|
is the value of the target attribute (if any) converted to a frame target id.
|
|
.B Method
|
|
is one of
|
|
.B HGet
|
|
or
|
|
.BR HPost .
|
|
.B Nfields
|
|
is the number of fields in the form, and
|
|
.B fields
|
|
is a linked list of the actual fields.
|
|
.PP
|
|
The individual fields in a form are described by the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Formfield* 'u
|
|
typedef struct Formfield Formfield;
|
|
struct Formfield
|
|
{
|
|
Formfield* next;
|
|
int ftype;
|
|
int fieldid;
|
|
Form* form;
|
|
Rune* name;
|
|
Rune* value;
|
|
int size;
|
|
int maxlength;
|
|
int rows;
|
|
int cols;
|
|
uchar flags;
|
|
Option* options;
|
|
Item* image;
|
|
int ctlid;
|
|
SEvent* events;
|
|
};
|
|
.EE
|
|
.PP
|
|
Here,
|
|
.B next
|
|
points to the next field in the list.
|
|
.B Ftype
|
|
is the type of the field, which can be one of
|
|
.BR Ftext ,
|
|
.BR Fpassword ,
|
|
.BR Fcheckbox ,
|
|
.BR Fradio ,
|
|
.BR Fsubmit ,
|
|
.BR Fhidden ,
|
|
.BR Fimage ,
|
|
.BR Freset ,
|
|
.BR Ffile ,
|
|
.BR Fbutton ,
|
|
.B Fselect
|
|
or
|
|
.BR Ftextarea .
|
|
.B Fieldid
|
|
is a serial number for the field within the form.
|
|
.B Form
|
|
points back to the form containing this field.
|
|
.BR Name ,
|
|
.BR value ,
|
|
.BR size ,
|
|
.BR maxlength ,
|
|
.B rows
|
|
and
|
|
.B cols
|
|
each contain the values of corresponding attributes of the field, if present.
|
|
.B Flags
|
|
contains per-field flags, of which
|
|
.B FFchecked
|
|
and
|
|
.B FFmultiple
|
|
are defined.
|
|
.B Image
|
|
is only used for fields of type
|
|
.BR Fimage ;
|
|
it points to an image item containing the image to be displayed.
|
|
.B Ctlid
|
|
is reserved for use by the caller, typically to store a unique id
|
|
of an associated control used to implement the field.
|
|
.B Events
|
|
is the same as the corresponding field of the generic attributes
|
|
associated with the item containing this field.
|
|
.B Options
|
|
is only used by fields of type
|
|
.BR Fselect ;
|
|
it consists of a list of possible options that may be selected for that
|
|
field, using the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Option* 'u
|
|
typedef struct Option Option;
|
|
struct Option
|
|
{
|
|
Option* next;
|
|
int selected;
|
|
Rune* value;
|
|
Rune* display;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
points to the next element of the list.
|
|
.B Selected
|
|
is set if this option is to be displayed initially.
|
|
.B Value
|
|
is the value to send when the form is submitted if this option is selected.
|
|
.B Display
|
|
is the string to display on the screen for this option.
|
|
.SS Tables
|
|
.PP
|
|
The library builds a list of all the tables in the document,
|
|
headed by
|
|
.BR Docinfo.tables .
|
|
Each element of this list has the following format:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Tablecell*** 'u
|
|
typedef struct Table Table;
|
|
struct Table
|
|
{
|
|
Table* next;
|
|
int tableid;
|
|
Tablerow* rows;
|
|
int nrow;
|
|
Tablecol* cols;
|
|
int ncol;
|
|
Tablecell* cells;
|
|
int ncell;
|
|
Tablecell*** grid;
|
|
Align align;
|
|
Dimen width;
|
|
int border;
|
|
int cellspacing;
|
|
int cellpadding;
|
|
Background background;
|
|
Item* caption;
|
|
uchar caption_place;
|
|
Lay* caption_lay;
|
|
int totw;
|
|
int toth;
|
|
int caph;
|
|
int availw;
|
|
Token* tabletok;
|
|
uchar flags;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
points to the next element in the list of tables.
|
|
.B Tableid
|
|
is a serial number for the table within the document.
|
|
.B Rows
|
|
is an array of row specifications (described below) and
|
|
.B nrow
|
|
is the number of elements in this array.
|
|
Similarly,
|
|
.B cols
|
|
is an array of column specifications, and
|
|
.B ncol
|
|
the size of this array.
|
|
.B Cells
|
|
is a list of all cells within the table (structure described below)
|
|
and
|
|
.B ncell
|
|
is the number of elements in this list.
|
|
Note that a cell may span multiple rows and/or columns, thus
|
|
.B ncell
|
|
may be smaller than
|
|
.BR nrow*ncol .
|
|
.B Grid
|
|
is a two-dimensional array of cells within the table; the cell
|
|
at row
|
|
.B i
|
|
and column
|
|
.B j
|
|
is
|
|
.BR Table.grid[i][j] .
|
|
A cell that spans multiple rows and/or columns will
|
|
be referenced by
|
|
.B grid
|
|
multiple times, however it will only occur once in
|
|
.BR cells .
|
|
.B Align
|
|
gives the alignment specification for the entire table,
|
|
and
|
|
.B width
|
|
gives the requested width as a dimension specification.
|
|
.BR Border ,
|
|
.B cellspacing
|
|
and
|
|
.B cellpadding
|
|
give the values of the corresponding attributes for the table,
|
|
and
|
|
.B background
|
|
gives the requested background for the table.
|
|
.B Caption
|
|
is a linked list of items to be displayed as the caption of the
|
|
table, either above or below depending on whether
|
|
.B caption_place
|
|
is
|
|
.B ALtop
|
|
or
|
|
.BR ALbottom .
|
|
Most of the remaining fields are reserved for use by the caller,
|
|
except
|
|
.BR tabletok ,
|
|
which is reserved for internal use.
|
|
The type
|
|
.B Lay
|
|
is not defined by the library; the caller can provide its
|
|
own definition.
|
|
.PP
|
|
The
|
|
.B Tablecol
|
|
structure is defined for use by the caller.
|
|
The library ensures that the correct number of these
|
|
is allocated, but leaves them blank.
|
|
The fields are as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Point 'u
|
|
typedef struct Tablecol Tablecol;
|
|
struct Tablecol
|
|
{
|
|
int width;
|
|
Align align;
|
|
Point pos;
|
|
};
|
|
.EE
|
|
.PP
|
|
The rows in the table are specified as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Background 'u
|
|
typedef struct Tablerow Tablerow;
|
|
struct Tablerow
|
|
{
|
|
Tablerow* next;
|
|
Tablecell* cells;
|
|
int height;
|
|
int ascent;
|
|
Align align;
|
|
Background background;
|
|
Point pos;
|
|
uchar flags;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
is only used during parsing; it should be ignored by the caller.
|
|
.B Cells
|
|
provides a list of all the cells in a row, linked through their
|
|
.B nextinrow
|
|
fields (see below).
|
|
.BR Height ,
|
|
.B ascent
|
|
and
|
|
.B pos
|
|
are reserved for use by the caller.
|
|
.B Align
|
|
is the alignment specification for the row, and
|
|
.B background
|
|
is the background to use, if specified.
|
|
.B Flags
|
|
is used by the parser; ignore this field.
|
|
.PP
|
|
The individual cells of the table are described as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Background 'u
|
|
typedef struct Tablecell Tablecell;
|
|
struct Tablecell
|
|
{
|
|
Tablecell* next;
|
|
Tablecell* nextinrow;
|
|
int cellid;
|
|
Item* content;
|
|
Lay* lay;
|
|
int rowspan;
|
|
int colspan;
|
|
Align align;
|
|
uchar flags;
|
|
Dimen wspec;
|
|
int hspec;
|
|
Background background;
|
|
int minw;
|
|
int maxw;
|
|
int ascent;
|
|
int row;
|
|
int col;
|
|
Point pos;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
is used to link together the list of all cells within a table
|
|
.RB ( Table.cells ),
|
|
whereas
|
|
.B nextinrow
|
|
is used to link together all the cells within a single row
|
|
.RB ( Tablerow.cells ).
|
|
.B Cellid
|
|
provides a serial number for the cell within the table.
|
|
.B Content
|
|
is a linked list of the items to be laid out within the cell.
|
|
.B Lay
|
|
is reserved for the user to describe how these items have
|
|
been laid out.
|
|
.B Rowspan
|
|
and
|
|
.B colspan
|
|
are the number of rows and columns spanned by this cell,
|
|
respectively.
|
|
.B Align
|
|
is the alignment specification for the cell.
|
|
.B Flags
|
|
is some combination of
|
|
.BR TFparsing ,
|
|
.B TFnowrap
|
|
and
|
|
.B TFisth
|
|
or'd together.
|
|
Here
|
|
.B TFparsing
|
|
is used internally by the parser, and should be ignored.
|
|
.B TFnowrap
|
|
means that the contents of the cell should not be
|
|
wrapped if they don't fit the available width,
|
|
rather, the table should be expanded if need be
|
|
(this is set when the nowrap attribute is supplied).
|
|
.B TFisth
|
|
means that the cell was created by the
|
|
.B <th>
|
|
element (rather than the
|
|
.B <td>
|
|
element),
|
|
indicating that it is a header cell rather than a data cell.
|
|
.B Wspec
|
|
provides a suggested width as a dimension specification,
|
|
and
|
|
.B hspec
|
|
provides a suggested height in pixels.
|
|
.B Background
|
|
gives a background specification for the individual cell.
|
|
.BR Minw ,
|
|
.BR maxw ,
|
|
.B ascent
|
|
and
|
|
.B pos
|
|
are reserved for use by the caller during layout.
|
|
.B Row
|
|
and
|
|
.B col
|
|
give the indices of the row and column of the top left-hand
|
|
corner of the cell within the table grid.
|
|
.SS Client-side Maps
|
|
.PP
|
|
The library builds a list of client-side maps, headed by
|
|
.BR Docinfo.maps ,
|
|
and having the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Rune* 'u
|
|
typedef struct Map Map;
|
|
struct Map
|
|
{
|
|
Map* next;
|
|
Rune* name;
|
|
Area* areas;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
points to the next element in the list,
|
|
.B name
|
|
is the name of the map (use to bind it to an image), and
|
|
.B areas
|
|
is a list of the areas within the image that comprise the map,
|
|
using the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Dimen* 'u
|
|
typedef struct Area Area;
|
|
struct Area
|
|
{
|
|
Area* next;
|
|
int shape;
|
|
Rune* href;
|
|
int target;
|
|
Dimen* coords;
|
|
int ncoords;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
points to the next element in the map's list of areas.
|
|
.B Shape
|
|
describes the shape of the area, and is one of
|
|
.BR SHrect ,
|
|
.B SHcircle
|
|
or
|
|
.BR SHpoly .
|
|
.B Href
|
|
is the URL associated with this area in its role as
|
|
a hypertext link, and
|
|
.B target
|
|
is the target frame it should be loaded in.
|
|
.B Coords
|
|
is an array of coordinates for the shape, and
|
|
.B ncoords
|
|
is the size of this array (number of elements).
|
|
.SS Frames
|
|
.PP
|
|
If the
|
|
.B Docinfo.kidinfo
|
|
field is set, the document is a frameset.
|
|
In this case, it is typical for
|
|
.I parsehtml
|
|
to return nil, as a document which is a frameset should have no actual
|
|
items that need to be laid out (such will appear only in subsidiary documents).
|
|
It is possible that items will be returned by a malformed document; the caller
|
|
should check for this and free any such items.
|
|
.PP
|
|
The
|
|
.B Kidinfo
|
|
structure itself reflects the fact that framesets can be nested within a document.
|
|
If is defined as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Kidinfo* 'u
|
|
typedef struct Kidinfo Kidinfo;
|
|
struct Kidinfo
|
|
{
|
|
Kidinfo* next;
|
|
int isframeset;
|
|
|
|
// fields for "frame"
|
|
Rune* src;
|
|
Rune* name;
|
|
int marginw;
|
|
int marginh;
|
|
int framebd;
|
|
int flags;
|
|
|
|
// fields for "frameset"
|
|
Dimen* rows;
|
|
int nrows;
|
|
Dimen* cols;
|
|
int ncols;
|
|
Kidinfo* kidinfos;
|
|
Kidinfo* nextframeset;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
is only used if this structure is part of a containing frameset; it points to the next
|
|
element in the list of children of that frameset.
|
|
.B Isframeset
|
|
is set when this structure represents a frameset; if clear, it is an individual frame.
|
|
.PP
|
|
Some fields are used only for framesets.
|
|
.B Rows
|
|
is an array of dimension specifications for rows in the frameset, and
|
|
.B nrows
|
|
is the length of this array.
|
|
.B Cols
|
|
is the corresponding array for columns, of length
|
|
.BR ncols .
|
|
.B Kidinfos
|
|
points to a list of components contained within this frameset, each
|
|
of which may be a frameset or a frame.
|
|
.B Nextframeset
|
|
is only used during parsing, and should be ignored.
|
|
.PP
|
|
The remaining fields are used if the structure describes a frame, not a frameset.
|
|
.B Src
|
|
provides the URL for the document that should be initially loaded into this frame.
|
|
Note that this may be a relative URL, in which case it should be interpretted
|
|
using the containing document's URL as the base.
|
|
.B Name
|
|
gives the name of the frame, typically supplied via a name attribute in the HTML.
|
|
If no name was given, the library allocates one.
|
|
.BR Marginw ,
|
|
.B marginh
|
|
and
|
|
.B framebd
|
|
are the values of the marginwidth, marginheight and frameborder attributes, respectively.
|
|
.B Flags
|
|
can contain some combination of the following:
|
|
.B FRnoresize
|
|
(the frame had the noresize attribute set, and the user should not be allowed to resize it),
|
|
.B FRnoscroll
|
|
(the frame should not have any scroll bars),
|
|
.B FRhscroll
|
|
(the frame should have a horizontal scroll bar),
|
|
.B FRvscroll
|
|
(the frame should have a vertical scroll bar),
|
|
.B FRhscrollauto
|
|
(the frame should be automatically given a horizontal scroll bar if its contents
|
|
would not otherwise fit), and
|
|
.B FRvscrollauto
|
|
(the frame gets a vertical scrollbar only if required).
|
|
.SH SOURCE
|
|
.B \*9/src/libhtml
|
|
.SH SEE ALSO
|
|
.MR fmt 1
|
|
.PP
|
|
W3C World Wide Web Consortium,
|
|
``HTML 4.01 Specification''.
|
|
.SH BUGS
|
|
The entire HTML document must be loaded into memory before
|
|
any of it can be parsed.
|