Web Media: HTML Basics

Summary

Introduction
- Purpose of this document
- Limits of this document
Structure separate from appearance
- Electronic publishing features
- Structure
- Appearance
Elements and tags
- Purpose of elements
- Elements are delimited by tags
- Container elements should not overlap
- Some tags can take one or more attributes
Information whereabouts
- Hypertext and addressing
- URL format
- Protocols
- File specifications
- Fragment identifiers and labels
Set up the framework of the page
- Purpose of the framework
- HTML container
- HEAD container
- BODY container
Fill in content
- TITLE container
- P container (optionally an empty element)
- Character entities
- IMG empty element
- EMBED empty element
Delineate structure
- H containers
- Style and Phrase containers
- BR empty element
- HR empty element
- List containers
- TABLE and related containers
Establish links
- A container with HREF attribute (link origin)
- A container with NAME attribute (link destination)

Note: For improved legibility, this file uses Cascading Style Sheets--supported in version 4 and later of the Netscape and Microsoft browsers.

Introduction
- Purpose of this document
  - This document is intended as an introduction for first-time authors to the HyperText Markup Language (HTML), the standard used to compose web pages. It assumes a basic understanding of Unix naming conventions (discussed separately on this site at http://www.sanedraw.com/LEARN/WEBMEDIA/UNIXSTRT.HTM), and Internet communications.
  - In this document, we will attempt to condense the knowledge necessary to create a basic web page into 3 concepts and 4 activities:
    - Concept 1: the structure of a web page and its appearance are defined separately.
    - Concept 2: the structure of a web page is indicated by elements, which in turn are delimited by tags.
    - Concept 3: the information within web pages can be linked to other information items, each identified in a standard manner.
    - Activity 1: define the basic framework for the page.
    - Activity 2: fill in multimedia content (text, pictures, animations, etc.)
    - Activity 3: delineate the structure of the page--so readers can better use it.
    - Activity 4: establish origin and destination of links to additional material.
- Limits of this document
  - The basic instructions contained within leave out many advanced web features. The main goal here is ease of learning (and, as an added bonus, the generation of pages compatible with virtually all browser programs).
    - You should be able to flesh out the basic pages resulting from this process by incrementally adding elements, without losing any of the work you did initially.
  - The information presented here is based on my own understanding of the HTML standard, and of its actual implementations. Due to wide-ranging and rapid changes in this area, not everything will work as described here in all circumstances.
    - Regarding terminology, I have tried to comply with general usage as much as possible. Changing and conflicting definitions, as well as a desire to simplify the topic, may have led me to use some terms in non-standard ways.

Structure separate from appearance
- Electronic publishing features
  - Unlike desktop publishing, which is geared towards the inflexible medium of paper, web page design should take full advantage of the flexibility of electronic publishing.
  - To this end, the standard used to define web pages, the HyperText Markup Language (HTML), allows for strict separation between the structure of the web page and the way that same structure is expressed (visually, aurally, spatially, etc.)
  - As HTML evolved, the separation of structure and appearance was increasingly muddled by ad-hoc additions (such as font specifications). In later versions of the standard, this separation was reaffirmed anew by the introduction of style sheets (discussed separately on this site at http://www.sanedraw.com/LEARN/WEBMEDIA/CSSSTART.HTM). Style sheets also afford more visual diversity than ever before possible.
- Structure
  - Defining the structure of the web page is what the author does when s/he marks up the page.
  - The structure of the page defines the relative importance and purpose of its elements. For example:
    - Main headlines are defined as more prominent than subheadings.
    - Body type is divided into paragraphs.
    - Lists of items are classified hierarchically.
    - Tabular material is divided into rows and columns.
  - Structural specifications do not imply a specific appearance for the content. Rather, they spell out the purpose of appearance specifications (for instance, a main headline may take on a bold appearance for the purpose of being more prominent).
- Appearance
  - The appearance of the web page is determined at the time the page is formatted by the browser program.
  - Formatting is partly the result of the author's choices and the browser's defaults--but ultimately both should yield to the reader's preferences.
    - Even when readers use the same browser software and the same type of computer as the author, differences in the appearance of the page will occur--due to availability of fonts, width of windows, type of browser window features enabled, display hardware and software settings, and many many other variables. Attempting to fully specify appearance in a web page is a counterproductive waste of time.
  - The goal of formatting should be to convey the structure and content of the page in a way best suited to the individual reader. It should not be to constrain the page within a preconceived notion of what looks or works best.

Elements and tags
- Purpose of Elements
  - Mark up involves assigning each content item to a specific place in the structure of a web page. This is done by assigning the content item to an element.
  - Except for special circumstances, HTML will ignore any other formatting of the text in the HTML file. In particular, carriage returns, tabs, and multiple spaces will not appear in the browser: they can be used to clarify the HTML code itself for the benefit of the author--without interfering with the reader's view of the page.
- Container elements are delimited by start and end tags
  - A start tag in HTML is a word (the name of the tag) enclosed in angle brackets (less-than and greater-than symbols).
    <sampletag>
  - An end tag is the same as the corresponding start tag, but adds a leading slash.
    <sampletag></sampletag>
- Empty elements are indicated by a start tag only.
- Containers should not overlap
  - If a container starts before a previous container closes, then it should also close before the previous container.
  - In other words, if a container is inside of another container, no part of it should 'stick out'.
  - The following containers are okay, since they do not overlap (container B is fully inside container A)
    <sampleA> <sampleB> </sampleB> </sampleA>
  - The following containers are okay, since they also do not overlap (container B is fully outside container A)
    <sampleA> </sampleA> <sampleB> </sampleB>
  - The following containers won't work, since they do overlap (container B is partially inside container A)
    <sampleA> <sampleB> </sampleA> </sampleB>
- Some tags can take one or more attributes
  - The attributes are used to fully specify the meaning of the tag. For instance, a tag that calls up a picture will take a attribute with the location of the picture file.
  - attributes appear within the same angle brackets as the tag name, following the tag name. Each attribute is separated by one or more spaces.
    - The attributes are only listed after the start tag, not after the end tag.
  - Each attribute is made up of a word (the name of the attribute), followed by = (equals sign), followed by the value of the attribute.
    <sampletag attribute1="some text" attribute2=256>
    - The attribute value may be a number, a text string, or whatever combination is appropriate for the specific attribute.
    - Enclosing the attribute value in " (straight double quotes) should always work (while the reverse is not true). However, I've had the occasional plug-in balk at quoted numeric values. As a general suggestion, if it doesn't work with quotes, try without...

Information whereabouts
- Hypertext and addressing
  - An essential feature of the Web is that it supports hypertext--the linking of information items in various locations within a page, within a web site, and even across web sites.
  - To achieve this linking it is necessary to specify the location of the information items down to a very detailed level--more detailed than previous Internet addressing standards would allow. Linking requirements led to the development of Universal Resource Locators (URLs), an addressing scheme that extends previous Internet standards.
- URL format
  - Most fully qualified URLs contain the specification for the computer file that contains the information item, a protocol (indicating how the reader's computer should access the file), and a fragment identifier (indicating a labeled location within the file).
    - If all three are present, they are listed in the following order: the protocol, then a : (colon), then the file specification, then a # (pound sign), then the fragment identifier. For example:
      http://www.sanedraw.com/LEARN/WEBMEDIA/HTMSTART.HTM#urls
      is the complete URL to access this section of the file you are currently reading, using a web browser.
  - Some protocols are not used to access files, and the corresponding URLs have different formats.
  - While the actual access rules are dependent on the computer hosting the web site, you should assume that URLs are case sensitive.
- Protocols
  - The most common protocol is the HyperText Transfer Protocol (HTTP), used to access files through web server software. It is indicated by http.
  - Other commonly used protocols are:
    - File Transfer Protocol (FTP), used to download and upload large files. Indicated by ftp.
    - Gopher, used for electronic publishing over the Internet before the Web. Indicated by gopher.
    - File--indicating that the file should be accessed through the normal operating system facilities of the reader's computer. This generally means that the file is stored on the same computer that is running the browser software. Indicated by file.
  - Two other commonly used protocols that require a special URL format:
    - Mail--used to send e-mail. Indicated by mailto. Instead of a file specification and fragment identifier, it takes a standard Internet mail address. For example:
      mailto:somebody@some.place.com
      would send mail to the user 'somebody' of the computer with DNS name 'some.place.com'
    - Net News Transfer Protocol (NNTP), indicated by news. Takes a standard Internet newsgroup designation instead of a file specification and fragment identifier. For example:
      news:alt.somehobby
      would acces the messages posted to the newsgroup 'alt.somehobby'
- File specifications
  - A relative specification starts from the location of the file containing the URL, and lists the folders (directories) that must be opened in order to get to the file containing the information item.
    - Relative specifications are preferrable, since they allow some changes to the location of the files without having to update the URLs.
    - Relative specifications cannot be used to access files on a site different from the one where the URL resides.
    - The elements of the file specification are listed according to Unix conventions: directory names are separated by a / (forward slash), and an enclosing (parent) directory is indicated by .. (two periods). For example:
      myfolder/myfile.html
      is the path to a file called 'myfile.html' located in a subdirectory (called 'myfolder') of the current directory;
      ../yourfile.html
      is the path to a file called 'yourfile.html' located in the parent directory (the one directly enclosing the current directory).
  - Absolute file specifications start from a fixed location, then list folders leading to the file containing the information item.
    - A file specification starting with a leading slash is an absolute file specification starting at the root (top level) of the filesystem of the same site where the URL resides. For example:
      /LEARN/WEBMEDIA/HTMSTART.HTM
      is an absolute pathname (note the slash at the beginning) to the file you are currently reading, which will only work starting from files on this same site.
    - To refer to a location on a different site, the file specification should start with // (two forward slashes), the DNS name or IP number corresponding to the site, then the absolute path name to the file starting from the root of the site. For example:
      //www.sanedraw.com/LEARN/WEBMEDIA/HTMSTART.HTM
      is an absolute pathname to the file you are currently reading, valid from anywhere on the Internet.
  - If the file specification is missing, the browser program assumes that it should look within the file that contains the URL itself.
- Fragment Identifier and Labels
  - Prior to using a URL containing a fragment identifier, the corresponding label must be attached to a portion of the destination file using the appropriate markup.
  - If a URL does not include a fragment identifier, it defaults to the beginning of the destination file.

Set up the framework of the page
- Purpose of the framework
  - This is the bare minimum necessary to create a valid, but empty, web page. It sets up the locations where the actual contents will be placed.
- HTML container
  - The 'outermost' container for the page. All other containers are located inside this container.
    <HTML> </HTML>
- HEAD container
  - The first part of the web page, mostly containing items invisible to the reader (for instance, indexing information, language, authorship) and used by automatic retrieval systems.
  - The HEAD container goes inside the HTML container.
    <HTML> <HEAD> </HEAD> </HTML>
- BODY container
  - The main part of the web page, containing the information displayed to the reader.
  - The BODY container goes inside the HTML container, outside of and after the HEAD container.
    <HTML> <HEAD> </HEAD> <BODY> </BODY> </HTML>

Fill in content
- TITLE container
  - This is the only content visible to the reader which goes inside the HEAD container. All other content items we will consider will go inside the BODY container.
  - TITLE contains a short piece of text describing the purpose of the web page.
    <TITLE>How to Bake Carrot Muffins</TITLE>
  - A succinct but descriptive title will help readers find the page in the history and bookmark menus available in most browsers. It will also improve automatic indexing of the page by Internet search engines.
  - By default, most browsers display the contents of TITLE in the title bar of the browser window.
- P container (optionally an empty element)
  - A separate P element is used for each paragraph of the text content of the web page.
  - Most browsers accept both container and empty variants of the P element. The following two examples are generally interpreted in the same manner:
    <P>Mix eggs and flour in a bowl. <P>Mix eggs and flour in a bowl.</P>
  - Using paragraph elements appropriately highlights the meaning of the text, and makes it more legible.
  - By default, most browsers separate paragraphs with blank lines.
- Character entities
  - Some text characters cannot be inserted directly into an HTML document. These include:
    - Characters used in HTML markup--whenever you want the viewer to see them in the rendered page. For example, the < and > (normally used to bracket tags) need special handling if you want to include the formula x > -10, x < 1 in your page.
    - Characters not included in the set common to virtually all computers. This is the ASCII set, which includes the numbers 0 thru 9, the uppercase and lowercase letters of the Latin alphabet (without diacritical marks--accents, dieresis etc), most punctuation marks, and a basic set of symbols (the top row of the keyboard).
      - So as not to complicate this document beyond its stated limits, we will assume the use of the ISO Latin-1 encoding (which comprises the characters needed for most languages originating in Western Europe).
      - If you need support for other languages, refer to the World Wide Web Consortium's page on Internationalization at http://www.w3.org/International/.
  - The special characters mentioned above can be represented as character entities. There are two ways to do this:
    - Named entities are easier to remember because they are represented by an abbreviated mnemonic. For instance, the named entity for the character ã ('a' with a tilde) is atilde. Note that names of entities are case-sensitive: Atilde yields Ã .
    - Numbered entities are represented by a number (indicating the character's position in the ISO Latin-1 encoding) prepended with a # . Since some browsers have incomplete support for named entities, numbered entities may be a more compatible alternative.
    - Both named and numbered entities are preceded by a & and followed by a ; . The city of São Paulo can appear in your HTML either as São Paulo or São Paulo
    - A thorough list of named and numbered entities maintained by the Web Design Group is available at http://www.htmlhelp.com/reference/charset/.
- IMG empty element
  - IMG elements are used for images displayed within the body of the web page, called inline images. Non-inline images are the ones displayed separately from the calling web page, using a hyperlink.
  - The IMG element is empty (it does not require a matching end tag). It does however require a SRC attribute, whose value is the URL pointing to the image file:
    <IMG SRC="http://pix.sample.com/images/sampleimage.gif">
  - Adding attributes for the size (in pixels) of the image will speed up the display of the page, since the text can be paginated immediately, before the graphics complete loading. Many graphics programs (such as Photoshop) will provide this information.
    <IMG SRC="otherimage.gif" WIDTH=256 HEIGHT=128>
  - By default, most browsers will display GIF and JPEG images inline. More recent versions add support for PNG images.
- EMBED empty element
  - This is used for other inline media elements (video, sound, animation, etc.) It is a Netscape extension, not an approved part of the HTML standard, and may be eventually replaced by a similar OBJECT element.
  - The EMBED element too is empty. In its basic format it is similar to the IMG element--it requires a SRC attribute, whose value is the URL pointing to the media file:
    <EMBED SRC="http://glitz.sample.com/animations/samplemovie.dcr">
  - The EMBED element may take a variety of additional attributes, depending on the specific requirements of the type of media displayed. WIDTH and HEIGHT are commonly used for visual media.
  - Embedded media elements are generally not supported directly by the browser--readers will need to install appropriate software add-ons called plugins.

Delineate structure
- H containers
  - This is a family of containers, delimited by tags named H1, H2, ... through H6. These containers hold text headings, decreasing in prominence as the tag number increases. Typically, H1 is used only once for the main headline at the top of the page, while H2 and H3 are used for subheads within the text.
  - In the following example, the main headline is followed by a brief paragraph, then by a subhead and another paragraph :
    <H1>Carrot Muffin Central</H1> <P>Carrot muffins are good for you. Here is how to whip'em up.</P> <H2>Getting Started</H2> <P>Mix eggs and flour in a bowl.</P>
  - Heads and subheads are important signposts that assist the reader in understanding how the web page is organized.
  - By default, most browsers display headings as larger, bold type, followed by a blank line. Beyond H3 or H4, however, it is generally hard to distinguish the various heading levels.
- Style and Phrase containers
  - These containers are used for small portions of text that need to be displayed in a unique manner (for instance, to emphasize a technical word).
  - Some style tags will mandate the manner in which the content is displayed (for instance, by italicizing it), thus partially violating the separation between structure and appearance:
    <P><B>Carrot</B> muffins are <I>good</I> for you.</P>
    - In this example, the word 'carrot' will be bolded, and 'good' will be italicized.
  - Other tags simply indicate that the text needs to be differentiated somehow, leaving the specifics to the browser and, possibly, to the reader:
    <P><STRONG>Carrot</STRONG> muffins are <EM>good</EM> for you.</P>
    - This example shows the two available emphasis tags, EM (interpreted as italics in most browsers), and STRONG (generally interpreted as bold). Different visual and/or aural devices could be used to highlight the text in these containers.
  - A special case is where type needs to be displayed in a monospaced font (one whose characters are all the same width). This may be necessary to align tabular data without resorting to tables (which may not be supported in some browsers). Putting type in a TT container will accomplish this:
    <H2>Ingredients</H2> <P><TT>Flour__________2 lb.</TT></P> <P><TT>Eggs___________6</TT></P> <P><TT>Carrots________1.5 lb.</TT></P>
- BR empty element
  - BR inserts a line break in the text content of the web page. Notice that this is conceptually different from delineating a paragraph, and is generally displayed differently in browsers (P is followed by a blank line, BR isn't)
  - One use of BR is to break the lines of a poem according to the meter of the verse, separately from the division into stanzas (which may be handled as paragraphs):
    <P>Carrots are orange<BR> Berries are blue<BR> Muffins are yummy<BR> And berries are, too.</P>
- HR empty element
  - HR inserts a horizontal rule (line) in the body of the web page.
  - HR is useful to indicate the boundaries between major sections of content, and to visually organize the web page:
    <P>This paragraph concludes our discussion of carrot muffins.</P> <HR> <H3>All About Blueberry Muffins</H3> <P>[Wo]man does not live by carrot muffins alone.</P>
- List elements
  - This is another family of elements, used to list content items in hierarchical order--usually displayed in an indented outline format.
  - Most lists are made up of three nested levels:
    - The enclosing list container. Of the many types included in early versions of HTML, only two are now recommended:
      - UL (unordered list), generally displayed as a bulleted list.
      - OL (ordered list), generally displayed as a numbered list.
    - One or more LI (list item) empty elements inside the list container
    - Actual content items (text and/or graphics) follow each LI tag.
  - A special case is a DL (definition list). Instead of listing simple LIs, it contains:
    - DT (definition term), meant to be used for the word or words being defined.
    - DD (definition defined), used for the explanation of DT's contents, and usually displayed indented from it.
  - To create more complex outlines, lists can be nested (e.g., an LI can contain a list container, possibly of a different kind).
  - In the example below, an unordered list (types of berries) is nested within an ordered list (preliminary steps to baking):
    <OL> <LI>Build up an appetite <LI>Pick your berry <UL> <LI>Boysenberry <LI>Blueberry <LI>Raspberry <LI>Strawberry </UL> <LI>Storm into the kitchen </OL>
- TABLE and related containers
  - TABLE, with its subcontainers, is used to list content items in tabular order, usually displayed in a grid of rows and columns.
    - TABLEs are used extensively to create more complex layouts than early HTML would allow. Unfortunately this makes for a rigid arrangement that is not amenable to alternative appearances. More flexible aproaches for positioning content elements have emerged in later versions of HTML.
  - A complete table is made up of four nested levels:
    - The enclosing TABLE
    - One or more TR (table row) containers inside TABLE
    - One or more cell containers inside each row. These can be TH (table headers, displayed more prominently) or TD (table data, ordinary content).
    - Actual content items (text and/or graphics) inside each cell.
  - To create more complex arrangements, tables can be nested (e.g., a cell container can contain a TABLE container).
  - Except for TABLE itself (which always requires a closing tag), the other containers can be entered either with or without an end tag.
  - The example below shows a simple 5-rows, 2-columns table with a row of 2 header cells and 4 rows of 2 data cells each:
    <TABLE> <TR> <TH>Berry <TH>Color <TR> <TD>Blueberry <TD>blue <TR> <TD>Boysenberry <TD>purple <TR> <TD>Raspberry <TD>magenta <TR> <TD>Strawberry <TD>red </TABLE>

Establish links
- A container with HREF attribute (link origin)
  - These contain the text or picture which the reader should click to activate the link.
    <A HREF="http://www.someplace.com/news/info.html#contents">click here to see a news summary</A>
  - The value of the HREF attribute is the URL pointing to the destination of the link.
  - By default, most browsers display the contents of these containers as blue underlined text. The color changes after the link has been activated, to remind the reader that s/he has followed the link before.
- A container with NAME attribute (link destination)
  - These contain the text or picture that should be displayed after the reader clicks a link leading to the labeled location.
    <A NAME="contents">News Summary</A>
  - The value of the NAME attribute is the label attached to the contents. This label matches the fragment identifier in URLs pointing to the location.
  - By default, browsers do not highlight the contents of these containers in any special way.
  - In the absence of labels, browsers show the beginning of the document.

Web Media Overview

HyperText Markup Language basics

part of SaneDraw's learning resources

Summary