Recent Changes - Search:

Home Pages Pidgin   Azarennya (S|N) Mac Textanium Reference ToDo Food Local Edit

Local: Hide

Language: Hide

Fantasy: Hide

SciFi: Hide

Film: Hide

Music: Hide

REALbasic: Hide

ResourcesGarageUniversityWebRingForums:REALElfDataPlugins and Code:BKeeneyDeclareSubEinhugurJoeRestrepoTempelmannZAZ

Coding: Hide

Forums:PowWebPHPWebmasterCodingWalkersPerlIntroMonksPHPJavaScriptToolboxUnobtrusiveJavaScriptJavaScriptCompressorRegularExpressions (test)JSLintSQLCocoaCocoaBuilderCocoaDevCocoaLabAppleScriptBBSUserlandFaqintoshFileMakerFileMakerTipsFileMakerWorldFileMakerPlugins

Science: Hide

History: Hide

1421

News/Politics: Hide

Cults/Crime: Hide

ClambakeInfidels

Miscellaneous: Hide

MarkupUsingBrackets

Things the markup code should do

Use only brackets?

This is some text. This is [b bold text].

A line (or group of lines) flanked by blank lines is a paragraph.

[table
	[row
		[td ]
	]
]

Pass 1 evaluates line types and page structure

Assume that a new section MUST start with a single command by itself: [table], [list], [numlist], [quote], [item], [comment], [code], [html], and [/].

Then we do multiple passes through the Wiki text. The first pass just divides the text into section commands and the text between those commands.

	// Break the Wiki text down by section (because different sections must be
	// handled differently).

	$string = str_replace("\r\n", "\n", $string);
	$string = str_replace("\r", "\n", $string);
	$lines = explode("\n", $string);
	$i = 0;
	$sections = array('');
	$cmds = array(
		'quote', 'table', 'list', 'numlist', 'item',
		'comment', 'code', 'html', '/'
	);
	foreach ($lines as $line) {
		$cmd = trim($line);
		if ($cmd{0} == '[' and $cmd{strlen($cmd)-1} == ']') {
			$word = substr($cmd, 1, strlen($cmd)-1);
			if (in_array($cmds, $word) {
				$sections[$i+1] = $word;
				$i += 2;
			} else {
				$sections[$i] .= "$line\n";
			}
		} else {
			$sections[$i] .= "$line\n";
		}
	}

	// Handle newline commands according to section: "[<<]" and "[>>]" work
	// in every section except "[code]" and "[html]".

	$stack = array();
	$sp = 0;
	$i = 0;
	$count = count($sections);
	while ($i < $count) {
		$line = $sections[$i];

		// For all sections except [code] and [html], resolve newline
		// commands: [>>] at the start of a line fuses the line to the
		// end of the preceding line, and [<<] inserts a newline, thus
		// splitting a line in two.

		if ($stack[$sp] != 'code' and $stack[$sp] != 'html') {
			$line = str_replace ("\n[>>]", '', $line);
			$line = str_replace ('[<<]', "\n", $line);
		}

		// For all sections except [html], convert HTML characters ('<',
		// '>', and '&') into HTML entities. (Yes, including those inside
		// embedded bracket commands.)

		if ($stack[$sp] != 'html') {
			$line = str_replace ('<', '&lt;', $line);
			$line = str_replace ('>', '&gt;', $line);
			$line = str_replace ('&', '&amp;', $line);
		}

		$sections[$i] = $line;

		// As long as we haven't reached the last section already, get the
		// next section name. If it is "/", pop the previous section name
		// from the stack; otherwise push the new section name onto the
		// stack. Make sure the section pointer ($i) is pointing at the
		// next text item.

		if ($i + 1 < $count) {
			$cmd = $sections[$i + 1];
			if ($cmd == '/') {
				--$sp; if ($sp < 0) $sp = 0;
			} else {
				++$sp; $stack[$sp] = $cmd;
			}
			$i += 2;
		}
	}

After this we can do other things. We can start adding some HTML tags right away:

  • The first nonblank line in a series is preceded by <p> (or <tr><td> in a [table], or enclosed in <li>...</li> in a [list] or [numlist]).
  • Each subsequent nonblank line is preceded by <br /> (or </td><td> in a [table], or enclosed in <li>...</li>).
  • The first blank line after a nonblank line (or the end of the text) is preceded by </p> (or </td></tr>, or nothing in a [list] or [numlist]).
  • The [table] section is enclosed in <table>...</table>.
  • The [quote] section is enclosed in <blockquote>...</blockquote>. If the section is in a table but not an item, it is further enclosed in <td>...</td>.
  • The [list] section is enclosed in <ul>...</ul>. If the section is in a table but not an item, it is further enclosed in <td>...</td>.
  • The [numlist] section is enclosed in <ol>...</ol>. If the section is in a table but not an item, it is further enclosed in <td>...</td>.
  • The [item] section in a table is enclosed in <td>...</td>.
  • The [item] section in a list or numlist is enclosed in <li>...</li>.
  • The [comment] section is simply removed.
  • The [code] section is enclosed in <pre>...</pre>.
  • The [html] section is NOT enclosed in tags.

Normalize line endings

  • Find "\n[>>]" and remove it
  • Find "[<<]", replace with newline
  • Find "[:xxxx:]", ensure it is preceded by newline
	// Make all end-of-line characters consistent
	$string = str_replace("\r\n", "\n", $string);
	$string = str_replace("\r", "\n", $string);

	// Resolve newline commands
	$string = str_replace("\n[>>]", "", $string);
	$string = str_replace("[<<]", "\n", $string);
	$string = str_replace("[:", "\n[:", $string);

	// Split Wiki text into lines
	$lines = explode("\n", $string);

	$output = '';
	foreach ($lines as $line) {
		$sp = 0; // stack pointer
		$stack = array();
		$space = FALSE;
		$i = 0;
		$count = strlen($line);

		do {
			while ($i < $count) {
				$c = $line{$i++};

				// If current section is NOT [:code:] or [:html:],
				// compress spans of whitespace into single spaces.

				if ($pre != FALSE && ($c == ' ' or $c == "\t")) {
					if ($space != FALSE) {
						$stack[$sp] .= ' ';
						$space = TRUE;
					}
				} else {
					$space = FALSE;
				}

				// If character begins a command, go up one
				// stack level. If character ends a command,
				// execute the command and append the result
				// to the text on the next stack level down.
				// (This has the effect of running innermost
				// nested commands first.)

				if ($c == '[') {
					++$sp;
				} elseif ($c == ']') {
					$result = command($stack[$sp]);
					--$sp;

					// If user entered too many closing
					// brackets, stack would "underflow"
					// and crash this program. Excess
					// closing brackets can be ignored.

					if ($sp < 0) $sp = 0;
					$stack[$sp] .= $result;
				}

				// If character is an HTML special character,
				// and current section is not [:html:], then
				// replace character with an HTML entity.

				elseif ($c == '<' && $html == FALSE) {
					$stack[$sp] .= '&lt;';
				} elseif ($c == '>' && $html == FALSE) {
					$stack[$sp] .= '&gt;';
				} elseif ($c == '&' && $html == FALSE) {
					$stack[$sp] .= '&amp;';
				}

				// Otherwise just append the character to the
				// text at the current stack level.

				else {
					$stack[$sp] .= $c;
				}
			}

			// If we've reached the end of the line and still have
			// items on the stack, then append closing brackets to
			// the line so that the stack can be cleaned up neatly.

			if ($sp > 0) {
				$count += $sp;
				$line .= str_repeat (']', $sp);
			}
		} while ($i < $count);

		// We now have a line of HTML at the bottom of the stack.
		$output .= $stack[0];
	}

Command() might itself keep its own stack, for HTML tags that haven't yet been closed. If the command is enclosed in colons, then some or all of these tags might be pulled from the stack and sent to output.

Etc.

Ruminations

  • Commands require a nonalphanumeric character after the opening bracket? [.film Star Wars], [.code ls -l], [.em Important!], [.sb in brackets], [.set count 500], [.sum 500 count], [.term ergativity], [+code] to open section, [-code] to close. No, this is ugly. For a section, use a colon: [:code:] to open a section, then [::] to close the section at the top of the nesting stack, then [code ls -l] for an inline "code" section.
  • What I need first is something that will take the Wiki text apart into its components -- text, section openings, section closings, inline commands, and inline command arguments (some of which might be phrases in quotes).

Code to search for "[:...:]":

function test_regexp() {
	$subject = 'see if [:code it:] is [:not:] found';
	$pattern = '/\[\:[a-zA-Z_0-9]+\:\]/';
	preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE);
	echo "<pre>\n";
	print_r($matches);
	echo "</pre>\n";
}

RESULTS:
Array
(
    [0] => Array
        (
            [0] => [:not:]
            [1] => 22
        )

)

Older stuff

  • Alternative to below: Just gather data about each line: bare text, commands, each arg for each command. Array has two-element arrays as elements, first element in pair is 't' for text, 'c' for command, 'x' for closing command, or 'a' for arg; second element is the text, command, or argument itself. Thus if array has only one element and it is a 'c' or 'x', then it could be a section command; if more than one element, then there is no section command.
  • Alternative: Commands do NOT vary according to whether they are on a line by themselves or not. A section command begins with + -- so [+quote] always begins a blockquote section and [/] always closes the most recently opened section. And [code] is used for inline stuff, but you use [+code] to end the current paragraph or item and start a code (<pre>) section.
  • Replace "\r\n" and then "\r" with "\n" -- then [<<] with [newline] -- then [>>] with [mergeline] -- then split the text into lines -- then scan each line for '['. For each '[', call command(). At the end of each line, call endline(). Each bracket command "x" corresponds to a real PHP function "cmd_x" that should have been included already. (Before calling the function, we check if the function exists, and if not, we print the command itself as text.)
	// Make all end-of-line characters consistent
	$string = str_replace("\r\n", "\n", $string);
	$string = str_replace("\r", "\n", $string);

	// Convert non-ASCII commands into ASCII commands
	$string = str_replace('[<<]', '[newline]', $string);
	$string = str_replace('[>>]', '[mergeline]', $string);

	// Split Wiki text into lines
	$lines = explode("\n", $string);

	foreach ($lines as $line) {

		// Trim line
		$trimmed = trim($line);

		// If command on line by itself, it may be a new-section command.
		if ($trimmed{0} == '[') {
			$trimmed = substr($trimmed, 0, -1); // remove final ']'
			if ($trimmed{1} == '/') {
				$closing = true;
				$trimmed = substr($trimmed, 2);
			} else {
				$closing = false;
				$trimmed = substr($trimmed, 1);
			}
			if ($trimmed == 'comment') {

			} elseif ($trimmed == 'quote') {

			} elseif ($trimmed == 'list') {

			} elseif ($trimmed == 'numlist') {

			} elseif ($trimmed == 'item') {

			} elseif ($trimmed == 'table') {

			} elseif ($trimmed == 'code') {

			} elseif ($trimmed == 'html') {

			} else {

			}
		}

		// Trim line
		if ($section != CODE and $section != HTML) {
			$line = trim($line);
		}
		// Need also check for section-command,
		// e.g., "[/code]" or "[/html]" on a line by itself.

		// (Shouldn't we split line into parts in brackets
		// vs. parts out of brackets here?)

		// Convert HTML characters into HTML entities
		// QUESTION!!!! Do we do this inside brackets?????
		if ($section != HTML) {
			$line = str_replace('&', '&amp;', $line);
			$line = str_replace('<', '&lt;', $line);
			$line = str_replace('>', '&gt;', $line);
		}
		// For each bracket on the line, starting from the left:

			// function command($name):
			// Create function name "cmd_" + first word in brackets
			// (but if word has nonalphanumeric characters, just look
			// word up in glossary of text substitutions).
			// Find end of command in line; return offset into line
			// to next character to process.

		// At end of line, do special processing (???)
	}
  • Convert free-standing square brackets into HTML entities ("[" and "]"). This lets us use '[' and ']' in the saved HTML so run-time processing can occur.
  • Search for [>>] at end of each line -- where found, merge the line it is on with the next line.
  • Search for [<<], replace with newline character.
  • Grab Wiki text a line at a time. Tags that are opened in the line must be closed at or before the end of the line; that is, if the user entered two quote marks to turn on italics but forgot to close italics, then the parser needs to add the </em> tag.
  • If the current section is not [code] or [html], trim leading and trailing whitespace from the line.
  • Compare each line with the previous one.
    • If the current line is blank, and the previous line was part of a paragraph, then close the paragraph.
    • If the previous line ended a paragraph or was part of something that wasn't a paragraph, then the current line might be the start of a new paragraph (if it doesn't begin a table).

Any two consecutive lines are fused by ending the first line with two backslashes. Any line can be split into two lines by entering [<] at the point where the line should be split. Therefore, my code must remove all instances of two-backslashes-and-a-linebreak, and then replace each instance of "[<]" (not inside a tag) with a newline.

Markup scheme

I rely mainly on commands in square brackets. I'll allow some conventional Wiki formatting commands, but only a handful.

Commands should be nestable, so that a field that calculates a sum can be nested inside a formatting command, e.g., [em [sum vowelcount conscount]]. (Here I'm assuming that fields can also set constants, e.g., [set vowelcount 8].)

Note that a command cannot cover text on more than one line (unless it is a section command on a line by itself), so my code must close any commands that the user has neglected to close before the end of the line.

Example commands:

  • [film Star Wars] -- changes "Star Wars" into a link to imdb.com and displays "Star Wars" in italics.
  • [code ls -l] -- displays "ls -l" in a monospaced font.
  • [link karig.net my home page] -- displays "my home page" as a link to "http://karig.net/". (The first word is assumed to be a URL. If the URL contains spaces, enclose it in doublequotes.)
  • [red This is an error message] -- displays "This is an error message" in red letters.
  • [tags "John Cleese" "Monty Python" comedy 1970s] -- makes "John Cleese", "Monty Python", "comedy", and "1970s" into tags attached to the current article.
  • [comment This section needs more work] -- allows comments by not displaying the text following the word "comment".
  • [to-do 2007-09-15 Ask again about this proposal] -- adds the text "Ask again about this proposal", and a link back to the current article, to the calendar on the date "2007-09-15".
  • [set vowelcount 8] -- creates a variable "vowelcount" and assigns it the value "8".
  • [sum vowelcount conscount] -- displays the sum of the most recent versions of the variables listed (here "vowelcount" and "conscount").
  • [sb in brackets] -- displays "[in brackets]".
  • [em important] -- displays "important" in italics.
  • [term ergativity] -- displays "ergativity" in a style appropriate for terms being introduced, e.g., in bold italics.

Hard-coded commands are nonalphanumeric commands (user-defined commands must begin with a letter):

  • [<] -- forces newline
  • [-] -- forces minus sign (en-dash)
  • -- forces em dash
  • -- on a line by itself, forces a new section (in plain text, it creates a horizontal line; in a table, it forces a new row)
  • [&161] or [&xA1] -- creates HTML entity ("¡" or "¡"). Use [&x5b] or [&91] for '[' and [&x5d] or [&93] for ']'.
  • I may add a similar command for italic, but it'd be better to have most commands be based on the purpose of the text, not on their look.

Markup sections

A markup section is enclosed in an opening section command and a closing section command. Each command must be on a line by itself; otherwise the command is treated as inline (and its influence ends at the end of the line it is on).

The section commands are:

  • [quote] ... [/quote] to enclose a blockquote (or to nest a blockquote within another blockquote).
  • [list] ... [/list] to enclose items in a list. Lists can be nested. Items are normally preceded with bullets, but if the opening tag is [list #], then the items are numbered.
  • [table] ... [/table] to enclose a table of data.
  • [item] ... [/item] to enclose a multiline item in a list (or a multiline cell in a table). Normally each line in a [list] section becomes an item in the list; [item] allows multiple lines, and such things as blockquotes and small tables, to be part of a single table item.
  • [code] ... [/code] to create a "<pre>" section where newlines and spaces are preserved.
  • [html] ... [/html] allows raw HTML or embedded PHP or JavaScript to be added to the page (and run when the page is displayed). This directive is treated like [code] unless the user's account actually allows the [html] tag.
  • [comment] ... [/comment] to remove text from the HTML and leave it in the Wiki -- for comments.

Note that if [quote] or [table] or [code] is on a line with other text, then it would produce something inline, something that would be terminated at the end of the line even if the [/quote] or [/table] or [/code] command is missing.

text [quote] [list] [table] [item] [code] [html]
How text line ends Ends with <br /> or </p> tag Ends with <br /> or </p> tag Enclosed in <ul> or <ol> tags Enclosed in <td> tags Ends with <br /> or </p> tag Ends as is Ends as is
Blank line Ends paragraph (group of lines) Ends paragraph (group of lines) Ignored (each line is a list item) Starts new row Ends paragraph (lines in an item) Left as is Left as is
Line of hyphens Ends paragraph, adds horiz. line Ends paragraph, adds horiz. line Adds horiz. line Starts new row Ends paragraph, adds horiz. line Left as is Left as is
Angle brackets To HTML entities To HTML entities To HTML entities To HTML entities To HTML entities To HTML entities Left as is
Spaces at start/end of line Ignored Ignored Ignored Ignored Ignored Left as is Left as is

Formatting codes (might not be needed after all)

My markup will use primarily commands in brackets instead of conventional Wiki formatting codes, but some formatting codes are handy:

  • **bold**, //italic//, and ##bold italic## -- you can't nest these
  • ---- on a line by itself to create a horizontal line (but in a table, it marks the following row of cells as header cells)
  • +Heading (one to six plus signs at the beginning of a line marks the line as a section header)

To make links or other things, you'd use bracket commands.

I can dispense with bold, italic, and bold italic! Just use "I must [em really] stress..." or "A [term collie] is..." or "[critical WARNING!]". Also just use [h1] through [h6]. This leaves you with just [...], and [/...] for end tags, and [\...] for literal brackets.

Edit - History - Print - Recent Changes - Search
Page last modified on February 04, 2008, at 01:53 PM