Recent Changes - Search:

Home Pages Pidgin   Azarennya (S|N) Mac Thesaurus Reference ToDo Colino Food Local

Blogs: BadIdea Rachel RIAA Cult: Clambake Infidels Fi: Arda StarTrek Trek/Wars Film: IMDB D Harry Jabootu Kyle Fun: Agony ICanHas? ObSkills Snopes Lang: ZBB Vreleksá AwkWords Omniglot Scriptorium More... Local: Maps Map MyWeb Metro (map) FC Weather GoWhere? GGWash DC Arlington Reston Beyond Bacon Pix: Deviant Places Renderosity Blender Artists Pol: Anchoress Lizards Lucianne Strata WAwakes Sci: SmallThings Darwin AntiEvo Skeptics EvC BAUT Physics /.Sci Junk Panda Pharyngula Mags AmSci NatG Space X86: OSX86 ArsTech OSNews TUAW Dev PowWeb PHP Webmaster Coding Walkers Prog: PHP JS Toolbox Unobt Compress RegExp (test) Lint SQL Cocoa Builder Dev Apple BBS Userland Faqin

Science/Tech: Engadget Thunderbolts Icecap Centauri NewSci Gizmodo co2sci ClimateDebate SciDaily Nrich NatGeog Math CreatClaims GoodBadMath

CurrentEvents: OrigSig Flamingo FlopAces ImmigProf ~J~ MyVRWC NewsGroper Pal2Pal Sanity Simon TCS Toldjah Blogs...

Tools: Calculator AsciiArt XMLVal

FunStuff: Pictures: Photobucket (eg Dubai) Videos: YouTube Subtitler

InterestingThings: LibraryThing FlashCards GoogleDocs Wowio Bubbl.us Colemak Audible PodioBooks WonderfulInfo BooksOnline AboutUs.org

KE /

KE

I might take up writing PHP again. I'd like the site to serve multiple articles per page, instead of one article per page which is what Wiki software does.

Alternative

KE

KE is my next attempt at a web-snippet application.

File Formats

  • Snippets.
    • Snippet files are stored in folder "snip".
    • Each snippet occupies one line.
      • Format: Keywords (sep. by spaces); tab; linefied text (CRs replaced with " " and newlines replaced with " ").
    • Each file contains up to 2048 snippets.
    • Location of snippet #N: in file with name (idhigh(N)), at line (idlow(N)).
  • Versions (old snippets).
    • Version files are stored in subfolders of folder "vers".
    • Each snippet gets its own version file.
    • Each version is one line, in the same format as in snippet files.
    • Subfolder name is (idhigh(N)); file name is (idlow(N)).
    • Versions are appended, so they are stored earliest first.
  • Keywords.
    • Why? To make keyword-based searches faster. (Keywords are stored separately from full-text indexes to make keyword searches as fast as possible.)
    • Keyword files are stored in folder "keys".
    • Keyword files contain only keywords, separated by spaces.
    • Location of keys for snippet #N: in file with name (idhigh(N)), at line (idlow(N)).
  • Full-text index.
    • Why? Because creating searchable text when editing and saving a single snippet uses much less memory than creating searchable text when searching through a whole file of snippets (which may even fail because the server is stingy with memory).
    • Index files are stored in folder "ftix".
    • File format is the same as for snippets, except that:
      • Each line begins with the snippet ID and a space. (This speeds up searches by eliminating the need to count lines or newline characters.)
      • All text is lowercase.
      • All letters with diacritical marks are changed to plain-ASCII equivalents.
      • All tabs, newlines, CRs, underscores, and hard spaces are converted to plain spaces.
      • Each run of more than one space is converted to a single space.
      • All characters other than letters, numbers, and spaces are removed.

Code

The snip.php contains the heart of Snip. All of its functionality is available through just three functions:

  • function load($theID, $theVersion) -- Returns array: [0] is the snippet text, [1] is the last-modified date, and other elements are keywords. Specify $theVersion only to load a version other than the current (most recent) one.
  • function save($theKeys, $theText, $theID) -- Saves a new version of the snippet. (You don't need to specify a date; save() does that for you.) If you don't specify $theID, or if $theID doesn't exist in the database, save() creates a new snippet.
  • function find($theCriteria) -- Returns $theCriteria with data from a number of snippets that match the criteria in $theCriteria[0]:
    • Phrases to find.
    • Search text as well as keywords?
    • Start from ID:
    • Max number of snippets to return.
[[http://karig.net/KE/KE?action=edit|(edit)]]

define ('SNIP_DIR', 'snipdata');
define ('OLD_SNIP_DIR', 'snipvers');
define ('KEY_INDEX_DIR', 'snipkeys');
define ('TEXT_INDEX_DIR', 'sniptext');
define ('LINES_PER_SNIP_FILE', 1024);

function save ($theKeys, $theText, $theID) {
	foreach (array (SNIP_DIR, OLD_SNIP_DIR, KEY_INDEX_DIR, TEXT_INDEX_DIR) as $dir) {
		if (!file_exists($dir)) mkdir ($dir, 0744);
	}

	// if (!isset($theID) or $theID isn't in the database) {
	//	$theID = new_id();
	// }

	list ($hid, $lid) = split_id ($theID);


	// PRESERVE EXISTING SNIPPET AS PREVIOUS VERSION
	$snips = file ("SNIP_DIR/$hid");
	if ($snips[$lid] != '') {
		if ($handle = fopen("OLD_SNIP_DIR/$hid/$lid", 'a')) {
			fwrite ($handle, $snips[$lid]);
			fclose ($handle);
		}
	}
	$snips = array();

	// SAVE NEW VERSION OF SNIPPET AS CURRENT VERSION
	$theKeys = trim(preg_replace('/\s\s+/', ' ', $theKeys))
	$k = sprintf('%d ', time()) . $theKeys;
	$t = str_replace(array("\r","\n"), array("
","
"), $theText) . "\n";
	overwrite_line ("SNIP_DIR/$hid", $k . chr(9) . $t, $lid);

	// UPDATE KEYWORD INDEX
	overwrite_line ("KEY_INDEX_DIR/$hid", $theKeys, $lid);

	// UPDATE FULL-TEXT INDEX
	overwrite_line ("TEXT_INDEX_DIR/$hid", fulltextify($lid . ' ' . $theText), $lid);

	return $theID; // Needed when a new snippet is created!
}

function split_id ($theID) {
	return array(
		floor ($ID / LINES_PER_SNIP_FILE),
		($ID % LINES_PER_SNIP_FILE)
	);
}

function overwrite_line ($thePath, $theLine, $lineNumber) {
	if ($lineNumber < 0) return false;
	$lines = file($thePath);
	$lines [$lineNumber] = $theLine;
	if (!$handle = fopen($path, 'w')) return false;
	fwrite ($handle, join ('', $lines));
	fclose ($handle);
	return true;
}

function fulltextify ($text) {
	$text = strtolower(strip_tags($text));
	$text = preg_replace('/\s\s+/', ' ', $text);
	return $text;
}
class Snippet {
	var $id;
	var $date;
	var $keywords;
	var $text; //unlinefied
	var $version;

	function Snippet($theID, $theKeys, $theText) {
		$id = isset($theID)? $theID : $this->new_id();
		$keywords = isset($theKeys)? $theKeys : array();
		$text = isset($theText)? $theText : '';
	}

	function Load($theID, $theVersion) {
		if (!isset($theVersion)) {
			// LOAD LINE FROM SNIPPET FILE
		} else {
			// LOAD LINE FROM VERSION FILE
		}
		// GET DATE, KEYWORDS AND TEXT FROM LINE
	}

	function new_id() {
		// Look at server data for next snippet ID to use
		// and return ID.
	}

	function linefy ($text) {
		return str_replace(array("\r","\n"), array("&#13;","&#10;"), $text) . "\n";
	}

	function delinefy ($text) {
		return str_replace(array("&#13;","&#10;"), array("\r","\n"), rtrim($text));
	}
}

Actions

Top-level actions

  • Read snippet (return keywords and text ready to be displayed)
  • Write snippet (write keywords and text back into database)
  • Find snippets (list up to N IDs of snippets [starting at ID X] that have certain keywords or phrases)

Use classes?

  • Class Snippet would have public items: $ID (set it to load a different snippet), $keywords[], and $text. Snippet can also echo itself, and save itself (which updates files and indexes).
  • Class Query would let you change the list of keywords sought. It would then provide a list of snippet IDs. (This list might be cached on the server; this cache would be tied to a query ID of some kind, which could be passed in a URL.)

Utilities

define ('LINES_PER_SNIP_FILE', 2048);
function debug_log($text) {append_file("debuglog.txt", $text);}
function snip_pack($keys, $text) {
	return str_replace(chr(9), ' ', $keys) . chr(9) . linefy($text);
}
function overwrite_file($path, $data) {return do_write_file($path, $data, 'w');}
function append_file($path, $data) {return do_write_file($path, $data, 'a');}
function do_write_file($path, $data, $mode) {
	if (!$handle = fopen ($path, $mode)) {return false;}
	fwrite ($handle, $data);
	fclose ($handle);
	return true;
}
function snip_path ($ID) {return '/snip/' . idhigh($ID);}
function snip_line ($ID) {return idlow($ID);}
function version_path ($ID) {return '/vers/' . idhigh($ID) . '/' . idlow($ID);}
function idhigh ($ID) {return floor ($ID / LINES_PER_SNIP_FILE);}
function idlow ($ID) {return ($ID % LINES_PER_SNIP_FILE);}
function linefy ($text) {
	return str_replace(array("\r","\n"), array("&#13;","&#10;"), $text) . "\n";
}
function delinefy ($text) {
	return str_replace(array("&#13;","&#10;"), array("\r","\n"), rtrim($text));
}

Old

(Reconsider)

  • Maybe store snippets in the same kind of setup as for versions -- "s" folder has up to, say, 2048 files, each of which is the current snippet. (It's the full-text version of the current snippet that is stored in a file with other snippets, in order to make full-text searches fast without requiring a concordance or index for full-text searches. Same with keywords.)

Parts

File: .htaccess

This tells the server to serve "index.php" whenever a 404 (file not found) error occurs. (See also "The Perfect 404.")

Script: index.php

This serves up one or more snippets for reading.

Each snippet has an "Edit" link, which invokes priv/edit.php on that snippet.

The top of the screen displays a search box.

search(int offset, int count, array words)

The search function opens the "keys" file and jumps to the line specified in "offset".

The search function searches for snippets and returns an array:

  • [0] contains an array of names of snippets whose keyword set contains all of the words in "words".
  • [1] contains an array of names of snippets whose keyword set contains one or more, but not all, of the words in "words".
  • [2] contains an array of names of snippets whose text contains all of the phrases in "words".
  • [3] contains an array of names of snippets whose text contains one or more, but not all, of the phrases in "words".

The function works by opening the "keys" file, offsetting to a specific line (set by "offset"), and reading in a number of lines (set by "count"). Each line contains the name of a snippet and the keywords associated with each snippet. Elements [0] and [1] are created using these.

After this, the snippets' text files are opened

Folder: s (snippets)

Each snippet has a number, which is used to derive (1) the file in which the snippet is stored, and (2) the number of the line containing the snippet.

LINES_PER_SNIPPET_FILE = 2048
filename               = floor(snippet_ID / LINES_PER_SNIPPET_FILE)
line_number            = snippet_ID % LINES_PER_SNIPPET_FILE

Whenever a snippet is saved, it is "linefied" -- &#13; replaces each carriage-return character, and &#10; replaces each newline character. An actual newline character is appended to the end of the resulting line.

All snippets are saved in files in the "s" folder. If each file is allowed up to 2048 lines (LINES_PER_SNIPPET_FILE), and a server directory can contain up to 32,766 files, then KE will store up to 67,104,768 snippets.

Folder: v (versions)

Whenever a snippet is edited, the previous version of the snippet is appended to a version file, which is stored in a subfolder of the "v" folder.

The name of the version file is the same as the line_number above, which is the offset into a snippet file to the current version of the snippet. The name of the subfolder is the same as the snippet file name. Thus all non-current versions of a given snippet are stored together in a single file, and each subfolder contains versions of only those snippets whose current versions comprise the contents of a particular snippet file.

Folder: t (texts)

A snippet is saved twice -- once as HTML into the snippet file, and once as plain text into the full-text file, which is used for searches. Each snippet in the full-text file is saved as a snippet ID, a space, the "fulltextified" text (all lowercase, no punctuation, no whitespace except single spaces), and a newline. (The presence of the snippet ID simplifies searches: As soon as you find a phrase or keyword, search back for the newline and grab the numeric characters that follow.)

Folder: k (keywords)

Each snippet can have keywords associated with it. These are stored in a keyword file, whose structure mirrors that of the snippet file.

Keywords are stored separately from text, because we're supposed to be able to assign keywords to snippets without including keywords in snippet text, and keywords are intended to be faster and to work like categories.

Folder: priv

This is a password-protected folder containing the edit.php and save.php scripts.

Script: priv/edit.php

This script takes a parameter "n" containing the number (ID) of the snippet to edit. If this is not given, then edit.php displays a blank screen so you can create a new snippet.

This script displays three fields: snippet ID, keyword list, and snippet text. You can:

  • save changes and return to index.php (save.php?back=2)
  • save changes and continue editing (save.php)
  • see previous versions of the snippet (restore.php)

Script: priv/save.php

This script takes a parameter "back", indicating how many steps back Javascript should take through the browser history. The default is 1, so that once the save is completed, you are returned to the editor.

Script: priv/restore.php

This script takes a parameter "n" containing the ID of the snippet. If this is not given, the script displays a list of snippets and asks you to pick one and click "See versions".

This script lets you restore a previous version of a snippet. It displays previous versions of a given snippet, so that you can delete one or more of them, or select one for editing. (Once edit.php is running, you click "Save" to restore the text as the new current version of the snippet.)

Code overview

Most code should go into "ke.php" which other files include -- particularly all functions that load data from or save data to files or alter folders or save info on the most recently used snippet ID.

Design

KE should avoid reading data from disk that it does not have to. For example, it should never read an entire file into memory (unless it has to get at the data in the last line of the file). It should also open files at the beginning of a search and never close them until everything is displayed.

One strategy might involve using fread to read in 64KB or 128KB or 256KB at a time and then parsing the buffer for newline characters (or using substr_count to count them). (This should be OK if you proceed on the assumption that snippet files are never written by any software except KE.)

Experiments:

  • See how fast it is to open a really huge text file using a 128KB buffer and then use strpos to jump from line to line until you reach the start of a particular line number.

Constants

  • LINES_PER_SNIPPET_FILE = 2048

Functions

  • linefy($text): Substitutes "&#10;" for each newline character, and "&#13;" for each carriage-return character, in "text"; appends "\n"; returns the result.
  • delinefy($text): Replaces each occurrence of "&#10;" with a newline character, and each occurrence of "&#13;" with a carriage-return character, and returns the result.
  • fulltextify($text): Converts "text" to all-lowercase, then removes all diacritical marks and all characters that are not alphanumeric, then converts each run of one or more whitespace characters into a single space, and returns the result.
  • There is no defulltextify() function.
  • idhigh($ID): Returns floor($ID/$LINES_PER_SNIPPET_FILE).
  • idlow($ID): Returns $ID % $LINES_PER_SNIPPET_FILE.
  • writefile($ID, $folder, $text): Overwrites file ($folder.'/'.idhigh($ID).'/'.idlow($ID)) with $text. Used to save changes to snippet ("s") files. (Version ("v") files use the same folder structure, but text would be "linefied" and then appended to existing file.)
  • writeline($ID, $folder, $text): Overwrites line (idlow($ID)) of file ($folder.'/'.idhigh($ID)) with (linefy($text)). Used to save changes to full-text ("t") and keyword ("k") files.

Tasks

edit.php: Load snippet/keywords into editor

	// $ID received
	<textarea cols="80" rows="24"><?php
		$file = '../s/'.idhigh($ID).'/'.idlow($ID);
		$lines = file_get_contents($file);
		// get two parts separated by first tab in $lines
		// do something with first part (keywords)
		// echo second part into textarea
	?></textarea>

	// More likely we do "echo loadsnip($ID)" which is
	// defined in another file "ke.php" we include here.
	// Load keywords with "echo loadkeys($ID)".

List snippet versions (in restore.php)

Load snippet version into edit.php (from restore.php)

Find next snippet with specific keywords

Find next snippet with specific phrases in text

FAQs

  • Q: What about snippet titles?
    A: Use a unique keyword for a snippet, e.g., time_to_reorg_2007. The permalink would be: karig.net/time_to_reorg_2007.
Edit - History - Print - Recent Changes - Search
Page last modified on June 07, 2007, at 10:48 AM