<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Karig &#187; algorithms</title>
	<atom:link href="http://karig.net/topics/algorithms/feed/" rel="self" type="application/rss+xml" />
	<link>http://karig.net</link>
	<description>My humble home on the Web</description>
	<lastBuildDate>Thu, 17 Dec 2009 20:24:44 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Marks and tracks</title>
		<link>http://karig.net/2009/12/marks-and-tracks/</link>
		<comments>http://karig.net/2009/12/marks-and-tracks/#comments</comments>
		<pubDate>Tue, 15 Dec 2009 21:05:42 +0000</pubDate>
		<dc:creator>Karig</dc:creator>
				<category><![CDATA[algorithms]]></category>

		<guid isPermaLink="false">http://karig.net/?p=758</guid>
		<description><![CDATA[I'll want my text editor to be able to internally "mark" words and phrases and characters within the text. Here I discuss the basics of a system for doing this. I'll need this system to implement such features as syntax highlighting, background spellchecking, bookmarks, and highlighting the results of previous searches.]]></description>
			<content:encoded><![CDATA[<p><em>In </em><a href="/2009/11/textbuffer-overview/"><em>a previous post</em></a><em>, I listed some of the features I wanted my TextBuffer class to have. One of these was &#8220;tags.&#8221; I decided to call these &#8220;marks&#8221; instead.</em></p>
<p><strong><em>Marks</em></strong> are something that my text editor will have under the hood; they are not a feature that would be directly visible to the user. Specifically, a mark is a chunk of data that is associated with a <strong><em>span</em></strong> of text (typically a single word or phrase) within a file being edited. Marks are not saved with the text as part of the file, but the file, while open, could have thousands of marks associated with various parts of the text. Each mark remains associated with its span even as the text is being edited. Marks would be useful for implementing syntax highlighting, bookmarks, and other features that a modern text editor would be expected to offer.</p>
<p>Each mark belongs to a <strong><em>track</em></strong>, which is conceived as being of the same length as the main text. Each point on a track corresponds to a character position within the main text. Each mark takes up certain points on its track, so each mark corresponds to the characters at the corresponding positions — the mark&#8217;s span. Each point on a track is occupied by no more than one mark, so marks on the same track cannot overlap. Thus, if a span of text is to have two marks, the second mark has to belong to a separate track. The editor can generate as many of these separate tracks as needed.</p>
<p>Here is an example involving three tracks of marks:</p>
<pre>Track 1: art noun adverb_ verb_ noun art adjective noun
Track 2: U                      U                      PB
Track 3:    O    O       O     O    O   O         O     M
Text:    The girl quickly found Fido the yellowish mutt.&lt;</pre>
<p>(Note that the less-than sign at the end of the text represents a newline character.)</p>
<p>Take a look at each track in this example:</p>
<ul>
<li>Track 1 marks each word, each contiguous sequence of letters, with its part of speech: article, noun, verb, adjective, or adverb. Each mark in this track takes up the length of its corresponding word.</li>
<li>Track 2 assigns a class to each character other than a space or a lowercase letter — &#8220;U&#8221; marks each uppercase letter, &#8220;P&#8221; marks each punctuation character, and &#8220;B&#8221; marks the mandatory-break character at the end. In this case, each mark takes up no more than one character.</li>
<li>Finally, Track 3 marks the characters where the line can be broken on a wordwrapped display: &#8220;O&#8221; marks optional-break characters like spaces, while &#8220;M&#8221; marks mandatory-break characters like newlines. Again, each mark corresponds to a single character.</li>
</ul>
<p>Note that each mark has a <strong><em>name</em></strong>. This name in effect assigns a class or category to the mark&#8217;s span. Track 1 marks some words as nouns and others as articles; Track 2 marks some characters as uppercase letters; Track 3 marks each space as an optional-break character to make wordwrap easier to implement. Mark names allow the text editor to treat different spans in the same way if they have the same mark name, but differently if their mark names are different.</p>
<p>(Another way of thinking about this, if you&#8217;re familiar with object-oriented programming, is that a track corresponds to an object property (part of speech, character type, line-break type), and a mark corresponds to a value assigned to that property.)</p>
<p>You can see how this would be useful in the bowels of a text editor. Marks would be useful not only for marking the locations of line-break opportunities or of different types of characters; they would also be useful in syntax highlighting (marking the locations of various classes of words or phrases) and for spellchecking-as-you-type.</p>
<p>Marks would likely be most useful for marking up just the text that is about to be displayed on the screen &amp;mdash; finding where each line of text can be wrapped, determining the &#8220;classes&#8221; of words so that the editor knows what font and style to use when drawing the text, finding misspellings, and so on. Marking up the rest of the file would be a time-consuming chore that would generate a lot of data that would then have to be stored somewhere in case it is needed later, so this kind of busywork is best avoided.</p>
<h3>Implementation</h3>
<p>The text buffer contains nothing but text from the open file. Loading and saving files is simpler if the buffer contains only the text loaded from or saved to files. Search and replace is also simpler if the buffer contains all of the data, and only the data, to be searched. (Syntax highlighting relies on searches in the background, so searches need to be fast, and they are faster when they are simpler.) So marks and tracks are stored elsewhere.</p>
<p>Marks never overlap within a track, so a track could be represented as an array of short records, where each record contains just a mark name (as String) and a span length (as Integer). If the mark name is blank, then the span is just unmarked text. So the data for Track 1 above would look like this:</p>
<pre>Name       Length     | (Corresponding
as String  as Integer |  span)
---------  ---------- | ---------------
article        3      | "The"
               1      | " "
noun           4      | "girl"
               1      | " "
adverb         7      | "quickly"
               1      | " "
verb           5      | "found"
               1      | " "
noun           4      | "Fido"
               1      | " "
article        3      | "the"
               1      | " "
adjective      9      | "yellowish"
               1      | " "
noun           4      | "mutt"
               2      | "." + EndOfLine</pre>
<p>I&#8217;m storing lengths, not offsets, to make it easier to keep marks and spans together when text is inserted or deleted. If text is typed in the middle of a span, then all that needs to happen is to increment the length of the span within the mark&#8217;s record. If I need to find the mark that corresponds to a specific offset within the text, I just keep adding span lengths until the sum reaches the offset.</p>
<p>I&#8217;ll have more to say about this later. I&#8217;ll need to spell out exactly how this information is laid out in memory. I&#8217;ll also have to figure out how I want this information saved into temp files and then reloaded when the mark information is needed. But that&#8217;s for later. Right now I just wanted to get something posted.</p>
]]></content:encoded>
			<wfw:commentRss>http://karig.net/2009/12/marks-and-tracks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
