Recent Changes - Search:

Home Pages Pidgin   Azarennya (S|N) Mac Textanium Reference ToDo Food Local Edit

Local: Hide

Language: Hide

Fantasy: Hide

SciFi: Hide

Film: Hide

Music: Hide

REALbasic: Hide

ResourcesGarageUniversityWebRingForums:REALElfDataPlugins and Code:BKeeneyDeclareSubEinhugurJoeRestrepoTempelmannZAZ

Coding: Hide

Forums:PowWebPHPWebmasterCodingWalkersPerlIntroMonksPHPJavaScriptToolboxUnobtrusiveJavaScriptJavaScriptCompressorRegularExpressions (test)JSLintSQLCocoaCocoaBuilderCocoaDevCocoaLabAppleScriptBBSUserlandFaqintoshFileMakerFileMakerTipsFileMakerWorldFileMakerPlugins

Science: Hide

History: Hide

1421

News/Politics: Hide

Cults/Crime: Hide

ClambakeInfidels

Miscellaneous: Hide

EntryFolders

The technique chosen

Whenever an entry is created and saved, it is saved into a subfolder of the "Entries" folder. This folder has the first three characters (converted to lowercase) of the entry's name. Thus an entry named "McLean,_Virginia" would be saved in the "mcl" subfolder in "Entries".

Proposed techniques

Allow up to one billion by juggling files in folders

Entry folders are subfolders of a folder "Entries". Entries are sorted by name, and the complete list of sorted entries is divided into parts, with each part going into a subfolder, and the subfolder being named after the first entry in that part. Example:

Entries/
 |
 |__ About_the_program/
 |    |
 |    |__ About_the_program.txt
 |    |__ Brainstorming.txt
 |    |__ Desirable_features.txt
 |    |__ Entry_folders.txt
 |
 |__ Entry_structure/
 |    |
 |    |__ Entry_structure.txt
 |    |__ Ideas_for_the_future.txt
 |    |__ Letters_and_emails.txt
 |    |__ Project_plan_and_schedule.txt
 |
 |__ Suggestions/
      |
      |__ Suggestions.txt
      |__ ToDo.txt
      |__ Version_history.txt
      |__ What_I_Want.txt

The idea is that, to find an entry, you get the list of folders in "Entries" and sort the folder names. Then you use a binary search, comparing the entry name to each folder name until you find the folder where the entry should be.

This setup requires maintenance as time goes on. If a folder gets too full, then either:

  • Some of the current folder's first entries need to be moved to the end of the previous folder, and the current folder needs to be renamed (after its new first entry).
  • Some of the current folder's last entries need to be moved to the beginning of the next folder, and the next folder needs to be renamed (after its new first entry).

It's a good idea not to fill a folder more than half full unless the "Entries" folder already has a full capacity of half-full subfolders.

Note that the process of shifting files around would occur when an entry is first created and saved, not at any other time.

What are the chances of reaching a billion entries?

Assume that the average overhead for an entry is 2KB. Then the overhead for one million entries is around 2GB, and that for sixty-four million entries is around 128GB. A billion entries would require around two terabytes of storage.

Keep a database of names and locations

The idea is to keep entry names and paths in a sorted file, so that the entry files would not themselves have to be moved around.

Suppose that each line in the file requires 64 bytes (i.e., the file is a series of fixed-sized records). A 2GB file would hold 33.5 million (32M) lines. If a folder holds 32K files, then 32M files requires 1K folders. If the average overhead per entry is 2KB, then 32M entries require 64GB.

Suppose that the file has blank records interspersed throughout the file, so that the system does not have to rewrite the whole file when adding an entry in order to keep the file sorted; only a chunk of records needs to be read and rewritten. The system might load 64KB (1024 records) from the file and look for a blank record; as long as there is a blank record somewhere in that chunk, then other records can be moved as needed to make room in the correct place for a new recordful of data.

(If there is no blank record in the chunk, other chunks can be loaded from the file until a blank record is found. The worst-case scenario is when the file is just about full -- when you start at the beginning of the file and have to go all the way to the end of the file before finding a blank record, which then has to be moved up to the beginning of the file.)

In order to produce these blank records, when a record is being saved, there should ideally be a blank record both before and after the new record. This can be guaranteed for the first 16M records, though enforcing the guarantee for more than the first 8M records may introduce performance problems. In practice, you might just look for the 64KB (or 128KB) chunk where a new entry belongs and restrict yourself to just moving things around in that chunk.

(During saves, you'd need to lock the file, get the entry written, and unlock the file as quickly as possible.)

MATH: 32K entries take up 32K * 64 or 2MB. 64K entries take up 4MB. 256K entries take up 16MB.

FOLDERS: Folders can have two-character 36-base names, beginning with '00' and ending at 'zz'. Thus each entry in the index reserves 62 characters for the entry name and two characters for the folder name.

Each subfolder handles two letters

This idea is even simpler to implement: Simply take the first two characters (case-insensitive) of the entry name, create a subfolder with those two characters as its name, and save the entry there.

MATH: Suppose each of 26 letters has an average of 20 combinations. That yields 520 subfolders, which is well within the 32,766-file limit for UNIX directories. This would also raise the maximum number of entries to 26 x 20 x 32,766 = 17,038,320.

If I up the ante and use three characters instead of two, I might have 30 x 20 x 20 or 12,000 subfolders. This is still within the 32K limit. (And 32K x 12K = 384M.)

So this is the way to go: For each entry created and saved, take the first three characters of the entry name, make them lowercase, create a folder with the three-character name, and save the entry there.

Edit - History - Print - Recent Changes - Search
Page last modified on August 03, 2007, at 01:44 PM