I call it a WordString - because you could use 2 bytes (1 word) to store the length of the string instead of a shortstring which only uses 1 byte to store the length.
Question
Why not just use an ansistring? Answer: a wordstr is fast and allocated on the stack whereas an ansistring is slow. Stack based strings are still useful for speed, such as in compilers. For example I discovered a bug in the FPC compiler where some command line option I was sending it was truncated.
Other alternatives are heap based CapStr which is a fast capacitance string.
Algorithm/Data Structure
Here is one way you could implement a WordStr by hand yourself (compiler magic would needed to make it easier to use):
program Proj1;
{$APPTYPE CONSOLE}
const MaxWord = High(word);
TWordStr = array [-1..MaxWord-1] of char;
{ turn two bytes into a word }
function MakeWord(a, b: byte): word;
begin
Result := (b shl 8) or a;
end;
{ set length of word string }
procedure SetWordStrLen(var ws: twordstr; size: word);
begin
ws[-1]:= char(lo(size)); // low byte of storage word
ws[0]:= char(hi(size)); // high byte of storage word
end;
{ get length of word string }
function WordStrLen(const ws: twordstr): word;
begin
result:= makeword(byte(ws[-1]), byte(ws[0]))
end;
{ concatenate (combine) 2 wordstrings }
procedure WordStrConcat(const ws1: TWordStr; var dest: TWordStr); overload;
begin
move(pchar(@ws1[1])^, pchar(@dest[WordStrLen(dest)+1])^, WordStrLen(ws1));
end;
{ concatenates a short string with a wordstring }
procedure WordStrConcat(const s1: shortstring; var dest: TWordStr); overload;
begin
move(pchar(@s1[1])^, pchar(@dest[WordStrLen(dest)+1])^, Length(s1));
end;
var
wordstr1: twordstr;
wordstr2: twordstr;
i: integer; // loop
sstring: shortstring;
begin
writeln('The program will prompt you for several tests (hit enter)');
readln;
writeln('Maxint is: ', maxint);
writeln('Maxword is: ', maxword);
writeln('Maxbyte is ', maxbyte);
writeln;
SetWordStrLen(wordstr1, 5); // no memory allocation occurs
wordstr1[1]:= 'h';
wordstr1[2]:= 'e';
wordstr1[3]:= 'l';
wordstr1[4]:= 'l';
wordstr1[5]:= 'o';
writeln('Length of WordString1: ', WordStrLen(wordstr1));
writeln('Actual Memory Size of Word String: ', sizeof(wordstr1));
writeln('WordString 1 Contents: ', pchar(@WordStr1[1]));
writeln;
SetWordStrLen(wordstr2, 7); // no memory allocation occurs
wordstr2[1]:= ' ';
wordstr2[2]:= 'w';
wordstr2[3]:= 'o';
wordstr2[4]:= 'r';
wordstr2[5]:= 'l';
wordstr2[6]:= 'd';
wordstr2[7]:= '!';
writeln('Length of Word String: ', WordStrLen(wordstr2));
writeln('Actual Memory Size of Word String: ', sizeof(wordstr2));
writeln('WordString 2 Contents: ', pchar(@WordStr2[1]));
WordStrConcat(wordstr2, wordstr1); // no memory allocation occurs
write('After concatenation of two wordstrings: ');
writeln(pchar(@wordstr1[1]));
writeln;
writeln('Want to see over 65,000 ''a'' characters? (press enter)');
readln;
SetWordStrLen(wordstr2, MaxWord - 1); // no memory allocation occurs
for i:= 1 to MaxWord - 1 do
begin
wordstr2[i]:= 'a';
end;
// write over 65,000 characters, the maximum capacity of a WordString
writeln(pchar(@wordstr2[1]));
// Note: You are not forced to use 65,000 characters, you could use less,
// such as WordStr1[1500] since the compiler would implement specifiable
// length option just like shorstrings. Typical wordstrings would be
// string[1000] rather than string[65000]
{ now let's combine a short string with a wordstring }
writeln;
writeln('Let''s combine a short string with a wordstring.. (press enter)');
readln;
sstring:= ', interesting world.';
WordStrConcat(sstring, wordstr1); // no memory allocated
// now write the contents of the new wordstring
writeln(pchar(@WordStr1[1]));
writeln;
writeln('Done (enter to exit)');
readln;
end.
The shortstring/string/longstring/ansistring naming scheme confuses many people. Many people confuse longstrings with ansistrings and strings with ansistrings and shortstrings with strings and shortstrings with ansistrings. So what is the solution? Call the string by it's length representation:
- ByteString: what you know as a shortstring. I think shortstring is a stupid name, considering short is a relative term. Byte is more specific.
- WordString: a fixed string like a short string, but which can hold more than 255 elements. WordString is a much more specific name than something stupid like "superstring" or "hugestring" because super/huge/short are relative terms and mean essentially nothing. Another name may be a 2BString (2byte based string).
- LongString: a fixed string like a WordString, but which can hold more than 65334 elements. Computers memory are growing and we may end up calling a 65334 string a short string some time - a better name may be a 4BString (4byte based string).
- Ansistring: this is a reference counted string. maybe it should have been called refstring. For now, Ansistring is an okay name. We'll leave it at that.
Before you read this page, here are some definitions to remember:
WordStr, WordString: My recommended name for a fixed string longer than 255 characters, with 2 bytes (word) for length storage. This is similar to a short string but instead of going up to 255 elements it can go up to more than 65000, actually. It uses fixed memory like the shortstring, making it fast for concatenations, with no reference count. Allocated on the stack!
Extended Pascal String: This is NOT an ansistring, this is a special shortstring that can hold more than 255 elements, which was available in extended pascal compilers. FPC and Delphi are NOT extended Pascal compilers. An example of a compiler that usees an extended Pascal string is Prospero Pascal. Up to more than 32000 elements can be stored in this string form. I'm not sure whether this is the same as a WordStr - I haven't studied Dec Pascal/Extended Pascal enough to understand what exactly a extended string is in those languages. It seems similar to the idea of a wordstr, but more verification will be needed
Longstring: This is some string that FPC implemented in early versions but didn't complete the implementation. I haven't studied it enough to find out if it is a wordstr or not. From my guess it is a LongInt based string.
Shortstring: This is a short string with only 255 elements. It was available in Delphi, Borland Turbo Pascal, and Standard Pascal compilers. Allocated on the stack!
Ansistring: This is a special string which has the abilities of a pchar, with memory allocated on the fly, but with reference counting and automation. Pchar has no automation and no length storage or reference counting, while a ansistring does. Allocated on the heap.
Pchar: Not really applicible to the article below, but a pchar is a null terminated string, or a pointer to a char (which can move to different locations of several chars, so acts like an array but really is a pointer to a char) with a null terminator. Allocated on the heap.
Remember not to mix up the term ansistring with longstring. Some Delphi programmers call ansistrings "long strings" but this is confusing. It is better to just call them ansistrings, so we don't confuse them with Extended Pascal's string and FPC's old LongString which is NOT an ansistring.
And now, at last, the Article...
Borland Pascal, Delphi, and FPC have had Shortstrings and Ansistrings available to the programmer. Older versions of Borland Pascal did not have ansistrings. What about a shortstring which can contain more than 255 elements, though?
Extended Pascal in some DEC Pascal compilers and other Extended Pascal compilers have had the ability to use a short string past 255 chars. How did they implement this special extended short string?
First let's analyze the way a shortstring works and why it has a 255 limit. The way a 255 element short string works is by storing the length of the shortstring in the first byte string[0]. Since a Byte can hold information up to 255, you are limited to 255 elements. Your string[0] byte will only hold length info up to 255.
A special short string which allows more than 255 elements (characters) uses string(-1) and string(0) to store length information. It uses both the 0 element and the negative 1 element to store info, not just the 0 element So an extended shortstring, is a negative 1 based array. A shortstring is a zero based array to the computer, but a 1 based array to the programmer. You can only access string[1]-string[255] with a shortstring. An extended shortstring is a negative 1 based array to the computer, but a 1 based array to the programmer. In an extended short string you can acccess string[1] all the way up to string[32000] and depending on the implementation limit, sometimes all the way up to more than string[65000]. Your special extended shortstring is now able to hold more than 255 characters, more 1000 characters even. Borland Pascal, FPC, and Delphi never implemented this special shortstring (sometimes called a longstring, but not to be confused with an ansistring). And why didn't they implement it? Most likely it is a common attitude that "255 is more than enough". But this is false, and here is proof why:
Have you ever seen programmers write code that has used an Array of Chars? Sure you have. You've seen lots of that.
Why would they use an array of chars if they could just use a short string? Isn't 255 characters more than enough? No, not for this case - an array of characters is a different situation, you say! Incomparable, you say! I would counter your argument, though. This is a buffer situation, you say! And I'm comparing apples to oranges, you say. But in fact, I'm not comparing apples to oranges. That's would I would counter.
You say: "A buffer is an orange, and a shortstring is an apple, you can't compare the two! They are two different things!" But in a character buffer is really a shortstring, I say back to you. An array of characters is just a "dumb" shortstring with more than 255 elements. It's a dumb "extended shortstring". An array of characters doesn't have any of the "smarts", "brains", or "intelligence" of an extended shortstring, with more than 255 elements.
Some programmers use a ByteArray or a CharArray for buffers. But why use a
CharArray if you needed the advantages that a string type has:
- Concatenation using the plus sign.
- Length stored in first two bytes.
- other automation that makes string work a breeze
Say you wanted a Buffer that holds roughly 5000KB. So you use CharArray or a BufferArray with 5000 elements, right?
array [0..4999] of char;
array [0..4999] of byte;
array [0..5119] of char;
array [1..5120] of byte;
etc.
etc.
But what if you need the power of an Array of Characters, along with the power of a shortstring? A shortstring buffer that expands past 255 characters. You must come out of your cave that you have been living in for a while in order to understand why this is useful. You've been living a 255 based world for years.. come out of your cave.
No need to reinvent that wheel - that has already been invented - it's a longstring or an extended shortstring that goes past 255 elements. It's really a buffer WITH the advantages of a shortstring. But no disadvantages of the Ansistring. It's a smart version of a char buffer, or a smart version of an array of characters.
And why do these folks use Array of Characters when they could just use a bunch of 255 limited shortstrings combined together? If shortstrings are enough forever - then why have you seen code that uses array of chars? There must be some reason. Sure, if you really don't need the power of a shortstring, and you need to extend past 255, then you can use an array of chars. But really - if you are dealing with an array of chars, you are essentially dealing with text, yes? And what does working with text entail? It entails text like operations. Do Arrays Of Chars allow you to do easy text operations? No, not compared to shortstrings - which only go to 255. Get the idea?
There is no concatenation with plus sign possible - with an array of chars. So all your code that relies on concatenation is now rendered officially broken if you ever change a short string to an Array of Chars further down the road when things grow bigger. You'd have to change all the code to concatenate without the plus sign using array of char tricks.
If your existing code was using shortstrings, and you decide to expand this code in the future to require a procedure or function parameter to expect incoming data longer than 255 elements, you now have to change that procedure parameter to accept a list of short strings or an array of chars, because you've hit the 255 barrier. You could have just changed the type to a longstring and you'd save all that time changing your code around to work with a CharArray or some other type which is reinventing the longstring. The longstring has already been invented - don't reinvent it using your own messy array of char code !
Use the longstring. But wait, it isn't available in Borland Pascal or Delphi. Because they didn't think it was needed - or they simply didn't have time and resources to implement it.
I've asked the fpc devel team what they think about implementing it. They already have some built in support for it but it needs some dusting as it was never a fully implemented feature.
Update
The above idea is good, but it just requires adding functions to the RTL to make it compatible with sysutils.pas/system.pas. It also requires compiler magic added for functions such as writeln/readln. For example all the RTL string functions would need to have intelligence about the WordStr/LongStr.
|