Jump to content

Recommended Posts

Posted (edited)

You might say I'm the guy who caught on last, but I have to confirm this with the experienced souls out there to check if I'm on the right track.

 

I've always been confused about binary representations of strings.

 

For example. I have never understood why the string "OR" where ASCII character O = 15 and R = 18 in binary.

So if I want to represent this string in 5 bits I always thought you would have to to this.

 

01111 10010 for O and R respectively. Can you imagine the space required if I'm building a computer smile.png .

 

I've just realized that bitwise operation OR as in 'O' OR 'R' gives me the correct 5 bit representation of the string "OR" because 'O' or 'R' (01111 or 10010) = 11111. Is my assumption correct here? .. But this doesn't work in every case. I'm therefore still confused how to represent a string with only a limited number of bits, 5 in this case?

 

 

 

If this is the case then how do I extract the 'O' and the 'R' from the binary representation of "OR" or binary value 31? Is this via bit masking operations?

 

Thanks

Edited by krusty
Posted

Strings typically use 8 per character. This allows up to 255 different characters to be represented. Just the printable Latin alphabet can be represented in fewer bits but there is no real reason to do this. To support internationalization 16 or 32 bits can be used.

 

Strings then are just concatenations of these. So to represent OR you would need two 8 bit characters: 79 and 80 in the standard ASCII code. You also need some way of indicating the length of the string (how many characters are in it). Some languages do this by including an extra count byte at the start. Others (e.g. C) add a zero byte at the end to mark the end of the string.

 

By OR'ing your two 5 bit characters you lose information because many different pairs of characters will give you the same result.

Posted (edited)

Hmm. But how is it that a 4 byte string can I fit a string? For example in C++ I would declare a string and present it as per the below. If it's 8 bits per character then I would need 26 x 8 bits to store this string?

#include<iostream>
#include<string>
using namespaces std;
int main()
{
string mystring = "The quick brown fox jumps over the lazy dog";

cout << mystring << endl;

)
Edited by krusty
Posted

That looks like 43 characters to me so, basically, 43*8 bits. But as it is a C++ string object there will be some extra bytes used to define extra information about the object (I don't know the details of how C++ string types are represented).


 

But how is it that a 4 byte string can I fit a string

 

Not sure I understand this... Do you mean, how can a four-byte variable represent an arbitrary string? That is because what the "string" variable contains is actually just a pointer (the address in memory where the string object is stored).

Posted

Thanks for the help so far. What I was asking is that a string in C++ is 4 bytes on a 32 bit system.

 

Therefore, if I have a string variable as per above this 4 byte string is still 4 bytes after I initialize it with "The quick brown fox jumps over the lazy dog". But you say the 4 byte string contains pointers to each character (yes sorry it is 43 characters). So 4 bytes string actually takes up allot more memory than 4 bytes? Is that right ?

Posted

Yes, the 4-byte string variable is a pointer. In C++ this is a pointer to a structure which contains information about the string including a pointer to the actual string of characters in memory. In C++ the memory used for the characters is allocated automatically when you assign a value to it.

 

In a language like C, you need to manually allocate the space for the string. So, if you want to get a better understanding of how strings work, it might be worth looking at C strings first.

Posted (edited)

This is where I'm getting confused. I understand C Strings though probably not in the light that I need to.

 

The table below comes from the wikipedia link: http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch .

This is related to LZW compression and decompression. If you look at the Decoding table down the third column under output sequence at the point where it says 'OR', the first column says that the bit sequence represents 'OR' is 011111 . How does on get to this. Let's assume the table at the top of the attached link is the ascii equivalent in numerical form. A through to Z.

 

How does: O (01111) and R(10010) become 011111 or "OR". I realise if you just slide them into each other you get 011111, but try doing that to "EO" where 'E' (00101) and 'O' (01111) becomes "EO" (011110). How was "EO" represented in Binary using the individual characters?

Edited by krusty
Posted

This is related to LZW compression and decompression...

 

I can see why you're confused-- LZW bit strings are not text strings. They just happened to be using the upper-case alphabet in their example. In LZW the symbols could represent unique patterns of any data, such as an image, and the binary codes for the symbols will be as long as needed.

 

What you're seeing with "OR" in the example is a new symbol being added to the dictionary that represents that pair of letters with a single symbol (whose bit pattern is just the next available code). It's the heart of the compression algorithm.

  • 2 weeks later...
Posted

Thanks. Well that clears it up. It was a whole group of misconceptions I think, but I believe I've got it now.

 

Cheers

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.