krusty Posted January 2, 2014 Posted January 2, 2014 (edited) You might say I'm the guy who caught on last, but I have to confirm this with the experienced souls out there to check if I'm on the right track. I've always been confused about binary representations of strings. For example. I have never understood why the string "OR" where ASCII character O = 15 and R = 18 in binary. So if I want to represent this string in 5 bits I always thought you would have to to this. 01111 10010 for O and R respectively. Can you imagine the space required if I'm building a computer . I've just realized that bitwise operation OR as in 'O' OR 'R' gives me the correct 5 bit representation of the string "OR" because 'O' or 'R' (01111 or 10010) = 11111. Is my assumption correct here? .. But this doesn't work in every case. I'm therefore still confused how to represent a string with only a limited number of bits, 5 in this case? If this is the case then how do I extract the 'O' and the 'R' from the binary representation of "OR" or binary value 31? Is this via bit masking operations? Thanks Edited January 2, 2014 by krusty
Strange Posted January 2, 2014 Posted January 2, 2014 Strings typically use 8 per character. This allows up to 255 different characters to be represented. Just the printable Latin alphabet can be represented in fewer bits but there is no real reason to do this. To support internationalization 16 or 32 bits can be used. Strings then are just concatenations of these. So to represent OR you would need two 8 bit characters: 79 and 80 in the standard ASCII code. You also need some way of indicating the length of the string (how many characters are in it). Some languages do this by including an extra count byte at the start. Others (e.g. C) add a zero byte at the end to mark the end of the string. By OR'ing your two 5 bit characters you lose information because many different pairs of characters will give you the same result.
krusty Posted January 2, 2014 Author Posted January 2, 2014 (edited) Hmm. But how is it that a 4 byte string can I fit a string? For example in C++ I would declare a string and present it as per the below. If it's 8 bits per character then I would need 26 x 8 bits to store this string? #include<iostream> #include<string> using namespaces std; int main() { string mystring = "The quick brown fox jumps over the lazy dog"; cout << mystring << endl; ) Edited January 2, 2014 by krusty
Strange Posted January 2, 2014 Posted January 2, 2014 That looks like 43 characters to me so, basically, 43*8 bits. But as it is a C++ string object there will be some extra bytes used to define extra information about the object (I don't know the details of how C++ string types are represented). But how is it that a 4 byte string can I fit a string Not sure I understand this... Do you mean, how can a four-byte variable represent an arbitrary string? That is because what the "string" variable contains is actually just a pointer (the address in memory where the string object is stored).
krusty Posted January 2, 2014 Author Posted January 2, 2014 Thanks for the help so far. What I was asking is that a string in C++ is 4 bytes on a 32 bit system. Therefore, if I have a string variable as per above this 4 byte string is still 4 bytes after I initialize it with "The quick brown fox jumps over the lazy dog". But you say the 4 byte string contains pointers to each character (yes sorry it is 43 characters). So 4 bytes string actually takes up allot more memory than 4 bytes? Is that right ?
Strange Posted January 2, 2014 Posted January 2, 2014 Yes, the 4-byte string variable is a pointer. In C++ this is a pointer to a structure which contains information about the string including a pointer to the actual string of characters in memory. In C++ the memory used for the characters is allocated automatically when you assign a value to it. In a language like C, you need to manually allocate the space for the string. So, if you want to get a better understanding of how strings work, it might be worth looking at C strings first.
krusty Posted January 3, 2014 Author Posted January 3, 2014 (edited) This is where I'm getting confused. I understand C Strings though probably not in the light that I need to. The table below comes from the wikipedia link: http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch . This is related to LZW compression and decompression. If you look at the Decoding table down the third column under output sequence at the point where it says 'OR', the first column says that the bit sequence represents 'OR' is 011111 . How does on get to this. Let's assume the table at the top of the attached link is the ascii equivalent in numerical form. A through to Z. How does: O (01111) and R(10010) become 011111 or "OR". I realise if you just slide them into each other you get 011111, but try doing that to "EO" where 'E' (00101) and 'O' (01111) becomes "EO" (011110). How was "EO" represented in Binary using the individual characters? Edited January 3, 2014 by krusty
Pulvinar Posted January 3, 2014 Posted January 3, 2014 This is related to LZW compression and decompression... I can see why you're confused-- LZW bit strings are not text strings. They just happened to be using the upper-case alphabet in their example. In LZW the symbols could represent unique patterns of any data, such as an image, and the binary codes for the symbols will be as long as needed. What you're seeing with "OR" in the example is a new symbol being added to the dictionary that represents that pair of letters with a single symbol (whose bit pattern is just the next available code). It's the heart of the compression algorithm.
krusty Posted January 17, 2014 Author Posted January 17, 2014 Thanks. Well that clears it up. It was a whole group of misconceptions I think, but I believe I've got it now. Cheers
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now