|
| 24 May 2015 05:57 PM |
[I understand that this is a very long text. Only those who wish to learn something should read.] Roblox filter keep blocking my text, so I'll have to write it in several posts.
Howdy! A few days ago, I've been interested in the world of data compression, more precisely, about the RLE algorithm. I've had previously created a topic with my questions about it(http://www.roblox.com/Forum/AddPost.aspx?PostID=162481066&mode=flat). The answers given there were accurate, but incomplete. Some ideas I learned there is the idea of an escape code(thane first mentioned it), compressing at the bit level(Cntkillme's idea), a good idea, but useless on Roblox so I will not talk about it here.
I did more research on my own(and experimented) and I was able to answer most of my previously asked question quite accurately(Although the ideas were spread onto several different websites, and I had to figure some on my own). What I'll do here is post everything I've found, in case others are curious to know. You could even go ahead and edit the Wikipedia page, since more tricks are explained here.
RLE is short for Run Length Encoding. This is an algorithm that compress repeated characters into a smaller form. "AAAAAAAAAABBBBB" would become "A10B5" for example. As you can see, writing A10B5 is much shorter than writing A ten times and B five times. That's the principle RLE is based on. Now all that is nice and good, but this form can cause ambiguity. Compressing "AAAAA12" would give "A512". If you were to decompress this, you would end up with the letter A repeated 512 times instead of repeating it five times, followed by the number 12! Here's the tricks I've read about or figured out myself to remove this vagueness: |
|
|
| Report Abuse |
|
|
|
| 24 May 2015 05:58 PM |
1)Use an escape code to indicate that a character is repeated multiple times. You could use "*" for example. AAAAAAAAAABBBBB would be written as A*10B*5. To solve the "AAAAA12" problem, you would have to use the escape code twice, so it becomes A*5*12 . A much better idea than using an escape code here, is to repeat the character twice. "AAAAAAAAAABBBBB" would turn into AA10BB5. Since the letter is repeated twice, the decompressor can understand that the character is repeated multiple times. "AAAAA12" would become AA5A12 . Adding an extra A after 5 is necessary to make the distinction between 5 and 12.
2)Instead of writing the amount of times a character is repeated, write how many EXTRA times it is repeated. instead of changing "AAAAA" into "AA5", turn it into "AA3", which is the same as saying that A is repeated 2+3=5 times. It's a small change, but it can help the compression rate. Figured out this one by myself.
3)Whenever exactly 3 symbols are repeated, "BBB10" for example, Do not convert it to "BB1B10", since this makes it lengthier. Keep it as "BBB10". Your decoder can figure out that this part was not compressed, because it sees 3 similar symbol. Figured out this one by myself.
4)Get two sets of symbols for the counters. Instead of writing "119123" for "1111111111123", you could write "11J23" In this case J would represent 9(A=0 B=1 C=2..). Figured this one out myself. |
|
|
| Report Abuse |
|
|
eLunate
|
  |
| Joined: 29 Jul 2014 |
| Total Posts: 13268 |
|
|
| 24 May 2015 06:06 PM |
The most immediate thing I think of to make it is
local function RLE(s) local fin = ''; local prevchar = ''; local prevloc = 1; local count = 1; for i=1,#s do local c = s:sub(i,i); if c == prevchar then count = count+1; else fin = fin..prevchar..count; prevchar = c; count = 1; end end; return fin:sub(2,-1); end
Although, for decoding you can kind of safely assume that it goes char,number alternatingly... |
|
|
| Report Abuse |
|
|
chimmihc
|
  |
| Joined: 01 Sep 2014 |
| Total Posts: 17143 |
|
|
| 24 May 2015 06:06 PM |
That is a rather basic compression method.
I script -~ chimmihc |
|
|
| Report Abuse |
|
|
|
| 24 May 2015 06:07 PM |
--[[Roblox is blocking #5 for no reasons, read next post to see what it said]]
6)When only two characters are repeated followed by a symbol used to count, such as "BBµ" you got a few possibilities. First, you could write it as "BBµBµ", adding two extra characters Or you could write it as BBµµ.(remember that I said I'd use µ as 0). The Decoder can figure out that since the letter is repeated twice with a symbol representing 0, that it should only remove the first µ. However, using the 2nd notation(BBµµ), it is impossible to send data to the compressor twice. The first time it would become BBµµ the next time BBµµµ and so on. You can also use a special escape code just for this like "BB/µ" but you would have to implement a way to add the / symbol into your data. IF ANYONE HAS A BETTER IDEA FOR THIS ONE SAY SO!
Here's some example using all these tips(Uncompressed=Compressed): "AAAAA122444VVV¶ÆÆÆÆÆ92½½½½½½" = "AAÆ122444VVV¶ÆÆ3Æ92½½4" --total compression of 6 "SAND33µ999µ213599999®®®®®®BC" = "SAND33µ3µ999µ213599Æ9®®4BC" --total compression of 2
It's impossible to make a compression algorithm that can compress Every kind of data(It's mathematically proven), so every algorithms have flaws. In this case the flaw is #6, where two characters only are repeated followed by a count symbol. "AA¬¬AA¬¬AA¬¬AA¬¬AA" This is an example where the algorithm would not be able to compress the data.
|
|
|
| Report Abuse |
|
|
eLunate
|
  |
| Joined: 29 Jul 2014 |
| Total Posts: 13268 |
|
|
| 24 May 2015 06:07 PM |
| But yes you still have that issue of the 512, because numbers can go higher than 9. Just stick \n in them or something, or just some char that won't really come up otherwise |
|
|
| Report Abuse |
|
|
cntkillme
|
  |
| Joined: 07 Apr 2008 |
| Total Posts: 44956 |
|
|
| 24 May 2015 06:11 PM |
You can easily do it on the bit level on Lua. You don't store the number represented in ASCII from before ("10A") but you store the byte of how many there is ("\10A").
You can then easily get the next byte using string.byte and string.sub |
|
|
| Report Abuse |
|
|
eLunate
|
  |
| Joined: 29 Jul 2014 |
| Total Posts: 13268 |
|
|
| 24 May 2015 06:15 PM |
Taking in mind what cnt said,
local function RLE(s) local fin = ''; local prevchar = ''; local prevloc = 1; local count = 1; for i=1,#s do local c = s:sub(i,i); if c == prevchar then count = count+1; else fin = fin..prevchar..string.char(count); prevchar = c; count = 1; end end; return fin:sub(2,-1); end
local function RLEdec(s) local fin = {} for i=1,#s,2 do table.insert(fin,s:sub(i,i):rep(s:sub(i+1,i+1):byte())); end return table.concat(fin,''); end
Or start learning to use huffman trees to try and represent your data because huffman trees. |
|
|
| Report Abuse |
|
|
|
| 24 May 2015 06:18 PM |
| 5)Pick rare letters for your counters. Roblox keep blocking this line, so imagine a line of ascii characters that are not used often. |
|
|
| Report Abuse |
|
|
cntkillme
|
  |
| Joined: 07 Apr 2008 |
| Total Posts: 44956 |
|
|
| 24 May 2015 06:19 PM |
| Just do it the correct way D: |
|
|
| Report Abuse |
|
|
eLunate
|
  |
| Joined: 29 Jul 2014 |
| Total Posts: 13268 |
|
|
| 24 May 2015 06:20 PM |
local function RLE(s) local fin = ''; local prevchar = ''; local prevloc = 1; local count = 1; for i=1,#s do local c = s:sub(i,i); if c == prevchar then count = count+1; else fin = fin..prevchar..string.char(count); prevchar = c; count = 1; end end; fin = fin..prevchar..string.char(count); return fin:sub(2,-1); end
local function RLEdec(s) local fin = {} for i=1,#s,2 do table.insert(fin,s:sub(i,i):rep(s:sub(i+1,i+1):byte())); end return table.concat(fin,''); end
I somehow managed to clip the end off of the converted string in my RLE |
|
|
| Report Abuse |
|
|
cntkillme
|
  |
| Joined: 07 Apr 2008 |
| Total Posts: 44956 |
|
| |
|
eLunate
|
  |
| Joined: 29 Jul 2014 |
| Total Posts: 13268 |
|
|
| 24 May 2015 06:23 PM |
| You know, I actually have no idea. I'd intended on using it at some point, but I'm tired and the thing just formulated in my head without prevloc when I was writing it. |
|
|
| Report Abuse |
|
|
|
| 24 May 2015 06:23 PM |
the 512 woudln't be a problem if you use all the tricks I named. AAAAA12 becomes AA3A12
Instead of going for huffman coding I'll go with Arithmetic coding, It seems to achieve a better compression rate. |
|
|
| Report Abuse |
|
|
|
| 24 May 2015 06:48 PM |
@cntkillme
How many characters will "/65" be saved as? 8 bits? 24 bits? |
|
|
| Report Abuse |
|
|
cntkillme
|
  |
| Joined: 07 Apr 2008 |
| Total Posts: 44956 |
|
| |
|
cntkillme
|
  |
| Joined: 07 Apr 2008 |
| Total Posts: 44956 |
|
|
| 24 May 2015 06:50 PM |
'/65' is different then '\65' so never mind. Yours would use 3 bytes, mine would be 1. |
|
|
| Report Abuse |
|
|
|
| 24 May 2015 06:57 PM |
Then it still holds that compression at the bit level is impossible on roblox.("/" was a typo on my part)
Your idea is great, but I can't see how to use it on roblox(so I'll use it outside roblox ;p)
How do you save Less than 8 bits of information? Since you said "\65" is 1 byte I can assume that "\10" is 1 byte too and so on.. "insert 5 bits here"
|
|
|
| Report Abuse |
|
|
cntkillme
|
  |
| Joined: 07 Apr 2008 |
| Total Posts: 44956 |
|
|
| 24 May 2015 07:00 PM |
In strings on Lua, all characters are 1 byte large, you don't want to save only '5' bits of information because that means things like \3 you would think to save 2 bits, but then you get a varying size and have no idea how to parse that.
The point is to use a full byte before the byte that represents the character as a 'how many repetitions'. |
|
|
| Report Abuse |
|
|
eLunate
|
  |
| Joined: 29 Jul 2014 |
| Total Posts: 13268 |
|
|
| 24 May 2015 07:06 PM |
| If you need to encode on a bit level, you'll need to encode your entire thing into bits and then break it into bytes and use string.char. Just make sure your decoder understands where your last character is (up to 7 bits worth of ambiguity at the end) |
|
|
| Report Abuse |
|
|
|
| 24 May 2015 07:20 PM |
Oh you ment like
AAAAABCCDEFFF
as 5A1B2C1D1E3F
If so, use the proposed solution that I posted, since using this method requires you to add an extra byte for all characters( ABCDE would give 1A1B1C1D1E)
Lunare is it possible to do that on roblox?(if so how, since im very interested!) |
|
|
| Report Abuse |
|
|
cntkillme
|
  |
| Joined: 07 Apr 2008 |
| Total Posts: 44956 |
|
|
| 24 May 2015 07:22 PM |
OFC it's possible. You can do what you mentioned before and have a 'skip' count, so if the number is negative you would not duplicate |num| bytes ahead. But that limites you from 0->255 to -128->127. Then again you can use 2 bytes instead of 1 to store this information but it really depends what you are compressing. |
|
|
| Report Abuse |
|
|
|
| 24 May 2015 07:41 PM |
ah, the packbit variant. I've read about this before but I didn't find it useful to add that to this post because it was starting to get long enough(And I'll make a 2nd post to explain the other techniques such as the one you mentionned and BWT transform)
Do you know an algorithm that can find the longest repeated string?
Example: "12KIL3ABC032ABCDEFGHIJKABCDEFGHIJK3491AB4298G"
In this case it would be 'ABCDEFGHIJK' I'd need a method better than Brute-Force testing. |
|
|
| Report Abuse |
|
|