Items tagged with: LynneTeachesTech
Content warning: how does compression work? why does a compressed JPEG look different, but a JPEG in a compressed zip file looks the same? (long, serious)
you could store it as:
to represent that there are 3 a's, 4 b's, etc. you would then simply need a program to reverse the process - decompression. this is seldom used, however, as it had a fairly major flaw:
which is twice as large! RLE compression is best used for data with lots of repetition, and will generally make more random files *larger* rather than smaller.
a more complex method is to make a "dictionary" of things contained in the file and use that to make things smaller. for example, you could replace all occurrences of "the united states of america" with "🇺🇲", and then state that "🇺🇲" refers to "the united states of america" in the dictionary. this would allow you to save a (relatively) huge amount of space if the full phrase appears dozens of times.
what i've been talking about so far are lossless compression methods. the file is exactly the same after being compressed and decompressed. lossy compression, on the other hand, allows for some data loss. this would be unacceptable in, for example, a computer program or a text file, because you can't just remove a chunk of text or code and approximate what used to be there. you can, however, do this with an image. this is how JPEGs work. the compression is lossy, which means that some data is removed. this is relatively imperceptible at higher quality settings, but becomes more obvious the more you sacrifice quality for size. PNG files are (almost always) lossless, however. your phone camera takes photos in JPEG instead of PNG, though, because even though some quality is lost, a photo stored as a PNG would be much, much larger.
some examples of file formats that typically use lossy compression are JPEG, MP4, MP3, OGG, and AVI. some examples of lossless compression formats are FLAC, PNG, ZIP, RAR, and ALAC. some examples of lossless, uncompressed files are WAV, TXT, JS, BMP, and TAR. in terms of file size, you'll always find that lossy files are smaller than the lossless files they were created from (unless it's an horrendously inefficient compression format), and that losslessly compressed files are smaller than uncompressed ones.
you'll find that putting a long text file in a zip makes it much smaller, but putting an MP3 in a zip has a much less major effect. this is because MP3 files are already compressed quite efficiently, and there's not really much that a lossless algorithm can do.
there are benefits to all three types of formats. lossily compressed files are much smaller, losslessly compressed files are perfectly true to the original sound/image/etc while being much smaller, and uncompressed data is very easy for computers to work with, as they don't have to apply any decompression or compression algorithms to it. this is (partly) why BMP and WAV files still exist, despite PNG and FLAC being much more efficient.
as an example of how dramatic these differences often are, i looked at the file sizes for the downloadable version of master boot record's album "internet protocol" in three formats: WAV, FLAC, and MP3. you can see that the file size (shown in megabytes) is nearly 90 megabytes smaller with the FLAC version, and the MP3 version is only ~13% of the size of the WAV version. note that these downloads are in ZIP format - the WAV files would be even larger than shown here. this is not representative of all compression algorithms, nor is it representative of all music - this is just an illustrative example. TV static in particular compresses very poorly, because it's so random, which makes it hard for algorithms to find patterns. watch a youtube video of TV static to see this in effect - you'll notice obvious "block" shapes and blurriness that shouldn't be there as the algorithm struggles to do its job. the compression on youtube is particularly strong to ensure that the servers can keep up with the enormous demand, but not so much so that videos become blurry, unwatchable messes.
Content warning: how does OCRbot work? (long, serious)
tesseract first tries to split the image into chunks of what it believes to be text. it then splits these chunks into words by looking for spaces, and then tries to identify the letters making up those words. if any of these steps fail, the entire process fails.
neural networks work in a similar way to how a brain does, but on a much smaller scale - computers really aren't up to the task of simulating an entire brain right now. powerful computers are used to train the neural network for hours and hours, making it become more accurate at reading text. this has some drawbacks - if it was always trained on helvetica text, it'd only be able to read that, for example. thus it's important to make sure you train it on lots of different fonts. tesseract provides these datasets for you, but you can train your own if you'd like.
finally, after extracting the text, OCRbot does some rudimentary fixes (like replacing | with I, as tesseract thinks | is a lot more frequent than it really is) and posts the reply.