Items tagged with: streams
The new Redis data structure introduced in Redis 5 under the name of “Streams” generated quite some interest in the community. Soon or later I want to run a community survey, talking with users having…
HN Discussion: https://news.ycombinator.com/item?id=19463306
Posted by itamarhaber (karma: 720)
Post stats: Points: 303 - Comments: 46 - 2019-03-22T15:13:21Z
#HackerNews #data #pure #redis #streams #structure
The new Redis data structure introduced in Redis 5 under the name of “Streams” generated quite some interest in the community. Soon or later I want to run a community survey, talking with users having production use cases, and blogging about it. Today I want to address another issue: I’m starting to suspect that many users are only thinking at Streams as a way to solve Kafka(TM)-alike use cases. Actually the data structure was designed to also work in the context of messaging with producers and consumers, but to think that Redis Streams are just good for that is incredibly reductive. Streaming is a terrific pattern and “mental model” that can be applied when designing systems with great success, but Redis Streams, like most Redis data structures, are more general, and can be used to model dozen of different unrelated problems. So in this blog post I’ll focus on Streams as a pure data structure, completely ignoring its blocking operations, consumer groups, and all the messaging parts. ## Streams are CSV files on steroids If you want to log a series of structured data items and decided that databases are overrated after all, you may say something like: let’s just open a file in append only mode, and log every row as a CSV (Comma Separated Value) item: (open data.csv in append only)
time=1553096725029,cpu_temp=23.2,load=2.1 Looks simple and people did this for ages and still do: it’s a solid pattern if you know what you are doing. But what is the in-memory equivalent of that? Memory is more powerful than an append only file and can automagically remove the limitations of a CSV file like that: 1. It’s hard (inefficient) to do range queries here. 2. There is too much redundant information: the time is almost the same in every entry and the fields are duplicated. At the same time removing it will make the format less flexible, if I want to switch to a different set of fields. 3. Item offsets are just the byte offset in the file: if we change the file structure the offset will be wrong, so there is no actual true concept of primary ID here. Entries are basically not univocally addressed in some way. 4. I can’t remove entries, but only mark them as no longer valid without the ability of garbage collecting, if not by rewriting the log. Log rewriting usually sucks for several reasons and if it can be avoided, it’s good. Still such log of CSV entries is also great in some way: there is no fixed structure and fields may change, is trivial to generate, and after all is quite compact as well. The idea with Redis Streams was to retain the good things, but go over the limitations. The result is a hybrid data structure very similar to Redis Sorted Sets: they feel like a fundamental data structure, but to get such an effect, internally it uses multiple representations. ## Streams 101 (you may skip that if you know already Redis Stream basics) Redis Streams are represented as delta-compressed macro nodes that are linked together by a radix tree. The effect is to be able to seek to random entries in a very fast way, to obtain ranges if needed, remove old items to create a capped stream, and so forth. Yet our interface to the programmer is very similar to a CSV file: > XADD mystream * cpu-temp 23.4 load 2.3
XADD mystream * cpu-temp 23.2 load 2.1References
"1553097568315-0" As you can see from the example above the XADD command auto generates and returns the entry ID, which is monotonically incrementing and has two parts: -. The time is in milliseconds and the counter increases for entries generated in the same milliseconds. So the first new abstraction on top of the “append only CSV file” idea is that, since we used the asterisk as the ID argument of XADD, we get the entry ID for free from the server. Such ID is not only useful to point to a specific item inside a stream, it’s also related to the time when the entry was added to the stream. In fact with XRANGE it is possible to perform range queries or fetch single items: > XRANGE mystream 1553097561402-0 1553097561402-0
1) 1) "1553097561402-0" 2) 1) "cpu-temp" 2) "23.4" 3) "load" 4) "2.3" In this case I used the same ID as the start and the stop of the range in order to identify a single element. However I can use any range, and a COUNT argument to limit the number of results. Similarly there is no need to specify full IDs as range, I can just use the millisecond unix time part of the IDs, to get elements in a given range of time: > XRANGE mystream 1553097560000 1553097570000
1) 1) "1553097561402-0" 2) 1) "cpu-temp" 2) "23.4" 3) "load" 4) "2.3"
2) 1) "1553097568315-0" 2) 1) "cpu-temp" 2) "23.2" 3) "load" 4) "2.1" For now there is no need to show you more Streams API, there is the Redis documentation for that. For now let’s just focus on that usage pattern: XADD to add stuff, XRANGE (but also XREAD) in order to fetch back ranges (depending on what you want to do), and let’s see why I claim Streams are so powerful as a data structure. However if you want to learn more about Redis Streams and their API, make sure to visit the tutorial here: https://redis.io/topics/streams-intro ## Tennis players A few days ago I was modeling an application with a friend of mine which is learning Redis those days: an app in order to keep track of local tennis courts, local players and matches. The way you model players in Redis is quite obvious, a player is a small object, so an Hash is all you need, with key names like player:. As you model the application data further, to use Redis as its primary, you immediately realize you need a way to track the games played in a given tennis club. If player:1 and player:2 played a game, and player 1 won, we could write the following entry in a stream: > XADD club:1234.matches * player-a 1 player-b 2 winner 1
"1553254144387-0" With this simple operation we have: 1. A unique identifier of the match: the ID in the stream.
2. No need to create an object in order to identify a match.
3. Range queries for free to paginate the matches, or check the matches played in a given moment in the past. Before Streams we needed to create a sorted set scored by time: the sorted set element would be the ID of the match, living in a different key as a Hash value. This is not just more work, it’s also an incredible amount of memory wasted. More, much more you could guess (see later). For now the point to show is that Redis Streams are kinda of a Sorted Set
in append only mode, keyed by time, where each element is a small Hash. And in its simplicity this is a revolution in the context of modeling for Redis. ## Memory usage The above use case is not just a matter of a more solid pattern. The memory cost of the Stream solution is so different compared to the old approach of having a Sorted Set + Hash for every object that makes certain things that were not viable, now perfectly fine. Those are the numbers for storing one million of matches in the configurations exposed previously: Sorted Set + Hash memory usage = 220 MB (242 RSS)
Stream memory usage = 16.8 MB (18.11 RSS) This is more than an order of magnitude difference (13 times difference exactly), and it means that use cases that yesterday were too costly for in-memory now are perfectly viable. The magic is all in the representation of Redis Streams: the macro nodes can contain several elements that are encoded in a data structure called listpack in a very compact way. Listpacks will take care, for instance, to encode integers in binary form even if they are semantically strings. On top of that, we then apply delta compression and same-fields compression. Yet we are able to seek by ID or time because such macro nodes are linked in the radix tree, which was also designed to use little memory. All these things together account for the low memory usage, but the interesting part is that semantically the user does not see any of the implementation details making Streams efficient. Now let’s do some simple math. If I can store 1 million entries in about 18 MB of memory, I can store 10 millions in 180 MB, and 100 millions in 1.8 GB. With just 18 GB of memory I can have 1 billion items. ## Time series One important thing to note is, in my opinion, how the usage above where we used a Stream to represent a tennis match was semantically very different than using a Redis Stream for a time series. Yes, logically we are still logging some kind of event, but one fundamental difference is that in one case we use the logging and the creation of entries in order to render objects. While in the case of time series, we are just metering something happening externally, that does not really represent an object. You may think that this difference is trivial but it’s not. It is important for Redis users to build the idea that Redis Streams can be used in order to create small objects that have a total order, and assign IDs to such objects. However even the most basic use case of time series is, obviously, a huge one here, because before Streams Redis was a bit hopeless in regard to such use case. The memory characteristics and flexibility of streams, plus the ability to have capped streams (see the XADD options), is a very important tool in the hands of the developer. ## Conclusions Streams are flexible and have lots of use cases, however I wanted to take this blog post short to make sure that there is a clear take-home message in the above examples and analysis of the memory usage. Perhaps this was already obvious to many readers, but talking with people in the last months gave me the feeling that there was a strong association between Streams and the streaming use case, like if the data structure was only good at that. That’s not the case 😀
HackerNewsBot debug: Calculated post rank: 217 - Loop: 481 - Rank min: 100 - Author rank: 37
- No more than a four-hour drive from either our current #homes
- Friendly #Second #Amendment #2a state and location
- #Remote with a lot of #wooded #land around us, not necessarily our land, and,
- Available #water; #Ponds, #streams, and #aquifers.
After negotiations, which lasted a week, the #seller and us settled on a price $6,000- less than the asking one for a total price of $2,300- an acre for 34-acres. The original idea was for us to #subdivide the #property into four sections to sell these parcels for a profit which would cover our original outlay. We did not do this in the end for several reasons which I will not go into right now.Still, for $78k, 34 acres isn't bad. #prep #prepare #BOLO #bugout #remote #collapse
Which websites featured on the Federation have the worst privacy?
My last post highlighted how ticking the OEmbed box to add a website picture to a post can compromise Federation users if it contains a tracker.
I also mentioned tools, like Disconnect, we could use to detect websites which track their users. In this post I reveal some of the most popular reference websites on the Federation with low privacy and high tracking rates.
I believe Federation users should consider not embedding, or at least warning their readers about the surveillance techniques carried out by these sites.
A Princeton University study identified almost a million websites that track their users. Here are just 5 examples of websites whose stories are commonly quoted on the Federation:
Wired is a popular website referenced on the Federation by many users because it publishes great tech-based stories. But how private is it?
Although it offers an ‘ad-free’ version for subscribers, normal visitors are ruthlessly fleeced for their data.
WIRED has embed deals (agreements to embed tracking codes into their pages for money or gain) with a staggering 171 third parties including Google, Amazon, Facebook, Vogue, GQ, Golf Digest, Bonappetit and Vanity Fair.
Some tracking beacons embedded on WIRED and captured by Ublock Origin
151 of these third parties are known tracking or advertising companies like Google, Amazon, Facebook, Turn, Add This, Scorecard Research, Adobe, Twitter Analytics, Typekit, Criteo and Quantserve. Aggressive trackers like Google Tag Manager (GTM), Add This and Turn are present here.
Below is a screengrab of the many scripts NoScript has blocked from the WIRED website, the 33 scripts, gifs and beacons blocked by Ublock Origin and a couple by Disconnect.
WIRED sets 25 short-term and 28 long-term cookies itself, while allowing its third party partners (including 69 tracking companies) to set 26 short-term and 133 long-term cookies.
It uses Google Analytics without the anonymization feature enabled, so user details are sent to Google servers.
All WIRED servers are based in the US so GDPR privacy rules can be ignored.
Websites loading this many scripts/cookies are usually blacklisted by most users, not least because they drain a device’s battery.
WIRED claims that subscribing with them will mean an ad free experience, but I find it hard to believe that a subscription to WIRED will suddenly load a clean page without a single tracker retrieving data. But then I am not a WIRED subscriber. Please comment if you are and have no trackers.
Seen by some as a safe pro-privacy resource celebrating Free and Open Source Software, FOSSPOST lets its users down by digitally fingerprinting their devices and loading 19 trackers into a browser.
FOSSPOST has embed deals with 27 third parties, making its embed renting in the ‘low’ category, including Google, Amazon, Creative Commons and WordPress.
13 of these are known tracking or advertising companies like Google, Amazon, Mailerlite, One Signal and the data-hungry caterpillar that is WordPress.
FOSSPOST sets 2 short-term and 2 long-term cookies itself while allowing its third party partners (including 3 tracking companies) to set 4 long-term cookies.
It uses Google Analytics without the anonymization feature so user details are sent to Google servers. All FOSSPOST servers are based in the US so GDPR privacy rules can be ignored.
Acquired by Yahoo’s parent company, Oath (a company that includes AOL), under the Verizon umbrella, in 2010, this is a popular reference source for researchers and Federation users.
Historically, Yahoo deserves some kudos as they were one of the few big tech companies that objected to sharing their users’ details with the PRISM
The Bush administration threatened them with $250k a day fines until they complied. Verizon bought them in 2017. Yahoo suffered the largest data breach in history in 2018.
The link to this NYT story is not embedded (consider blocking the GTM tracker on the site)
TECHCRUNCH.com fingerprints the user’s device and dumps 2-7 Yahoo trackers in their browser, depending on the page loaded.
TECHCRUNCH.com has embed deals with 27 third parties, including Google, Facebook, Yahoo and WordPress.
15 of these are known tracking or advertising companies like Google, Facebook, Yahoo, WordPress, Atwola, Typekit, AOL and Scorecard Research.
TECHCRUNCH.com sets 4 short-term and 5 long-term cookies itself while allowing its third party partners (including 4 tracking companies) to set 1 short-term and 7 long-term cookies.
It uses Google Analytics but interestingly enables the anonymization feature so some user details are not sent to Google servers.
All servers are based in the US so forget about GDPR privacy rules.
THE REGISTER .co.uk
Although a great resource with well-written and groundbreaking stories, it isn’t as private as I’d hoped.
There is no obvious digital fingerprinting but it seems to have gathered more Google syndication in the last couple of years, (9 of its 16 embed deals are with the Big G). 12 known tracking or advertising companies like Google, Admedo and the Amp Project gather data.
THE REGISTER sets 3 short-term and 4 long-term cookies itself while allowing its third party partners (including 2 tracking companies) to set 7 long-term cookies.
It uses Google Analytics without enabling the anonymization feature so user details are sent to Google servers. Although THE REGISTER’s domain is in the UK, both its data and email servers are based in the US so GDPR privacy rules could be compromised here, though I am not a lawyer.
The Guardian .com
I’ve been sitting on this for a few years now but it’s about time I blew the whistle.
I first noticed the Guardian newspaper’s website was digitally fingerprinting its users’ devices when they published an article on, um, Canvas Fingerprinting.
That page has been removed since, but they still continued doing it, long before Facebook, though not before Google.
I’ve kept quiet about this surveillance because I admire the paper for its incredible journalism, especially exclusives like the Snowdon revelations, and its general championing of freedom issues across many sectors of society. But the hypocrisy has started to wear me down.
Some tracking items & widgets embedded on Guardian .com and captured by Ublock Origin
The Guardian has embed deals with a privacy-sapping 142 third parties, including Google, Amazon, Bing, Twitter, and, despite being one of its main critics, Facebook. 132 of these third party partners are known tracking or advertising companies like Google, Amazon, Facebook, Turn, AddThis, Scorecard Research, Blue Kai, Twitter Analytics, Rubicon, Criteo and Quantserve.
Some of the most aggressive trackers like GTM, AddThis and Turn are present here.
The Guardian also sets 3 short-term and 5 long-term cookies itself, while allowing its third party partners (including 51 tracking companies) to set 10 short-term and 131 long-term cookies.
Yes, we NEED the Guardian’s continued existence, but castigating Facebook et al while allowing them to track its users doesn’t sit well with me.
The website uses Google Analytics but at least enables the anonymization feature, so some user details are not sent to Google servers.
Although The Guardian’s data servers are in Germany, their email servers are based in the US so GDPR privacy rules could be compromised here, though, again, I am not a lawyer.
In conclusion, I’ve given just 5 examples of popular sites Federation users quote in their posts.
I am NOT advocating a boycott of these sites but politely suggest we don’t OEmbed them, just feature a hyperlink and give readers the heads-up about these privacy concerns.
Alternatively, look for other sources featuring the same story. It’s also worth highlighting which websites do NOT add a tracker when we OEmbed a story, or have a low level of surveillance. Please promote those guys.
#news #fakenews #journalism #FreePress #PressFreedom #theguardian
#privacy #tracking #trackers #facebook #social #mass-surveillance #gdpr #google #location #user #device #setup #private #secure #internet #tips #tricks #online #os #windows #apple #ios #advertising #ad #revenue #streams #developers #media #data #corporations #telemetry #consent #spyware #surveillancecapitalism #humanrights, #anonymity #cookies #surveillance #browser #proxy #relay #network #www #leaks #fingerprint #activity #activitytrackers #thefederation #pods #federation #fediverse #friendica #mastodon #pleroma #socialhome # #Gnusocial #Funkwhale #Peertube #pixelfed #hubzilla #Diaspora
Operating systems – can we make them private?
Every #operating #system (and #application) ever created becomes less #private with each new version.
As technology has evolved, developers are under increasing pressure to spy on their #customers and extract their #data for #exploitation.
Users are always advised to update their #software to improve its #security, its #interface and embrace new features – sometimes with good reason (like #patching a known #vulnerability). Other times the #developer is simply adding spyware. Some 'useful' applications are designed solely as #spyware and do nothing but #collect data.
As most here know, the #OS with the biggest data collection appetite is Windows 10 . #Microsoft have invested their time and money into a #business model that demands its online products extract user data to drive their #ad #targeting #revenue #streams.
Gone are the days when we could install an OS from a CD or have a choice to accept or reject an update.
#Windows is no longer a product but a “service” and with #services come #fees. Microsoft will charge for its OS’s in future. Even #windows7 will incur fees for users who prefer it to #windows10 from January 2020, the cost will rise each year.
Microsoft 'Confirms' Windows 7 New Monthly Charge
Above link details
This site sets 1 long-term and 2 short-term cookies we can delete. It uses a MEDIUM number of third party embeds (16) that set 3 short-term and 0 long-term cookies. 2 tracker companies do not set cookies although Disconnect blocked 18 trackers and this link has an embedded Forbes tracker we can block.
Windows 10 has had 3 major update scandals this year alone where #devices frozen or random files were #deleted.
Thousands of Windows 7 & 8 users have had their devices upgraded to 10 without their permission, while many businesses are refusing to change from Windows 7 forcing Microsoft to extend their support for it.
I will try to explain how we can make Windows 10 #safer in future posts but users will have to face the fact that
Microsoft's Software is Malware
Furthermore it contains backdoors. I have tested this myself. I’ve turned off all updates on a Windows 7 device yet have still received updates! These were flagged up by Windows 7’s event manager – ironically, a Microsoft product betraying another Microsoft product. However, I did not disable updates in the #registry on that #device – the surest way of truly stopping updates – because one mistake can trash the system. The #hacking #community tells me they have developed a souped-up version of #XP running with all #backdoors closed, although I have no proof of this.
Microsoft's Software is Malware
Above link details
This site sets NO long-term or short-term cookies. It uses a NO third party embeds and NO tracking companies.
It is very rare and
How can Federation users post more safely?
You know how it goes. We find a great story online and we want to share it with our supporters or feature it in our feed with appropriate hashtags for maximum reach.
But do we check the website featuring the story for privacy before we post?
When we embed a link by selecting the OEmbed box (often ticked by default) this displays an image or video on our post from the website we’ve featured.
They may look cool, but these images can contain beacons or other trackers. Embedded trackers also load into the browsers of any user who scrolls down the public feeds.
Should we ensure the website is safe before linking to it?
Actually some do. Posts that don’t feature a website’s images (with the OEmbed box unchecked as below) can actually protect Federation users from a serious amount of surveillance.
Some thoughtful users actually reproduce the article’s main points in their post, to protect their readers from visiting the site itself. They usually supply a link to the original content if one wants more detail and perhaps is protected with tracker blockers. So how do we know a site we recommend is safe?
Here are some privacy tips:
• Consider checking the page’s security/privacy before linking to it.
Using Tor, or a beefed-up Firefox fork or version (for detecting digital fingerprinting), and/or Disconnect, NoScript or Ublock Origin add-ons to reveal a multitude of trackers.
• There is usually more than one website featuring the same story. Consider picking the website with the least trackers and digital fingerprinting.
• Issue a warning in your post about any of the site’s surveillance methods and privacy issues you’ve detected.
• Embedding a picture/video could also make users vulnerable. Consider unchecking the OEmbed box.
In the next post I’ll give examples of a number of websites with low privacy and excessive trackers, commonly featured in the public feeds.
#secure #internet #windows #apple #revenue #streams #developers #Social #media #data #corporations #tracking #trackers #facebook #social #mass-surveillance #gdpr #google #alphabet #location #user #device #setup #private #secure #internet #chrome #tips #tricks #online #os #mobile #ie #safari #apple #ios #ad #revenue #streams #developers #telemetry #consent #windows10 #windows7 #windows81 #microsoft #linux #debian #ubuntu #mate #gnome #grub #iphone #firefox #advertising #android #chrome #browser #browsers #phone #phones #device #Tor #privacy, #humanrights, #anonymity #internet #security #cookies #surveillance #browser #web #onion #router #torbrowser #bridge #proxy #relay #leaks #fingerprint #activity #activitytrackers #spyware #surveillancecapitalism