Friday, February 6, 2015

Compaction in Cassandra

Overview

A write in Cassandra is immediately written to commit logs and memtable, and once this memtable is full it is flushed on to disk in the form of SSTables. For each memtable a new SSTable is created. The process of merging these SSTables to improve the performance of partition scans and to reclaim space from deleted data (tombstone) is called compaction in Cassandra.

Full compaction: A full compaction applies only to SizeTieredCompactionStrategy. It merges all SSTables into one large SSTable.

What happens during Compaction?

Compaction is a periodic background process. During compaction Cassandra merges SSTables by combining row fragments, evicting expired tombstones, and rebuilding indexes. As SSTables are sorted, the merge is efficient (no random disk I/O). After a newly merged SSTable is complete, the input SSTables are reference counted and removed as soon as possible. During compaction, there is a temporary spike in disk space usage and disk I/O.

Impact of compaction

Compaction impacts both read and write performance. During compaction higher disk I/O and utilization causes a little performance degradation.

Impact on read: Compaction can impact reads that are not fulfilled by the cache due to high disk I/O, however, after a compaction completes, off-cache read performance improves since there are fewer SSTable files on disk that need to be checked

Impact on disk: A full compaction is I/O and CPU intensive and can temporarily require double disk space when no old versions or tombstones are evicted. It is thus advisable to have disk double the size of your data.

Compaction also results in disk space getting reclaimed when major data in sstables is present in form of tombstones

Types of Compaction Strategies

Cassandra provides two compaction strategies:

SizeTieredCompactionStrategy: This is the default compaction strategy. This strategy gathers SSTable of similar size (default 4) and compacts them together into a larger SSTable.

In figure 1, each green box represents an sstable, and the arrow represents compaction. As new sstables are created, nothing happens at first. Once there are four, they are compacted together, and again when we have another four.

Figure 2 shows what we expect much later, when the second-tier sstables have been combined to create third-tier, third tier combined to create fourth, and so on.

LeveledCompactionStrategy: Introduced in Cassandra 1.0, this strategy creates SSTables of a fixed, relatively small size (5 MB by default) that are grouped into levels. Within each level, SSTables are guaranteed to be non-overlapping. Each level (L0, L1, L2 and so on) is 10 times as large as the previous.

In figure 3, new sstables are added to the first level, L0, and immediately compacted with the sstables in L1 (blue). When L1 fills up, extra sstables are promoted to L2 (violet). Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap. As more data is added, leveled compaction results in a situation like the one shown in figure 4.

What problems does Leveled compaction solves?

There are three problems with size-tiered compaction in update-heavy workloads:

Read performance can be inconsistent because there are no guarantees as to how many sstables a row may be spread across: in the worst case, we could have columns from a given row in each sstable.
A substantial amount of space can be wasted since there is no guarantee as to how quickly obsolete columns (tombstone) will be merged out of existence; this is particularly noticeable when there is a high ratio of deletes.
Space can also be a problem as sstables grow larger from repeated compactions, since an obsolete sstable cannot be removed until the merged sstable is completely written. In the worst case of a single set of large sstable with no obsolete rows to remove, Cassandra would need 100% as much free space as is used by the sstables being compacted, into which to write the merged one.

Leveled compaction solves the above problems with tiered compaction as follows:

Leveled compaction guarantees that 90% of all reads will be satisfied from a single sstable (assuming nearly-uniform row size). Worst case is bounded at the total number of levels — e.g., 7 for 10TB of data.
At most 10% of space will be wasted by obsolete rows.
Only enough space for 10x the sstable size needs to be reserved for temporary use by compaction.

When to use which compaction

Both compaction strategies are good depending on the use case. It is thus important to understand when to use which compaction strategy.

When to use Leveled Compaction

Following are the scenarios in which use of leveled compaction is recommended

1. High Sensitivity to Read Latency: In addition to lowering read latencies in general, leveled compaction lowers the amount of variability in read latencies. If your application has strict latency requirements for the 99th percentile, leveled compaction may be the only reliable way to meet those requirements, because it gives you a known upper bound on the number of SSTables that must be read.

2. High Read/Write Ratio: If you perform at least twice as many reads as you do writes, leveled compaction may actually save you disk I/O, despite consuming more I/O for compaction. This is especially true if your reads are fairly random and don’t focus on a single, hot dataset.

3. Rows Are Frequently Updated: If you have wide rows where new columns are constantly added then it will be spread across multiple SSTables with size-tired compaction. Leveled compaction, on the other hand, keeps the number of SSTables that the row is spread across very low, even with frequent row updates. So, if you’re using wide rows then leveled compaction can be useful.

4. Deletions or TTLed Columns in Wide Rows: If you maintain event timelines in wide rows and set TTLs on the columns in order to limit the timeline to a window of recent time, those columns will be replaced by tombstones when they expire. Similarly, if you use a wide row as a work queue, deleting columns after they have been processed, you will wind up with a lot of tombstones in each row. So, even if there aren’t many live columns still in a row, there may be tombstones for that row spread across extra SSTables; these too need to be read by Cassandra, potentially requiring additional disk seeks. In this scenario Leveled Compaction is a better choice as it limits the row spreading and frequently removes the tombstones while merging.

When to use Size Tiered Compaction

Following are the scenarios in which size tiered compaction is recommended:

1. Write-heavy Workloads: If you have more writes than reads then it is recommended to use size tiered as it may be difficult for leveled compaction to keep up with write-heavy workloads, and because reads are infrequent, there is little benefit to the extra compaction I/O.

2. Rows Are Write-Once: If your rows are always written entirely at once and are never updated, they will naturally always be contained by a single SSTable when using size-tiered compaction. Thus, there’s really nothing to gain from leveled compaction.

3. Low I/O performance: If your cluster is already pressed for I/O, switching to leveled compaction will almost certainly only worsen the problem, so it’s better to use size tiered when disk IOPS is not good.

Verification Test

Following test can be performed to verify the impact of different compaction strategies on read performance and disk usage

1. Create a column family in Cassandra for storing user’s status updates. Keep user-id as row-key and status as columns value, current timestamp as column name.

2. Set a TTL on status column to remove a status older than 5 minutes.

3. Set default compaction strategy of size-tiered compaction.

4. Keep inserting lots of message for the same user for about an hour

5. Meanwhile, from another program read the status updates for the same user.

6. Watch the read latency and disk usage parameters over the time.

7. Perform the same test with leveled compaction strategy and note the difference

Internet of Things

Overview

Internet of Things (IoT) refers to the interconnection of billions of devices spreading around the world over the internet making an internet of devices. Things, in the IoT, can refer to a wide variety of devices such as heart monitoring implants, biochip transponders on farm animals, automobiles with built-in sensors, or field operation devices that assist fire-fighters in search and rescue. Typically, IoT is expected to offer advanced connectivity of devices, systems, and services that goes beyond machine-to-machine communications (M2M) and covers a variety of protocols, domains, and applications. The interconnection of these embedded devices (including smart objects), is expected to usher in automation in nearly all fields, while also enabling advanced applications like a Smart Grid. The IoT is connecting new places–such as manufacturing floors, energy grids, healthcare facilities, and transportation systems–to the Internet. When an object can represent itself digitally, it can be controlled from anywhere. This connectivity means more data, gathered from more places, with more ways to increase efficiency and improve safety and security.

One example could be of smart home. Your smart house would have the ability to send data to a website where you could monitor all the important as well as inconsequential telemetry of your abode, including regular updates from every single one of your appliances, built-in cameras, your thermostat, and so on.

Industry Trend

Practically, everyone is talking about the Internet of Things and billions on interconnected device going to change our lives forever. IoT has become the main topic in all the tradeshows and summits happening right now. According to Gartner, there will be nearly 26 billion devices on the Internet of Things by 2020. ABI Research estimates that more than 30 billion devices will be wirelessly connected to the Internet of Things (Internet of Everything) by 2020. As per a recent survey and study done by Pew Research Internet Project, a large majority of the technology experts and engaged Internet users who responded—83 percent—agreed with the notion that the Internet/Cloud of Things, embedded and wearable computing (and the corresponding dynamic systems) will have widespread and beneficial effects by 2025. Judging by the amount of coverage, the high-visibility presence of high-profile vendors, the emerging crop of tradeshows, and the inevitable stream of books, you might think that the IoT is an imminent.

Industry is moving towards adopting IoT by converting all protocols to IP based protocols.

Growth Prospects

There is no doubt that next decade is all about IoT. All industries are going to get benefit from IoT, hardware vendors, software service vendors, etc. Every hardware vendor wants to participate (processors, wi-fi/Bluetooth, monitoring devices, screens, and so on), and every software vendor wants a role in a scenario in which literally billions of Internet endpoints suddenly teem forth as data collections points.

As per industrial-ip organization, 14.4 trillion USD is at stake from the internet of things that can be realized by 2022. See the chart below prepared by industrial-ip organization.

Most of this business will be coming from manufactures. Industrial firms are reporting benefits ranging from a boost in labor productivity and collaboration, to greater overall equipment efficiency, better market agility, and positive customer experiences. As an example, manufacturers, who are deploying architectures to support the IoT revolution say they are reaping benefits from opening up information flows between plant systems and business applications. As these information silos disappear, disconnects between the floor and the business are diminishing. For example, R&D departments are now working in tandem with manufacturing planners, streamlining the introduction of new products. Using dashboards and mobile devices, managers and engineers react immediately to shifting production needs, operational issues and market scenarios. The result, managers say, is like having an "enterprise-wide decision engine" that enables them to speed new products to market and execute supply chain adjustments faster than before.

Challenges

Along with growth, there are various challenges that IoT is facing. In addition to many secondary issues, network infrastructure, security, and data formats are the salient questions. The last two items, especially security, are important obstacles. Security violations will no longer mean just the simple, but now accepted, bother of waiting for new credit cards and monitoring your credit profile. Rather, a breach means access to very personal data about you, your family, your car, your home, your possessions, etc. The danger is significant. If I can read sensor data in your house, I might well be able to know that despite the lights being on, the TV playing, and the car in the driveway, there is nobody home. And equally frightening, I might be able to issue commands to the various connected devices.

However, security alone is not a big enough obstacle to prevent the IoT. It's just one that will need to be handled before mass adoption. Network infrastructure, however, is a whole different story. If each device becomes an Internet endpoint, there is little doubt that IPv4 will run out of numbers almost immediately (as it's been on the verge of doing for much of this decade). While it's certainly possible that NAT will allow devices to operate behind a single IP address and then post data for retrieval elsewhere, this is not the design generally presented nor one that is particularly desirable. It seems fairly clear that for IoT to become a reality, the world needs to move to IPv6. Most ISPs today and most networked devices are IPv6-ready, but only a tiny fraction have actually switched over.

References

http://www.drdobbs.com/architecture-and-design/the-internet-of-overhyped-things/240168927?_mc=MP_IW_EDT_STUB

http://en.wikipedia.org/wiki/Internet_of_Things

http://www.industrial-ip.org/industrial-ip/internet-of-things/how-the-internet-of-everything-will-transform-industries

Analytics overview

Analytics

Have you ever think about the data that is being generated everyday over the internet in the form of tweets, blogs, email, WebPages, photos, videos, etc.

During the last decade we have experienced the data explosion. From the Dawn of 2003 till the starting of this year, human civilization has generated approximately 12 Exabyte of aggregated information.

In July 2011 alone, Facebook’s 750 million worldwide users uploaded approximately 100 terabytes of data every day to the social media platform.

This data may not be useful to you BUT to the world’s Business visionaries, leaders and marketers, this tremendous growth of data has given rise to what may prove to be the most substantial commercial opportunity since the emergence of the Internet: the ability to better understand consumers, seamlessly match “right-time” offers to their needs and optimize the management of profitable, long-term customer relationships.

Unsurprisingly, many brains are working to capitalize on this new potential of market and across the globe efforts to consolidate and aggregate information from different sources to generate meaningful data are getting focus.

This demand of real-time, rules-driven, customer-oriented data by digging into mountains of information gave birth to Analytics.

What is Analytics?

Analytics in a broad term describes a variety of statistical and analytical techniques used to develop models that shows meaningful reporting of past trends and helps in predicting future events and behavior.

The science of analytics is concerned with extracting useful properties of data typically from large data bases; these databases can contain structured information (tables, predefined data structures) or unstructured information such as pdf, documents, videos, email, images etc. As Technology is not mature enough to dig into unstructured data, efforts to provide full spectrum Analytics are already getting a blow.

Analytics can be broadly divided into two steps, Data Integration as a foundation followed by Prediction.

Data Integration: This step involve collection of data from different sources, storing that data at one data center, and then applying various techniques to generate meaningful data, such as trends, behavior and patterns.

(e.g. 70% Mobile application developers feels Apple’s IOS is too close to experiment)

Prediction: Prediction is forecasting, modeling, and simulation that explore current patterns and then guide future action. In one line, it is about using historical pattern of data to predict future events.

(e.g. based on the previous data, Google launched Android and made it open source)

What Analytics can do?

One should not underestimate the potential of Analytics; it touches the whole spectrum of Business World. There are numerous use cases where Analytics can play a crucial role in redefining the success in Business.

Analytics can help in taking pivotal decision along the complete hierarchy of management. To start with, but not limited to

Strategic decisions set the long-term direction for an organization, a product, or an initiative that result in guidelines. These decisions are based on past trends of customer likings, finding gaps to get an idea of new product, and how market is evolving.

Operational decisions focus on a specific project or process and translate the strategy into guidelines for action, such as rules for determining an optimal price. For instance Analytics can provide the network usage per city that can further help in increasing infrastructure in particular area.

Tactical decisions repeat frequently and can occur in high volume. Examples are what price to charge a specific customer for a seat on a particular flight or a room in a specific hotel for a specific night or what Ads to serve to a particular netizen. These decisions are based on the current trends. E.g. A person watching football match on a video hosting site is served with a Ad related to Sports wear.

What Analytics can do? The list is endless

Specific Advertisement
Customer retention / engagement
Market research / customer behavior analysis
Website content optimization
Offer optimization
Cross-channel touch point optimization
Portfolio optimization
New Market forecasting
New Product / service Idea

To illustrate one case of Specific Advertisement, Data availability is now allowing advertisers, agencies and publishers to optimize ad delivery, evaluate campaign results, improve site selection and retarget ads to other sites. It’s also improving the value of media to brands by delivering their advertising to better-qualified prospects—making the ad more efficient, more valuable and providing a more compelling user experience.

What are the Challenges?

By now, you will be thinking wow! Analytics is so great; I should get it for my business, but the billion dollar question is how to get this business specific analytics. Ad delivery houses, ISPs, Content generators, Entrepreneurs, venture capitalist, government organizations everyone is looking for specific analytical data that suit their requirement, compliant with government policy, and that is concrete and preferably in real time. There are two options, develop in-house or buy. Developing in-house requires a very big investment and often a redundant work that some one else must be doing.

Ironically, there is no one to fulfill the market needs. A big un-tapped market is there and no BIG player.

Why so, although Analytics looks like a lucrative opportunity to tap, but it comes with tough challenges. There are numerous challenges that are hindering the growth of massive futuristic market. Some of the challenges are highlighted as

Rules-driven integration of disparate data sets: The collection, analysis and segmentation of digital data demands the aggregation and anonymization of virtually all data, challenging marketers’ fundamental ability to draw distinct insights from consumers’ cross-channel interactions.

Improved operating infrastructures: Though substantial process and data structure challenges also exist, a substantial barrier now inhibiting wider marketing data optimization resides within the marketing organization—characterized by rigid “silos” and the paucity of data-savvy marketing operations, IT and sales talent

Strong network of data-centric technology: The fastest and most efficient data aggregation, analysis and throughput solutions require a strong ecosystem of data centric technologies. Geographically redundant cloud for collecting data and multisource data mining techniques are still evolving.

Using the data / Marketing data governance: While organizations have long employed policy experts to advise on the regulatory ramifications of data utilization, many are coming to see marketing data governance—defining the “rules of the road” for assigning distinct data sources to different promotional tasks—as equally important

Expertise issues: The skill is in shortest supply and it is difficult to find individuals who understand the analytical techniques and know enough about business issues to be able to marry one to the other. i.e. correlating similar aspect of different data sets to aggregate into one knowledge.

I am a Geek, What's for me?

Now, it’s the turn of Geeks out there, “Enough jargons! How to develop it”, so, let me explain in step by step process.

First step is to collect data from different data sources (end user browser, routers, switches, load balancers, servers, etc) at one single place (Data center). Start with writing plug-in, agents that can be part or reside at same place of data source. All these plug-in, agent send periodic stats (in HTTP packets) to one central Data center (Web Service in cloud). Periodic sampling of normal stats and real-time reporting of critical data is crucial feature in agents.

Second step is to store this data in a database; obviously conventional database is not a choice with millions of columns and billions of rows. Some choices, store structured data in No-SQL such as Cassandra and un-structured data (file, picture, doc, etc) in HDFS.

Third step is very complex and backbone of analytics, it involves digging into database, correlating different data collected from different sources based on specific rule set. {Lets take a simple example, first data source (a video player plug-in) reports QoE of the video that it plays along with the IPa of the user, second data source (Content Delivery Network [CDN]) reports about the content (Live football match) of the video watched by IPa (assuming only CDN knows about the video content), so correlating data from these two data sources, one can know about the quality of a “Live football Match” User IPa is receiving}. After correlation, this complex process involves aggregation of data over multiple user, different time frames, and different locations to generate a pattern or trend.

Several Techniques and tools can be used to perform this correlation and aggregation. Some are,

· Map reduce

· Indexing

· ETL (e.g. Kettle)

Fourth Step is to store the aggregated information into structured data bases (RDBMS) with predefined tables and columns. Multiple procedures (PL/SQL) can be written in RDBMS to support complex queries. Providing support for “Ad-hoc query” is imperative.

Final Step is to provide a GUI portal that will show trends and patterns in the form of graphs by querying into RDBMS. Generating graphs on the fly, based on Ad-hoc query is what everyone wants. Some tools that get attention here are:

· Tableau

· Google charts

No matter what techniques or tools you are going to use, there are some key-points that should be remembered while developing any Analytics solution.

· Appropriate Sampling of Data

· Anonymization

· Reporting critical events

· Central storage (very large data base)

· Correlation of different data metrics

· Aggregation from different source

· Data mining techniques(ETL, Map reduce, indexing)

· Report generation

As you have noted all the above steps are part of Data Integration (step 1 of Analytics) only, as Prediction is very subjective let’s leave the it to the management.

Thursday, May 8, 2014

An Alternative to Splunk: ELK (ElasticSearch, Logstash, Kibana)

A common use case in big data is processing of logs and getting some insight on how application is performing through visualization. The problem however is that these logs are often plain text files located on multiple servers in different formats. Consolidating all logs, making them searchable and providing insights based on them is a challenge.

Log analysis software features

Any real time log processing software should support at least the following features:

i) Variety of different logs structures

ii) Massive amount of data spread over multiple nodes

iii) Horizontal scalability as data grow

iv) Fast searching

v) Ad-hoc queries

vi) Customizable visualization.

There are variety of log processing/analysis system available in the market, one such popular log analysis system is Splunk. Splunk has been facing more and tougher competition with each passing month and ELK is the latest competitor of Splunk.

ELK

ELK is a term for integrated stack of Elasticsearch, Logstash and Kibana. All three are available as separate components and have different functionality. While Logstash is used for log processing and ingestion, Kibana provides the visualization over the searchable content stored in elasticsearch. Let see how the complete stack works together.

1. Logstash has the capability of processing different formats of log structure, further multiple logstash agents installed on different nodes can ingest data into single elasticsearch cluster

2. Once the data is structurally stored in elasticsearch, it is searchable and Adhoc queries can be made on elasticsearch using ‘curl’ or plugin such as ‘head’

3. Kibana provides stunning visuals by reading the data directly from elasticsearch, further it provides real time dashboard updates as soon as the record is available in elasticsearch

A Simple Architecture or data flow of ELK

As ELK consists of different components for processing, storing and visualization, therefore, it is highly scalable and can be fit in any customized requirement.

A scale out Architecture

A visual GUI interface such as Head can be used to make adhoc queries and search on data stored on ElasticSearch.

Query browser in elasticsearch

Kibana

Kibana deserves a special mention as it has outperformed even some of the paid tools with its stunning visuals and ease of configuration. Its completely open source and currently only reads data from elasticsearch. Kibana shows mostly time-series based graphs, it provides some basic analytical functions such as sum, count, average, max, min, top n, etc. Here are some of the Kibana dashboard visuals.

All of the three components are part of elasticsearch.org. So far, the biggest distinction between Splunk and its competition is how they're productized. Elasticsearch also has been commercializing ELK by monetizing analytics and is offering commercial support for all of its components.

The Big Elephant - Hadoop

Search This Blog

The Big Elephant