aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Engineering
  • Software Engineering

Building Advanced Beam Pipelines In Scala With SCIO

  • aster.cloud
  • December 1, 2022
  • 4 minute read
Apache Beam is an open source, unified programming model with a set of language-specific SDKs for defining and executing data processing workflows.Scio, pronounced shee-o, is Scala API for Beam developed by Spotify to build both Batch and Streaming pipelines.
In this blog we will uncover the need for SCIO and a few reference patterns.

Why Scio

SCIO provides high level abstraction for developers and is preferred for following reasons:


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

  • Striking balance between concise and performance. Pipeline written in Scala are concise compared to java with similar performance
  • Easier migration for Scalding/Spark developers due to similar semantics compared to Beam API thereby avoiding a steep learning curve for developers.
  • Enables access to a large ecosystem of infrastructure libraries in Java e.g. Hadoop, Avro, Parquet and high level numerical processing libraries in Scala like Algebird and Breeze.
  • Supports Interactive exploration of data and code snippets using SCIO REPL

Reference Patterns

Let us checkout few concepts along with examples:

1. Graph Composition

If you have a complex pipeline consisting of several transforms, the feasible approach is tocompose the logically related transforms into blocks. This would make it easy to manageand debug the graph rendered on dataflow UI. Let us consider an example using popularWordCount pipeline.

Fig: Word Count Pipeline without Graph Composition

Let us modify the code to group the related transforms into blocks: 

Fig: Word Count Pipeline with Graph Composition

2. Distributed Cache

Distributed Cache allows to load the data from a given URI on workers and use thecorresponding data across all tasks (DoFn’s) executing on the worker. Some of the commonuse cases are loading serialized machine learning model from object stores like GoogleCloud Storage for running predictions, lookup data references etc.

Read More  Reduce Scaling Costs By Up To 50% In Cloud Spanner With Doubled Provisioned Storage

Let us checkout an example that loads lookup data from CSV file on worker duringinitialization and utilizes to count the number of matching lookups for each input element.

Fig: Example demonstrating Distribute Cache

3. Scio Joins

Joins in Beam are expressed using CoGroupByKey while Scio allows to express variousjoin types like inner, left outer and full outer joins through flattening the CoGbkResult.

Hash joins (syntactic sugar over a beam side input) can be used, if one of the dataset isextremely small (max ~1GB) by representing a smaller dataset on the right hand side. Sideinputs are small, in-memory data structures that are replicated to all workers and avoidsshuffling.

MultiJoin can be used to join up to 22 data sets. It is recommended that all data sets beordered in descending size, because non-shuffle joins do require the largest data sets to be onthe left of any chain of operators

Sparse Joins can be used for cases where the left collection (LHS) is much larger than theright collection (RHS) that cannot fit in memory but contains a sparse intersection of keysmatching with the left collection . Sparse Joins are implemented by constructing aBloom filter of keys from the right collection and split the left side collection into 2partitions. Only the partition with keys in the filter go through the join and the rest areeither concatenated (i.e Outer join) or discarded (Inner join). Sparse Join is especially usefulfor joining historical aggregates with incremental updates.

Skewed Joins are a more appropriate choice for cases where the left collection (LHS) ismuch larger and contains hotkeys. Skewed join uses Count Mink Sketch which is aprobabilistic data structure to count the frequency of keys in the LHS collection. LHS ispartitioned into Hot and chill partitions. While the Hot partition is joined withcorresponding keys on RHS using a Hash join, chill partition uses a regular join and finallyboth the partitions are combined through union operation.

Read More  Shopify Engineers Deliver On Peak Performance During Black Friday Cyber Monday 2021

Fig: Example demonstrating Scio Joins

Note that while using Beam Java SDK you can also take advantage of some of the similarjoin abstractions using Join Library extension

4. AlgeBird Aggregators and SemiGroup

Algebird is Twitter’s abstract algebra library containing several reusable modules for parallelaggregation and approximation. Algebird Aggregator or Semigroup can be used withaggregate and sum transforms on SCollection[T] or aggregateByKey and sumByKeytransforms on SCollection[(K, V)]. Below example illustrates computing parallelaggregation on customer orders and composition of result into OrderMetrics class

Fig: Example demonstrating Algebird Aggregators

Below code snippet expands on previous example and demonstrates the SemiGroup foraggregation of objects by combining fields. 

Fig: Example demonstrating Algebird SemiGroup

5. GroupMap and GroupMapReduce

GroupMap can be used as a replacement of groupBy(key) + mapValues(_.map(func))or_.map(e => kv.of(keyfunc, valuefunc))+groupBy(key)

Let us consider the below example that calculates the length of words for each type. Insteadof grouping by each type and applying length function, the GroupMap allows combiningthese operations by applying keyfunc and valuefunc.

Fig: Example demonstrating GroupMap

GroupMapReduce can be used to derive the key and apply the associative operation on thevalues associated with each key. The associative function is performed locally on eachmapper similarly to a “combiner” in MapReduce (aka combiner lifting) before sending theresults to the reducer. This is equivalent to keyBy(keyfunc) + reduceByKey(reducefunc)Let us consider the below example that calculates the cumulative sum of odd and evennumbers in a given range. In this case individual values are combined on each worker andthe local results are aggregated to calculate the final result

Fig: Example demonstrating GroupMapReduce

Read More  Google Cloud Spanner Dialect For SQLAlchemy

Conclusion

Thanks for reading and I hope now you are motivated to learn more about SCIO. Beyond thepatterns covered above, SCIO contains several interesting features like implicit coders forScala case classes, Chaining jobs using I/O Taps , Distinct Count using HyperLogLog++ ,Writing sorted output to files etc. Several use case specific libraries like BigDiffy (comparisonof large datasets) , FeaTran (used for ML Feature Engineering) were also built on top of SCIO.

For Beam lovers with Scala background, SCIO is the perfect recipe for building complexdistributed data pipelines.

 

By: Prathap Kumar Parvathareddy (Staff Data Engineer)
Source: Google Cloud Blog


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • Apache Beam
  • Coding
  • Data Analytics
  • Google Cloud
  • Programming
  • Scala
  • SCIO
  • Tutorials
You May Also Like
View Post
  • Engineering
  • Technology

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

  • March 9, 2025
View Post
  • Software Engineering
  • Technology

Claude 3.7 Sonnet and Claude Code

  • February 25, 2025
View Post
  • Computing
  • Engineering

Why a decades old architecture decision is impeding the power of AI computing

  • February 19, 2025
View Post
  • Engineering
  • Software Engineering

This Month in Julia World

  • January 17, 2025
View Post
  • Engineering
  • Software Engineering

Google Summer of Code 2025 is here!

  • January 17, 2025
View Post
  • Data
  • Engineering

Hiding in Plain Site: Attackers Sneaking Malware into Images on Websites

  • January 16, 2025
View Post
  • Computing
  • Design
  • Engineering
  • Technology

Here’s why it’s important to build long-term cryptographic resilience

  • December 24, 2024
IBM and Ferrari Premium Partner
View Post
  • Data
  • Engineering

IBM Selected as Official Fan Engagement and Data Analytics Partner for Scuderia Ferrari HP

  • November 7, 2024

Stay Connected!
LATEST
  • college-of-cardinals-2025 1
    The Definitive Who’s Who of the 2025 Papal Conclave
    • May 7, 2025
  • conclave-poster-black-smoke 2
    The World Is Revalidating Itself
    • May 6, 2025
  • 3
    Conclave: How A New Pope Is Chosen
    • April 25, 2025
  • Getting things done makes her feel amazing 4
    Nurturing Minds in the Digital Revolution
    • April 25, 2025
  • 5
    AI is automating our jobs – but values need to change if we are to be liberated by it
    • April 17, 2025
  • 6
    Canonical Releases Ubuntu 25.04 Plucky Puffin
    • April 17, 2025
  • 7
    United States Army Enterprise Cloud Management Agency Expands its Oracle Defense Cloud Services
    • April 15, 2025
  • 8
    Tokyo Electron and IBM Renew Collaboration for Advanced Semiconductor Technology
    • April 2, 2025
  • 9
    IBM Accelerates Momentum in the as a Service Space with Growing Portfolio of Tools Simplifying Infrastructure Management
    • March 27, 2025
  • 10
    Tariffs, Trump, and Other Things That Start With T – They’re Not The Problem, It’s How We Use Them
    • March 25, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • 1
    IBM contributes key open-source projects to Linux Foundation to advance AI community participation
    • March 22, 2025
  • 2
    Co-op mode: New partners driving the future of gaming with AI
    • March 22, 2025
  • 3
    Mitsubishi Motors Canada Launches AI-Powered “Intelligent Companion” to Transform the 2025 Outlander Buying Experience
    • March 10, 2025
  • PiPiPi 4
    The Unexpected Pi-Fect Deals This March 14
    • March 13, 2025
  • Nintendo Switch Deals on Amazon 5
    10 Physical Nintendo Switch Game Deals on MAR10 Day!
    • March 9, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.