<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Dataflow on bramp.net</title>
    <link>https://blog.bramp.net/</link>
    <description>Recent content in Dataflow on bramp.net</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-GB</language>
    <lastBuildDate>Sat, 05 Jan 2019 07:59:08 -0800</lastBuildDate>
    <atom:link href="https://blog.bramp.net/tags/dataflow/" rel="self" type="application/rss+xml" />
    
    <item>
      <title>Apache Beam and Google Dataflow in Go</title>
      <link>https://blog.bramp.net/post/2019/01/05/apache-beam-and-google-dataflow-in-go/</link>
      <pubDate>Sat, 05 Jan 2019 07:59:08 -0800</pubDate>
      
      <guid>https://blog.bramp.net/post/2019/01/05/apache-beam-and-google-dataflow-in-go/</guid>
      <description><p><em>Originally <a href="https://blog.gopheracademy.com/advent-2018/apache-beam/">published</a> as part of the Go Advent 2018 series</em></p>
<h1 id="overview">Overview</h1>
<p><a href="https://beam.apache.org/">Apache Beam</a> (<strong>b</strong>atch and str<strong>eam</strong>) is a powerful tool for handling <a href="https://en.wikipedia.org/wiki/Embarrassingly_parallel">embarrassingly parallel</a> workloads. It is a evolution of <a href="https://ai.google/research/pubs/pub35650">Google’s Flume</a>, which provides batch and streaming data processing based on the <a href="https://en.wikipedia.org/wiki/MapReduce">MapReduce</a> concepts. One of the novel features of Beam is that it’s agnostic to the platform that runs the code. For example, a pipeline can be written once, and run locally, across <a href="https://flink.apache.org/">Flink</a> or <a href="https://spark.apache.org/">Spark</a> clusters, or on <a href="https://cloud.google.com/dataflow/">Google Cloud Dataflow</a>.</p>
<p>An experimental <a href="https://beam.apache.org/documentation/sdks/go/">Go SDK</a> was created for Beam, and while it is still immature compared to Beam for <a href="https://beam.apache.org/documentation/sdks/python/">Python</a> and <a href="https://beam.apache.org/documentation/sdks/java/">Java</a>, it is able to do some impressive things. The remainder of this article will briefly recap a simple example from the Apache Beam site, and then work through a more complex example running on Dataflow. Consider this a more advanced version of the <a href="https://beam.apache.org/get-started/">official getted started guide</a> on the Apache Beam site.</p>
<p>Before we begin, it’s worth pointing out, that if you can do your analysis on a single machine, it is more likely faster, and more cost effective. Beam is more suitable when your data processing needs are large enough they must run in a distributed fashion.</p>
<h2 id="table-of-contents">Table of Contents</h2>
<ul>
<li><a href="#concepts">Concepts</a></li>
<li><a href="#shakespeare-simple-example">Shakespeare (simple example)</a>
<ul>
<li><a href="#running-the-pipeline">Running the pipeline</a></li>
</ul>
</li>
<li><a href="#art-history-more-complex-example">Art history (more complex example)</a>
<ul>
<li><a href="#stateful-functions">Stateful functions</a></li>
<li><a href="#iterating-over-a-cogbk">Iterating over a CoGBK</a></li>
<li><a href="#data-enrichment">Data enrichment</a></li>
<li><a href="#error-handling-and-dead-letters">Error handling and dead letters</a></li>
</ul>
</li>
<li><a href="#gotchas">Gotchas</a>
<ul>
<li><a href="#marshing">Marshing</a></li>
<li><a href="#errors">Errors</a></li>
<li><a href="#difference-between-direct-and-dataflow-runners">Difference between direct and dataflow runners</a></li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<h1 id="concepts">Concepts</h1>
<p>Beam already has good documentation, that explains all the <a href="https://beam.apache.org/documentation/programming-guide/">main concepts</a>. We will cover some of the basics.</p>
<figure><img src="/post/2019/01/05/apache-beam-and-google-dataflow-in-go/design-your-pipeline-linear.png" width="720" height="175"><figcaption>
      <h4>Pipeline stages</h4>
    </figcaption>
</figure>

<p>A pipeline is made up of multiple steps, that takes some input, operates on that data, and finally produces output. The steps that operates on the data are called PTransforms (parallel transforms), and the data is always stored in PCollections (parallel collections). The PTransform takes one item at a time from the PCollection and operates on it. The PTransform are assumed to be hermetic, using no global state, thus ensuring it will always produce the same output for the given input. These properties allow the data to be sharded into multiple smaller dataset and processed in any order across multiple machines. The code you write ends up being very simple, but is able to seamlessly split across 100s of machines.</p>
<h1 id="shakespeare-simple-example">Shakespeare (simple example)</h1>
<div style="float: right; width: 200px">
	<img src="word-count.png" width=200 height=436>
</div>
<p>A classic example is counting the words in Shakespeare. In brief, the pipeline counts the number of times each word appears across Shakespeare’s works, and outputs a simple key-value list of word to word-count. There is an <a href="https://github.com/apache/beam/blob/master/sdks/go/examples/minimal_wordcount/minimal_wordcount.go">example</a> provided with the Beam SDK, and along with a great <a href="https://beam.apache.org/get-started/wordcount-example/">walk through</a>. I suggest you read that before continuing. I will however dive into some of the Go specifics, and add additional context.</p>
<p>The example begins with <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam/io/textio#Read"><code>textio.Read</code></a>, which reads all the files under the shakespeare directory stored on <a href="https://cloud.google.com/storage/">Google Cloud Storage</a> (GCS). The files are stored on GCS, so when this pipeline runs across a cluster of machines, they will all have access. <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam/io/textio#Read"><code>textio.Read</code></a> always returns a <code>PCollection&lt;string&gt;</code> which contains one element for every line in the given files.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="nx">lines</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">textio</span><span class="p">.</span><span class="nf">Read</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="s">&#34;gs://apache-beam-samples/shakespeare/*&#34;</span><span class="p">)</span><span class="w">
</span></span></span></code></pre></div><p>The <code>lines</code> PCollection is then processed by a ParDo (<strong>Par</strong>allel <strong>Do</strong>), a type of PTransform. Most transforms are built with a <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam#ParDo"><code>beam.ParDo</code></a>. It will execute a supplied function in parallel on the source PCollection. In this example, the function is defined inline and very simply splits the input lines into words with a regexp. Each word is then emitted to another <code>PCollection&lt;string&gt;</code> named <code>words</code>. Note how for every line, zero or more words may be emitted, making this new collection a different size to the original.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="nx">splitFunc</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="kd">func</span><span class="p">(</span><span class="nx">line</span><span class="w"> </span><span class="kt">string</span><span class="p">,</span><span class="w"> </span><span class="nx">emit</span><span class="w"> </span><span class="kd">func</span><span class="p">(</span><span class="kt">string</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">for</span><span class="w"> </span><span class="nx">_</span><span class="p">,</span><span class="w"> </span><span class="nx">word</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="k">range</span><span class="w"> </span><span class="nx">wordRE</span><span class="p">.</span><span class="nf">FindAllString</span><span class="p">(</span><span class="nx">line</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="mi">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nf">emit</span><span class="p">(</span><span class="nx">word</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nx">words</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nf">ParDo</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="nx">splitFunc</span><span class="p">,</span><span class="w"> </span><span class="nx">lines</span><span class="p">)</span><span class="w">
</span></span></span></code></pre></div><p>An interesting trick used by the Apache Beam Go API is passing functions as an <code>interface{}</code>, and using reflection to infer the types. Specifically, since <code>lines</code> is a <code>PCollection&lt;string&gt;</code> it is expected that the first argument of <code>splitFunc</code> is a string type. The second argument to <code>splitFunc</code> will allow Beam to infer the type of the <code>words</code> output PCollection. In this example it is a function with a single string argument. Thus the output type will be <code>PCollection&lt;string&gt;</code>. If <code>emit</code> was defined as <code>func(int)</code> then the return type would be a <code>PCollection&lt;int&gt;</code>, and the next PTransform would be expected to handle ints.</p>
<p>The next step uses one of the library’s higher level constructs.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="nx">counted</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">stats</span><span class="p">.</span><span class="nf">Count</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="nx">words</span><span class="p">)</span><span class="w">
</span></span></span></code></pre></div><p><a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam/transforms/stats#Count"><code>stats.Count</code></a> takes a <code>PCollection&lt;X&gt;</code>, counts each unique element, and outputs a key-value pair of (X, int) as a <code>PCollection&lt;KV&lt;X, int&gt;&gt;</code>. In this specific example, the input is a <code>PCollection&lt;string&gt;</code>, thus the output is <code>PCollection&lt;KV&lt;string, int&gt;&gt;</code></p>
<p>Internally <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam/transforms/stats#Count"><code>stats.Count</code></a> it’s made up of multiple ParDos, and a <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam#GroupByKey"><code>beam.GroupByKey</code></a>, but it hides that to make it easier to use.</p>
<p>At this point, the counts of each word has been calculated, and the results are stored to a simple text file. To do this the <code>PCollection&lt;KV&lt;string, int&gt;&gt;</code> is converted to a <code>PCollection&lt;string&gt;</code>, containing one element for each line to be written out.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="nx">formatFunc</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="kd">func</span><span class="p">(</span><span class="nx">w</span><span class="w"> </span><span class="kt">string</span><span class="p">,</span><span class="w"> </span><span class="nx">c</span><span class="w"> </span><span class="kt">int</span><span class="p">)</span><span class="w"> </span><span class="kt">string</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">return</span><span class="w"> </span><span class="nx">fmt</span><span class="p">.</span><span class="nf">Sprintf</span><span class="p">(</span><span class="s">&#34;%s: %v&#34;</span><span class="p">,</span><span class="w"> </span><span class="nx">w</span><span class="p">,</span><span class="w"> </span><span class="nx">c</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nx">formatted</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nf">ParDo</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="nx">formatFunc</span><span class="p">,</span><span class="w"> </span><span class="nx">counted</span><span class="p">)</span><span class="w">
</span></span></span></code></pre></div><p>Again a <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam#ParDo"><code>beam.ParDo</code></a> is used, but you’ll notice the <code>formatFunc</code> is slightly different to the <code>splitFunc</code> above. The <code>formatFunc</code> takes two arguments, a string (the key), and a int (the value). These are the pairs in the <code>PCollection&lt;KV&lt;string, int&gt;&gt;</code>. However, the <code>formatFunc</code> does not take a <code>emit func(...)</code> instead it simply returns a type string.</p>
<p>Since the PTransform outputs a single line for each input element, a simpler form of the function can be specified. One where the output element is just returned from the function. The <code>emit func(...)</code> is useful when the number of output elements differ to the number of input elements. If its a 1:1 mapping a return makes the function easier to read. As above this is all inferred at runtime with reflection when the pipeline is being constructed..</p>
<p>Multiple return arguments can also be used. For example, if the output was expected to be <code>PCollection&lt;KV&lt;float64, bool&gt;&gt;</code>, the return type could be <code>func(...) (float64, bool)</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="nx">textio</span><span class="p">.</span><span class="nf">Write</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="s">&#34;wordcounts.txt&#34;</span><span class="p">,</span><span class="w"> </span><span class="nx">formatted</span><span class="p">)</span><span class="w">
</span></span></span></code></pre></div><p>Finally <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam/io/textio#Write"><code>textio.Write</code></a> takes the formatted <code>PCollection&lt;string&gt;</code> and writes it to a file named “wordcounts.txt&quot; with one line per element.</p>
<h2 id="running-the-pipeline">Running the pipeline</h2>
<p>To test the pipeline it can easily be run locally like so:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">go get github.com/apache/beam/sdks/go/examples/wordcount
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> <span class="nv">$GOPATH</span>/src/github.com/apache/beam/sdks/go/examples/wordcount
</span></span><span class="line"><span class="cl">go run wordcount.go --runner<span class="o">=</span>direct
</span></span></code></pre></div><p>To run in a more realistic way, it can be run on <a href="https://cloud.google.com/dataflow/">GCP Dataflow</a>. Before you do so, you need to create a GCP project, create a GCS bucket, enable the Cloud Dataflow APIs, and create a service account. This is documented on the <a href="https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python">Python quickstart guide</a>, under “Before you begin”.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">GOOGLE_APPLICATION_CREDENTIALS</span><span class="o">=</span><span class="nv">$PWD</span>/your-gcp-project.json
</span></span><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">BUCKET</span><span class="o">=</span>your-gcs-bucket
</span></span><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">PROJECT</span><span class="o">=</span>your-gcp-project
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> <span class="nv">$GOPATH</span>/src/github.com/apache/beam/sdks/go/examples/wordcount
</span></span><span class="line"><span class="cl">go run wordcount.go <span class="se">\
</span></span></span><span class="line"><span class="cl">    --runner dataflow <span class="se">\
</span></span></span><span class="line"><span class="cl">    --input gs://dataflow-samples/shakespeare/kinglear.txt <span class="se">\
</span></span></span><span class="line"><span class="cl">    --output gs://<span class="si">${</span><span class="nv">BUCKET</span><span class="p">?</span><span class="si">}</span>/counts <span class="se">\
</span></span></span><span class="line"><span class="cl">    --project <span class="si">${</span><span class="nv">PROJECT</span><span class="p">?</span><span class="si">}</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">    --temp_location gs://<span class="si">${</span><span class="nv">BUCKET</span><span class="p">?</span><span class="si">}</span>/tmp/ <span class="se">\
</span></span></span><span class="line"><span class="cl">    --staging_location gs://<span class="si">${</span><span class="nv">BUCKET</span><span class="p">?</span><span class="si">}</span>/binaries/ <span class="se">\
</span></span></span><span class="line"><span class="cl">    --worker_harness_container_image<span class="o">=</span>apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515
</span></span></code></pre></div><p>If this works correctly you’ll see something similar to the following printed:</p>
<pre tabindex="0"><code>Cross-compiling .../wordcount.go as .../worker-1-1544590905654809000
Staging worker binary:  .../worker-1-1544590905654809000
Submitted job: 2018-12-11_21_02_29
Console: https://console.cloud.google.com/dataflow/job/2018-12-11...
Logs: https://console.cloud.google.com/logs/viewer?job_id%2F2018-12-11...
Job state: JOB_STATE_PENDING …
Job still running …
Job still running …
...
Job succeeded!
</code></pre><p>Let&rsquo;s take a moment to explain what’s going on, starting with the various flags. The <code>--runner dataflow</code> flag tells the Apache Beam SDK to run this on GCP Dataflow, including executing all the steps required to make that happen. This includes, compiling the code and uploading it to the <code>--staging_location</code>. Later the staged binary will be run by Dataflow under the <code>--project</code> project. As this will be running “in the cloud”, the pipeline will not be able to access local files. Thus for both the <code>--input</code> and <code> --output</code> flags are set to paths on GCS, as this is a convenient place to store files. Finally the <code>--worker_harness_container_image</code> flag specifies the docker image that Dataflow will use to host the workcount.go binary that was uploaded to the <code>--staging_location</code>.</p>
<p>Once wordcount.go is running, it prints out helpful information, such as links to the the Dataflow console. The console displays current progress as well as a visualization of the pipeline as a directed graph. The local wordcount.go continues to run only to display status updates. It can be interrupted at any time, but the pipeline will continue to run on Dataflow until it either succeeds or fails. Once that occurs, the logs link can provide useful information.</p>
<h1 id="art-history-more-complex-example">Art history (more complex example)</h1>
<div style="float: right; width: 300px">
	<img src="palette.png" width=300 height=411>
</div>
<p>Now we’ll construct a more complex pipeline, that demonstrates some other features of Beam and Dataflow. In this pipeline we will be taking 100,000 paintings from the last 600 years and processing them to extract information about their color palettes. Specifically the question we aim to answer is, “Has the color palettes of paintings change over the decades?”. This may not be a pipeline we run repeatedly, but it was a fun example, and demonstrates many advance topics.</p>
<p>We will skip over the details of the color extraction algorithm, and provide that in a later article. Here we’ll focus on how to create a pipeline to accomplish this task.</p>
<p>We start by reading a csv file that contains metadata for each painting, such as the artist, year it was painted, and a GCS path to a jpg of the painting. The paintings will then be grouped by the decade they were painted, and then the color palette for each group will be determined. Each palette will saved to a png file (DrawColorPalette), as well as all the palette saved to a single large json file (WriteIndex). To finish it off, the pipeline will be productionised, so it easier to debug, and re-run. The full source code is <a href="https://github.com/bramp/dataflow-art">available here</a>.</p>
<p>To start with, the main function for the pipeline looks like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="kn">import</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="o">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="s">&#34;github.com/apache/beam/sdks/go/pkg/beam&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="o">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="kd">func</span><span class="w"> </span><span class="nf">main</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">// If beamx or Go flags are used, flags must be parsed first.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">flag</span><span class="p">.</span><span class="nf">Parse</span><span class="p">()</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">// beam.Init() is an initialization hook that must called on startup. On</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">// distributed runners, it is used to intercept control.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">beam</span><span class="p">.</span><span class="nf">Init</span><span class="p">()</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">p</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nf">NewPipeline</span><span class="p">()</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">s</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">p</span><span class="p">.</span><span class="nf">Root</span><span class="p">()</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nf">buildPipeline</span><span class="p">(</span><span class="nx">s</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">ctx</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">context</span><span class="p">.</span><span class="nf">Background</span><span class="p">()</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">if</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">beamx</span><span class="p">.</span><span class="nf">Run</span><span class="p">(</span><span class="nx">ctx</span><span class="p">,</span><span class="w"> </span><span class="nx">p</span><span class="p">);</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="kc">nil</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="nx">log</span><span class="p">.</span><span class="nf">Fatalf</span><span class="p">(</span><span class="nx">ctx</span><span class="p">,</span><span class="w"> </span><span class="s">&#34;Failed to execute job: %v&#34;</span><span class="p">,</span><span class="w"> </span><span class="nx">err</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p>That is the standard boilerplate for a Beam pipeline, it parses the flags, initialises Beam, delegates the pipeline construction to <code>buildPipeline</code> function, and finally runs the pipeline.</p>
<p>The interesting code begins in the <code>buildPipeline</code> function, which constructs the pipeline, by passing PCollections from one function to the next. To build up the tree we see in the above diagram.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="kd">func</span><span class="w"> </span><span class="nf">buildPipeline</span><span class="p">(</span><span class="nx">s</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nx">Scope</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">// nothing -&gt; PCollection&lt;Painting&gt;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">paintings</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">csvio</span><span class="p">.</span><span class="nf">Read</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="o">*</span><span class="nx">index</span><span class="p">,</span><span class="w"> </span><span class="nx">reflect</span><span class="p">.</span><span class="nf">TypeOf</span><span class="p">(</span><span class="nx">Painting</span><span class="p">{}))</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">// PCollection&lt;Painting&gt; -&gt; PCollection&lt;CoGBK&lt;string, Painting&gt;&gt;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">paintingsByGroup</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nf">GroupByDecade</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="nx">paintings</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">// PCollection&lt;CoGBK&lt;string, Painting&gt;&gt; -&gt;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">//   (PCollection&lt;KV&lt;string, Histogram&gt;&gt;, PCollection&lt;KV&lt;string, string&gt;&gt;)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">histograms</span><span class="p">,</span><span class="w"> </span><span class="nx">errors1</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nf">ExtractHistogram</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="nx">paintingsByGroup</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">// Calculate the color palette for the combined histograms.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">// PCollection&lt;KV&lt;string, Histogram&gt;&gt; -&gt;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">//   (PCollection&lt;KV&lt;string, []color.RGBA&gt;&gt;, PCollection&lt;KV&lt;string, string&gt;&gt;)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">palettes</span><span class="p">,</span><span class="w"> </span><span class="nx">errors2</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nf">CalculateColorPalette</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="nx">histograms</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">// PCollection&lt;KV&lt;string, []color.RGBA&gt;&gt; -&gt; PCollection&lt;KV&lt;string, string&gt;&gt;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">errors3</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nf">DrawColorPalette</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="o">*</span><span class="nx">outputPrefix</span><span class="p">,</span><span class="w"> </span><span class="nx">palettes</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">// PCollection&lt;KV&lt;string, []color.RGBA&gt;&gt; -&gt; nothing</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nf">WriteIndex</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="nx">morebeam</span><span class="p">.</span><span class="nf">Join</span><span class="p">(</span><span class="o">*</span><span class="nx">outputPrefix</span><span class="p">,</span><span class="w"> </span><span class="s">&#34;index.json&#34;</span><span class="p">),</span><span class="w"> </span><span class="nx">palettes</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">// PCollection&lt;KV&lt;string, string&gt;&gt; -&gt; nothing</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nf">WriteErrorLog</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="s">&#34;errors.log&#34;</span><span class="p">,</span><span class="w"> </span><span class="nx">errors1</span><span class="p">,</span><span class="w"> </span><span class="nx">errors2</span><span class="p">,</span><span class="w"> </span><span class="nx">errors3</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p>To make it easy to follow, each function describes the step, and is annotated with a comment that explains what kind of PCollection is accepted and returned. Let&rsquo;s highlight some interesting steps.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="kd">var</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">index</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">flag</span><span class="p">.</span><span class="nf">String</span><span class="p">(</span><span class="s">&#34;index&#34;</span><span class="p">,</span><span class="w"> </span><span class="s">&#34;art.csv&#34;</span><span class="p">,</span><span class="w"> </span><span class="s">&#34;Index of the art.&#34;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">// Painting represents a single painting in the dataset.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="kd">type</span><span class="w"> </span><span class="nx">Painting</span><span class="w"> </span><span class="kd">struct</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">Artist</span><span class="w"> </span><span class="kt">string</span><span class="w"> </span><span class="s">`csv:&#34;artist&#34;`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">Title</span><span class="w">  </span><span class="kt">string</span><span class="w"> </span><span class="s">`csv:&#34;title&#34;`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">Date</span><span class="w">   </span><span class="kt">string</span><span class="w"> </span><span class="s">`csv:&#34;date&#34;`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">Genre</span><span class="w">  </span><span class="kt">string</span><span class="w"> </span><span class="s">`csv:&#34;genre&#34;`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">Style</span><span class="w">  </span><span class="kt">string</span><span class="w"> </span><span class="s">`csv:&#34;style&#34;`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">Filename</span><span class="w"> </span><span class="kt">string</span><span class="w"> </span><span class="s">`csv:&#34;new_filename&#34;`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="o">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="o">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="kd">func</span><span class="w"> </span><span class="nf">buildPipeline</span><span class="p">(</span><span class="nx">s</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nx">Scope</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">// nothing -&gt; PCollection&lt;Painting&gt;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">paintings</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">csvio</span><span class="p">.</span><span class="nf">Read</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="o">*</span><span class="nx">index</span><span class="p">,</span><span class="w"> </span><span class="nx">reflect</span><span class="p">.</span><span class="nf">TypeOf</span><span class="p">(</span><span class="nx">Painting</span><span class="p">{}))</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="o">...</span><span class="w">
</span></span></span></code></pre></div><p>The very first step uses <a href="https://godoc.org/github.com/bramp/morebeam/csvio#Read"><code>csvio.Read</code></a> to read the CSV file specified by the <code>--index</code> flag, and returns a PCollection of Painting structs. In all the examples we’ve seen before the PCollections only contains basic types, e.g. strings, ints, etc. More complex types, such as a slices and structs are allowed (but not maps and interfaces). This makes it easier to pass rich information between the PTransforms. The only caveat is the type must be JSON-serialisable. This is because in a distributed pipeline, the PTransforms could be processed on different machines, and the PCollection needs to be marshalled to be passed between them.</p>
<p>For Beam to successfully unmarshal your data, the types must also be registered. This is typically done within the init() function, by called <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam#RegisterType"><code>beam.RegisterType</code></a>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="kd">func</span><span class="w"> </span><span class="nf">init</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">beam</span><span class="p">.</span><span class="nf">RegisterType</span><span class="p">(</span><span class="nx">reflect</span><span class="p">.</span><span class="nf">TypeOf</span><span class="p">(</span><span class="nx">Painting</span><span class="p">{}))</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p>If you forget to register the type, a error will occur at Runtime, for example:</p>
<pre tabindex="0"><code>java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error received from SDK harness for instruction -224: execute failed: panic: reflect: Call using main.Painting as type struct { Artist string; Title string; ... } goroutine 70 [running]:
</code></pre><p>This can be a little frustrating, as when running the pipeline locally with the <code>direct</code> runner, it does not marshal your data, so errors like this aren’t exposed until running on Dataflow.</p>
<p>Now we have a collection of Paintings, we group them by decade:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="c1">// GroupByDecade takes a PCollection&lt;Painting&gt; and returns a </span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">// PCollection&lt;CoGBK&lt;string, Painting&gt;&gt; of the paintings group by decade.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="kd">func</span><span class="w"> </span><span class="nf">GroupByDecade</span><span class="p">(</span><span class="nx">s</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nx">Scope</span><span class="p">,</span><span class="w"> </span><span class="nx">paintings</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nx">PCollection</span><span class="p">)</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nx">PCollection</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">s</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">s</span><span class="p">.</span><span class="nf">Scope</span><span class="p">(</span><span class="s">&#34;GroupBy Decade&#34;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">// PCollection&lt;Painting&gt; -&gt; PCollection&lt;KV&lt;string, Painting&gt;&gt;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">paintingsWithKey</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">morebeam</span><span class="p">.</span><span class="nf">AddKey</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="kd">func</span><span class="p">(</span><span class="nx">art</span><span class="w"> </span><span class="nx">Painting</span><span class="p">)</span><span class="w"> </span><span class="kt">string</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="k">return</span><span class="w"> </span><span class="nx">art</span><span class="p">.</span><span class="nf">Decade</span><span class="p">()</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="p">},</span><span class="w"> </span><span class="nx">paintings</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="c1">// PCollection&lt;string, Painting&gt; -&gt; PCollection&lt;CoGBK&lt;string, Painting&gt;&gt;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">return</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nf">GroupByKey</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="nx">paintingsWithKey</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p>The first line in this function, <code>s.Scope(&quot;GroupBy Decade&quot;)</code> allows us to name this step, and group multiple sub-steps. For example, in the above diagram “GroupBy Decade” is a single step, which can be expanded to show a <a href="https://godoc.org/github.com/bramp/morebeam#AddKey"><code>AddKey</code></a> and <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam#GroupByKey"><code>GroupByKey</code></a> step.</p>
<p><code>GroupByDecade</code> returns a <code>PCollection&lt;CoGBK&lt;string, Painting&gt;&gt;</code>. The CoGBK, is short for <strong>Co</strong>mmon <strong>G</strong>roup <strong>B</strong>y <strong>K</strong>ey. It is a special collection, where (as you’ll see later) each element is a tuple of a key, and an iterable collection of elements. The key in this case is the decade the painting was painted. The <code>PCollection&lt;Painting&gt;</code> is transformed into a <code>PCollection&lt;KV&lt;String,Painting&gt;&gt;</code> by the <a href="https://godoc.org/github.com/bramp/morebeam#AddKey"><code>morebeam.AddKey</code></a> step, adding a key to each value. Then the <code>GroupByKey</code> will use that key to produce the final PCollection.</p>
<p>Next up is the <code>ExtractHistogram</code>, which takes the <code>PCollection&lt;CoGBK&lt;string, Painting&gt;&gt;</code>, and returns two PCollections. The first PCollection is a <code>PCollection&lt;KV&lt;string, Histogram&gt;&gt;</code>, which contains a <a href="https://en.wikipedia.org/wiki/Color_histogram">color histogram</a> for every decade of paintings. The second PCollection is related to error handling, and will be explained later.</p>
<p>The ExtractHistogram function demonstrates three new concepts, “Stateful functions”, “Data enrichment”, and “Error handling”.</p>
<h2 id="stateful-functions">Stateful functions</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="kd">var</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">artPrefix</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">flag</span><span class="p">.</span><span class="nf">String</span><span class="p">(</span><span class="s">&#34;art&#34;</span><span class="p">,</span><span class="w"> </span><span class="s">&#34;gs://mybucket/art&#34;</span><span class="p">,</span><span class="w"> </span><span class="s">&#34;Path to where the art is kept.&#34;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="kd">func</span><span class="w"> </span><span class="nf">init</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">beam</span><span class="p">.</span><span class="nf">RegisterType</span><span class="p">(</span><span class="nx">reflect</span><span class="p">.</span><span class="nf">TypeOf</span><span class="p">((</span><span class="o">*</span><span class="nx">extractHistogramFn</span><span class="p">)(</span><span class="kc">nil</span><span class="p">)).</span><span class="nf">Elem</span><span class="p">())</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="kd">type</span><span class="w"> </span><span class="nx">extractHistogramFn</span><span class="w"> </span><span class="kd">struct</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">ArtPrefix</span><span class="w"> </span><span class="kt">string</span><span class="w"> </span><span class="s">`json:&#34;art_prefix&#34;`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">fs</span><span class="w"> </span><span class="nx">filesystem</span><span class="p">.</span><span class="nx">Interface</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">// ExtractHistogram calculates the color histograms for all the Paintings in</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">// the CoGBK.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="kd">func</span><span class="w"> </span><span class="nf">ExtractHistogram</span><span class="p">(</span><span class="nx">s</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nx">Scope</span><span class="p">,</span><span class="w"> </span><span class="nx">files</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nx">PCollection</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="p">(</span><span class="nx">beam</span><span class="p">.</span><span class="nx">PCollection</span><span class="p">,</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nx">PCollection</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">s</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">s</span><span class="p">.</span><span class="nf">Scope</span><span class="p">(</span><span class="s">&#34;ExtractHistogram&#34;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">return</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nf">ParDo2</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="nx">extractHistogramFn</span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="nx">ArtPrefix</span><span class="p">:</span><span class="w"> </span><span class="o">*</span><span class="nx">artPrefix</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="p">},</span><span class="w"> </span><span class="nx">files</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p>Instead of passing a simple function to <code>beam.ParDo</code>, a struct containing two fields is passed. The exported field, <code>ArtPrefix</code> is the path to where the painting jpgs are stored, and the unexported field, <code>fs</code>, is a filesystem client for reading these jpgs.</p>
<p>When the pipeline runs, no global variables are allowed, including the command line flag variables. For example, when running this pipeline we may start it like so:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">go run main.go <span class="se">\
</span></span></span><span class="line"><span class="cl">  --art gs://<span class="si">${</span><span class="nv">BUCKET</span><span class="p">?</span><span class="si">}</span>/art/ <span class="se">\
</span></span></span><span class="line"><span class="cl">  --runner dataflow <span class="se">\
</span></span></span><span class="line"><span class="cl">  ...
</span></span></code></pre></div><p>When the code actually runs on the Dataflow workers, the <code>--art</code> flag is not specified. Thus the <code>*artPrefix</code> value will use the default value. To pass this to the Dataflow workers, it must be part of the DoFn struct that is passed to <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam#ParDo"><code>beam.ParDo</code></a>. So in this example, we create a <code>extractHistogramFn</code> struct, with the exported <code>ArtPrefix</code> field set to the value of the <code>--art</code> flag. This <code>extractHistogramFn</code> is then marshalled and passed to the workers. As with the unmarshalled PCollection values, the extractHistogramFn must also be registered with beam during <code>init</code>.</p>
<p>When the pipeline executes this step it calls the <code>extractHistogramFn</code>’s <code>ProcessElement</code> method. This method works in a similar way to a simple DoFn functions. The arguments and return value are reflected at runtime and mapped to the PCollections being processed and returned.</p>
<h2 id="iterating-over-a-cogbk">Iterating over a CoGBK</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="kd">func</span><span class="w"> </span><span class="p">(</span><span class="nx">fn</span><span class="w"> </span><span class="o">*</span><span class="nx">extractHistogramFn</span><span class="p">)</span><span class="w"> </span><span class="nf">ProcessElement</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="nx">ctx</span><span class="w"> </span><span class="nx">context</span><span class="p">.</span><span class="nx">Context</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="nx">key</span><span class="w"> </span><span class="kt">string</span><span class="p">,</span><span class="w"> </span><span class="nx">values</span><span class="w"> </span><span class="kd">func</span><span class="p">(</span><span class="o">*</span><span class="nx">Painting</span><span class="p">)</span><span class="w"> </span><span class="kt">bool</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="nx">errors</span><span class="w"> </span><span class="kd">func</span><span class="p">(</span><span class="kt">string</span><span class="p">,</span><span class="w"> </span><span class="kt">string</span><span class="p">))</span><span class="w"> </span><span class="nx">HistogramResult</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">log</span><span class="p">.</span><span class="nf">Infof</span><span class="p">(</span><span class="nx">ctx</span><span class="p">,</span><span class="w"> </span><span class="s">&#34;%q: ExtractHistogram started&#34;</span><span class="p">,</span><span class="w"> </span><span class="nx">key</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="kd">var</span><span class="w"> </span><span class="nx">art</span><span class="w"> </span><span class="nx">Painting</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">for</span><span class="w"> </span><span class="nf">values</span><span class="p">(</span><span class="o">&amp;</span><span class="nx">art</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="nx">filename</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">morebeam</span><span class="p">.</span><span class="nf">Join</span><span class="p">(</span><span class="nx">fn</span><span class="p">.</span><span class="nx">ArtPrefix</span><span class="p">,</span><span class="w"> </span><span class="nx">art</span><span class="p">.</span><span class="nx">Filename</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="nx">h</span><span class="p">,</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">fn</span><span class="p">.</span><span class="nf">extractHistogram</span><span class="p">(</span><span class="nx">ctx</span><span class="p">,</span><span class="w"> </span><span class="nx">key</span><span class="p">,</span><span class="w"> </span><span class="nx">filename</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="k">if</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="kc">nil</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">			</span><span class="err">…</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="nx">result</span><span class="p">.</span><span class="nx">Histogram</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">result</span><span class="p">.</span><span class="nx">Histogram</span><span class="p">.</span><span class="nf">Combine</span><span class="p">(</span><span class="nx">h</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">return</span><span class="w"> </span><span class="nx">result</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p><code>ProcessElement</code> is called once for every unique group in the <code>PCollection&lt;CoGBK&lt;string, Painting&gt;</code>. The <code>key string</code> argument will be the key for that group, and a <code>values func(*Painting) bool</code> is used to iterate all values within the group. The contact is that <code>values</code> is passed a pointer to a <code>Painting</code> struct, which is populated on each iteration. As long as there are more paintings to process in the group the values function returns true. Once it returns false, the group has been fully processed. This iterator pattern is unique to the <code>CoGBK</code> and makes it convient to apply an operation to every element in the group.</p>
<p>In this case, <code>extractHistogram</code> is called for each Painting, fetches a jpg of the artwork, and extract a [histogram of colors]((<a href="https://en.wikipedia.org/wiki/Color_histogram)">https://en.wikipedia.org/wiki/Color_histogram)</a>. The histograms from all painting in that group are combined, and finally one result is per group is returned.</p>
<h2 id="data-enrichment">Data enrichment</h2>
<p>Reading the paintings from an external service (such as <a href="https://cloud.google.com/storage/">GCS</a>) demonstrates a data enrichment step. This is where an external service is used to “enrich” the dataset the pipeline is processing. You could imagine a user service being called when processing log entries, or a product service when processing purchases. It should be noted, that any external action should be <a href="https://en.wikipedia.org/wiki/Idempotence">idempotent</a>. If a worker fails, it is possible the same element is retried, and thus processed multiple times. Dataflow keeps track of failures and ensures the final result only has each element processed once.</p>
<p>When calling a remote service, typically some kind of client is needed to make the request. In this pipeline we read the images from GCS, thus setting up GCS client at startup is useful. Since we are using a struct based DoFn, there are some additional methods that can be defined.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="kd">func</span><span class="w"> </span><span class="p">(</span><span class="nx">fn</span><span class="w"> </span><span class="o">*</span><span class="nx">extractHistogramFn</span><span class="p">)</span><span class="w"> </span><span class="nf">Setup</span><span class="p">(</span><span class="nx">ctx</span><span class="w"> </span><span class="nx">context</span><span class="p">.</span><span class="nx">Context</span><span class="p">)</span><span class="w"> </span><span class="kt">error</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="kd">var</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="kt">error</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">fn</span><span class="p">.</span><span class="nx">fs</span><span class="p">,</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">filesystem</span><span class="p">.</span><span class="nf">New</span><span class="p">(</span><span class="nx">ctx</span><span class="p">,</span><span class="w"> </span><span class="nx">fn</span><span class="p">.</span><span class="nx">ArtPrefix</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">if</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="kc">nil</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="k">return</span><span class="w"> </span><span class="nx">fmt</span><span class="p">.</span><span class="nf">Errorf</span><span class="p">(</span><span class="s">&#34;filesystem.New(%q) failed: %s&#34;</span><span class="p">,</span><span class="w"> </span><span class="nx">fn</span><span class="p">.</span><span class="nx">ArtPrefix</span><span class="p">,</span><span class="w"> </span><span class="nx">err</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">return</span><span class="w"> </span><span class="kc">nil</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="kd">func</span><span class="w"> </span><span class="p">(</span><span class="nx">fn</span><span class="w"> </span><span class="o">*</span><span class="nx">extractHistogramFn</span><span class="p">)</span><span class="w"> </span><span class="nf">Teardown</span><span class="p">()</span><span class="w"> </span><span class="kt">error</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">return</span><span class="w"> </span><span class="nx">fn</span><span class="p">.</span><span class="nx">fs</span><span class="p">.</span><span class="nf">Close</span><span class="p">()</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p>When the DoFn is initialized on the worker, the <code>Setup</code> method is called. Here a new <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam/io/filesystem">Filesystem client</a> is created and store it in the struct’s <code>fs</code> field. Later, when the DoFn is no longer needed, the <code>Teardown</code> method is called, giving us opportunity to cleanup the client. With all things distributed, don’t expect the <code>Teardown</code> to ever be called.</p>
<p>There are also some simple best practices around error handling that should be following when calling an external services.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="kd">func</span><span class="w"> </span><span class="p">(</span><span class="nx">fn</span><span class="w"> </span><span class="o">*</span><span class="nx">extractHistogramFn</span><span class="p">)</span><span class="w"> </span><span class="nf">extractHistogram</span><span class="p">(</span><span class="nx">ctx</span><span class="w"> </span><span class="nx">context</span><span class="p">.</span><span class="nx">Context</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nx">key</span><span class="p">,</span><span class="w"> </span><span class="nx">filename</span><span class="w"> </span><span class="kt">string</span><span class="p">)</span><span class="w"> </span><span class="p">(</span><span class="nx">palette</span><span class="p">.</span><span class="nx">Histogram</span><span class="p">,</span><span class="w"> </span><span class="kt">error</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">ctx</span><span class="p">,</span><span class="w"> </span><span class="nx">cancel</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">context</span><span class="p">.</span><span class="nf">WithTimeout</span><span class="p">(</span><span class="nx">ctx</span><span class="p">,</span><span class="w"> </span><span class="mi">30</span><span class="o">*</span><span class="nx">time</span><span class="p">.</span><span class="nx">Second</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">defer</span><span class="w"> </span><span class="nf">cancel</span><span class="p">()</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">fd</span><span class="p">,</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">fn</span><span class="p">.</span><span class="nx">fs</span><span class="p">.</span><span class="nf">OpenRead</span><span class="p">(</span><span class="nx">ctx</span><span class="p">,</span><span class="w"> </span><span class="nx">filename</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">if</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="kc">nil</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="k">return</span><span class="w"> </span><span class="kc">nil</span><span class="p">,</span><span class="w"> </span><span class="nx">fmt</span><span class="p">.</span><span class="nf">Errorf</span><span class="p">(</span><span class="s">&#34;fs.OpenRead(%q) failed: %s&#34;</span><span class="p">,</span><span class="w"> </span><span class="nx">filename</span><span class="p">,</span><span class="w"> </span><span class="nx">err</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">defer</span><span class="w"> </span><span class="nx">fd</span><span class="p">.</span><span class="nf">Close</span><span class="p">()</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">img</span><span class="p">,</span><span class="w"> </span><span class="nx">_</span><span class="p">,</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">image</span><span class="p">.</span><span class="nf">Decode</span><span class="p">(</span><span class="nx">fd</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">if</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="kc">nil</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="k">return</span><span class="w"> </span><span class="kc">nil</span><span class="p">,</span><span class="w"> </span><span class="nx">fmt</span><span class="p">.</span><span class="nf">Errorf</span><span class="p">(</span><span class="s">&#34;image.Decode(%q) failed: %s&#34;</span><span class="p">,</span><span class="w"> </span><span class="nx">filename</span><span class="p">,</span><span class="w"> </span><span class="nx">err</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">return</span><span class="w"> </span><span class="nx">palette</span><span class="p">.</span><span class="nf">NewColorHistogram</span><span class="p">(</span><span class="nx">img</span><span class="p">),</span><span class="w"> </span><span class="kc">nil</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p>The function begins by using a <a href="https://golang.org/pkg/context/#WithTimeout"><code>context.WithTimeout</code></a>. This ensures that if the external service does not respond in a timely manner the context will be cancelled and a error returned. If this timeout wasn’t set, the external call may never end, and the pipeline never terminates.</p>
<p>Since the pipeline could be running across 100s of machines, it could generate significant load on a remote service. It is wise to implement appropriate <a href="https://cloud.google.com/storage/docs/exponential-backoff">backoff and retry logic</a>. In some cases even <a href="https://cloud.google.com/service-infrastructure/docs/rate-limiting">rate limiting</a> your pipeline’s execution, or tagging your pipeline’s traffic at a <a href="https://www.usenix.org/conference/srecon17asia/program/presentation/sheerin">lower QoS</a> so it can be easily shed.</p>
<p>The external service, may also return permanent errors. Thus a more robust error handling pattern is needed.</p>
<h2 id="error-handling-and-dead-letters">Error handling and dead letters</h2>
<p>When Beam processes a PCollection, it bundles up multiple elements and processes one bundle at a time. If the PTransform return an error, panics, or otherwise fails (such as running out of memory), the full bundle is retried. With Dataflow, bundles are <a href="https://cloud.google.com/dataflow/docs/resources/faq#how-are-java-exceptions-handled-in-cloud-dataflow">retried up to four times</a>, after which the entire pipeline is aborted. This can be inconvenient, so where appropriate instead of returning an error we we use a <a href="https://en.wikipedia.org/wiki/Dead_letter_queue">dead letter queue</a>. This is a new PCollection that collects processing errors. These errors can then be persisted at the end of the pipeline, manually inspected, and processed again later.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="k">return</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nf">ParDo2</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="nx">extractHistogramFn</span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">ArtPrefix</span><span class="p">:</span><span class="w"> </span><span class="o">*</span><span class="nx">artPrefix</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">},</span><span class="w"> </span><span class="nx">files</span><span class="p">)</span><span class="w">
</span></span></span></code></pre></div><p>A keen observer would have noticed that <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam#ParDo2"><code>beam.ParDo2</code></a> was used by ExtractHistogram, instead of <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam#ParDo"><code>beam.ParDo</code></a>. This function works the same, but returns two PCollections. In our case, the first is the normal output, and the second is a <code>PCollection&lt;KV&lt;string, string&gt;&gt;</code>. This second collection is keyed on the unique identifer of the painting having an issue, and the value is the error message.</p>
<p>Since returning a error is optional, the errors PCollection was passed to <code>extractHistogramFn</code>’s <code>ProcessElement</code> as a <code>errors func(string, string)</code>.</p>
<p>Throughout we use this kind of error PCollections from every stage, and at the end of the pipeline they are collected together and output to a single errors log file:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="c1">// WriteErrorLog takes multiple PCollection&lt;KV&lt;string,string&gt;&gt;s combines them</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="c1">// and writes them to the given filename.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="kd">func</span><span class="w"> </span><span class="nf">WriteErrorLog</span><span class="p">(</span><span class="nx">s</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nx">Scope</span><span class="p">,</span><span class="w"> </span><span class="nx">filename</span><span class="w"> </span><span class="kt">string</span><span class="p">,</span><span class="w"> </span><span class="nx">errors</span><span class="w"> </span><span class="o">...</span><span class="nx">beam</span><span class="p">.</span><span class="nx">PCollection</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">s</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">s</span><span class="p">.</span><span class="nf">Scope</span><span class="p">(</span><span class="nx">fmt</span><span class="p">.</span><span class="nf">Sprintf</span><span class="p">(</span><span class="s">&#34;Write %q&#34;</span><span class="p">,</span><span class="w"> </span><span class="nx">filename</span><span class="p">))</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">c</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nf">Flatten</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="nx">errors</span><span class="o">...</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">c</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nf">ParDo</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="kd">func</span><span class="p">(</span><span class="nx">key</span><span class="p">,</span><span class="w"> </span><span class="nx">value</span><span class="w"> </span><span class="kt">string</span><span class="p">)</span><span class="w"> </span><span class="kt">string</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="k">return</span><span class="w"> </span><span class="nx">fmt</span><span class="p">.</span><span class="nf">Sprintf</span><span class="p">(</span><span class="s">&#34;%s,%s&#34;</span><span class="p">,</span><span class="w"> </span><span class="nx">key</span><span class="p">,</span><span class="w"> </span><span class="nx">value</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="p">},</span><span class="w"> </span><span class="nx">c</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">textio</span><span class="p">.</span><span class="nf">Write</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="nx">morebeam</span><span class="p">.</span><span class="nf">Join</span><span class="p">(</span><span class="o">*</span><span class="nx">outputPrefix</span><span class="p">,</span><span class="w"> </span><span class="nx">filename</span><span class="p">),</span><span class="w"> </span><span class="nx">c</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p>Since the output is key, comma, value, the file can easily be re-read to try just the failed keys.</p>
<p>The rest of the pipeline is much of the same, and thus won’t be explained in detail. <code>CalculateColorPalette</code> takes the color histograms and runs a K-Means clustering algorithm to extract the color palettes for those paintings. Those palettes are written out to png files with the <code>DrawColorPalette</code>, and finally all the palettes are written out to a JSON file in <code>WriteIndex</code>.</p>
<h2 id="gotchas">Gotchas</h2>
<h3 id="marshing">Marshing</h3>
<p>Always remember to register the types that will be transmitted between workers. This is anything that’s inside a PCollection, as well as any DoFn. Not all types are allowed, but slices, structs, and primitives are. For other types, custom JSON marshalling can be used.</p>
<p>It should also be reminded that global state is not allowed. Flags and other global variables will not always be populated when running on a remote worker. Also, examples like this may catch you out:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="nx">prefix</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="err">“</span><span class="nx">X</span><span class="err">”</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nx">s</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">s</span><span class="p">.</span><span class="nf">Scope</span><span class="p">(</span><span class="err">“</span><span class="nx">Prefix</span><span class="w"> </span><span class="err">”</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nx">prefix</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nx">c</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">beam</span><span class="p">.</span><span class="nf">ParDo</span><span class="p">(</span><span class="nx">s</span><span class="p">,</span><span class="w"> </span><span class="kd">func</span><span class="p">(</span><span class="nx">value</span><span class="w"> </span><span class="kt">string</span><span class="p">)</span><span class="w"> </span><span class="kt">string</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">return</span><span class="w"> </span><span class="nx">prefix</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nx">value</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">},</span><span class="w"> </span><span class="nx">c</span><span class="p">)</span><span class="w">
</span></span></span></code></pre></div><p>This simple example appears to add “X” to the beginning of each element, however, it will prefix nothing. This is because, the simple anonymous function is marshalled, and unmarshalled on the worker. When it is then invoked on the worker, it does not have the closure, and thus has not captured the value of prefix. Instead prefix is the zero value. For this example to work, prefix must be defined inside the anonymous function, or a DoFn struct used which contains the prefix as a marshalled field.</p>
<h3 id="errors">Errors</h3>
<p>Since the pipeline could be running across 100s of workers, errors are to be expected. Extensively using  <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam/log#Infof"><code>log.Infof</code></a>, <a href="https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam/log#Debugf"><code>log.Debugf</code></a>, etc will make your live better. They can make it very easy to debug why the pipeline got stuck, or mysteriously failed.</p>
<p>While debugging this pipeline, it would occasionally fail due to exceeding the memory limits of the Dataflow worker’s. Standard Go infrastructure can be used to help debug this, such as <a href="https://golang.org/pkg/net/http/pprof/">pprof</a>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="kn">import</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="s">&#34;net/http&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="nx">_</span><span class="w"> </span><span class="s">&#34;net/http/pprof&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="kd">func</span><span class="w"> </span><span class="nf">main</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="o">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">go</span><span class="w"> </span><span class="kd">func</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="c1">// HTTP Server for pprof (and other debugging)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">		</span><span class="nx">log</span><span class="p">.</span><span class="nf">Info</span><span class="p">(</span><span class="nx">ctx</span><span class="p">,</span><span class="w"> </span><span class="nx">http</span><span class="p">.</span><span class="nf">ListenAndServe</span><span class="p">(</span><span class="s">&#34;localhost:8080&#34;</span><span class="p">,</span><span class="w"> </span><span class="kc">nil</span><span class="p">))</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="p">}()</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="o">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p>This configures a webserver which can export useful stats, and used for grabbing pprof profiling data.</p>
<h3 id="difference-between-direct-and-dataflow-runners">Difference between direct and dataflow runners</h3>
<p>Running the pipeline locally is a quick way to validate the pipeline is setup, and that is runs as expected. However, running locally won’t run the pipeline in parallel, and it is obviously constrained to a single machine. There are some other difference, mostly around marshalling data. It’s always a good idea to test on Dataflow, perhaps with a smaller or sampled dataset as input, that can be used as a smoke test.</p>
<h1 id="conclusion">Conclusion</h1>
<p>This article has covered the basics of creating an Apache Beam pipeline with the Go SDK, while also covering some more advanced topics. The results of the specific pipeline will be revealed in a later article, until then the <a href="https://github.com/bramp/dataflow-art">code is available here</a>.</p>
<p>While the Beam Go SDK is still experimental, there are many great tutorials and example using the more mature Java and Python Beam SDKs [<a href="https://medium.com/google-cloud/popular-java-projects-on-github-that-could-use-some-help-analyzed-using-bigquery-and-dataflow-dbd5753827f4">1</a>, <a href="https://medium.com/@vallerylancey/error-handling-elements-in-apache-beam-pipelines-fffdea91af2a">2</a>]. Google themselves even published a series of generic articles [<a href="https://cloud.google.com/blog/products/gcp/guide-to-common-cloud-dataflow-use-case-patterns-part-1">part 1</a>, <a href="https://cloud.google.com/blog/products/gcp/guide-to-common-cloud-dataflow-use-case-patterns-part-2">part 2</a>] explaining common use cases.</p>
</description>
    </item>
    
  </channel>
</rss>
