Add a batch write flow control example for Bigtable #9314

kongweihan · 2024-05-08T16:20:13Z

Description

Add a batch write flow control example for Bigtable

Checklist

I have followed Sample Format Guide
pom.xml parent set to latest shared-configuration
Appropriate changes to README are included in PR
[] These samples need a new API enabled in testing projects to pass (let us know which ones)
[] These samples need a new/updated env vars in testing projects set to pass (let us know which ones)
Tests pass: mvn clean verify required
Lint passes: mvn -P lint checkstyle:check required
Static Analysis: mvn -P lint clean compile pmd:cpd-check spotbugs:check advisory only
[] This sample adds a new sample directory, and I updated the CODEOWNERS file with the codeowners for this sample
[] This sample adds a new Product API, and I updated the Blunderbuss issue/PR auto-assigner with the codeowners for this sample
Please merge this PR for me once it is approved

bigtable/beam/batch-write-flow-control-example/pom.xml

...am/batch-write-flow-control-example/src/main/java/bigtable/BatchWriteFlowControlExample.java

minherz

Hello,
Please address the following questions:

We ask to have a single code sample per file. This code looks like showing two code samples. What is this code sample demonstrates?
This code sample does not have regional tags. What documentation use it?
We do not host code samples without tests. What is a reason for lack of tests?

minherz · 2024-05-23T14:52:54Z

...am/batch-write-flow-control-example/src/main/java/bigtable/BatchWriteFlowControlExample.java

+    Pipeline p = Pipeline.create(options);
+
+    PCollection<Long> numbers = p.apply(generateLabel, GenerateSequence.from(0).to(numRows));
+
+    if (options.getUseCloudBigtableIo()) {
+      System.out.println("Using CloudBigtableIO");
+      PCollection<org.apache.hadoop.hbase.client.Mutation> mutations = numbers.apply(mutationLabel,
+          ParDo.of(new CreateHbaseMutationFn(options.getBigtableColsPerRow(),
+              options.getBigtableBytesPerCol())));
+
+      mutations.apply(
+          String.format("Write data to table %s via CloudBigtableIO", options.getBigtableTableId()),
+          CloudBigtableIO.writeToTable(new CloudBigtableTableConfiguration.Builder()
+              .withProjectId(options.getProject())
+              .withInstanceId(options.getBigtableInstanceId())
+              .withTableId(options.getBigtableTableId())
+              .withConfiguration(BigtableOptionsFactory.BIGTABLE_ENABLE_BULK_MUTATION_FLOW_CONTROL,
+                  "true")
+              .withConfiguration(BigtableOptionsFactory.BIGTABLE_BULK_MAX_REQUEST_SIZE_BYTES,
+                  "1048576")
+              .build()));
+    } else {
+      System.out.println("Using BigtableIO");
+      PCollection<KV<ByteString, Iterable<Mutation>>>
+          mutations = numbers.apply(mutationLabel,
+          ParDo.of(new CreateMutationFn(options.getBigtableColsPerRow(),
+              options.getBigtableBytesPerCol())));
+
+      BigtableIO.Write write = BigtableIO.write()
+          .withProjectId(options.getProject())
+          .withInstanceId(options.getBigtableInstanceId())
+          .withTableId(options.getBigtableTableId())
+          .withFlowControl(true);  // This enables batch write flow control
+
+      mutations.apply(
+          String.format("Write data to table %s via BigtableIO", options.getBigtableTableId()),
+          write
+      );
+    }
+
+    p.run();


this block of code is hard to read. Since it is a code sample it should be easy to understand. Please, reformat it so it will look like steps each of the steps calling apply method of the pipeline. See the dataflow-bigquery-read-tablerows sample as a reference.

This code sample shows how to enable the flow control feature. It has 2 parts because we have two different connectors and they're configured differently. I was unsure whether it'll worth duplicating the rest of the code to keep the sample simple. Please advice and I can split.

We're going to write the doc that will use this code sample.

Yes I'll add tests.

I was unsure whether it'll worth duplicating the rest of the code to keep the sample simple. Please advice and I can split.

It seems to me that both parts of the if/else block apply only two transformations where the first transformations are very similar and hard to distinguish in the current version.
I suggest either to look into an option to abstract these transformations (by inheriting from a proper Beam class) or, at least, to implement each of them in separate method with meaningful names.
Another option can be to implement each pipeline in a separate class from the start to the end.

We're going to write the doc that will use this code sample.

Please either acquire a region tag or create a new tag using devrel/ site and add it to the code.

minherz · 2024-05-23T14:53:39Z

...am/batch-write-flow-control-example/src/main/java/bigtable/BatchWriteFlowControlExample.java

+    Pipeline p = Pipeline.create(options);
+
+    PCollection<Long> numbers = p.apply(generateLabel, GenerateSequence.from(0).to(numRows));
+
+    if (options.getUseCloudBigtableIo()) {
+      System.out.println("Using CloudBigtableIO");
+      PCollection<org.apache.hadoop.hbase.client.Mutation> mutations = numbers.apply(mutationLabel,
+          ParDo.of(new CreateHbaseMutationFn(options.getBigtableColsPerRow(),
+              options.getBigtableBytesPerCol())));
+
+      mutations.apply(
+          String.format("Write data to table %s via CloudBigtableIO", options.getBigtableTableId()),
+          CloudBigtableIO.writeToTable(new CloudBigtableTableConfiguration.Builder()
+              .withProjectId(options.getProject())
+              .withInstanceId(options.getBigtableInstanceId())
+              .withTableId(options.getBigtableTableId())
+              .withConfiguration(BigtableOptionsFactory.BIGTABLE_ENABLE_BULK_MUTATION_FLOW_CONTROL,
+                  "true")
+              .withConfiguration(BigtableOptionsFactory.BIGTABLE_BULK_MAX_REQUEST_SIZE_BYTES,
+                  "1048576")
+              .build()));
+    } else {
+      System.out.println("Using BigtableIO");
+      PCollection<KV<ByteString, Iterable<Mutation>>>
+          mutations = numbers.apply(mutationLabel,
+          ParDo.of(new CreateMutationFn(options.getBigtableColsPerRow(),
+              options.getBigtableBytesPerCol())));
+
+      BigtableIO.Write write = BigtableIO.write()
+          .withProjectId(options.getProject())
+          .withInstanceId(options.getBigtableInstanceId())
+          .withTableId(options.getBigtableTableId())
+          .withFlowControl(true);  // This enables batch write flow control
+
+      mutations.apply(
+          String.format("Write data to table %s via BigtableIO", options.getBigtableTableId()),
+          write
+      );
+    }
+
+    p.run();


Does it intend for run asynchronously? Please, append waitUntilFinish() call to the result of the run().

So far our example jobs all run async, is it a better practice to run sync? I'm happy to learn the reasoning of the practice.

This is a code sample. It is a good practice to have the binary terminate after the sampled behavior is complete.
If there is practical differences between implementing the batch flow control asynchronously and synchronously, consider creating multiple code samples that demonstrate the behaviors.

Includes using BigtableIO and CloudBigtableIO

kongweihan requested review from yoshi-approver and a team as code owners May 8, 2024 16:20

product-auto-label bot added samples Issues that are directly related to samples. api: bigtable Issues related to the Bigtable API. labels May 8, 2024

blunderbuss-gcf bot assigned danieljbruce May 8, 2024

kongweihan force-pushed the flow-control-example branch 5 times, most recently from 411e662 to 6e4f7db Compare May 8, 2024 17:35

Sita04 unassigned danieljbruce May 8, 2024

billyjacobson reviewed May 9, 2024

View reviewed changes

bigtable/beam/batch-write-flow-control-example/pom.xml Outdated Show resolved Hide resolved

...am/batch-write-flow-control-example/src/main/java/bigtable/BatchWriteFlowControlExample.java Show resolved Hide resolved

kongweihan force-pushed the flow-control-example branch from 6e4f7db to d563567 Compare May 17, 2024 18:36

Sita04 requested a review from minherz May 23, 2024 02:32

Sita04 assigned minherz May 23, 2024

minherz requested changes May 23, 2024

View reviewed changes

docs: Add an example for batch write flow control

b29139e

Includes using BigtableIO and CloudBigtableIO

kongweihan force-pushed the flow-control-example branch from d563567 to b29139e Compare May 31, 2024 04:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a batch write flow control example for Bigtable #9314

Add a batch write flow control example for Bigtable #9314

kongweihan commented May 8, 2024

minherz left a comment

minherz May 23, 2024

kongweihan May 30, 2024

minherz Jun 5, 2024

minherz May 23, 2024

kongweihan May 30, 2024

minherz Jun 5, 2024

Add a batch write flow control example for Bigtable #9314

Are you sure you want to change the base?

Add a batch write flow control example for Bigtable #9314

Conversation

kongweihan commented May 8, 2024

Description

Checklist

minherz left a comment

Choose a reason for hiding this comment

minherz May 23, 2024

Choose a reason for hiding this comment

kongweihan May 30, 2024

Choose a reason for hiding this comment

minherz Jun 5, 2024

Choose a reason for hiding this comment

minherz May 23, 2024

Choose a reason for hiding this comment

kongweihan May 30, 2024

Choose a reason for hiding this comment

minherz Jun 5, 2024

Choose a reason for hiding this comment