A Review Of Virtual Professional-Data-Engineer Exam Prep

Exam Code: Professional-Data-Engineer (Practice Exam Latest Test Questions VCE PDF)
Exam Name: Google Professional Data Engineer Exam
Certification Provider: Google
Free Today! Guaranteed Training- Pass Professional-Data-Engineer Exam.

Check Professional-Data-Engineer free dumps before getting the full version:

NEW QUESTION 1

If a dataset contains rows with individual people and columns for year of birth, country, and income, how many of the columns are continuous and how many are categorical?

  • A. 1 continuous and 2 categorical
  • B. 3 categorical
  • C. 3 continuous
  • D. 2 continuous and 1 categorical

Answer: D

Explanation:
The columns can be grouped into two types—categorical and continuous columns:
A column is called categorical if its value can only be one of the categories in a finite set. For example, the native country of a person (U.S., India, Japan, etc.) or the education level (high school, college, etc.) are categorical columns.
A column is called continuous if its value can be any numerical value in a continuous range. For example, the capital gain of a person (e.g. $14,084) is a continuous column.
Year of birth and income are continuous columns. Country is a categorical column.
You could use bucketization to turn year of birth and/or income into categorical features, but the raw columns are continuous.
Reference: https://www.tensorflow.org/tutorials/wide#reading_the_census_data

NEW QUESTION 2

Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow. Numerous data logs are being are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour.
The data scientists have written the following code to read the data for a new key features in the logs. BigQueryIO.Read
.named(“ReadLogData”)
.from(“clouddataflow-readonly:samples.log_data”)
You want to improve the performance of this data read. What should you do?

  • A. Specify the TableReference object in the code.
  • B. Use .fromQuery operation to read specific fields from the table.
  • C. Use of both the Google BigQuery TableSchema and TableFieldSchema classes.
  • D. Call a transform that returns TableRow objects, where each element in the PCollexction represents asingle row in the table.

Answer: D

NEW QUESTION 3

MJTelco’s Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations. You want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline configuration setting should you update?

  • A. The zone
  • B. The number of workers
  • C. The disk size per worker
  • D. The maximum number of workers

Answer: A

NEW QUESTION 4

Suppose you have a table that includes a nested column called "city" inside a column called "person", but when you try to submit the following query in BigQuery, it gives you an error.
SELECT person FROM `project1.example.table1` WHERE city = "London"
How would you correct the error?

  • A. Add ", UNNEST(person)" before the WHERE clause.
  • B. Change "person" to "person.city".
  • C. Change "person" to "city.person".
  • D. Add ", UNNEST(city)" before the WHERE clause.

Answer: A

Explanation:
To access the person.city column, you need to "UNNEST(person)" and JOIN it to table1 using a comma. Reference:
https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql#nested_repeated_resu

NEW QUESTION 5

You operate a logistics company, and you want to improve event delivery reliability for vehicle-based sensors. You operate small data centers around the world to capture these events, but leased lines that provide connectivity from your event collection infrastructure to your event processing infrastructure are unreliable, with unpredictable latency. You want to address this issue in the most cost-effective way. What should you do?

  • A. Deploy small Kafka clusters in your data centers to buffer events.
  • B. Have the data acquisition devices publish data to Cloud Pub/Sub.
  • C. Establish a Cloud Interconnect between all remote data centers and Google.
  • D. Write a Cloud Dataflow pipeline that aggregates all data in session windows.

Answer: A

NEW QUESTION 6

You work for a shipping company that has distribution centers where packages move on delivery lines to route them properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real time while the packages are in transit. Which solution should you choose?

  • A. Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches.
  • B. Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.
  • C. Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Function
  • D. Integrate the package tracking applications with this function.
  • E. Use TensorFlow to create a model that is trained on your corpus of image
  • F. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages.

Answer: A

NEW QUESTION 7

You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat. Here is some of the information you need to store:
Professional-Data-Engineer dumps exhibit The user profile: What the user likes and doesn’t like to eat
Professional-Data-Engineer dumps exhibit The user account information: Name, address, preferred meal times
Professional-Data-Engineer dumps exhibit The order information: When orders are made, from where, to whom
The database will be used to store all the transactional data of the product. You want to optimize the data schema. Which Google Cloud Platform product should you use?

  • A. BigQuery
  • B. Cloud SQL
  • C. Cloud Bigtable
  • D. Cloud Datastore

Answer: A

NEW QUESTION 8

Which of these numbers are adjusted by a neural network as it learns from a training dataset (select 2 answers)?

  • A. Weights
  • B. Biases
  • C. Continuous features
  • D. Input values

Answer: AB

Explanation:
A neural network is a simple mechanism that’s implemented with basic math. The only difference between the traditional programming model and a neural network is that you let the computer determine the parameters (weights and bias) by learning from training datasets.
Reference:
https://cloud.google.com/blog/big-data/2016/07/understanding-neural-networks-with-tensorflow-playground

NEW QUESTION 9

Google Cloud Bigtable indexes a single value in each row. This value is called the .

  • A. primary key
  • B. unique key
  • C. row key
  • D. master key

Answer: C

Explanation:
Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, allowing you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is known as the row key.
Reference: https://cloud.google.com/bigtable/docs/overview

NEW QUESTION 10

Your company’s on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for-like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration. What should you do?

  • A. Put the data into Google Cloud Storage.
  • B. Use preemptible virtual machines (VMs) for the Cloud Dataproc cluster.
  • C. Tune the Cloud Dataproc cluster so that there is just enough disk for all data.
  • D. Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk.

Answer: B

NEW QUESTION 11

You are building a model to predict whether or not it will rain on a given day. You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy. What can you do?

  • A. Eliminate features that are highly correlated to the output labels.
  • B. Combine highly co-dependent features into one representative feature.
  • C. Instead of feeding in each feature individually, average their values in batches of 3.
  • D. Remove the features that have null values for more than 50% of the training records.

Answer: B

NEW QUESTION 12

You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of-Things (IoT) devices. The volume of data is growing at 100 TB per year, and each data entry has about 100 attributes. The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID). However, high availability and low latency are required.
You need to analyze the data by querying against individual fields. Which three databases meet your requirements? (Choose three.)

  • A. Redis
  • B. HBase
  • C. MySQL
  • D. MongoDB
  • E. Cassandra
  • F. HDFS with Hive

Answer: BDF

NEW QUESTION 13

Which of the following is not true about Dataflow pipelines?

  • A. Pipelines are a set of operations
  • B. Pipelines represent a data processing job
  • C. Pipelines represent a directed graph of steps
  • D. Pipelines can share data between instances

Answer: D

Explanation:
The data and transforms in a pipeline are unique to, and owned by, that pipeline. While your program can create multiple pipelines, pipelines cannot share data or transforms
Reference: https://cloud.google.com/dataflow/model/pipelines

NEW QUESTION 14

Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud. Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about a transmission, the system re-transmits the data. How should you deduplicate the data most efficiency?

  • A. Assign global unique identifiers (GUID) to each data entry.
  • B. Compute the hash value of each data entry, and compare it with all historical data.
  • C. Store each data entry as the primary key in a separate database and apply an index.
  • D. Maintain a database table to store the hash value and other metadata for each data entry.

Answer: D

NEW QUESTION 15

Your team is responsible for developing and maintaining ETLs in your company. One of your Dataflow jobs is failing because of some errors in the input data, and you need to improve reliability of the pipeline (incl. being able to reprocess all failing data). What should you do?

  • A. Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs.
  • B. Add a try… catch block to your DoFn that transforms the data, extract erroneous rows from logs.
  • C. Add a try… catch block to your DoFn that transforms the data, write erroneous rows to PubSub directly from the DoFn.
  • D. Add a try… catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to PubSub later.

Answer: C

NEW QUESTION 16

Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers. Which cloud-native service should you use to orchestrate the entire pipeline?

  • A. Cloud Dataflow
  • B. Cloud Composer
  • C. Cloud Dataprep
  • D. Cloud Dataproc

Answer: D

NEW QUESTION 17

You work for a shipping company that uses handheld scanners to read shipping labels. Your company has strict data privacy standards that require scanners to only transmit recipients’ personally identifiable information (PII) to analytics systems, which violates user privacy rules. You want to quickly build a scalable solution using cloud-native managed services to prevent exposure of PII to the analytics systems. What should you do?

  • A. Create an authorized view in BigQuery to restrict access to tables with sensitive data.
  • B. Install a third-party data validation tool on Compute Engine virtual machines to check the incoming data for sensitive information.
  • C. Use Stackdriver logging to analyze the data passed through the total pipeline to identify transactions that may contain sensitive information.
  • D. Build a Cloud Function that reads the topics and makes a call to the Cloud Data Loss Prevention AP
  • E. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.

Answer: A

NEW QUESTION 18

You need to copy millions of sensitive patient records from a relational database to BigQuery. The total size of the database is 10 TB. You need to design a solution that is secure and time-efficient. What should you do?

  • A. Export the records from the database as an Avro fil
  • B. Upload the file to GCS using gsutil, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
  • C. Export the records from the database as an Avro fil
  • D. Copy the file onto a Transfer Appliance and send itto Google, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
  • E. Export the records from the database into a CSV fil
  • F. Create a public URL for the CSV file, and then use Storage Transfer Service to move the file to Cloud Storag
  • G. Load the CSV file into BigQuery using the BigQuery web UI in the GCP Console.
  • H. Export the records from the database as an Avro fil
  • I. Create a public URL for the Avro file, and then use Storage Transfer Service to move the file to Cloud Storag
  • J. Load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.

Answer: A

NEW QUESTION 19

Which Google Cloud Platform service is an alternative to Hadoop with Hive?

  • A. Cloud Dataflow
  • B. Cloud Bigtable
  • C. BigQuery
  • D. Cloud Datastore

Answer: C

Explanation:
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis.
Google BigQuery is an enterprise data warehouse. Reference: https://en.wikipedia.org/wiki/Apache_Hive

NEW QUESTION 20

You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?

  • A. Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP typ
  • B. Reload the data.
  • C. Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numericvalues from the column TS for each ro
  • D. Reference the column TS instead of the column DT from now on.
  • E. Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP value
  • F. Reference the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.
  • G. Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN typ
  • H. Reload all data in append mod
  • I. For each appended row, set the value of IS_NEW to tru
  • J. For future queries, reference the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.
  • K. Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP value
  • L. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP typ
  • M. Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now o
  • N. In the future, new data is loaded into the table NEW_CLICK_STREAM.

Answer: D

NEW QUESTION 21

Data Analysts in your company have the Cloud IAM Owner role assigned to them in their projects to allow them to work with multiple GCP products in their projects. Your organization requires that all BigQuery data access logs be retained for 6 months. You need to ensure that only audit personnel in your company can access the data access logs for all projects. What should you do?

  • A. Enable data access logs in each Data Analyst’s projec
  • B. Restrict access to Stackdriver Logging via Cloud IAM roles.
  • C. Export the data access logs via a project-level export sink to a Cloud Storage bucket in the Data Analysts’ project
  • D. Restrict access to the Cloud Storage bucket.
  • E. Export the data access logs via a project-level export sink to a Cloud Storage bucket in a newly created projects for audit log
  • F. Restrict access to the project with the exported logs.
  • G. Export the data access logs via an aggregated export sink to a Cloud Storage bucket in a newly created project for audit log
  • H. Restrict access to the project that contains the exported logs.

Answer: D

NEW QUESTION 22

What are two of the benefits of using denormalized data structures in BigQuery?

  • A. Reduces the amount of data processed, reduces the amount of storage required
  • B. Increases query speed, makes queries simpler
  • C. Reduces the amount of storage required, increases query speed
  • D. Reduces the amount of data processed, increases query speed

Answer: B

Explanation:
Denormalization increases query speed for tables with billions of rows because BigQuery's performance degrades when doing JOINs on large tables, but with a denormalized data
structure, you don't have to use JOINs, since all of the data has been combined into one table. Denormalization also makes queries simpler because you do not have to use JOIN clauses.
Denormalization increases the amount of data processed and the amount of storage required because it creates redundant data.
Reference:
https://cloud.google.com/solutions/bigquery-data-warehouse#denormalizing_data

NEW QUESTION 23

When you store data in Cloud Bigtable, what is the recommended minimum amount of stored data?

  • A. 500 TB
  • B. 1 GB
  • C. 1 TB
  • D. 500 GB

Answer: C

Explanation:
Cloud Bigtable is not a relational database. It does not support SQL queries, joins, or multi-row transactions. It is not a good solution for less than 1 TB of data.
Reference: https://cloud.google.com/bigtable/docs/overview#title_short_and_other_storage_options

NEW QUESTION 24
......

Recommend!! Get the Full Professional-Data-Engineer dumps in VCE and PDF From 2passeasy, Welcome to Download: https://www.2passeasy.com/dumps/Professional-Data-Engineer/ (New 239 Q&As Version)