Create new dataset from specific columns of other 2 datasets in Scala


zubug55

I have the following datasets with 2 different schemas.

case class schema1(a: Double, b: Double) -> dataset1
case class schema2(c: Double, d: Double, e: Double, f: Double) -> dataset2

I want to create another dataset with the following schema:

case class schema3(c: Double,  b: Double) -> dataset3

That is, the Mode 3 dataset contains the first column c in the Mode 2 dataset and the second column b in the Mode 1 dataset.

How to create a third dataset based on schema3 using the data in columns c and b of dataset2 and dataset1.

Or more simply, I have to create a third dataset by taking one column from the first dataset and another column from the second dataset and mapping it to the third schema defined above.

please help.

Srinivas

Use monotonically_increasing_id& row_numerto add unique id values ​​in both datasets and join the two datasets using idcolumn and the desired columns in both datasets and finally remove the id from the resulting dataset.

Please check the code below.

scala> case class schema1(a: Double, b: Double)
defined class schema1

scala> case class schema2(c: Double, d: Double, e: Double, f: Double)
defined class schema2

scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._

scala> val sa = Seq(schema1(11,12),schema1(22,23)).toDF.withColumn("id",monotonically_increasing_id).withColumn("id",row_number().over(Window.orderBy("id")))
sa: org.apache.spark.sql.DataFrame = [a: double, b: double ... 1 more field]

scala> val sb = Seq(schema2(22,23,24,25),schema2(32,33,34,35),schema2(132,133,134,135)).toDF.withColumn("id",monotonically_increasing_id).withColumn("id",row_number().over(Window.orderBy("id")))
sb: org.apache.spark.sql.DataFrame = [c: double, d: double ... 3 more fields]

scala> sa.show(false)
+----+----+---+
|a   |b   |id |
+----+----+---+
|11.0|12.0|0  |
|22.0|23.0|1  |
+----+----+---+


scala> sb.show(false)
+-----+-----+-----+-----+---+
|c    |d    |e    |f    |id |
+-----+-----+-----+-----+---+
|22.0 |23.0 |24.0 |25.0 |0  |
|32.0 |33.0 |34.0 |35.0 |1  |
|132.0|133.0|134.0|135.0|2  |
+-----+-----+-----+-----+---+

scala> sb.select("c","id").join(sa.select("b","id"),Seq("id"),"full").drop("id").show(false)
+-----+----+
|c    |b   |
+-----+----+
|22.0 |12.0|
|32.0 |23.0|
|132.0|null|
+-----+----+

Related


Create variables from other values in other datasets

es_dutch I want to create a variable that counts each party seat for each municipality, so I get the following: df <- data.frame( stringsAsFactors = FALSE, municipality= c("Aa en Hunze","Aa en Hunze", "Aa en Hunze","Aalburg",

Create a new specific value column from other columns

new 123 I have county data, but I want to create columns that only list states. Basically, I have this: County County 1, NY County 2, NY County 3, NY County 4, TX County 5, TX County 6, IL County 7, IL But I want this: County State Coun

Create variables from other values in other datasets

es_dutch I want to create a variable that counts each party seat for each municipality, so I get the following: df <- data.frame( stringsAsFactors = FALSE, municipality= c("Aa en Hunze","Aa en Hunze", "Aa en Hunze","Aalburg",

Create a new specific value column from other columns

new 123 I have county data, but I want to create columns that only list states. Basically, I have this: County County 1, NY County 2, NY County 3, NY County 4, TX County 5, TX County 6, IL County 7, IL But I want this: County State Coun

Create a new column with values from other columns in the dataset

Lily I want to create a column that contains information from one or the other column in my dataset, depending on the third column. My dataset is full of 0 and 1 values like this: df <- data.frame(PatientID = c("0002" ,"0004", "0005", "0006" ,"0009" ,"0010" ,"

Create a new specific value column from other columns

new 123 I have county data, but I want to create columns that only list states. Basically, I have this: County County 1, NY County 2, NY County 3, NY County 4, TX County 5, TX County 6, IL County 7, IL But I want this: County State Coun

Create a new specific value column from other columns

new 123 I have county data, but I want to create columns that only list states. Basically, I have this: County County 1, NY County 2, NY County 3, NY County 4, TX County 5, TX County 6, IL County 7, IL But I want this: County State Coun

Create a new specific value column from other columns

new 123 I have county data, but I want to create columns that only list states. Basically, I have this: County County 1, NY County 2, NY County 3, NY County 4, TX County 5, TX County 6, IL County 7, IL But I want this: County State Coun

Create a new specific value column from other columns

new 123 I have county data, but I want to create columns that only list states. Basically, I have this: County County 1, NY County 2, NY County 3, NY County 4, TX County 5, TX County 6, IL County 7, IL But I want this: County State Coun

Create a new specific value column from other columns

new 123 I have county data, but I want to create columns that only list states. Basically, I have this: County County 1, NY County 2, NY County 3, NY County 4, TX County 5, TX County 6, IL County 7, IL But I want this: County State Coun