Skip to content

Commit

Permalink
Merge pull request #1251 from ydai1124/master
Browse files Browse the repository at this point in the history
Release 0.8.0
  • Loading branch information
ydai1124 authored Sep 2, 2016
2 parents 82422d6 + f2f2a1d commit ca58061
Showing 1 changed file with 174 additions and 5 deletions.
179 changes: 174 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,177 @@
GOBBLIN 0.8.0
-------------

#### Created Date: 08/22/2016

## Highlights

* Gobblin can now convert avro to orc files through Hive. Documentation: http://gobblin.readthedocs.io/en/latest/adaptors/Hive-Avro-To-ORC-Converter/.
* Gobblin can now write data to Kafka using a new `KafkaWriter`. Documentation: http://gobblin.readthedocs.io/en/latest/sinks/Kafka/.
* Gobblin distcp can now replicate Hive tables between different Hive Metastores. Documentation: http://gobblin.readthedocs.io/en/latest/case-studies/Hive-Distcp/.
* Gobblin can now support hive based retentions. Documentation: http://gobblin.readthedocs.io/en/latest/data-management/Gobblin-Retention/.
* Gobblin can now support job templates, which reduces the efforts of writing a Gobblin job.
Documentation: http://gobblin.readthedocs.io/en/latest/user-guide/Gobblin-template/.

## NEW FEATURES

* [Kafka] [PR 1016] Integration with Confluent Schema Registry, Confluent Deserializers, and Kafka Deserializers
* [Avro to ORC] [PR 1031] Adding Avro To ORC conversion logic and related framework modifications
* [General FileSystem Support] [PR 1066] Config file monitor for general file system
* [Avro to ORC] [PR 1068] Nested Avro to Nested ORC conversion support
* [General FileSystem Support] [PR 1073] extension of loading config file from general file system
* [AWS] [PR 1088] Gobblin on AWS
* [Kafka Writer] [PR 1089] Kafka writer
* [JDBC Extractor] [PR 1090] Teradata JDBC Extractor and Source
* [Avro to ORC] [PR 1093] Support for schema evolution, staging, selective column projection and compatibility check for Avro to ORC
* [Hive Retention] [PR 1106] Hive Based Retention
* [Job Templates] [PR 1145] Initial commit for job configuration template
* [Http Writer] [PR 1186] HttpWriter including SalesForceRestWriter, ThrottleWriter, etc
* [Avro to ORC] [PR 1188] Avro to orc data validation
* [Job Templates] [PR 1197] Kafka-template
* [Job Launcher] [PR 1203] New std driver2
* [Core] [PR 1216] Adding a simple console writer to gobblin

## BUG FIXES

* [YARN] [PR 982] Using new zk port numbers for unit tests
* [Kafka] [PR 996] Fix offset related bug in KafkaSource
* [Core] [PR 999] distcp-ng throws UnsupportedOperationException
* [Build] [PR 1001] Setting heaps size for gobblin-runtime tests due to OOM in some cases
* [Core] [PR 1002] Set explicit 755 permissions to state store
* [Core] [PR 1005] Fixing SOURCE_QUERYBASED_LOW_WATERMARK_BACKUP_SECS no default value
* [Config Management] [PR 1043] Fix includes order
* [JDBC Writer] [PR 1050] JDBCWriter. Bug fix on SQL statements. Bug fix on data type mapping.
* [Data Management] [PR 1051] Fix default blacklist key
* [Salesforce] [PR 1069] Adding security token to Salesforce bulk API login
* [Runtime] [PR 1078] Fixing possible NPE in SourceDecorator
* [Documentation] [PR 1081] Fixing search for Gobblin ReadTheDocs
* [Documentation] [PR 1107] Minor text formatting fix for README.md
* [Salesforce] [PR 1118] gobblin salesforce update to new proxy
* [Config Management] [PR 1135] Revert changes to ConfigUtils
* [Utility] [PR 1147] Capture exceptions correctly in HadoopUtilsTest.testSafeRenameRecursively
* [Salesforce] [PR 1152] Updated gobblin salesforce to resolve entity.source and extract.table.name
* [Build] [PR 1153] Make sure maven central repo is first; bug fixes
* [Utility] [PR 1154] Fix for failing createProxiedFileSystemUsingToken
* [Avro to ORC] [PR 1155] Changed Hive validation to make it compatible with old Hive version with auth turned on, and Hive query generation compile with new Hive version
* [Build] [PR 1156] Upgrade wix-embedded-mysql
* [Runtime] [PR 1157] Move test MR jobs dir to /tmp to avoid issues with DistributedCache
* [Distcp] [PR 1160] FIxed a race condition on CopyDataPublisher.
* [Metrics] [PR 1170] Not fail the task if metricsReport failed to be stopped
* [Metrics] [PR 1176] Added a backwards compatible constructor to SchemRegistryVersionWriter
* [Retention] [PR 1182] Throw exception when retention dataset finder fails to initialize
* [Retention] [PR 1202] Bug fix - Retention does not blacklist dataset
* [Runtime] [PR 1215] Fixed silent failures and hung application when a standalone service fails to initialize.
* [Example] [PR 1217] Fixing console writer example

## IMPROVEMENTS

* [YARN] [PR 978] Initial commit for gobblin-cluster; gobblin-yarn refactoring
* [Core] [PR 979] Initial commit for HTTP Writer APIs
* [Core] [PR 980] Add metadata after completion of job to a specific metadata directory
* [Hive Distcp] [PR 983] need to deregister existing table
* [Documentation] [PR 988] Adding documentation page for Gobblin Distcp
* [Documentation] [PR 989] Added retention docs
* [Documentation] [PR 991] Add Hive registration doc
* [Kafka] [PR 992] Making kafka metadata read more resillient to issues with the brokers
* [Documentation] [PR 993] open source wiki for config management
* [Data Management] [PR 998] Merge the two LongWatermarks
* [Hive Distcp] [PR 1003] Added the predicate check to skip full table diff if the existing table's registration time > source table's mod time
* [Distcp] [PR 1008] ETL-4470: Implementation of http filer puler using Distcp-ng
* [Documentation] [PR 1012] Document changes in PR#952
* [Documentation] [PR 1013] Update documents
* [Build] [PR 1023] Adding parallel test Travis VMs
* [Hive Registration] [PR 1027] Added configuration to Hive client for getting credentials.
* [Hive Registration] [PR 1034] Hive metastore initialization should support empty HCat uri ie default to platform defaults
* [Avro to ORC] [PR 1035] Use table schema and partition schema
* [Avro to ORC] [PR 1036] Hive metastore connection pool optimization, Fixes for: backward compatibility for Hive in AvroToOrc, schema parser deserialization from schema literal, database name in Hive DDL query generation, Hive metastore connection pool initialization NPE if Hcat uri is platform provided
* [Avro to ORC] [PR 1037] Add sla events for avro to orc conversion
* [Hive Registration] [PR 1038] Made Hive metastore connection auto returnable to connection pool after Hive dataset discovery
* [Avro to ORC] [PR 1044] Made HiveAvroToOrcConverter compatible with Hive v0.13 version
* [Hive Distcp] [PR 1045] Add bootstrap low watermark support for HiveSource in data management
* [Avro to ORC] [PR 1046] [Avro to ORC] Mark all workunits of a dataset as failed if one task fails
* [Hive Distcp] [PR 1053] Add lookback days for HiveSource
* [Hive Registration] [PR 1054] Converted Hive dereg / registration to post publish steps, fixed missing fileset.
* [Distcp] [PR 1055] Parallelize commit rebased
* [Hive Distcp] [PR 1056] Add lastDataPublishTime in hive table/partition properties
* [Runtime] [PR 1060] MR launcher does not write tasks to the jobstate file in HDFS.
* [Hive Distcp] [PR 1062] Enable AvroSchemaManager to read schema from Kafka schema registry
* [Hive Distcp] [PR 1067] Add a backfill hive source that does not check watermarks
* [Data Management] [PR 1071] Add ConvertibleHiveDataset and config store support to HiveDatasetFinder
* [Documentation] [PR 1082] Updating the README and other outdated docs to encourage use of Gobblin Releases
* [Avro to ORC] [PR 1087] Add support for nested and flattened orc conversion configuration
* [Kafka] [PR 1091] Confluent schema registry example for kafka writer
* [Json Converter] [PR 1092] Added JsonConverter to parse Json files to a format such that JsonIntermediateToAvro converter can parse
* [Avro to ORC] [PR 1095] Refactored to rename HiveAvroORCQueryUtils to HiveAvroORCQueryGenerator
* [Compaction] [PR 1096] Added simulate mode in Hive JDBC Connector to simulate query execution
* [Avro to ORC] [PR 1097] Added limit clause to Hive query generation to enable conversion validation of sample subset
* [Avro to ORC] [PR 1098] Added Azkaban job that can validate conversion result by comparing source and target Hive tables
* [Core] [PR 1102] Inter strings in deserialized States to reduce memory usage.
* [Documentation] [PR 1104] Added powered by section in wiki for companies using Gobblin
* [Documentation] [PR 1105] Added Gobblin meetup June 2016 presentations on Talks and Tech Blogs wiki
* [Documentation] [PR 1109] Updating the code contributions documentation
* [Documentation] [PR 1110] Added videos from June 2016 meetup to talks-and-tech-blogs wiki page
* [Documentation] [PR 1111] Made order of presentations chronological in talks-and-tech-blogs wiki page
* [Documentation] [PR 1112] Update Gobblin on AWS video presentation link with right start time in playback
* [Documentation] [PR 1113] Added Paypal to powered by wiki page
* [Documentation] [PR 1115] Adding Sandia National Labs to Powered-By page
* [Avro to ORC] [PR 1119] Changed concatenated queries string to list in Hive converter publisher
* [Avro to ORC] [PR 1120] Added Hive query generation to optionally support explicit database names
* [Avro to ORC] [PR 1122] Made changes to handle Hive-6129 (inverted exchange partition bug) and corresponding support for backward incompatible changes in Hive
* [Hive Distcp] [PR 1126] Make distcp publisher safer: renameRecursively fails appropriately, hive registration fails if location doesn't exist.
* [Avro to ORC] [PR 1127] Drop hourly partitions when daily data gets converted to ORC
* [Hive Registration] [PR 1128] Added events in hive-registration
* [Avro to ORC] [PR 1138] Change Hive Avro to ORC publish to use Gobblin constructs instead of Hive exchange partition query
* [Avro to ORC] [PR 1139] Added support to escape the Hive nested field names when derived from destination table as raw string
* [Data Management] [PR 1140] Moved WhitelistBlacklist from data-management to utility.
* [Avro to ORC] [PR 1141] Renamed partitionDir.prefixLocationHint to source.dataPathIdentifier to be more consistent with naming across Hive data conversion
* [Build] [PR 1142] Add gradle property withFindBugsXmlReport to enable XML FindBugs reports
* [Avro to ORC] [PR 1148] Support for distcp-ng registration time in isOlderThanLookback check and minor refactoring
* [Avro to ORC] [PR 1151] Changed Hive conversion validation job to use HIVE_DATASET_CONFIG_PREFIX consistent with HiveAvroToOrcSource
* [Avro to ORC] [PR 1163] Fail avro to orc valiation job on at least one failure
* [Hive Registration] [PR 1165] Add create time to newly registered Hive tables and partitions.
* [Hive Distcp] [PR 1167] Adding options in watermarkCopyableFileFilter and some refactoring
* [Metrics] [PR 1169] Gobblin metrics registers the base schemas instead of inferring them from events.
* [Avro to ORC] [PR 1171] Added more SLA event metadata to Avro to Orc conversion job
* [Avro to ORC] [PR 1172] Use camel case for event names
* [Avro to ORC] [PR 1173] Parallalize Avro to Orc validation job
* [Utility] [PR 1175] Schema files (schema.avsc) will be written with 774 permission.
* [Hive Distcp] [PR 1180] Add createtime when altering a table.
* [Job Templates] [PR 1183] change the key name of required.attributes
* [Job Templates] [PR 1184] Fixed name of ResourceBasedTemplate.
* [Job Templates] [PR 1185] Fix naming of template and template class file.
* [Avro to ORC] [PR 1189] cache data modTime to reduce too many HDFS calls
* [Hive Retention] [PR 1190] Add logs to hive retention. Support more DatasetFinder constructors
* [Data Management] [PR 1192] Add config store uri builder for hive datasets
* [Core] [PR 1204] Refactor methods between HadoopFsHelper and AvroFsHelper
* [Avro to ORC] [PR 1205] AvroToorc - Implemented a per partition watermark
* [Job Launcher] [PR 1206] Refactored SchedulerUtils into a new PullFileLoader that uses Config to load pull files.
* [Documentation] [PR 1207] template wiki doc added
* [Kafka] [PR 1210] Make topic suffix configurable for lookup in Confluent Schema Registry
* [Job Templates] [PR 1211] Restored template functionality removed accidentally. Add unit test for the functionality.
* [Kafka] [PR 1218] Making Kafka consumer configurable for Kafka extract
* [Runtime] [PR 1220] Refactored MR mode to use GobblinInputFormat.
* [Kafka Writer] [PR 1226] Making kafka writer more robust, adding tests
* [Job Templates] [PR 1228] Templates use config instead of properties.

## EXTERNAL CONTRIBUTIONS

We would like to thank all our external contributors for helping improve Gobblin.

* singhd10:
-Add metadata after completion of job to a specific metadata directory (PR 980)
* shelocks:
-Fixing SOURCE_QUERYBASED_LOW_WATERMARK_BACKUP_SECS no default value (PR 1005)
* lbendig,Lorand Bendig:
-Document changes in PR#952 (PR 1012)
-Make topic suffix configurable for lookup in Confluent Schema Registry (PR 1210)
* jinhyukchang, Jinhyuk Chang:
-JDBCWriter. Bug fix on SQL statements. Bug fix on data type mapping. (PR 1050)
-HttpWriter including SalesForceRestWriter, ThrottleWriter, etc (PR 1186)
* ypopov, Eugene Popov:
-Teradata JDBC Extractor and Source (PR 1090)
* pldash
-Added JsonConverter to parse Json files to a format such that JsonIntermediateToAvro converter can parse (PR 1092)

GOBBLIN 0.7.0
-------------

Expand Down Expand Up @@ -48,9 +222,7 @@ GOBBLIN 0.7.0
* [Publisher] [PR 657] Issue #561 - fix for BaseDataPublisher to mark WorkingState correctly
* [Core] [PR 661] Change ParallelRunner.close to wait for all futures to finish
* [Core] [PR 663] ParallelRunner catches exceptions correctly and has failure policies.
* [Build] [PR 664] Fix broken Gobblin version resolution ( fixes #662 )
* [Build] [PR 665] Gobblin-compaction tarball doesn't contain gobblin-compaction.jar
* [Core] [PR 670] Fixing FindBugs warnings
* [Core] [PR 676] Ensure that parallel runner waits for the underlying tasks to finish
* [Core] [PR 677] Fix race condition in FsStateStore
* [Compaction] [PR 680] Fix a ConcurrentModificationException in MRCompactor
Expand All @@ -60,9 +232,6 @@ GOBBLIN 0.7.0
* [Distcp] [PR 691] Fix permissions for directories in distcp.
* [Core] [PR 700] Add missing jars to gobblin mapreduce runner, sort.
* [Core] [PR 706] Fixing CliOptions config file fs
* [Core] [PR 722] Fixing FindBugs warnings in gobblin-compaction
* [Build] [PR 743] Fixing skipTestGroup option
* [Build] [PR 775] Fix javadoc warnings by only adding linksOffline to projects that the current project depends on.
* [Core] [PR 797] Fixing Fork + Task Retry Logic #776
* [Distcp] [PR 884] Fix issue with replicating owner and permission of system directories in distcp
* [Data Management] [PR 887] Fix NPE in DateTimeDatasetVersionFinder
Expand Down

0 comments on commit ca58061

Please sign in to comment.