Vis enkel innførsel

dc.contributor.authorMahasivam, Nivethika
dc.contributor.authorNikolov, Nikolay
dc.contributor.authorSukhobok, Dina
dc.contributor.authorRoman, Dumitru
dc.date.accessioned2018-01-23T07:04:53Z
dc.date.available2018-01-23T07:04:53Z
dc.date.created2018-01-20T21:41:32Z
dc.date.issued2017
dc.identifier.citationLecture Notes in Computer Science. 2017, 10465 LNCS 125-139.nb_NO
dc.identifier.issn0302-9743
dc.identifier.urihttp://hdl.handle.net/11250/2478902
dc.description.abstractData preparation is the process of collecting, cleaning and consolidating raw datasets into cleaned data of certain quality. It is an important aspect in almost every data analysis process, and yet it remains tedious and time-consuming. The complexity of the process is further increased by the recent tendency to derive knowledge from very large datasets. Existing data preparation tools provide limited capabilities to effectively process such large volumes of data. On the other hand, frameworks and software libraries that do address the requirements of big data, require expert knowledge in various technical areas. In this paper, we propose a dynamic, service-based, scalable data preparation approach that aims to solve the challenges in data preparation on a large scale, while retaining the accessibility and flexibility provided by data preparation tools. Furthermore, we describe its implementation and integration with an existing framework for data preparation – Grafterizer. Our solution is based on Apache Spark, and exposes application programming interfaces (APIs) to integrate with external tools. Finally, we present experimental results that demonstrate the improvements to the scalability of Grafterizer.nb_NO
dc.language.isoengnb_NO
dc.titleData preparation as a service based on Apache Sparknb_NO
dc.typeJournal articlenb_NO
dc.typePeer reviewednb_NO
dc.description.versionacceptedVersionnb_NO
dc.source.pagenumber125-139nb_NO
dc.source.volume10465 LNCSnb_NO
dc.source.journalLecture Notes in Computer Sciencenb_NO
dc.identifier.doi10.1007/978-3-319-67262-5_10
dc.identifier.cristin1548459
cristin.unitcode7401,90,12,0
cristin.unitnameNettbaserte systemer og tjenester
cristin.ispublishedtrue
cristin.fulltextpostprint
cristin.qualitycode1


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel