• norsk
    • English
  • English 
    • norsk
    • English
  • Login
View Item 
  •   Home
  • SINTEF
  • Publikasjoner fra CRIStin
  • Publikasjoner fra CRIStin - SINTEF AS
  • View Item
  •   Home
  • SINTEF
  • Publikasjoner fra CRIStin
  • Publikasjoner fra CRIStin - SINTEF AS
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Data preparation as a service based on Apache Spark

Mahasivam, Nivethika; Nikolov, Nikolay; Sukhobok, Dina; Roman, Dumitru
Journal article, Peer reviewed
Accepted version
Thumbnail
View/Open
paper22-camera-ready.docx+%28002%29.pdf (948.7Kb)
URI
http://hdl.handle.net/11250/2478902
Date
2017
Metadata
Show full item record
Collections
  • Publikasjoner fra CRIStin - SINTEF AS [4329]
  • SINTEF Digital [1671]
Original version
Lecture Notes in Computer Science. 2017, 10465 LNCS 125-139.   10.1007/978-3-319-67262-5_10
Abstract
Data preparation is the process of collecting, cleaning and consolidating raw datasets into cleaned data of certain quality. It is an important aspect in almost every data analysis process, and yet it remains tedious and time-consuming. The complexity of the process is further increased by the recent tendency to derive knowledge from very large datasets. Existing data preparation tools provide limited capabilities to effectively process such large volumes of data. On the other hand, frameworks and software libraries that do address the requirements of big data, require expert knowledge in various technical areas. In this paper, we propose a dynamic, service-based, scalable data preparation approach that aims to solve the challenges in data preparation on a large scale, while retaining the accessibility and flexibility provided by data preparation tools. Furthermore, we describe its implementation and integration with an existing framework for data preparation – Grafterizer. Our solution is based on Apache Spark, and exposes application programming interfaces (APIs) to integrate with external tools. Finally, we present experimental results that demonstrate the improvements to the scalability of Grafterizer.
Journal
Lecture Notes in Computer Science

Contact Us | Send Feedback

Privacy policy
DSpace software copyright © 2002-2019  DuraSpace

Service from  Unit
 

 

Browse

ArchiveCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsDocument TypesJournalsThis CollectionBy Issue DateAuthorsTitlesSubjectsDocument TypesJournals

My Account

Login

Statistics

View Usage Statistics

Contact Us | Send Feedback

Privacy policy
DSpace software copyright © 2002-2019  DuraSpace

Service from  Unit