Amazon Simple Storage Service (S3)
Amazone S3 is a scalable, web-based cloud storage service for application programs and online backup and archiving of data at high speed and at low cost. To make web scale computing for developers as simple as possible, Amazon S3 was intentionally designed to offer a minimal feature set. Amazon S3 is an object storage service. The concept of object storage is different from file and block storage. Each object is filed together with an identification number and associated metadata. Applications use this identification number to access an object.
Related articles
Importing data to Databricks: external tables and Delta Lake
Categories: Data Engineering, Data Science, Learning | Tags: Parquet, AWS, Amazon S3, Azure Data Lake Storage (ADLS), Databricks, Delta Lake, Python
During a Machine Learning project we need to keep track of the training data we are using. This is important for audit purposes and for assessing the performance of the models, developed at a later…
May 21, 2020
Introducing Apache Airflow on AWS
Categories: Big Data, Cloud Computing, Containers Orchestration | Tags: Airflow, Oozie, Spark, PySpark, Docker, Learning and tutorial, AWS, Python
Apache Airflow offers a potential solution to the growing challenge of managing an increasingly complex landscape of data management tools, scripts and analytics processes. It is an open-source…
May 5, 2020
Cloudera CDP and Cloud migration of your Data Warehouse
Categories: Big Data, Cloud Computing | Tags: Cloudera, Data Hub, Data Lake, Data Warehouse, Azure
While one of our customer is anticipating a move to the Cloud and with the recent announcement of Cloudera CDP availability mi-september during the Strata conference, it seems like the appropriate…
By David WORMS
Dec 16, 2019
Should you move your Big Data and Data Lake to the Cloud
Categories: Big Data, Cloud Computing | Tags: DevOps, AWS, Cloud, CDP, Databricks, GCP, Azure
Should you follow the trend and migrate your data, workflows and infrastructure to GCP, AWS and Azure? During the Strata Data Conference in New-York, a general focus was put on moving customer’s Big…
Dec 9, 2019
Hadoop Ozone part 3: advanced replication strategy with Copyset
Categories: Infrastructure | Tags: HDFS, Ozone, Cluster, Kubernetes, Node
Hadoop Ozone provide a way of setting a ReplicationType for every write you make on the cluster. Right now is supported HDFS and Ratis but more advanced replication strategies can be achieved. In this…
Dec 3, 2019
Hadoop Ozone part 2: tutorial and getting started of its features
Categories: Infrastructure | Tags: HDFS, CLI, Learning and tutorial, REST, Ozone, Amazon S3, Cluster
The releases of Hadoop Ozone come with a handy docker-compose file to try out Ozone. The below instructions provide details on how to use it. You can also use the Katacoda training sandbox which…
Dec 3, 2019
Hadoop Ozone part 1: an introduction of the new filesystem
Categories: Infrastructure | Tags: HDFS, Ozone, Cluster, Kubernetes
Hadoop Ozone is an object store for Hadoop. It is designed to scale to billions of objects of varying sizes. It is currently in development. The roadmap is available on the project wiki. This article…
Dec 3, 2019