Photo by Fikri Rasyid on Unsplash

EDA on Tanzania Water Pumps

Vidya Menon
8 min readJul 20, 2020

--

Hand-driven and gravity fed water pumps are a key source of potable water throughout much of Africa and Asia, and the ongoing management and maintenance of such pumps is an ever-present challenge for many communities.Using data from Taarifa and the Tanzanian Ministry of Water, we want to predict which pumps are functional, which need some repairs, and which do not work at all.

We have several factors to go through to understand which of them help in the operations of the water pumps. Let’s see if we can understand these and help improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

According to Tanzania’s Ministry of Water, more than 74,000 such pumps can be found throughout the country. While the installation of these pumps is largely funded via contributions from charitable and other non-governmental organizations (NGO’s), their ongoing maintenance is typically the responsibility of the local community within which they reside. Unfortunately, the cost of maintaining the pumps is often beyond the means of the local community, resulting in pumps becoming non-functional. Also, local communities are often unaware of the need to perform the required maintenance due to the apparent lack of any significant problems with a pump up until the point it ultimately fails.

Through this analysis we are going to answer some of the questions as below:

Q1. Does source of water influence the functionality of the wells?

Q2. Does Age effect the condition of the wells?

Q3. What is the best mode of payment for maintenance of wells?

Q4. Does having Public Meetings help to keep the wells functional?

The data set was available on Kaggle. You can also get the data from my Git Hub.

In the article below, I have explained some of the important features to determine the functionality of wells. For a detailed explanation, please check out my git hub.

We have two data sets, target.csv and Pred.csv. Merging the two dataframes that we had created from these 2 files( df1 and df2):

# Merging the files together
data = pd.merge(df1, df2, on='id', how='inner')
data.head()

Exploration of Our Target Variable :

status_group
The target feature in this dataset is the ‘status_group’ feature. This feature is an indicator of whether or not the water well is ‘functional’, ‘functional needs repairs’, or ‘non functional’.

We see that there is a lot of imbalance among our classes, hence we will combine “functional needs repair” and “non functional” into one. This looks much better now.

functional      32259
needs repair 27141

Understanding our other features in the Data set.

  • amount_tsh: Total static head (amount water available to water point)
  • date_recorded: The date the row was entered
  • funder: Who funded the well
  • gps_height: Altitude of the well
  • installer: Organization that installed the well
  • longitude: GPS coordinate
  • latitude: GPS coordinate
  • wpt_name: Name of the water point if there is one
  • num_private:
  • basin: Geographic water basin
  • subvillage: Geographic location
  • region: Geographic location
  • region_code: Geographic location (coded)
  • district_code: Geographic location (coded)
  • lga: Geographic location
  • ward: Geographic location
  • population: Population around the well
  • public_meeting: True/False
  • recorded_by: Group entering this row of data
  • scheme_management: Who operates the water point
  • scheme_name: Who operates the water point
  • permit: If the water point is permitted
  • construction_year: Year the water point was constructed
  • extraction_type: The kind of extraction the water point uses
  • extraction_type_group: The kind of extraction the water point uses
  • extraction_type_class: The kind of extraction the water point uses
  • management: How the water point is managed
  • management_group: How the water point is managed
  • payment: Payment schedule
  • payment_type: Payment schedule
  • water_quality: The quality of the water
  • quality_group: The quality of the water
  • quantity: The quantity of water
  • quantity_group: The quantity of water
  • source: The source of the water
  • source_type: The source of the water
  • source_class: The source of the water
  • waterpoint_type: The kind of water point
  • waterpoint_type_group: The kind of water point

This is how our the Water Pump statues looks now:

Creating a baseline

We can use these numbers as a baseline when comparing subgroups. For example, if a region has less than 54% functionality, we know they are below average and some features within that region are effecting the functionality of the wells.

This will help us identify important features more easily, and give us a frame of reference for further data exploration.

In addition, when we identify wells that are above average, we can examine the features that are contributing to this above average behavior, and apply the necessary steps for bringing subpar wells up to the same level of performance.

Exploration of Administrative Features:

Administrative features are features that are used by/for the company performing the survey. In the case of our Tanzania Waterwell dataset, the following features can be categorized as administrative:

  • ‘id’ : the unique identification number for each individual waterwell.
  • ‘date_recorded’ : the date on which the survey company entered the data into the dataset.
  • ‘recorded_by’ : the name of the company responsible for surveying and recording the waterwell statistics. In this dataset the company name is constant accross all wells: ‘GeoData Consultants Ltd’.

Because these administrative features are used primarily for creating the dataset, they most likely posess little or no predictive potential.
Hence, we will not be using them in our modelling.

Exploration of Categorical Features:

1.Funder: We can see that the Govt of tanzania is the max that has contributed the money for the construction of the well.

Further, when we visualize the graph to check the functionality of pumps:

Here, we can see that a lot of wells that have been funded by the Govt. is not working or needs repair. This is not a good sign.

2.Installer:

DWE and Govt are the major installers which is similar to our funders.

3.wpt_name:

There are lot of unique values here. Some belong to a school, mosque or even to a name of a person. Hence, we will not be considering this feature for our modelling.

Geographical Features Tanzania is divided into thirty-one regions (mkoa in Swahili). Each region is subdivided into districts (wilaya in Swahili). The districts are sub-divided into divisions (tarafa in Swahili) and further into local wards (kata in Swahili). Wards are further subdivided for management purposes: for urban wards into streets (mtaa in Swahili) and for rural wards into villages (kijiji in Swahili). The villages may be further subdivided into hamlets (vitongoji in Swahili).

This is how the above features are co-related to one another:

Basin being the largest an sub-village being the smallest unit

Basin: Checking the functionality of water pumps with respect to each Basin

Lake Nyasa has the max functional wells connected to it

Three (Lake Nyasa, Pangani, Rufiji) have more than 60% of their pumps functioning.

4.Region:

Iringa has the max functional wells
Lindi and Mtwara has the max wells that need repair/not functional

5.Public Meeting:

We see that having a Public Meeting helps in the maintenance of the wells.

6.Permit:

Permit is an important factor too

7.Water quality:

The water_quality variable is an indicator of the quality of the water produced by a given pump. The water_quality variable is dominated by the soft value, with 50,818 of the 59,400 pumps having that water_quality value.

8.Water Source:

Rainwater harvesting is an important source of water to help in the functioning of wells.

Co-relation among the different source of water:

Non-categorical Numerical:

9.gps_height — This variable represents the altitude of the pump’s location.

The higher the pump, more functional it is.

10.Age of Wells:

Newer wells are more functional

Modelling:

Please look into my Git Hub for modelling information.

I used LinearSVC,XGBoost and RandomForest.

Our important factors are to determine the functionality of the well are- Age of the well, Permit, Payment type, scheme management, Lakes, gps_height.

Conclusion and Recommendation

Groundwater is very important to maintain the functionality of the wells.Almost, the entire water supply to the wells is dependent on groundwater. Hence we would look into different methods such as rainwater harvesting and soil conservation which would also help sustain more water in the lakes.

We see that when we make payment based on bucket or monthly payment, the wells are maintained better. So payment is an important factor.

Definitely, age of the wells is an important factor to predict the functionality of the wells. The newer the wells are, more water they have or better is their condition.

We can see that having a public meeting helps in functioning of the wells. More than 50% wells are functional when there is a public meeting held for the same. Thus, Public meeting is an important factor for the functioning of wells.

Further send out designated people to inspect the pumps detected by the model and assess what needs to be done.

--

--