#DataAnalysisTools

15 posts loaded — scroll for more

Text
dataautomationtools
dataautomationtools

Data Analysis Tools Are Meaning-Making Machines

Data analysis tools occupy a position rarified atop the data ecosystem, converting data into a story. What they really do is make meaning. They don’t just compute numbers or render charts. They decide what gets counted, how it’s grouped, what’s visible, what’s comparable, and what’s ignored. Those decisions shape how people understand reality inside an organization. In that sense, analytics isn’t downstream of meaning — it’s one of the primary ways meaning is constructed.


blueprint-style schemetic of various data analytics tools and their pipeline flows

For example, an analytics dashboard doesn’t merely report revenue. It asserts a story about how revenue should be understood: what time frame matters, which segments are relevant, what constitutes success or failure. A funnel chart doesn’t just show drop-off; it implies causality, priority, and responsibility. Someone is now accountable for that slope.


Analytics platforms are meaning-making machines because they:


- Select: Out of infinite possible measurements, a platform surfaces a few.
- Frame: Through dimensions, filters, and time windows, it defines context.
- Stabilize: It freezes interpretations into reusable metrics (“this is what churn means here”).
- Authorize: Numbers displayed on an “official” dashboard carry institutional weight.
- Normalize: Over time, repeated views become assumed truths rather than hypotheses.

That’s why disagreements over dashboards feel political rather than technical. People aren’t arguing about SQL — they’re arguing about which interpretation of reality gets to be the default. And this is also why analytics platforms differ so dramatically in philosophy.


Take Looker’s insistence on a semantic layer, for example. It isn’t just about governance — it’s about centralizing meaning. It says: interpretation should be controlled, hierarchal, deliberate, and versioned. Tableaus emphasis on free exploration, on the other hand, reflects a different belief: meaning should emerge through visual interaction and human intuition. Sigma’s spreadsheet model leans on familiarity, letting people reason in a language they already trust. Power BI’s tight integration with Excel acknowledges that meaning often forms outside formal BI systems — in ad-hoc models and side calculations. None of these are neutral choices.


Even seemingly mundane design decisions carry interpretive weight. Does the platform default to month-over-month or year-over-year? Does it make cumulative metrics easy and cohort analysis hard? Does it encourage slicing by geography or by customer segment? Each of these nudges how people think, not just what they see. And this extends beyond platforms to analysis itself.


Analysis is not the act of discovering objective truth hidden in data. It’s the act of constructing a plausible narrative from incomplete, biased, and historically contingent signals. The math matters, but the story matters more — because the story is what gets acted on.


This is why “self-service BI” is such a complicated phrase. What’s really being democratized isn’t access to data; it’s access to interpretation. And interpretation without shared context can fragment meaning instead of clarifying it. That’s why analytics maturity often follows a curve: exploration first, then chaos, then governance, then—if you’re lucky—shared understanding.


a blueprint schematic of a data analysis tool and how its situated in the data ecosystem

It also explains why analytics tools so often disappoint when organizations expect them to “settle debates.” They don’t. They formalize them. Once a metric is codified in a dashboard, the argument shifts from “what happened?” to “why does this number say that?” and eventually to “should this number even exist?”


In that sense, data analysis tools are more like a language than a machine. They provide grammar (models, metrics, dimensions), vocabulary (fields, measures), and syntax (charts, dashboards). Different tools encourage different dialects. Over time, organizations develop an accent — a way of speaking about performance, risk, and success that feels natural but is entirely constructed.


So in the end, data analysis tools help create the reality an organization believes it’s operating in. And that’s why choosing one is never a purely technical decision. It’s a choice about how meaning gets made, shared, and enforced inside the system.


What are data analysis tools actually used for?


Most organizations use data analytics tools for four overlapping jobs.


The first is exploration. Analysts and power users need a place to poke at data, slice it different ways, test hypotheses, and follow threads without writing a new query for every idea. This is where filters, drill-downs, joins, and ad-hoc calculations live. A good data analysis tool makes exploration feel playful. A bad one turns curiosity into work.


The second is dashboards and operational visibility. These are the shared views—the revenue board, the funnel chart, the SLA dashboard—that become the organization’s ambient awareness. They’re consulted daily, sometimes obsessively. When they’re right, teams become focused on shared goals. When they’re wrong, entire weeks can be wasted arguing over whose numbers are “correct.”


The third is reporting and distribution. Scheduled reports, embedded analytics, board decks, regulatory summaries—analytics platforms are the engine behind all of them. This is where permissions, formatting, and delivery reliability start to matter as much as query performance.


The fourth—and often the most underestimated—is semantic modeling and governance. This is where analytics platforms either shine or slowly poison trust. If “revenue,” “active user,” or “conversion” mean different things in different dashboards, the platform is failing, no matter how pretty the charts look.


What Makes One Data Analysis Tool Better Than Others?


The difference between a good analytics tool and death by a thousand cuts isn’t about beautiful, easy-to-read charts or a quick and slick dashboard. It’s about whether the tool survives contact with actual organizations, and the flawed humans they’re made of. A good data analysis tool - like any data tool - must be able to deftly compensate for the irrationalities, inefficiencies, and absurdities that we humans are made of.


Our gift for irrationality requires a tool that insists on rigorous data governance. Platforms that allow (or require) metrics to be defined once and reused everywhere will dramatically reduce confusion and rework. Platforms that shunt this responsibility onto individual dashboard authors will allow metric drift that, over time, confidence erodes in the rules and definitions that are the meat of the analysis. You’ll have your pretty charts but they’ll be illustratin’ doo-doo. You’ll think it’s chocolate milk but it’s watered-down Yoohoo. I’m talkin:


- Can you define metrics once and reuse them everywhere?
- Row-level security, object permissions, audit logs, lineage-ish capabilities.

Similarly, our many very human inefficiencies make how the tool performs at scale - and especially concurrency - a second criterium that makes or breaks a good data analysis tool. A platform that works beautifully for one analyst can collapse when hundreds of people open dashboards at once. How it pushes compute - for example some tools lean heavily on in-memory engines, while others push computation down into the warehouse, and still others blend caching strategies, makes a big difference when scaled to enterprise. These architectural choices show up very quickly in real usage, especially on Monday mornings.


Usability does to our myriad human absurdities what good parents do for a family - define norms with a gentle touch In my experience, and what I hear in conversations with other developers, is that the best platforms are so elegantly opinionated that you find it . They guide users toward sane patterns and discourage destructive ones. A tool that lets everyone do everything often ends up doing nothing well.


Extensibility also matters a lot to people trying to apply data anlytics to a product, a portal, or an internal tool . Many organizations want analytics embedded in products, portals, or internal tools. That requires APIs, embedding controls, tenant isolation, and pricing models that don’t punish success.


Finally, there’s total cost of ownership. License (or usage) price is only part of the story. Training time, admin overhead, duplicated datasets, broken dashboards, and governance cleanup all cost real money—even if they don’t show up on an invoice.


The Chase


Ok smart guy, you say. If it all boils down to firm but gentle governance, stability at scale, simplicity, extensability, and cost, then let’s cut to the chase - which data analysis tool has all five in spades? Ah, would that it were such a sweet simplicity that the best data analysis tool is the merely the tool that checks all five boxes in that list? I wish.


Instead, what you find is something like a tension graph, where each quality pulls against one or more of the others. I’ve had this conversation so many times with various developers, informally polling to get the lay of the data analytics land, with the same handful of questions and the handful of names turning up again and again, that I knew this was both a question in need of an answer, and an answer that was in desperate need of further analysis.


If you’d like to read me making sweet meaning out of the field of data analysis tools, as well as lay your cones and rods upon a data visualization so sensual it will touch you right in your soul (or therabouts), then for god’s sake click here to read my piece on the best data analysis tools.


Data Analysis Tool FAQs


Which data analysis tool should I choose?

It depends on your stack and users. Power BI works well in Microsoft environments, Tableau is strong for visualization, Looker for governed modeling, and Sigma for warehouse-native analysis. The “best” tool is the one that fits your data sources, budget, and team skills.

Do we really need a data warehouse first?

For serious analytics, yes. Running reports directly on application databases doesn’t scale. A warehouse like Snowflake, BigQuery, or Redshift provides performance, separation from production systems, and a central place for clean, modeled data.

What’s the difference between BI tools and analytics tools?

BI tools focus on dashboards and reporting. Analytics tools can include data preparation, modeling, statistics, and exploration. Many modern analytics platforms blend both capabilities.

Extract-based or live-query tools—what should we use?

Extracts are fast but require extra pipelines. Live-query tools simplify architecture but rely on warehouse performance. Choose based on data size, freshness needs, and infrastructure costs.

How do I integrate analytics into my application?

Most platforms offer APIs and embedding options so dashboards can live inside your product with single sign-on and programmatic access controls.

How important is the semantic layer?

It’s crucial. A strong semantic layer ensures metrics are defined once and used consistently, preventing conflicting reports and duplicated SQL logic.

What’s the biggest risk when adopting a data tool?

Poor data quality and governance. The tool matters less than having clean, modeled, and trusted data feeding it. See my article about data cleaning before visualization to learn more.

Text
statswork
statswork

Top Data Analysis Tools Used by Researchers in 2026

In 2026, research success depends heavily on how effectively data is analysed. With increasing data volume, complex research designs, and higher publication standards, researchers now rely on advanced data analysis tools to ensure accuracy, efficiency, and credibility. Whether you are working on academic research, market studies, healthcare projects, or business analytics, choosing the right software makes a significant difference.

This blog explores the top data analysis tools used by researchers in 2026, their applications, and how they support both quantitative data analysis and qualitative data analysis.

Why Data Analysis Tools Matter in Modern Research

Modern research is no longer limited to spreadsheets. Today’s studies involve big datasets, multiple variables, and mixed-method approaches. Advanced data analytics tools help researchers:

Improve accuracy and consistency
Save time in data processing
Visualize results clearly
Support statistical interpretation
Enhance research credibility

Professional data analysis services often use these tools to deliver reliable insights for academic and business clients.

1. SPSS – Still a Research Standard

SPSS remains one of the most widely used tools for statistical data analysis in social sciences, healthcare, education, and business research.

Key Features:

  • Descriptive and inferential statistics
  • Regression analysis and hypothesis testing
  • Easy-to-use interface
  • Strong reporting support

2. R Programming – Power for Advanced Analytics

R continues to dominate among data scientists and academic researchers due to its flexibility and open-source ecosystem.

Why Researchers Prefer R:

  • Advanced statistical modelling
  • Data visualization libraries
  • Meta-analysis and regression tools
  • Ideal for complex research studies

R is commonly used in meta-analysis research, bioinformatics, and large-scale data analytics projects.

3. Python – Versatile and Research Friendly

Python has become one of the most trusted tools for data analytics services because of its simplicity and powerful libraries.

Popular Libraries:

  • Pandas for data handling
  • NumPy for numerical analysis
  • Matplotlib and Seaborn for visualization
  • SciPy for statistical testing

Python is widely used in business research, healthcare analytics, and predictive studies.

4. NVivo – Best for Qualitative Data Analysis

NVivo is the top choice for researchers working with interviews, focus groups, and textual data.

Applications:

  • Thematic analysis
  • Content analysis
  • Coding of transcripts
  • Qualitative research reporting

It supports professional qualitative data analysis services with structured and transparent workflows.

5. Excel – Still Relevant in 2026

Despite advanced tools, Excel remains essential for quick data cleaning, sorting, and basic statistical calculations.

Best Uses:

  • Data pre-processing
  • Charts and tables
  • Basic descriptive statistics

Excel often acts as a starting point before advanced analysis.

6. Tableau – Data Visualization Excellence

Tableau helps researchers convert complex datasets into easy-to-understand visual stories.

Benefits:

  • Interactive dashboards
  • Business and academic reporting
  • Easy sharing of insights

Visualization improves research communication and decision-making.

7. Stata – Strong for Econometrics and Social Research

Stata is widely used in economics, sociology, and policy research.

Core Strengths:

  • Time-series analysis
  • Panel data analysis
  • Regression modelling
  • Survey data analysis

It supports high-quality statistical analysis services.

8. SAS – Enterprise-Level Analytics

SAS remains a leader in healthcare, finance, and large research institutions.

Key Capabilities:

  • Advanced predictive analytics
  • Data management
  • Regulatory compliance reporting

SAS is commonly used by professional data analytics solutions providers.

Choosing the Right Data Analysis Tool

The best tool depends on:

Research objectives
Data type (quantitative or qualitative)
Sample size
Complexity of analysis
Reporting requirements

Many researchers now rely on outsourced data analysis services to select and apply the right tools effectively.

How Statswork Supports Researchers in 2026

Statswork provides expert Data Analysis Services, Quantitative Data Analysis Services, and Qualitative Data Analysis Services using industry-leading tools such as SPSS, R, Python, NVivo, Stata, and SAS.

With professional analysts, accurate reporting, and publication-ready outputs, Statswork helps researchers focus on insights instead of software challenges.

Future Trends in Research Data Analysis

In 2026, researchers are moving toward:

Automation in statistical testing
Integration of multiple data sources
Advanced visualization techniques
Cloud-based analytics platforms
Collaborative research workflows

These trends are shaping the future of data science and analytics services worldwide.

Final Thoughts

Choosing the right data analysis tool is not just a technical decision—it directly impacts research quality, credibility, and publication success. Whether you are a student, academic researcher, or business professional, using the right tools ensures accurate and meaningful insights.

For professional support, outsourcing to expert data analysis services like Statswork ensures precision, compliance, and research excellence.

Text
pythonjobsupport
pythonjobsupport

Data analysis and visualization project in Malayalam | Just 30 min #dataanalysis #dataanalysistools

Discover how to analyze and visualize climate change data using Python in this comprehensive tutorial! Perfect for beginners and …
source

Text
jameswilliamsus23
jameswilliamsus23

Transform Data into Growth Opportunities

Harness the power of AI-driven business analytics software to gain real-time insights, streamline operations, and make smarter decisions that drive your business forward.

Text
tccicomputercoaching
tccicomputercoaching
Text
amarasoftware
amarasoftware
Text
ltslean
ltslean

Manufacturing Shop floor Data Collection Software: Data-driven journey towards seamless production


Shop floor Data Collection software automates your Data-Driven manufacturing strategy while boosting throughput and profitability.

For more details read our blog:
https://shopfloordatacollectionsoftware.leantransitionsolutions.com/software-blog/manufacturing-shop-floor-data-collection-software

Text
iemlabs
iemlabs

Best Data Analysis Tools and Software in 2022

The term “data” has been around for a very long time. Data is essential for decision-making in today’s world when 2.5 quintillion bytes of data are produced every day. However, how do you believe we can manage that much data? The function of a data analyst is one of several important roles in the industry today that work with data to obtain insights. To extract insights from data, a data analyst has to use various techniques.

Now share your experience with us in the 𝗖𝗼𝗺𝗺𝗲𝗻𝘁 section

Read the full blog:https://bit.ly/3i1xZd9

Text
growingpage
growingpage

AdsReport

AdsReport

AdsReport is a Facebook advertising data reporting tool. It automatically generates reports from Facebook ad data through dashboards.
Our advantage:
1.Channel advantages: We focus on the generation of Facebook advertising data reports. The unique quality allows us to build an excellent platform.
2.Content advantages: The report is generated through the design of the dashboard. We provide advertising data templates and 130 advertising indicators for you to choose to create the most suitable advertising report.
3.Free to use
4.Useful functions: PDF one-click download function, online report link sharing function, massive data storage function, no matter how long the data is, you can immediately generate reports in your advertising account.

Read the full article

Text
azucenacoursera
azucenacoursera

Data Analysis Tools - Part Two

For the second assignment of the Data Analysis Tools course, we’ve been asked to run a Chi Square Independence Test. 

By using the Gapminder dataset, I would like to understand if there is an association between income group (explanatory variable) and life expectancy group (response variable).

  • Ho - There is no association between income group and life expectancy.
  • Ha - There is an association between income group and life expectancy.

Then we run our Chi Square test in SAS, and we have the below results:

Based on the above we are able to see, the probability is <.0001 therefore we are able to reject the Ho and accept the Ha. The results show an association between  income group and life expectancy group, or in chi square terms, the explanatory variable show a dependency with the response variable.

Finally, we want to understand which of the possible combinations of probabilities show an actual significance. For this, we run a sequential analysis comparing the 10 possible pairs. Also, we will use the Bonferroni adjustment to check if the results are significant enough to reject the Ho.

Bonferroni adjustment = Probability 0.05 / # of comparisons

In this case, we have the below:

B. Adjustment = 0.05 / 10 = 0.005 - Then if the results show a probability value lower to 0.005 we’ll be able to reject the Ho.

Possible combinations:

SAS code used:

Summary results table:

As we are able to see almost all the combinations showed a significant value, except the combination of Short life vs V Short Life, and Long Life vs Average Life.

More to come! :)

Text
azucenacoursera
azucenacoursera

Data Analysis Tools - Part One

For the first assignment of the Data Analysis Tools course, we’ve been asked to run an analysis of variance.

By using the Gapminder dataset, I would like to understand if there is an association between continent (explanatory variable) and life expectancy (response variable).

  • Ho - There is no association between continent and life expectancy.
  • Ha - There is an association between the continent and life expectancy.

Then we run our ANOVA analysis in SAS, and we have the below results:

Based on the results above, we can see there is a F Value of 57.10, and probability of <.0001, therefore we can reject the Ho and accept the Ha that there is an association between continent and life expectancy.

When running an Analysis of Variance (ANOVA), the results tells if there is a difference in means. However, it doesn’t pinpoint which means are different. Duncan’s Multiple Range test (DMRT) is a post hoc test to measure specific differences between pairs of means.

Here is the code to run Duncan’s test:

Duncan’s results:

Based on the above, we can see that there is a significance different between Africa and Europe.

More to come! :)

Text
artemshulgin-blog
artemshulgin-blog

Data Analysis Tools Assignment 4

INTRODUCTION

The aim of this assignment was to run a correlation coefficient that includes a moderator. As in my previous assignment I tested the dependence between income per person and life expectancy. However I have split countries into two groups: those with high alcohol consumption and those with low. The outcome of my investigation can be seen below. For python script please scroll down to the end

INVESTIGATION

The countries were split by a particular criterion. Those that consumed less than or equal to 8 litres per person in a year were defined as “low alcohol consuming countries”. Those that consumed more were defined as “high alcohol consuming countries”, correspondingly. The first group contains 109 countries and the second contains 62. The R coefficient and p-value for the groups are following:

association between income per person and life expectancy for LOW alcohol consuming countries
(0.5085799406548519, 1.64151689069347e-08)

association between income per person and life expectancy for HIGH alcohol consuming countries
(0.6093640344213327, 1.4716677770817285e-07)

There is indeed a strong linear correlation between variables and the p-value is small enough to reject the null hypothesis.

GRAPHS

And the common graph to compare two groups:

To conclude, I would say that alcohol consumption does moderate in that case. From the common graph we can see that orange dots (those for high alcohol consuming countries) are located to the right in comparison to blue ones (those for low alcohol consuming countries). It means that people from countries from the second group need to have a higher income to live the same life in comparison to people from the first group. Another way the outcome could be commented is that people with the SAME income live less in high alcohol consuming countries than those from low alcohol consuming countries.

PYTHON SCRIPT

import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

data = pandas.read_csv(‘gapminder.csv’, low_memory=False)

data['alcconsumption’] = data['alcconsumption’].convert_objects(convert_numeric=True)
data['incomeperperson’] = data['incomeperperson’].convert_objects(convert_numeric=True)
data['lifeexpectancy’] = data['lifeexpectancy’].convert_objects(convert_numeric=True)
data['incomeperperson’]=data['incomeperperson’].replace(’ ’, numpy.nan)

data_clean=data.dropna()

print (scipy.stats.pearsonr(data_clean['incomeperperson’], data_clean['lifeexpectancy’]))

def alcohol (row):
  if row['alcconsumption’] <= 8:
     return 1
  elif row['alcconsumption’] > 8 :
     return 2

data_clean['alcohol’] = data_clean.apply (lambda row: alcohol (row),axis=1)

chk1 = data_clean['alcohol’].value_counts(sort=False, dropna=False)
print(chk1)

sub1=data_clean[(data_clean['alcohol’]== 1)]
sub2=data_clean[(data_clean['alcohol’]== 2)]

print ('association between income per person and life expectancy for LOW alcohol consuming countries’)
print (scipy.stats.pearsonr(sub1['incomeperperson’], sub1['lifeexpectancy’]))
print (’       ’)
print ('association between income per person and life expectancy for HIGH alcohol consuming countries’)
print (scipy.stats.pearsonr(sub2['incomeperperson’], sub2['lifeexpectancy’]))
print (’       ’)

scat1 = seaborn.regplot(x=“incomeperperson”, y=“lifeexpectancy”, data=sub1)
plt.xlabel('Income per person’)
plt.ylabel('Life Expectancy’)
plt.title('Scatterplot for the association between income per person and life expectancy for LOW alcohol consuming countries’)
print (scat1)
#%%
scat2 = seaborn.regplot(x=“incomeperperson”, y=“lifeexpectancy”, fit_reg=False, data=sub2)
plt.xlabel('Income per person’)
plt.ylabel('Life Expectancy’)
plt.title('Scatterplot for the association between income per person and life expectancy for HIGH alcohol consuming countries’)
print (scat2)

Text
artemshulgin-blog
artemshulgin-blog

Data Analysis Tools Assignment 3

INTRODUCTION

The aim of my assignment was to find out two cases with a strong relationship between two variables. The investigation is based on GapMinder dataset and provides scatterplot and pearson coefficient & p-value

CASE 1

The investigation is aimed to determine a correlation between income per person and oil per person values. Python script:

import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt

data = pandas.read_csv(‘gapminder.csv’, low_memory=False)

data['incomeperperson’] = data['incomeperperson’].convert_objects(convert_numeric=True)
data['oilperperson’] = data['oilperperson’].convert_objects(convert_numeric=True)

data['incomeperperson’]=data['incomeperperson’].replace(’ ’, numpy.nan)
data['oilperperson’]=data['oilperperson’].replace(’ ’, numpy.nan)

scat1 = seaborn.regplot(x=“incomeperperson”, y=“oilperperson”, fit_reg=True, data=data)
plt.xlabel('Income per person’)
plt.ylabel('Oil per person’)
plt.title('Scatterplot for the association aetween income per person and oil per person’)

data_clean=data.dropna()

print ('association between income per person and oil per person’)
print (scipy.stats.pearsonr(data_clean['incomeperperson’], data_clean['oilperperson’]))

The outcome:

association between income per person and oil per person
(0.5418925087373309, 6.4737526715952944e-06)

As we can see from the outcome, Pearson coefficient shows us a strong positive relationship between variables. Moreover, the p-value is significant enough to reject the null hypothesis.It means that wage does influence the oil consumption.

CASE 2

The investigation is aimed to determine a correlation between income per person and life expectancy values. Python script:

import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder.csv’, low_memory=False)

data['incomeperperson’] = data['incomeperperson’].convert_objects(convert_numeric=True)
data['lifeexpectancy’] = data['lifeexpectancy’].convert_objects(convert_numeric=True)

data['incomeperperson’]=data['incomeperperson’].replace(’ ’, numpy.nan)
data['lifeexpectancy’]=data['lifeexpectancy’].replace(’ ’, numpy.nan)

scat2 = seaborn.regplot(x=“incomeperperson”, y=“lifeexpectancy”, fit_reg=True, data=data)
plt.xlabel('Income per person’)
plt.ylabel('Expectancy of life duration’)
plt.title('Scatterplot for the Association Between income per person and expectancy of life duration’)

data_clean=data.dropna()

print ('association between income per person and expectancy of life duration’)
print (scipy.stats.pearsonr(data_clean['incomeperperson’], data_clean['lifeexpectancy’]))

The outcome:

association between income per person and expectancy of life duration
(0.7415154434411546, 8.186143219511415e-12)

As we can see from the outcome, Pearson coefficient shows us a strong positive relationship between variables. Moreover, the p-value is significant enough to reject the null hypothesis.It means that wage does influence the expectancy of life duration.

Text
artemshulgin-blog
artemshulgin-blog

Data Analysis Tools Assignment 2

INTRODUCTION

The aim of this research was to implement the chi square test to identify whether there is a dependence between alcohol abuse and dependence and the frequency of alcohol consumption (in group of 18-50 years old). The investigation is based on NESARC data set. The variables are following: ALCABDEP12DX (defines whether a man is dependent on alcohol or not), CONSUMER (defines current drinking status), S2AQ4B (defines the frequency of alc consumption), AGE.

The variable S2AQ4B contains 10 levels. However, I recombined some levels by unifing those that are alike (e.g. those who drank “every day” and “nearly every day” come together). Therefore, I got a new variable called ALCFREQ that contains 6 levels. It describes the average of days that a man drank per year.

Python script can be found at the end (scroll down please).

GRAPH AND OUTCOME

The following graph depicts the correlation between the quantity of days that a man drank in one year and the value of alcohol abuse & dependence.

The outcome also showed:

chi-square value, p value
235.39486036562974, 7.45921009538277e-49 

The values are significant enough to reject the null hypothesis. However, I have implemented a post hoc test to prevent my investigation from type I error.

I will not represent the whole outcome of my programme. But I will list the chi-square and p values for each pair tested. The “true” comment in brackets defines significant pairs:
1v2 9.061540802080241, 0.0026104191196679874 (true)

1v3 15.380628594617393, 8.78846593320258e-05 (true)

1v4 67.32748024836498, 2.299537198380458e-16 (true)

1v5 123.48161575378339, 1.0939517758987028e-28 (true)

1v6 111.10281945108211, 5.618005476992593e-26 (true)

2v3 0.6800817415679685, 0.40955857813105057 (false)

2v4 14.422143353293013, 0.00014607455031474549 (true)

2v5 52.944355058013045, 3.431327502983536e-13 (true)

2v6 49.17696322066388, 2.3388131521615183e-12 (true)

3v4 6.895239341312582, 0.008642559115931894 (false)

3v5 38.26901054674063, 6.163369742450784e-10 (true)

3v6 35.87346972039927, 2.105563413376138e-09 (true)

4v5 16.581367949184713, 4.660665852959781e-05 (true)

4v6 15.71181270396884, 7.376207194912583e-05 (true)

5v6 0.002177601809954723, 0.9627804002347792 (false)

As we see from the test only pairs 2v3, 3v4 and 5v6 are not significant at all. It is interesting to notice that all the following groups are nearby.

PYTHON SCRIPT

The following python script implements the whole investigation:

import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

data = pandas.read_csv(‘dataset.csv’, low_memory=False)

data['ALCABDEP12DX’] = pandas.to_numeric(data['ALCABDEP12DX’], errors='coerce’)
data['CONSUMER’] = pandas.to_numeric(data['CONSUMER’], errors='coerce’)
data['S2AQ4B’] = pandas.to_numeric(data['S2AQ4B’], errors='coerce’)
data['AGE’] = pandas.to_numeric(data['AGE’], errors='coerce’)

sub1=data[(data['AGE’]>=18) & (data['AGE’]<=50) & (data['CONSUMER’]==1)]

sub2 = sub1.copy()

sub2['S2AQ4B’]=sub2['S2AQ4B’].replace(99, numpy.nan)
sub2['ALCABDEP12DX’]=sub2['ALCABDEP12DX’].replace(1, numpy.nan)
sub2['ALCABDEP12DX’]=sub2['ALCABDEP12DX’].replace(2, numpy.nan)

recode1 = {1: 340, 2: 340, 3: 183, 4: 80, 5: 80, 6: 30, 7: 12, 8: 5, 9: 5, 10: 5}
sub2['ALCFREQ’]= sub2['S2AQ4B’].map(recode1)

ct1=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['ALCFREQ’])
print (ct1)

colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)

sub2[“ALCFREQ”] = sub2[“ALCFREQ”].astype('category’)
sub2['ALCABDEP12DX’] = pandas.to_numeric(sub2['ALCABDEP12DX’], errors='coerce’)

seaborn.factorplot(x=“ALCFREQ”, y=“ALCABDEP12DX”, data=sub2, kind=“bar”, ci=None)
plt.xlabel('Days drunk per year’)
plt.ylabel('Alcohol Abuse & Dependence’)

recode2 = {5: 5, 12: 12}
sub2['COMP1v2’]= sub2['ALCFREQ’].map(recode2)

ct2=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP1v2’])
print (ct2)

colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)

recode3 = {5: 5, 30: 30}
sub2['COMP1v3’]= sub2['ALCFREQ’].map(recode3)

ct3=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP1v3’])
print (ct3)

colsum=ct3.sum(axis=0)
colpct=ct3/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs3= scipy.stats.chi2_contingency(ct3)
print (cs3)

recode4 = {5: 5, 80: 80}
sub2['COMP1v4’]= sub2['ALCFREQ’].map(recode4)

ct4=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP1v4’])
print (ct4)

colsum=ct4.sum(axis=0)
colpct=ct4/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs4= scipy.stats.chi2_contingency(ct4)
print (cs4)

recode5 = {5: 5, 183: 183}
sub2['COMP1v5’]= sub2['ALCFREQ’].map(recode5)

ct5=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP1v5’])
print (ct5)

colsum=ct5.sum(axis=0)
colpct=ct5/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs5= scipy.stats.chi2_contingency(ct5)
print (cs5)

recode6 = {5: 5, 340: 340}
sub2['COMP1v6’]= sub2['ALCFREQ’].map(recode6)

ct6=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP1v6’])
print (ct6)

colsum=ct6.sum(axis=0)
colpct=ct6/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs6= scipy.stats.chi2_contingency(ct6)
print (cs6)

recode7 = {12: 12, 30: 30}
sub2['COMP2v3’]= sub2['ALCFREQ’].map(recode7)

ct7=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP2v3’])
print (ct7)

colsum=ct7.sum(axis=0)
colpct=ct7/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs7= scipy.stats.chi2_contingency(ct7)
print (cs7)

recode8 = {12: 12, 80: 80}
sub2['COMP2v4’]= sub2['ALCFREQ’].map(recode8)

ct8=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP2v4’])
print (ct8)

colsum=ct8.sum(axis=0)
colpct=ct8/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs8= scipy.stats.chi2_contingency(ct8)
print (cs8)

recode9 = {12: 12, 183: 183}
sub2['COMP2v5’]= sub2['ALCFREQ’].map(recode9)

ct9=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP2v5’])
print (ct9)

colsum=ct9.sum(axis=0)
colpct=ct9/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs9= scipy.stats.chi2_contingency(ct9)
print (cs9)

recode10 = {12: 12, 340: 340}
sub2['COMP2v6’]= sub2['ALCFREQ’].map(recode10)

ct10=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP2v6’])
print (ct10)

colsum=ct10.sum(axis=0)
colpct=ct10/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs10= scipy.stats.chi2_contingency(ct10)
print (cs10)

recode11 = {30: 30, 80: 80}
sub2['COMP3v4’]= sub2['ALCFREQ’].map(recode11)

ct11=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP3v4’])
print (ct11)

colsum=ct11.sum(axis=0)
colpct=ct11/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs11= scipy.stats.chi2_contingency(ct11)
print (cs11)

recode12 = {30: 30, 183: 183}
sub2['COMP3v5’]= sub2['ALCFREQ’].map(recode12)

ct12=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP3v5’])
print (ct12)

colsum=ct12.sum(axis=0)
colpct=ct12/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs12= scipy.stats.chi2_contingency(ct12)
print (cs12)

recode13 = {30: 30, 340: 340}
sub2['COMP3v6’]= sub2['ALCFREQ’].map(recode13)

ct13=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP3v6’])
print (ct13)

colsum=ct13.sum(axis=0)
colpct=ct13/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs13= scipy.stats.chi2_contingency(ct13)
print (cs13)

recode14 = {80: 80, 183: 183}
sub2['COMP4v5’]= sub2['ALCFREQ’].map(recode14)

ct14=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP4v5’])
print (ct14)

colsum=ct14.sum(axis=0)
colpct=ct14/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs14= scipy.stats.chi2_contingency(ct14)
print (cs14)

recode15 = {80: 80, 340: 340}
sub2['COMP4v6’]= sub2['ALCFREQ’].map(recode15)

ct15=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP4v6’])
print (ct15)

colsum=ct15.sum(axis=0)
colpct=ct15/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs15= scipy.stats.chi2_contingency(ct15)
print (cs15)

recode16 = {183: 183, 340: 340}
sub2['COMP5v6’]= sub2['ALCFREQ’].map(recode16)

ct16=pandas.crosstab(sub2['ALCABDEP12DX’], sub2['COMP5v6’])
print (ct16)

colsum=ct16.sum(axis=0)
colpct=ct16/colsum
print(colpct)

print ('chi-square value, p value, expected counts’)
cs16= scipy.stats.chi2_contingency(ct16)
print (cs16)

Text
dataanalysistools-jf-blog
dataanalysistools-jf-blog

SAS vs Python vs R

I’ve already invest a lot of time learning R. I’ve spent a little time with Python though not too much yet on statistics and data mining/analysis. I’m not looking forward to having to learn a new tool. Would be great if there was an R track to mirror the SAS and Python learning options.